hebrew_flash_cards

Author	SHA1	Message	Date
Sochen	c85063ee2f	Sprint 15: example sentence pipeline overhaul + corpus expansion + card improvements - Regenerated all example sentences from scratch (deleted legacy + stale entries) - Added .txt file support to epub_examples.py for Ben Yehuda corpus - 7 Ben Yehuda nikkud'd children's texts + 3 new Time Tunnel EPUBs - Maqaf-stripped construct form indexing (+68% inflected matches) - Total: 3,598 words with examples, 3,289 with cloze (was ~2,900) - Cloze prefix preservation (_cloze_prefix_len) - Hebrew spoiler stripping from English meanings - Gender field (זָכָר/נְקֵבָה) on vocab cards - sec-table CSS layout for aligned key:value pairs - Mishkal uses mishkal_hebrew on plural cards - Improved mishkal extraction from pealim detail pages - 21 new pytest tests (cloze, PoS, Hebrew stripping, gender, mishkal) - 2 new validate_data.py tests + mishkal stats - Colliding forms tracking (local-only) - Release tag v0.17 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 10:44:14 +00:00
Sochen	b2fef5aa8a	Sprint 11.1: strip_nikkud cleanup, dead code removal, test fixes Remove strip_nikkud from all pipeline files — use ktiv_male directly. Fix case-insensitive binyan matching in detail scraper (og:description uses UPPERCASE). Fix integration test slugs and test limits. Delete legacy CSVs, stale .apkg, and dead scripts from git. Add vulture to pre-commit hook. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 04:03:47 +00:00
Sochen	5685270dfa	chore: add PROJECT_NOTES.md, update .gitignore Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 08:20:31 +00:00
Sochen	17f7458d19	Sprint 9: cloze cards, plurals deck, project reorg, lint tooling - Cloze card pipeline: 924 cards from 2,296 AI-vetted Hebrew book sentences - Plurals deck: 375 notes (144 irregular + 231 regular from 86 mishkal patterns) - Ktiv male forms expanded to 20,711 entries for sentence matching - Project reorg: helpers.py (deduped strip_nikkud from 10 files), scripts/ for one-off tools, tests/ with smoke tests, deleted 3 dead files - Lint tooling: pyproject.toml with ruff/vulture/bandit/pytest config, .editorconfig, fixed all 129 ruff errors (B023 closure fix, SIM103, unused vars) - validate_apkg.py: card count range check for optional cloze template - Data caches committed: vetted_sentences, ktiv_male_forms, noun_plurals, noun_slug_map, vocab_sentence_matches, epub_sentence_index Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 08:09:39 +00:00
Sochen	f8e4873349	Remove releases/ binary from git; add to .gitignore Release artifacts (.apkg files) are distributed via Forgejo releases, not committed to the repository tree. Also gitignore CLAUDE.md (internal). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-05 05:26:18 +00:00
Sochen	39fb388f6c	first release	2026-03-04 00:27:46 -08:00
Sochen	58dc1b8d9b	fix: correct word/verb counts in README, add missing .gitignore entries - README: ~14,400 → ~9,100 words (actual scrape count) - README: 71 → 69 verbs (current verb list; 2 short of Coffin & Bolozky — to investigate) - .gitignore: add data/audio_conj/, data/image_cache.json, data/images/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-04 06:32:44 +00:00
Sochen	b018f21b1d	feat: Sprint 2 + Sprint 3 — verb list, audio, passive forms, CSS/UX, validation, Heebo font, images Sprint 2: - extract_verb_list.py (NEW): downloads Coffin & Bolozky PDF, extracts 71-verb paradigm list from Appendix 1 with hardcoded fallback. Pu'al/Huf'al use '# 3ms:' prefix for 3ms search. - conjugation_extract.py: audio URL capture per form, passive forms parsing (Pu'al/Huf'al partner tables), 3ms search support. - benyehuda.py: nikkud corpus (txt.zip), index by nikkud word form, single best example (longest ≤200 chars), --refresh-examples rebuild. - apkg_builder.py: Hebrew labels, centered dark Hebrew text, freq-badge, related words grouped by PoS. Conjugation: Voice/Audio fields, present-tense 12-card expansion, 2fp/3fp modern fallback with classical in parens, פָּעִיל/סָבִיל voice labels. - README.md: rewritten — learner-first structure, data sources. - run.py: --refresh-examples flag, conjugation audio download (step 4b). - data/conjugations.json: rebuilt with 70 verbs, audio URLs, passive partner data. Sprint 3: - validate_verb_list.py (NEW): queries pealim.com for all entries in verb input list, classifies as OK/3ms/REVIEW/NOT_FOUND, writes cleaned verbs_input.txt. Results: 51 OK, 15 3ms-past, 4 REVIEW. - apkg_builder.py: binyan in Hebrew (BINYAN_TO_HEBREW map) on its own line; remove "דוגמה:" label; "Other" related-words shown unlabeled; "50k+" freq display for unlisted words; Image field in VOCAB_MODEL. - image_fetch.py (NEW): Wikipedia/Commons thumbnails for concrete nouns, caches in data/image_cache.json, downloads to data/images/. - Heebo variable font TTF bundled in both .apkg files via @font-face. - run.py: step_fonts(), step_images(), --skip-images flag. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 08:36:51 +00:00
Sochen	b086123bec	feat: add apkg builder, frequency, Ben Yehuda examples, conjugation deck Implements four major improvements to the Pealim Anki deck pipeline: 1. Automated .apkg generation (genanki) — no more manual Anki Desktop step. Both vocabulary and conjugation decks are built programmatically. 2. Word frequency ranking from hermitdave/FrequencyWords he_50k corpus. Notes sorted by rank so Anki presents most common words first. 3. Example sentences from Ben Yehuda public domain corpus (not pealim.com). Downloads txt_stripped.zip, indexes 25k texts, ~89% coverage on test set. 4. Conjugation drill deck — one card per form × verb. Input: verbs_input.txt (Hebrew infinitives). Initial set: 7 verbs (one per binyan). Extracts 28 forms each via pealim.com/search/ + table parse. New files: apkg_builder.py — genanki deck builder for both decks benyehuda.py — Ben Yehuda corpus downloader + sentence indexer frequency_lookup.py — FrequencyWords downloader + rank lookup verbs_input.txt — verb input list (7 test verbs, one per binyan) data/ — baseline CSVs + generated caches Updated: conjugation_extract.py — rewritten: reads verbs_input.txt, searches /search/?q= for slug, parses table by row labels requirements.txt — add genanki, beautifulsoup4, lxml run.py — full orchestration pipeline with CLI flags .gitignore — exclude venv/, benyehuda_index.json, audio/, output/ CLI: python run.py --skip-scrape --skip-audio --test 20 (quick test) python run.py --skip-scrape (full build) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 01:58:31 +00:00
Sochen	158f0477a3	added extraction of verb conjugations	2025-07-21 01:43:47 -07:00

10 commits