hebrew_flash_cards

Author	SHA1	Message	Date
Sochen	f3496998f5	feat: confusables show ktiv male, emoji/prep stripping fully upstream - Confusables deck front now shows shared ktiv male form instead of nikkud variants joined by "/". Back still shows nikkud with definitions. - Fixed list scraper EMOJI_RE to catch variation selectors (U+FE0F) and ZWJ (U+200D) — cleaned 17 entries with leftover selectors in meaning. - Removed build-time prep extraction fallback (0 entries relied on it). - release.py: fix keeshare field name (API_TOKEN → password). Closes: Pealim #11 (emoji/prep upstream), Pealim #16 (confusables ktiv male) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-21 02:19:03 +00:00
Sochen	0a85291975	feat(v0.20): adaptive sentence difficulty scoring — 3163 scored cloze sentences Replaces length-based sentence selection with frequency-based difficulty scoring. Easiest sentences (most common context words) become cloze candidates. Score range: 1-50000, median=179. Coverage: 3163/3325 cloze entries scored. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-15 13:35:23 +00:00
Sochen	af186e2030	Sprint 17: homograph example dedup + plural audio + prep extraction - Homograph collision fix: _deduplicate_confusable_examples() clears shared examples from less-common confusable group members (36 entries fixed). Keeps examples only on highest-frequency meaning. - Plural deck audio: wired up PluralAudio field in apkg_builder.py, downloaded 613 plural audio files from pealim.com for all deck entries. - Prep extraction upstream: moved Hebrew preposition parsing from build time into list/detail scrapers (SCHEMA.yaml prep field added). - Validation: new no_shared_confusable_examples check in validate_data.py - Tests: 9 new unit tests for confusable deduplication (98 total) - Release: v0.19 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 21:51:35 +00:00
Sochen	c85063ee2f	Sprint 15: example sentence pipeline overhaul + corpus expansion + card improvements - Regenerated all example sentences from scratch (deleted legacy + stale entries) - Added .txt file support to epub_examples.py for Ben Yehuda corpus - 7 Ben Yehuda nikkud'd children's texts + 3 new Time Tunnel EPUBs - Maqaf-stripped construct form indexing (+68% inflected matches) - Total: 3,598 words with examples, 3,289 with cloze (was ~2,900) - Cloze prefix preservation (_cloze_prefix_len) - Hebrew spoiler stripping from English meanings - Gender field (זָכָר/נְקֵבָה) on vocab cards - sec-table CSS layout for aligned key:value pairs - Mishkal uses mishkal_hebrew on plural cards - Improved mishkal extraction from pealim detail pages - 21 new pytest tests (cloze, PoS, Hebrew stripping, gender, mishkal) - 2 new validate_data.py tests + mishkal stats - Colliding forms tracking (local-only) - Release tag v0.17 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 10:44:14 +00:00
Sochen	3b0f9defa9	feat: YAP-cleaned frequency corpus + two-tier assignment pipeline - Add clean_frequency_corpus.py: YAP morphological analyzer removes prefix+word combos (e.g. בבית=ב+בית) from he_50k frequency data. Headwords always protected. 30,430 clean entries from 49,999 raw. - Add assign_frequency.py: two-tier assignment with PoS-aware homograph handling. Tier 1 matches headwords; Tier 2 matches inflections (any rank) and conjugations (rank>5000 only, to avoid false positives). Function words claim frequency over content words in homograph groups, with manual overrides for 12 common dual-use words. - frequency_lookup.py auto-prefers frequency_clean.json when available - 6,691 entries now have frequency (was 5,974), 717 newly assigned Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 06:22:55 +00:00
Sochen	04a4b52113	fix: deduplicate 66 plural GUIDs for homograph nouns Homographs (same nikkud form, different meanings) had identical plurals_guid values. Regenerated unique GUIDs by including meaning in the hash. Also updated build-time fallback to use meaning. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 04:12:45 +00:00
Sochen	08fb7009d8	Sprint 11: unified JSON architecture + consolidated scraping pipeline Migrate from fragmented CSV + 10 JSON files to a single data/words.json (9,104 entries) as the unified data store. All GUIDs preserved for Anki study progress continuity. New files: - SCHEMA.yaml: authoritative schema for words.json - pealim_list_scrape.py: consolidated list page scraper → words.json - pealim_detail_scrape.py: noun/verb detail scraper → words.json - pealim_audio_download.py: audio downloader reading from words.json - scripts/migrate_to_json.py: one-time CSV→JSON migration - scripts/validate_data.py: 17 data integrity tests - scripts/check_guid_coverage.py: GUID preservation checker - scripts/repair_slugs.py: slug deduplication repair tool - tests/test_scraper_integration.py: live scraper integration tests Updated: - apkg_builder.py: reads from words.json (no more pandas) - run.py: 8-step pipeline (list scrape → frequency → examples → detail scrape → audio download → fonts → images → build) - benyehuda.py, frequency_lookup.py, image_fetch.py: TODO markers for future words.json integration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 10:54:58 +00:00

7 commits