- Regenerated all example sentences from scratch (deleted legacy + stale entries)
- Added .txt file support to epub_examples.py for Ben Yehuda corpus
- 7 Ben Yehuda nikkud'd children's texts + 3 new Time Tunnel EPUBs
- Maqaf-stripped construct form indexing (+68% inflected matches)
- Total: 3,598 words with examples, 3,289 with cloze (was ~2,900)
- Cloze prefix preservation (_cloze_prefix_len)
- Hebrew spoiler stripping from English meanings
- Gender field (זָכָר/נְקֵבָה) on vocab cards
- sec-table CSS layout for aligned key:value pairs
- Mishkal uses mishkal_hebrew on plural cards
- Improved mishkal extraction from pealim detail pages
- 21 new pytest tests (cloze, PoS, Hebrew stripping, gender, mishkal)
- 2 new validate_data.py tests + mishkal stats
- Colliding forms tracking (local-only)
- Release tag v0.17
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add clean_frequency_corpus.py: YAP morphological analyzer removes
prefix+word combos (e.g. בבית=ב+בית) from he_50k frequency data.
Headwords always protected. 30,430 clean entries from 49,999 raw.
- Add assign_frequency.py: two-tier assignment with PoS-aware homograph
handling. Tier 1 matches headwords; Tier 2 matches inflections (any rank)
and conjugations (rank>5000 only, to avoid false positives).
Function words claim frequency over content words in homograph groups,
with manual overrides for 12 common dual-use words.
- frequency_lookup.py auto-prefers frequency_clean.json when available
- 6,691 entries now have frequency (was 5,974), 717 newly assigned
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Homographs (same nikkud form, different meanings) had identical
plurals_guid values. Regenerated unique GUIDs by including meaning
in the hash. Also updated build-time fallback to use meaning.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>