264 confusable groups where all entries shared the same Hebrew frequency
now have differentiated pseudo_frequency values based on English word
commonality (hermitdave en_50k.txt). Most common meaning keeps base
rank; less common meanings get +100 offset per position.
Examples:
- אב: "father" (en:194) → 2491, "bud" (en:2963) → 2591
- אח: "brother" (en:300) → 911, "fireplace" (en:9389) → 1011
Builder uses pseudo_frequency for sort order when available.
Confusable card definitions now sorted most-common-first.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Homograph collision fix: _deduplicate_confusable_examples() clears
shared examples from less-common confusable group members (36 entries
fixed). Keeps examples only on highest-frequency meaning.
- Plural deck audio: wired up PluralAudio field in apkg_builder.py,
downloaded 613 plural audio files from pealim.com for all deck entries.
- Prep extraction upstream: moved Hebrew preposition parsing from build
time into list/detail scrapers (SCHEMA.yaml prep field added).
- Validation: new no_shared_confusable_examples check in validate_data.py
- Tests: 9 new unit tests for confusable deduplication (98 total)
- Release: v0.19
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Regenerated all example sentences from scratch (deleted legacy + stale entries)
- Added .txt file support to epub_examples.py for Ben Yehuda corpus
- 7 Ben Yehuda nikkud'd children's texts + 3 new Time Tunnel EPUBs
- Maqaf-stripped construct form indexing (+68% inflected matches)
- Total: 3,598 words with examples, 3,289 with cloze (was ~2,900)
- Cloze prefix preservation (_cloze_prefix_len)
- Hebrew spoiler stripping from English meanings
- Gender field (זָכָר/נְקֵבָה) on vocab cards
- sec-table CSS layout for aligned key:value pairs
- Mishkal uses mishkal_hebrew on plural cards
- Improved mishkal extraction from pealim detail pages
- 21 new pytest tests (cloze, PoS, Hebrew stripping, gender, mishkal)
- 2 new validate_data.py tests + mishkal stats
- Colliding forms tracking (local-only)
- Release tag v0.17
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Template & CSS fixes (15 items from Mar 9 feedback):
- Fix conjugation front showing 3ms form instead of infinitive
- Rename conjugation model to "Hebrew Conjugation"
- Strip Hebrew parenthesized text from English meanings
- Shoresh separator: spaces → dots (א.כ.ל)
- Remove duplicate English meaning from cloze back
- Remove example sentences from vocab front/back (cloze only)
- Center-align audio buttons on all decks
- Fix parenthesis spacing: "you(feminine,singular)" → "you (feminine, singular)"
- Unify sec-key/sec-label fonts, make keys bold
- Size overhaul: bigger Hebrew (42px), meaning (34px), secondary (28px)
- Center-align related words groups
- Sort confusables by average frequency
- Plurals: show Gender (Hebrew) before Mishkal, strip emoji from meaning
- Clean duplicate quotation marks in cloze sentences
Sprint 12 carry-forward (detail scrape + EPUB):
- Adjective/preposition detail scraping in pealim_detail_scrape.py
- EPUB example matching rewrite in epub_examples.py
- Delete benyehuda.py and rebuild_sentence_matches.py (merged)
- 49 parser tests for detail scraping
- SCHEMA.yaml updates for new fields
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add clean_frequency_corpus.py: YAP morphological analyzer removes
prefix+word combos (e.g. בבית=ב+בית) from he_50k frequency data.
Headwords always protected. 30,430 clean entries from 49,999 raw.
- Add assign_frequency.py: two-tier assignment with PoS-aware homograph
handling. Tier 1 matches headwords; Tier 2 matches inflections (any rank)
and conjugations (rank>5000 only, to avoid false positives).
Function words claim frequency over content words in homograph groups,
with manual overrides for 12 common dual-use words.
- frequency_lookup.py auto-prefers frequency_clean.json when available
- 6,691 entries now have frequency (was 5,974), 717 newly assigned
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix PoS substring bug: "Pronoun" no longer matches "Noun"
- CSS: reduce sec-label/sec-key font sizes, add .definitions/.conf-entry
- Slug-based audio filenames for confusable words (no more collisions)
- Scraper captures slug from pealim.com list page links
- Confusables: RTL alignment, re-enable audio (remove all-must-have gate)
- Plurals: blue given word, gray meaning, labeled mishkal badge
- Conjugation: add "אֵיךְ אוֹמְרִים" prompt, tense prefix (בְּ),
Prep field from HBPAREN_RE, labeled RelatedVocab
- Ben Yehuda: skip stripped fallback for confusable words
- Bump RELEASE_TAG to v0.15
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>