264 confusable groups where all entries shared the same Hebrew frequency
now have differentiated pseudo_frequency values based on English word
commonality (hermitdave en_50k.txt). Most common meaning keeps base
rank; less common meanings get +100 offset per position.
Examples:
- אב: "father" (en:194) → 2491, "bud" (en:2963) → 2591
- אח: "brother" (en:300) → 911, "fireplace" (en:9389) → 1011
Builder uses pseudo_frequency for sort order when available.
Confusable card definitions now sorted most-common-first.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reviewed all 117 pealim-inherited emoji assignments:
- Made 114 correct assignments visible (emoji_visible: true)
- Removed: goblet (🏆 is trophy), fitness (🏋 too abstract), red (💄 is lipstick)
- Fixed: onion 🌰→🧅 (was chestnut emoji)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Confusables deck front now shows shared ktiv male form instead of
nikkud variants joined by "/". Back still shows nikkud with definitions.
- Fixed list scraper EMOJI_RE to catch variation selectors (U+FE0F) and
ZWJ (U+200D) — cleaned 17 entries with leftover selectors in meaning.
- Removed build-time prep extraction fallback (0 entries relied on it).
- release.py: fix keeshare field name (API_TOKEN → password).
Closes: Pealim #11 (emoji/prep upstream), Pealim #16 (confusables ktiv male)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Homograph collision fix: _deduplicate_confusable_examples() clears
shared examples from less-common confusable group members (36 entries
fixed). Keeps examples only on highest-frequency meaning.
- Plural deck audio: wired up PluralAudio field in apkg_builder.py,
downloaded 613 plural audio files from pealim.com for all deck entries.
- Prep extraction upstream: moved Hebrew preposition parsing from build
time into list/detail scrapers (SCHEMA.yaml prep field added).
- Validation: new no_shared_confusable_examples check in validate_data.py
- Tests: 9 new unit tests for confusable deduplication (98 total)
- Release: v0.19
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Regenerated all example sentences from scratch (deleted legacy + stale entries)
- Added .txt file support to epub_examples.py for Ben Yehuda corpus
- 7 Ben Yehuda nikkud'd children's texts + 3 new Time Tunnel EPUBs
- Maqaf-stripped construct form indexing (+68% inflected matches)
- Total: 3,598 words with examples, 3,289 with cloze (was ~2,900)
- Cloze prefix preservation (_cloze_prefix_len)
- Hebrew spoiler stripping from English meanings
- Gender field (זָכָר/נְקֵבָה) on vocab cards
- sec-table CSS layout for aligned key:value pairs
- Mishkal uses mishkal_hebrew on plural cards
- Improved mishkal extraction from pealim detail pages
- 21 new pytest tests (cloze, PoS, Hebrew stripping, gender, mishkal)
- 2 new validate_data.py tests + mishkal stats
- Colliding forms tracking (local-only)
- Release tag v0.17
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add clean_frequency_corpus.py: YAP morphological analyzer removes
prefix+word combos (e.g. בבית=ב+בית) from he_50k frequency data.
Headwords always protected. 30,430 clean entries from 49,999 raw.
- Add assign_frequency.py: two-tier assignment with PoS-aware homograph
handling. Tier 1 matches headwords; Tier 2 matches inflections (any rank)
and conjugations (rank>5000 only, to avoid false positives).
Function words claim frequency over content words in homograph groups,
with manual overrides for 12 common dual-use words.
- frequency_lookup.py auto-prefers frequency_clean.json when available
- 6,691 entries now have frequency (was 5,974), 717 newly assigned
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Homographs (same nikkud form, different meanings) had identical
plurals_guid values. Regenerated unique GUIDs by including meaning
in the hash. Also updated build-time fallback to use meaning.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove strip_nikkud from all pipeline files — use ktiv_male directly.
Fix case-insensitive binyan matching in detail scraper (og:description
uses UPPERCASE). Fix integration test slugs and test limits. Delete
legacy CSVs, stale .apkg, and dead scripts from git. Add vulture to
pre-commit hook.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix PoS substring bug: "Pronoun" no longer matches "Noun"
- CSS: reduce sec-label/sec-key font sizes, add .definitions/.conf-entry
- Slug-based audio filenames for confusable words (no more collisions)
- Scraper captures slug from pealim.com list page links
- Confusables: RTL alignment, re-enable audio (remove all-must-have gate)
- Plurals: blue given word, gray meaning, labeled mishkal badge
- Conjugation: add "אֵיךְ אוֹמְרִים" prompt, tense prefix (בְּ),
Prep field from HBPAREN_RE, labeled RelatedVocab
- Ben Yehuda: skip stripped fallback for confusable words
- Bump RELEASE_TAG to v0.15
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Full pealim.com rescrape: 9,120 words (15 new), all with audio URLs
- Plurals deck: 2:1 regular:irregular ratio (649 notes), RTL arrows, 1.6x hint text
- Conjugation deck: blue infinitive on front, plain meaning on back, nikkud labels
- Confusables deck: larger prompt text (32px), audio only when all words have it
- Validator: non-audio variants no longer false-fail on audio check
- 14 new audio files downloaded
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Expanded _EMOJI_STOP from ~20 to ~80 keywords after manual review
of all 2,261 emoji-word pairs. Removes false positives from
polysemous words (french→🍟, water→🤽, rock→🪨, etc.)
- Emoji count: 2,261 → 1,820 (removed ~440 bad matches)
- hebrew_dict.csv now populated with audio_url from pealim.com scrape
(8,727 words with audio URLs)
- Cached emoji_lookup.json (1,749 keywords from Unicode emoji-test.txt)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Item 1/2: Extract emoji and Hebrew parentheticals (prepositions) from
Meaning field; display emoji with 3.5em font, prep inline after Hebrew
word. Add Emoji and Prep fields to Hebrew Flash Cards model.
- Item 3: Seeded RNG per verb reduces conjugation cards by ~630 (4 present
forms → 1 pronoun each; past_3p → 1 gender). 1st-person forms gain gender
label (זכר/נקבה). Total: 1,834 conj cards (was ~2,464).
- Item 4: hebrew_extract.py uses BeautifulSoup to capture data-audio URLs
from pealim.com list pages during scraping. step_audio() reads audio_url
column from CSV (no longer needs audio_extract.py).
- Item 5: Rename to 'Hebrew Flash Cards'. New filenames: hebrew_dict.csv,
hebrew_extract.py, hebrew_vocabulary.apkg, hebrew_conjugations.apkg.
Deck/model names updated throughout. Forgejo repo rename pending (sochen
lacks admin rights — Nevo must do via UI).
- Fix: Deduplicate entries with same Hebrew word before adding notes
(eliminates GUID collisions from duplicate source CSV rows).
- Bump RELEASE_TAG to v0.11.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- add @media (prefers-color-scheme: dark) block to CARD_CSS covering all hardcoded colors
- _parse_table: add table_el param to parse a specific table directly
- _extract_conjugations: detect second active conjugation table; store alternate_forms
- build_conj_deck: show "primary / alternate" when alternate form exists for a key
- README: fix dead ../../releases link → git.nevo.engineer/nevo/pealim/releases
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
להתלקלח in the original source was a typo for להתקלח (1896-lehitkaleach),
not for להתקלקל as previously assumed — it's a completely different word.
Conjugation deck now has the correct 70 paradigm verbs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- vocab deck uses frequency insertion order (genanki.Package); conjugation deck random (_RandomOrderPackage)
- skip infinitive form_key in conjugation deck build (reference only, not a quiz target)
- PAST_3P_EXPANSION: split past_3p into separate הֵם and הֵן cards
- SECTION_BINYAN parsing: read section headers from verbs_input.txt as binyan hints
- add binyan_hint param to _extract_conjugations and _extract_passive_from_active_slug
- patch 20 cached entries with empty binyan (Pa'al, Nif'al) using section hints
- result: 2428 notes across 69 verbs, all with populated binyan
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Conjugation extraction:
- Active entries now extract active forms only (no auto passive partner)
- Passive (# 3ms:) entries extract passive section only via new
_extract_passive_from_active_slug(); search-based fallback also uses
this path so no active forms leak into passive entries
- # slug: VERB SLUG override syntax for search-ambiguous active verbs
- # 3ms: FORM ACTIVE-SLUG syntax for passive entries with known active page
- Fixed verb spellings: בוטל (was בותל), slug overrides for תואם →
2344-letaem, זוכה → 503-lezakot, לָשִׂים → 45-lasim, העבר → 1442-lehaavir
Card UX:
- Passive card front: shows active partner infinitive (e.g. לְבַטֵּל) with
(סָבִיל) inline in smaller font instead of bare 3ms past form
- Removed פָּעִיל label from active cards; only passive cards carry voice label
- New cards introduced in random order (new.order=0 via _RandomOrderPackage)
- Frequency badge: words outside top 50k show "50k+" instead of blank
README: updated CLI options, output files table, pipeline list, card
descriptions to reflect Sprint 3 state
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- data/fonts/: Heebo variable font TTF (Regular + Bold) for bundling in .apkg
- image_fetch.py: Wikipedia/Commons image fetch for concrete nouns
- validate_verb_list.py: pealim.com validator for verb input list
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements four major improvements to the Pealim Anki deck pipeline:
1. Automated .apkg generation (genanki) — no more manual Anki Desktop step.
Both vocabulary and conjugation decks are built programmatically.
2. Word frequency ranking from hermitdave/FrequencyWords he_50k corpus.
Notes sorted by rank so Anki presents most common words first.
3. Example sentences from Ben Yehuda public domain corpus (not pealim.com).
Downloads txt_stripped.zip, indexes 25k texts, ~89% coverage on test set.
4. Conjugation drill deck — one card per form × verb.
Input: verbs_input.txt (Hebrew infinitives). Initial set: 7 verbs (one
per binyan). Extracts 28 forms each via pealim.com/search/ + table parse.
New files:
apkg_builder.py — genanki deck builder for both decks
benyehuda.py — Ben Yehuda corpus downloader + sentence indexer
frequency_lookup.py — FrequencyWords downloader + rank lookup
verbs_input.txt — verb input list (7 test verbs, one per binyan)
data/ — baseline CSVs + generated caches
Updated:
conjugation_extract.py — rewritten: reads verbs_input.txt, searches
/search/?q= for slug, parses table by row labels
requirements.txt — add genanki, beautifulsoup4, lxml
run.py — full orchestration pipeline with CLI flags
.gitignore — exclude venv/, benyehuda_index.json, audio/, output/
CLI:
python run.py --skip-scrape --skip-audio --test 20 (quick test)
python run.py --skip-scrape (full build)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>