Template & CSS fixes (15 items from Mar 9 feedback):
- Fix conjugation front showing 3ms form instead of infinitive
- Rename conjugation model to "Hebrew Conjugation"
- Strip Hebrew parenthesized text from English meanings
- Shoresh separator: spaces → dots (א.כ.ל)
- Remove duplicate English meaning from cloze back
- Remove example sentences from vocab front/back (cloze only)
- Center-align audio buttons on all decks
- Fix parenthesis spacing: "you(feminine,singular)" → "you (feminine, singular)"
- Unify sec-key/sec-label fonts, make keys bold
- Size overhaul: bigger Hebrew (42px), meaning (34px), secondary (28px)
- Center-align related words groups
- Sort confusables by average frequency
- Plurals: show Gender (Hebrew) before Mishkal, strip emoji from meaning
- Clean duplicate quotation marks in cloze sentences
Sprint 12 carry-forward (detail scrape + EPUB):
- Adjective/preposition detail scraping in pealim_detail_scrape.py
- EPUB example matching rewrite in epub_examples.py
- Delete benyehuda.py and rebuild_sentence_matches.py (merged)
- 49 parser tests for detail scraping
- SCHEMA.yaml updates for new fields
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove strip_nikkud from all pipeline files — use ktiv_male directly.
Fix case-insensitive binyan matching in detail scraper (og:description
uses UPPERCASE). Fix integration test slugs and test limits. Delete
legacy CSVs, stale .apkg, and dead scripts from git. Add vulture to
pre-commit hook.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
List scrape captures slugs needed by detail scrape, so they should be
adjacent. Reordered: list→detail→frequency→examples→audio→fonts→images→build
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix PoS substring bug: "Pronoun" no longer matches "Noun"
- CSS: reduce sec-label/sec-key font sizes, add .definitions/.conf-entry
- Slug-based audio filenames for confusable words (no more collisions)
- Scraper captures slug from pealim.com list page links
- Confusables: RTL alignment, re-enable audio (remove all-must-have gate)
- Plurals: blue given word, gray meaning, labeled mishkal badge
- Conjugation: add "אֵיךְ אוֹמְרִים" prompt, tense prefix (בְּ),
Prep field from HBPAREN_RE, labeled RelatedVocab
- Ben Yehuda: skip stripped fallback for confusable words
- Bump RELEASE_TAG to v0.15
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Emoji: _load_emoji_lookup() fetches unicode.org emoji-test.txt, builds
{keyword: emoji_char} map cached in data/emoji_lookup.json. Falls back
to empty dict on network failure. build_all_variants() loads once and
passes to all build_vocab_deck() calls. For each word without pealim
emoji, tries first 5 keywords from English meaning against lookup.
- Nikkud: זכר→זָכָר, נקבה→נְקֵבָה in PRESENT_EXPANSION constants and
build_conj_deck() 1st-person gender labels.
- Summary: conj audio file count now excludes _infinitive and _passive_
on-disk extras never bundled in .apkg (was 2235, now shows ~1765).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously --skip-conjugations returned None, causing build_all_variants()
to produce near-empty conjugation decks (0.3MB font-only files). Now loads
from conjugations.json cache so all 6 release variants build correctly.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Item 1/2: Extract emoji and Hebrew parentheticals (prepositions) from
Meaning field; display emoji with 3.5em font, prep inline after Hebrew
word. Add Emoji and Prep fields to Hebrew Flash Cards model.
- Item 3: Seeded RNG per verb reduces conjugation cards by ~630 (4 present
forms → 1 pronoun each; past_3p → 1 gender). 1st-person forms gain gender
label (זכר/נקבה). Total: 1,834 conj cards (was ~2,464).
- Item 4: hebrew_extract.py uses BeautifulSoup to capture data-audio URLs
from pealim.com list pages during scraping. step_audio() reads audio_url
column from CSV (no longer needs audio_extract.py).
- Item 5: Rename to 'Hebrew Flash Cards'. New filenames: hebrew_dict.csv,
hebrew_extract.py, hebrew_vocabulary.apkg, hebrew_conjugations.apkg.
Deck/model names updated throughout. Forgejo repo rename pending (sochen
lacks admin rights — Nevo must do via UI).
- Fix: Deduplicate entries with same Hebrew word before adding notes
(eliminates GUID collisions from duplicate source CSV rows).
- Bump RELEASE_TAG to v0.11.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Type annotations: dict|None defaults, return types, nested func annotations
- Dead code: removed unused row_forms_with_audio(), duplicate _strip_nikkud defs,
redundant guards, duplicate 'ism' in ABSTRACT_SUFFIXES
- Exceptions: narrowed bare except to (ValueError, pd.errors.ParserError) and
(json.JSONDecodeError, OSError) throughout; all raise ValueError given messages
- Deduplication: extracted deduplicate() helper in _parse_table; setdefault() for
dict building in benyehuda and apkg_builder; list comprehension in benyehuda
- Correctness: limit=0 guard fixed (is not None); audio tag parsing uses
removeprefix/removesuffix instead of magic offsets; vectorized pandas sum
- Constants: BINYAN_NAMES extracted; unicodedata imports moved to top level
- benyehuda load(): removed wasted cache read on force_rebuild; word-boundary
regex simplified from double-negative to \w
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Conjugation extraction:
- Active entries now extract active forms only (no auto passive partner)
- Passive (# 3ms:) entries extract passive section only via new
_extract_passive_from_active_slug(); search-based fallback also uses
this path so no active forms leak into passive entries
- # slug: VERB SLUG override syntax for search-ambiguous active verbs
- # 3ms: FORM ACTIVE-SLUG syntax for passive entries with known active page
- Fixed verb spellings: בוטל (was בותל), slug overrides for תואם →
2344-letaem, זוכה → 503-lezakot, לָשִׂים → 45-lasim, העבר → 1442-lehaavir
Card UX:
- Passive card front: shows active partner infinitive (e.g. לְבַטֵּל) with
(סָבִיל) inline in smaller font instead of bare 3ms past form
- Removed פָּעִיל label from active cards; only passive cards carry voice label
- New cards introduced in random order (new.order=0 via _RandomOrderPackage)
- Frequency badge: words outside top 50k show "50k+" instead of blank
README: updated CLI options, output files table, pipeline list, card
descriptions to reflect Sprint 3 state
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements four major improvements to the Pealim Anki deck pipeline:
1. Automated .apkg generation (genanki) — no more manual Anki Desktop step.
Both vocabulary and conjugation decks are built programmatically.
2. Word frequency ranking from hermitdave/FrequencyWords he_50k corpus.
Notes sorted by rank so Anki presents most common words first.
3. Example sentences from Ben Yehuda public domain corpus (not pealim.com).
Downloads txt_stripped.zip, indexes 25k texts, ~89% coverage on test set.
4. Conjugation drill deck — one card per form × verb.
Input: verbs_input.txt (Hebrew infinitives). Initial set: 7 verbs (one
per binyan). Extracts 28 forms each via pealim.com/search/ + table parse.
New files:
apkg_builder.py — genanki deck builder for both decks
benyehuda.py — Ben Yehuda corpus downloader + sentence indexer
frequency_lookup.py — FrequencyWords downloader + rank lookup
verbs_input.txt — verb input list (7 test verbs, one per binyan)
data/ — baseline CSVs + generated caches
Updated:
conjugation_extract.py — rewritten: reads verbs_input.txt, searches
/search/?q= for slug, parses table by row labels
requirements.txt — add genanki, beautifulsoup4, lxml
run.py — full orchestration pipeline with CLI flags
.gitignore — exclude venv/, benyehuda_index.json, audio/, output/
CLI:
python run.py --skip-scrape --skip-audio --test 20 (quick test)
python run.py --skip-scrape (full build)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>