Commit graph

16 commits

Author SHA1 Message Date
efd0745ada Sprint 14: deck template/CSS overhaul + Sprint 12 detail scrape
Template & CSS fixes (15 items from Mar 9 feedback):
- Fix conjugation front showing 3ms form instead of infinitive
- Rename conjugation model to "Hebrew Conjugation"
- Strip Hebrew parenthesized text from English meanings
- Shoresh separator: spaces → dots (א.כ.ל)
- Remove duplicate English meaning from cloze back
- Remove example sentences from vocab front/back (cloze only)
- Center-align audio buttons on all decks
- Fix parenthesis spacing: "you(feminine,singular)" → "you (feminine, singular)"
- Unify sec-key/sec-label fonts, make keys bold
- Size overhaul: bigger Hebrew (42px), meaning (34px), secondary (28px)
- Center-align related words groups
- Sort confusables by average frequency
- Plurals: show Gender (Hebrew) before Mishkal, strip emoji from meaning
- Clean duplicate quotation marks in cloze sentences

Sprint 12 carry-forward (detail scrape + EPUB):
- Adjective/preposition detail scraping in pealim_detail_scrape.py
- EPUB example matching rewrite in epub_examples.py
- Delete benyehuda.py and rebuild_sentence_matches.py (merged)
- 49 parser tests for detail scraping
- SCHEMA.yaml updates for new fields

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 07:44:47 +00:00
b2fef5aa8a Sprint 11.1: strip_nikkud cleanup, dead code removal, test fixes
Remove strip_nikkud from all pipeline files — use ktiv_male directly.
Fix case-insensitive binyan matching in detail scraper (og:description
uses UPPERCASE). Fix integration test slugs and test limits. Delete
legacy CSVs, stale .apkg, and dead scripts from git. Add vulture to
pre-commit hook.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 04:03:47 +00:00
a1d970a782 fix: reorder pipeline — detail scrape immediately after list scrape
List scrape captures slugs needed by detail scrape, so they should be
adjacent. Reordered: list→detail→frequency→examples→audio→fonts→images→build

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 19:16:57 +00:00
08fb7009d8 Sprint 11: unified JSON architecture + consolidated scraping pipeline
Migrate from fragmented CSV + 10 JSON files to a single data/words.json
(9,104 entries) as the unified data store. All GUIDs preserved for Anki
study progress continuity.

New files:
- SCHEMA.yaml: authoritative schema for words.json
- pealim_list_scrape.py: consolidated list page scraper → words.json
- pealim_detail_scrape.py: noun/verb detail scraper → words.json
- pealim_audio_download.py: audio downloader reading from words.json
- scripts/migrate_to_json.py: one-time CSV→JSON migration
- scripts/validate_data.py: 17 data integrity tests
- scripts/check_guid_coverage.py: GUID preservation checker
- scripts/repair_slugs.py: slug deduplication repair tool
- tests/test_scraper_integration.py: live scraper integration tests

Updated:
- apkg_builder.py: reads from words.json (no more pandas)
- run.py: 8-step pipeline (list scrape → frequency → examples →
  detail scrape → audio download → fonts → images → build)
- benyehuda.py, frequency_lookup.py, image_fetch.py: TODO markers
  for future words.json integration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 10:54:58 +00:00
2e48109d7f v0.15: PoS fix, slug-based audio, CSS cleanup, template improvements
- Fix PoS substring bug: "Pronoun" no longer matches "Noun"
- CSS: reduce sec-label/sec-key font sizes, add .definitions/.conf-entry
- Slug-based audio filenames for confusable words (no more collisions)
- Scraper captures slug from pealim.com list page links
- Confusables: RTL alignment, re-enable audio (remove all-must-have gate)
- Plurals: blue given word, gray meaning, labeled mishkal badge
- Conjugation: add "אֵיךְ אוֹמְרִים" prompt, tense prefix (בְּ),
  Prep field from HBPAREN_RE, labeled RelatedVocab
- Ben Yehuda: skip stripped fallback for confusable words
- Bump RELEASE_TAG to v0.15

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 17:50:23 +00:00
17f7458d19 Sprint 9: cloze cards, plurals deck, project reorg, lint tooling
- Cloze card pipeline: 924 cards from 2,296 AI-vetted Hebrew book sentences
- Plurals deck: 375 notes (144 irregular + 231 regular from 86 mishkal patterns)
- Ktiv male forms expanded to 20,711 entries for sentence matching
- Project reorg: helpers.py (deduped strip_nikkud from 10 files), scripts/ for
  one-off tools, tests/ with smoke tests, deleted 3 dead files
- Lint tooling: pyproject.toml with ruff/vulture/bandit/pytest config, .editorconfig,
  fixed all 129 ruff errors (B023 closure fix, SIM103, unused vars)
- validate_apkg.py: card count range check for optional cloze template
- Data caches committed: vetted_sentences, ktiv_male_forms, noun_plurals,
  noun_slug_map, vocab_sentence_matches, epub_sentence_index

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 08:09:39 +00:00
607fd1a3bc feat: emoji Unicode lookup, conj nikkud, fix summary metric
- Emoji: _load_emoji_lookup() fetches unicode.org emoji-test.txt, builds
  {keyword: emoji_char} map cached in data/emoji_lookup.json. Falls back
  to empty dict on network failure. build_all_variants() loads once and
  passes to all build_vocab_deck() calls. For each word without pealim
  emoji, tries first 5 keywords from English meaning against lookup.
- Nikkud: זכר→זָכָר, נקבה→נְקֵבָה in PRESENT_EXPANSION constants and
  build_conj_deck() 1st-person gender labels.
- Summary: conj audio file count now excludes _infinitive and _passive_
  on-disk extras never bundled in .apkg (was 2235, now shows ~1765).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 21:24:10 +00:00
3fc3a21a33 fix: load conjugations from cache when --skip-conjugations passed
Previously --skip-conjugations returned None, causing build_all_variants()
to produce near-empty conjugation decks (0.3MB font-only files). Now loads
from conjugations.json cache so all 6 release variants build correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 21:01:47 +00:00
ccd7d61efb Add 6-variant release build (4 vocab + 2 conj), bump to v0.12
- build_vocab_deck(): include_audio/include_images flags
- build_conj_deck(): include_audio flag
- build_all_variants(): builds all 6 apkg files in one call
- Variants: hebrew_vocabulary{,_audio,_images,_audio_images}.apkg
            hebrew_conjugations{,_audio}.apkg
- run.py: step_build_all() replaces step_build_vocab(); conjugation
  extraction reuses cached conjugations.json unless refreshed
- RELEASE_TAG bumped to v0.12

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 20:58:06 +00:00
64a1b18951 Sprint 7: emoji/prep extraction, conjugation reduction, project rename
- Item 1/2: Extract emoji and Hebrew parentheticals (prepositions) from
  Meaning field; display emoji with 3.5em font, prep inline after Hebrew
  word. Add Emoji and Prep fields to Hebrew Flash Cards model.
- Item 3: Seeded RNG per verb reduces conjugation cards by ~630 (4 present
  forms → 1 pronoun each; past_3p → 1 gender). 1st-person forms gain gender
  label (זכר/נקבה). Total: 1,834 conj cards (was ~2,464).
- Item 4: hebrew_extract.py uses BeautifulSoup to capture data-audio URLs
  from pealim.com list pages during scraping. step_audio() reads audio_url
  column from CSV (no longer needs audio_extract.py).
- Item 5: Rename to 'Hebrew Flash Cards'. New filenames: hebrew_dict.csv,
  hebrew_extract.py, hebrew_vocabulary.apkg, hebrew_conjugations.apkg.
  Deck/model names updated throughout. Forgejo repo rename pending (sochen
  lacks admin rights — Nevo must do via UI).
- Fix: Deduplicate entries with same Hebrew word before adding notes
  (eliminates GUID collisions from duplicate source CSV rows).
- Bump RELEASE_TAG to v0.11.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 05:49:51 +00:00
bb79725a7f Python code cleanup (python-pro review)
- Type annotations: dict|None defaults, return types, nested func annotations
- Dead code: removed unused row_forms_with_audio(), duplicate _strip_nikkud defs,
  redundant guards, duplicate 'ism' in ABSTRACT_SUFFIXES
- Exceptions: narrowed bare except to (ValueError, pd.errors.ParserError) and
  (json.JSONDecodeError, OSError) throughout; all raise ValueError given messages
- Deduplication: extracted deduplicate() helper in _parse_table; setdefault() for
  dict building in benyehuda and apkg_builder; list comprehension in benyehuda
- Correctness: limit=0 guard fixed (is not None); audio tag parsing uses
  removeprefix/removesuffix instead of magic offsets; vectorized pandas sum
- Constants: BINYAN_NAMES extracted; unicodedata imports moved to top level
- benyehuda load(): removed wasted cache read on force_rebuild; word-boundary
  regex simplified from double-negative to \w

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-04 07:33:14 +00:00
0686298610 Deprecate --skip-conjugations in favor of --only vocab
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-04 07:21:03 +00:00
d26e4c8ce5 feat: Sprint 3 — passive/active separation, random card order, card UX fixes
Conjugation extraction:
- Active entries now extract active forms only (no auto passive partner)
- Passive (# 3ms:) entries extract passive section only via new
  _extract_passive_from_active_slug(); search-based fallback also uses
  this path so no active forms leak into passive entries
- # slug: VERB SLUG override syntax for search-ambiguous active verbs
- # 3ms: FORM ACTIVE-SLUG syntax for passive entries with known active page
- Fixed verb spellings: בוטל (was בותל), slug overrides for תואם →
  2344-letaem, זוכה → 503-lezakot, לָשִׂים → 45-lasim, העבר → 1442-lehaavir

Card UX:
- Passive card front: shows active partner infinitive (e.g. לְבַטֵּל) with
  (סָבִיל) inline in smaller font instead of bare 3ms past form
- Removed פָּעִיל label from active cards; only passive cards carry voice label
- New cards introduced in random order (new.order=0 via _RandomOrderPackage)
- Frequency badge: words outside top 50k show "50k+" instead of blank

README: updated CLI options, output files table, pipeline list, card
descriptions to reflect Sprint 3 state

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 10:16:50 +00:00
b018f21b1d feat: Sprint 2 + Sprint 3 — verb list, audio, passive forms, CSS/UX, validation, Heebo font, images
Sprint 2:
- extract_verb_list.py (NEW): downloads Coffin & Bolozky PDF, extracts
  71-verb paradigm list from Appendix 1 with hardcoded fallback.
  Pu'al/Huf'al use '# 3ms:' prefix for 3ms search.
- conjugation_extract.py: audio URL capture per form, passive forms
  parsing (Pu'al/Huf'al partner tables), 3ms search support.
- benyehuda.py: nikkud corpus (txt.zip), index by nikkud word form,
  single best example (longest ≤200 chars), --refresh-examples rebuild.
- apkg_builder.py: Hebrew labels, centered dark Hebrew text, freq-badge,
  related words grouped by PoS. Conjugation: Voice/Audio fields,
  present-tense 12-card expansion, 2fp/3fp modern fallback with
  classical in parens, פָּעִיל/סָבִיל voice labels.
- README.md: rewritten — learner-first structure, data sources.
- run.py: --refresh-examples flag, conjugation audio download (step 4b).
- data/conjugations.json: rebuilt with 70 verbs, audio URLs, passive
  partner data.

Sprint 3:
- validate_verb_list.py (NEW): queries pealim.com for all entries in
  verb input list, classifies as OK/3ms/REVIEW/NOT_FOUND, writes
  cleaned verbs_input.txt. Results: 51 OK, 15 3ms-past, 4 REVIEW.
- apkg_builder.py: binyan in Hebrew (BINYAN_TO_HEBREW map) on its own
  line; remove "דוגמה:" label; "Other" related-words shown unlabeled;
  "50k+" freq display for unlisted words; Image field in VOCAB_MODEL.
- image_fetch.py (NEW): Wikipedia/Commons thumbnails for concrete nouns,
  caches in data/image_cache.json, downloads to data/images/.
- Heebo variable font TTF bundled in both .apkg files via @font-face.
- run.py: step_fonts(), step_images(), --skip-images flag.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 08:36:51 +00:00
b086123bec feat: add apkg builder, frequency, Ben Yehuda examples, conjugation deck
Implements four major improvements to the Pealim Anki deck pipeline:

1. Automated .apkg generation (genanki) — no more manual Anki Desktop step.
   Both vocabulary and conjugation decks are built programmatically.

2. Word frequency ranking from hermitdave/FrequencyWords he_50k corpus.
   Notes sorted by rank so Anki presents most common words first.

3. Example sentences from Ben Yehuda public domain corpus (not pealim.com).
   Downloads txt_stripped.zip, indexes 25k texts, ~89% coverage on test set.

4. Conjugation drill deck — one card per form × verb.
   Input: verbs_input.txt (Hebrew infinitives). Initial set: 7 verbs (one
   per binyan). Extracts 28 forms each via pealim.com/search/ + table parse.

New files:
  apkg_builder.py     — genanki deck builder for both decks
  benyehuda.py        — Ben Yehuda corpus downloader + sentence indexer
  frequency_lookup.py — FrequencyWords downloader + rank lookup
  verbs_input.txt     — verb input list (7 test verbs, one per binyan)
  data/               — baseline CSVs + generated caches

Updated:
  conjugation_extract.py — rewritten: reads verbs_input.txt, searches
                           /search/?q= for slug, parses table by row labels
  requirements.txt       — add genanki, beautifulsoup4, lxml
  run.py                 — full orchestration pipeline with CLI flags
  .gitignore             — exclude venv/, benyehuda_index.json, audio/, output/

CLI:
  python run.py --skip-scrape --skip-audio --test 20  (quick test)
  python run.py --skip-scrape                          (full build)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 01:58:31 +00:00
e23b353064 Improve scraper robustness and Hebrew text handling 2026-02-26 21:57:20 +00:00