Commit graph

7 commits

Author SHA1 Message Date
d0f4aea58d feat(frequency_lookup): add get_freq_data() for batch frequency access
Exposes the full word→rank dict for use by the upcoming sentence_difficulty
module without requiring per-word lookups.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-15 13:18:11 +00:00
3b0f9defa9 feat: YAP-cleaned frequency corpus + two-tier assignment pipeline
- Add clean_frequency_corpus.py: YAP morphological analyzer removes
  prefix+word combos (e.g. בבית=ב+בית) from he_50k frequency data.
  Headwords always protected. 30,430 clean entries from 49,999 raw.
- Add assign_frequency.py: two-tier assignment with PoS-aware homograph
  handling. Tier 1 matches headwords; Tier 2 matches inflections (any rank)
  and conjugations (rank>5000 only, to avoid false positives).
  Function words claim frequency over content words in homograph groups,
  with manual overrides for 12 common dual-use words.
- frequency_lookup.py auto-prefers frequency_clean.json when available
- 6,691 entries now have frequency (was 5,974), 717 newly assigned

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 06:22:55 +00:00
b2fef5aa8a Sprint 11.1: strip_nikkud cleanup, dead code removal, test fixes
Remove strip_nikkud from all pipeline files — use ktiv_male directly.
Fix case-insensitive binyan matching in detail scraper (og:description
uses UPPERCASE). Fix integration test slugs and test limits. Delete
legacy CSVs, stale .apkg, and dead scripts from git. Add vulture to
pre-commit hook.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 04:03:47 +00:00
08fb7009d8 Sprint 11: unified JSON architecture + consolidated scraping pipeline
Migrate from fragmented CSV + 10 JSON files to a single data/words.json
(9,104 entries) as the unified data store. All GUIDs preserved for Anki
study progress continuity.

New files:
- SCHEMA.yaml: authoritative schema for words.json
- pealim_list_scrape.py: consolidated list page scraper → words.json
- pealim_detail_scrape.py: noun/verb detail scraper → words.json
- pealim_audio_download.py: audio downloader reading from words.json
- scripts/migrate_to_json.py: one-time CSV→JSON migration
- scripts/validate_data.py: 17 data integrity tests
- scripts/check_guid_coverage.py: GUID preservation checker
- scripts/repair_slugs.py: slug deduplication repair tool
- tests/test_scraper_integration.py: live scraper integration tests

Updated:
- apkg_builder.py: reads from words.json (no more pandas)
- run.py: 8-step pipeline (list scrape → frequency → examples →
  detail scrape → audio download → fonts → images → build)
- benyehuda.py, frequency_lookup.py, image_fetch.py: TODO markers
  for future words.json integration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 10:54:58 +00:00
17f7458d19 Sprint 9: cloze cards, plurals deck, project reorg, lint tooling
- Cloze card pipeline: 924 cards from 2,296 AI-vetted Hebrew book sentences
- Plurals deck: 375 notes (144 irregular + 231 regular from 86 mishkal patterns)
- Ktiv male forms expanded to 20,711 entries for sentence matching
- Project reorg: helpers.py (deduped strip_nikkud from 10 files), scripts/ for
  one-off tools, tests/ with smoke tests, deleted 3 dead files
- Lint tooling: pyproject.toml with ruff/vulture/bandit/pytest config, .editorconfig,
  fixed all 129 ruff errors (B023 closure fix, SIM103, unused vars)
- validate_apkg.py: card count range check for optional cloze template
- Data caches committed: vetted_sentences, ktiv_male_forms, noun_plurals,
  noun_slug_map, vocab_sentence_matches, epub_sentence_index

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 08:09:39 +00:00
bb79725a7f Python code cleanup (python-pro review)
- Type annotations: dict|None defaults, return types, nested func annotations
- Dead code: removed unused row_forms_with_audio(), duplicate _strip_nikkud defs,
  redundant guards, duplicate 'ism' in ABSTRACT_SUFFIXES
- Exceptions: narrowed bare except to (ValueError, pd.errors.ParserError) and
  (json.JSONDecodeError, OSError) throughout; all raise ValueError given messages
- Deduplication: extracted deduplicate() helper in _parse_table; setdefault() for
  dict building in benyehuda and apkg_builder; list comprehension in benyehuda
- Correctness: limit=0 guard fixed (is not None); audio tag parsing uses
  removeprefix/removesuffix instead of magic offsets; vectorized pandas sum
- Constants: BINYAN_NAMES extracted; unicodedata imports moved to top level
- benyehuda load(): removed wasted cache read on force_rebuild; word-boundary
  regex simplified from double-negative to \w

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-04 07:33:14 +00:00
b086123bec feat: add apkg builder, frequency, Ben Yehuda examples, conjugation deck
Implements four major improvements to the Pealim Anki deck pipeline:

1. Automated .apkg generation (genanki) — no more manual Anki Desktop step.
   Both vocabulary and conjugation decks are built programmatically.

2. Word frequency ranking from hermitdave/FrequencyWords he_50k corpus.
   Notes sorted by rank so Anki presents most common words first.

3. Example sentences from Ben Yehuda public domain corpus (not pealim.com).
   Downloads txt_stripped.zip, indexes 25k texts, ~89% coverage on test set.

4. Conjugation drill deck — one card per form × verb.
   Input: verbs_input.txt (Hebrew infinitives). Initial set: 7 verbs (one
   per binyan). Extracts 28 forms each via pealim.com/search/ + table parse.

New files:
  apkg_builder.py     — genanki deck builder for both decks
  benyehuda.py        — Ben Yehuda corpus downloader + sentence indexer
  frequency_lookup.py — FrequencyWords downloader + rank lookup
  verbs_input.txt     — verb input list (7 test verbs, one per binyan)
  data/               — baseline CSVs + generated caches

Updated:
  conjugation_extract.py — rewritten: reads verbs_input.txt, searches
                           /search/?q= for slug, parses table by row labels
  requirements.txt       — add genanki, beautifulsoup4, lxml
  run.py                 — full orchestration pipeline with CLI flags
  .gitignore             — exclude venv/, benyehuda_index.json, audio/, output/

CLI:
  python run.py --skip-scrape --skip-audio --test 20  (quick test)
  python run.py --skip-scrape                          (full build)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 01:58:31 +00:00