Commit graph

7 commits

Author SHA1 Message Date
f3496998f5 feat: confusables show ktiv male, emoji/prep stripping fully upstream
- Confusables deck front now shows shared ktiv male form instead of
  nikkud variants joined by "/". Back still shows nikkud with definitions.
- Fixed list scraper EMOJI_RE to catch variation selectors (U+FE0F) and
  ZWJ (U+200D) — cleaned 17 entries with leftover selectors in meaning.
- Removed build-time prep extraction fallback (0 entries relied on it).
- release.py: fix keeshare field name (API_TOKEN → password).

Closes: Pealim #11 (emoji/prep upstream), Pealim #16 (confusables ktiv male)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-21 02:19:03 +00:00
0a85291975 feat(v0.20): adaptive sentence difficulty scoring — 3163 scored cloze sentences
Replaces length-based sentence selection with frequency-based difficulty
scoring. Easiest sentences (most common context words) become cloze candidates.

Score range: 1-50000, median=179. Coverage: 3163/3325 cloze entries scored.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 13:35:23 +00:00
af186e2030 Sprint 17: homograph example dedup + plural audio + prep extraction
- Homograph collision fix: _deduplicate_confusable_examples() clears
  shared examples from less-common confusable group members (36 entries
  fixed). Keeps examples only on highest-frequency meaning.
- Plural deck audio: wired up PluralAudio field in apkg_builder.py,
  downloaded 613 plural audio files from pealim.com for all deck entries.
- Prep extraction upstream: moved Hebrew preposition parsing from build
  time into list/detail scrapers (SCHEMA.yaml prep field added).
- Validation: new no_shared_confusable_examples check in validate_data.py
- Tests: 9 new unit tests for confusable deduplication (98 total)
- Release: v0.19

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 21:51:35 +00:00
c85063ee2f Sprint 15: example sentence pipeline overhaul + corpus expansion + card improvements
- Regenerated all example sentences from scratch (deleted legacy + stale entries)
- Added .txt file support to epub_examples.py for Ben Yehuda corpus
- 7 Ben Yehuda nikkud'd children's texts + 3 new Time Tunnel EPUBs
- Maqaf-stripped construct form indexing (+68% inflected matches)
- Total: 3,598 words with examples, 3,289 with cloze (was ~2,900)
- Cloze prefix preservation (_cloze_prefix_len)
- Hebrew spoiler stripping from English meanings
- Gender field (זָכָר/נְקֵבָה) on vocab cards
- sec-table CSS layout for aligned key:value pairs
- Mishkal uses mishkal_hebrew on plural cards
- Improved mishkal extraction from pealim detail pages
- 21 new pytest tests (cloze, PoS, Hebrew stripping, gender, mishkal)
- 2 new validate_data.py tests + mishkal stats
- Colliding forms tracking (local-only)
- Release tag v0.17

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 10:44:14 +00:00
3b0f9defa9 feat: YAP-cleaned frequency corpus + two-tier assignment pipeline
- Add clean_frequency_corpus.py: YAP morphological analyzer removes
  prefix+word combos (e.g. בבית=ב+בית) from he_50k frequency data.
  Headwords always protected. 30,430 clean entries from 49,999 raw.
- Add assign_frequency.py: two-tier assignment with PoS-aware homograph
  handling. Tier 1 matches headwords; Tier 2 matches inflections (any rank)
  and conjugations (rank>5000 only, to avoid false positives).
  Function words claim frequency over content words in homograph groups,
  with manual overrides for 12 common dual-use words.
- frequency_lookup.py auto-prefers frequency_clean.json when available
- 6,691 entries now have frequency (was 5,974), 717 newly assigned

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 06:22:55 +00:00
04a4b52113 fix: deduplicate 66 plural GUIDs for homograph nouns
Homographs (same nikkud form, different meanings) had identical
plurals_guid values. Regenerated unique GUIDs by including meaning
in the hash. Also updated build-time fallback to use meaning.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 04:12:45 +00:00
08fb7009d8 Sprint 11: unified JSON architecture + consolidated scraping pipeline
Migrate from fragmented CSV + 10 JSON files to a single data/words.json
(9,104 entries) as the unified data store. All GUIDs preserved for Anki
study progress continuity.

New files:
- SCHEMA.yaml: authoritative schema for words.json
- pealim_list_scrape.py: consolidated list page scraper → words.json
- pealim_detail_scrape.py: noun/verb detail scraper → words.json
- pealim_audio_download.py: audio downloader reading from words.json
- scripts/migrate_to_json.py: one-time CSV→JSON migration
- scripts/validate_data.py: 17 data integrity tests
- scripts/check_guid_coverage.py: GUID preservation checker
- scripts/repair_slugs.py: slug deduplication repair tool
- tests/test_scraper_integration.py: live scraper integration tests

Updated:
- apkg_builder.py: reads from words.json (no more pandas)
- run.py: 8-step pipeline (list scrape → frequency → examples →
  detail scrape → audio download → fonts → images → build)
- benyehuda.py, frequency_lookup.py, image_fetch.py: TODO markers
  for future words.json integration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 10:54:58 +00:00