hebrew_flash_cards/data
Sochen 08fb7009d8 Sprint 11: unified JSON architecture + consolidated scraping pipeline
Migrate from fragmented CSV + 10 JSON files to a single data/words.json
(9,104 entries) as the unified data store. All GUIDs preserved for Anki
study progress continuity.

New files:
- SCHEMA.yaml: authoritative schema for words.json
- pealim_list_scrape.py: consolidated list page scraper → words.json
- pealim_detail_scrape.py: noun/verb detail scraper → words.json
- pealim_audio_download.py: audio downloader reading from words.json
- scripts/migrate_to_json.py: one-time CSV→JSON migration
- scripts/validate_data.py: 17 data integrity tests
- scripts/check_guid_coverage.py: GUID preservation checker
- scripts/repair_slugs.py: slug deduplication repair tool
- tests/test_scraper_integration.py: live scraper integration tests

Updated:
- apkg_builder.py: reads from words.json (no more pandas)
- run.py: 8-step pipeline (list scrape → frequency → examples →
  detail scrape → audio download → fonts → images → build)
- benyehuda.py, frequency_lookup.py, image_fetch.py: TODO markers
  for future words.json integration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 10:54:58 +00:00
..
fonts feat: Sprint 3 — Heebo font files, image fetch, verb validator scripts 2026-03-03 08:37:08 +00:00
conjugations.json Sprint 9: cloze cards, plurals deck, project reorg, lint tooling 2026-03-07 08:09:39 +00:00
emoji_lookup.json feat: curated emoji denylist, vocab audio URLs in CSV 2026-03-06 12:29:15 +00:00
epub_sentence_index.json Sprint 9: cloze cards, plurals deck, project reorg, lint tooling 2026-03-07 08:09:39 +00:00
examples_cache.json Sprint 9: cloze cards, plurals deck, project reorg, lint tooling 2026-03-07 08:09:39 +00:00
frequency_cache.json feat: add apkg builder, frequency, Ben Yehuda examples, conjugation deck 2026-03-03 01:58:31 +00:00
hebrew_dict.csv v0.14: rescrape vocab, formatting fixes for all decks 2026-03-07 09:26:41 +00:00
hebrew_dict_for_anki.csv Sprint 11: unified JSON architecture + consolidated scraping pipeline 2026-03-08 10:54:58 +00:00
ktiv_male_forms.json v0.14: rescrape vocab, formatting fixes for all decks 2026-03-07 09:26:41 +00:00
legacy_guid_map.json Sprint 9: cloze cards, plurals deck, project reorg, lint tooling 2026-03-07 08:09:39 +00:00
noun_plurals.json v0.14: rescrape vocab, formatting fixes for all decks 2026-03-07 09:26:41 +00:00
noun_slug_map.json Sprint 9: cloze cards, plurals deck, project reorg, lint tooling 2026-03-07 08:09:39 +00:00
pealim_dict.csv feat: add apkg builder, frequency, Ben Yehuda examples, conjugation deck 2026-03-03 01:58:31 +00:00
pealim_dict_for_anki.csv feat: add apkg builder, frequency, Ben Yehuda examples, conjugation deck 2026-03-03 01:58:31 +00:00
refined_meanings.json Sprint 9: cloze cards, plurals deck, project reorg, lint tooling 2026-03-07 08:09:39 +00:00
vetted_sentences.json Sprint 9: cloze cards, plurals deck, project reorg, lint tooling 2026-03-07 08:09:39 +00:00
vocab_sentence_matches.json Sprint 9: cloze cards, plurals deck, project reorg, lint tooling 2026-03-07 08:09:39 +00:00
words.json Sprint 11: unified JSON architecture + consolidated scraping pipeline 2026-03-08 10:54:58 +00:00