hebrew_flash_cards/scripts
Sochen 08fb7009d8 Sprint 11: unified JSON architecture + consolidated scraping pipeline
Migrate from fragmented CSV + 10 JSON files to a single data/words.json
(9,104 entries) as the unified data store. All GUIDs preserved for Anki
study progress continuity.

New files:
- SCHEMA.yaml: authoritative schema for words.json
- pealim_list_scrape.py: consolidated list page scraper → words.json
- pealim_detail_scrape.py: noun/verb detail scraper → words.json
- pealim_audio_download.py: audio downloader reading from words.json
- scripts/migrate_to_json.py: one-time CSV→JSON migration
- scripts/validate_data.py: 17 data integrity tests
- scripts/check_guid_coverage.py: GUID preservation checker
- scripts/repair_slugs.py: slug deduplication repair tool
- tests/test_scraper_integration.py: live scraper integration tests

Updated:
- apkg_builder.py: reads from words.json (no more pandas)
- run.py: 8-step pipeline (list scrape → frequency → examples →
  detail scrape → audio download → fonts → images → build)
- benyehuda.py, frequency_lookup.py, image_fetch.py: TODO markers
  for future words.json integration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 10:54:58 +00:00
..
add_slugs.py v0.15: PoS fix, slug-based audio, CSS cleanup, template improvements 2026-03-07 17:50:23 +00:00
check_guid_coverage.py Sprint 11: unified JSON architecture + consolidated scraping pipeline 2026-03-08 10:54:58 +00:00
extract_pdf_sentences.py Sprint 9: cloze cards, plurals deck, project reorg, lint tooling 2026-03-07 08:09:39 +00:00
extract_verb_list.py Sprint 9: cloze cards, plurals deck, project reorg, lint tooling 2026-03-07 08:09:39 +00:00
migrate_to_json.py Sprint 11: unified JSON architecture + consolidated scraping pipeline 2026-03-08 10:54:58 +00:00
repair_slugs.py Sprint 11: unified JSON architecture + consolidated scraping pipeline 2026-03-08 10:54:58 +00:00
scrape_ktiv_male.py Sprint 9: cloze cards, plurals deck, project reorg, lint tooling 2026-03-07 08:09:39 +00:00
scrape_noun_plurals.py Sprint 9: cloze cards, plurals deck, project reorg, lint tooling 2026-03-07 08:09:39 +00:00
scrape_verb_ktiv.py Sprint 9: cloze cards, plurals deck, project reorg, lint tooling 2026-03-07 08:09:39 +00:00
validate_data.py Sprint 11: unified JSON architecture + consolidated scraping pipeline 2026-03-08 10:54:58 +00:00