hebrew_flash_cards/frequency_lookup.py
Sochen 08fb7009d8 Sprint 11: unified JSON architecture + consolidated scraping pipeline
Migrate from fragmented CSV + 10 JSON files to a single data/words.json
(9,104 entries) as the unified data store. All GUIDs preserved for Anki
study progress continuity.

New files:
- SCHEMA.yaml: authoritative schema for words.json
- pealim_list_scrape.py: consolidated list page scraper → words.json
- pealim_detail_scrape.py: noun/verb detail scraper → words.json
- pealim_audio_download.py: audio downloader reading from words.json
- scripts/migrate_to_json.py: one-time CSV→JSON migration
- scripts/validate_data.py: 17 data integrity tests
- scripts/check_guid_coverage.py: GUID preservation checker
- scripts/repair_slugs.py: slug deduplication repair tool
- tests/test_scraper_integration.py: live scraper integration tests

Updated:
- apkg_builder.py: reads from words.json (no more pandas)
- run.py: 8-step pipeline (list scrape → frequency → examples →
  detail scrape → audio download → fonts → images → build)
- benyehuda.py, frequency_lookup.py, image_fetch.py: TODO markers
  for future words.json integration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 10:54:58 +00:00

76 lines
2.4 KiB
Python

#!/usr/bin/env python3
"""
Hebrew word frequency lookup from hermitdave/FrequencyWords corpus.
Downloads he_50k.txt once; subsequent runs read from cache.
Exposed API: get_frequency_rank(word_no_nikkud) -> int | None
TODO: Rewrite to update words.json frequency field directly instead of
writing to a separate frequency_cache.json. Currently the migration script
bridges the gap. See Phase 5 in SPRINT_LOG.md.
"""
import json
import logging
from pathlib import Path
import requests
from helpers import strip_nikkud as _strip_nikkud
logger = logging.getLogger(__name__)
FREQ_URL = "https://raw.githubusercontent.com/hermitdave/FrequencyWords/master/content/2016/he/he_50k.txt"
CACHE_PATH = Path(__file__).parent / "data" / "frequency_cache.json"
REQUEST_TIMEOUT = 30
# Module-level cache: word_no_nikkud -> rank (1 = most common)
_freq: dict[str, int] = {}
def load(cache_path: Path = CACHE_PATH) -> None:
"""Load frequency data from cache, downloading if not present."""
global _freq
if cache_path.exists():
with open(cache_path, encoding="utf-8") as f:
_freq = json.load(f)
logger.info(f"Frequency cache loaded: {len(_freq)} entries")
return
logger.info("Downloading FrequencyWords he_50k.txt …")
resp = requests.get(FREQ_URL, timeout=REQUEST_TIMEOUT)
resp.raise_for_status()
rank = 1
for line in resp.text.splitlines():
line = line.strip()
if not line:
continue
word = _strip_nikkud(line.split()[0])
if word and word not in _freq:
_freq[word] = rank
rank += 1
cache_path.parent.mkdir(parents=True, exist_ok=True)
with open(cache_path, "w", encoding="utf-8") as f:
json.dump(_freq, f, ensure_ascii=False)
logger.info(f"Frequency cache saved: {len(_freq)} entries → {cache_path}")
def get_frequency_rank(word_no_nikkud: str) -> int | None:
"""
Return the frequency rank of a word (1 = most common).
Returns None if not found in the corpus.
Strips nikkud from the input before lookup.
"""
if not _freq:
load()
clean = _strip_nikkud(word_no_nikkud.strip())
return _freq.get(clean)
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")
load()
tests = ["שלום", "ספר", "בית", "מים", "כלב"]
for w in tests:
print(f"{w}: rank {get_frequency_rank(w)}")