feat: pseudo-frequency for confusables using English word frequency

264 confusable groups where all entries shared the same Hebrew frequency now have differentiated pseudo_frequency values based on English word commonality (hermitdave en_50k.txt). Most common meaning keeps base rank; less common meanings get +100 offset per position. Examples: - אב: "father" (en:194) → 2491, "bud" (en:2963) → 2591 - אח: "brother" (en:300) → 911, "fireplace" (en:9389) → 1011 Builder uses pseudo_frequency for sort order when available. Confusable card definitions now sorted most-common-first. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fix: vet fallback emoji — verb gate + expanded stop list removes 852 bad matches
2026-04-03 05:28:30 +00:00 · 2026-04-03 05:17:31 +00:00 · 2026-03-26 23:19:43 +00:00 · 2026-03-21 02:19:03 +00:00 · 2026-03-15 14:09:19 +00:00 · 2026-03-15 13:35:23 +00:00
17 changed files with 103719 additions and 32828 deletions
--- a/SCHEMA.yaml
+++ b/SCHEMA.yaml
@ -27,6 +27,7 @@ entry:
  pos_hebrew: "שֵׁם עֶצֶם"          # Part of speech in Hebrew (with nikkud)
  meaning: "father"                # English meaning (cleaned — no inline emoji, no Hebrew prepositions)
  meaning_raw: "father 👨"         # Original meaning as scraped (may contain emoji and/or Hebrew preps)
  prep: "על"                       # Hebrew preposition(s) governing this word, extracted from meaning_raw (e.g. "(על)" → "על"); null if none
  audio_url: "https://..."         # Pealim audio URL
  audio_file: "6009-av.mp3"        # Local filename (slug-based for confusables, consonant-based otherwise)
  tags: ""                         # Pealim tags if any
@ -68,6 +69,7 @@ entry:
      cloze_word_end: 4            # End offset — enables exact extraction regardless of nikkud changes
      cloze_hint: "family member"
      cloze_guid: "def456..."      # GUID for the cloze note
      difficulty_score: 234          # Median frequency rank of context words (lower = easier); optional
    rejected_count: 0
  # --- Noun-specific: Inflection Forms ---
--- a/apkg_builder.py
+++ b/apkg_builder.py
@ -35,7 +35,7 @@ COMPLETE_PLURAL_DECK_ID = 1_234_567_903
 # Release version tag added to all notes so users can identify which release
 # their cards come from (visible in Anki's Browse view and card info).
-RELEASE_TAG = "v0.18"
+RELEASE_TAG = "v0.20"
 # Regex for extracting emoji and Hebrew prepositions from meaning strings
 EMOJI_RE = re.compile(r"[\U0001F000-\U0001FFFF\u2600-\u27FF\u2300-\u23FF\uFE00-\uFE0F]+")
@ -265,6 +265,14 @@ details[open] > .more-header::before { content: "● "; }
  text-align: center;
  margin: 0.3em 0;
 }
 .plural-direction {
  font-size: 32px;
  color: #444;
  text-align: center;
  direction: rtl;
  margin: 8px 0;
  font-weight: bold;
 }
 .card [type="button"], .card button, .replay-button {
  display: block !important;
  margin: 4px auto !important;
@ -288,7 +296,26 @@ details[open] > .more-header::before { content: "● "; }
  .related-header { color: #999; }
  .rw-word     { color: #e0e0e0; }
  .rw-meaning  { color: #999; }
  .plural-direction { color: #aaa; }
 }
 .nightMode .card        { color: #e8e8e8; background: #1c1c1e; }
 .nightMode .hebrew      { color: #f0f0f0; }
 .nightMode .hebrew-sm   { color: #e0e0e0; }
 .nightMode .meaning     { color: #82b0ff; }
 .nightMode .sec-label   { color: #e0e0e0; }
 .nightMode .sec-key     { color: #e0e0e0; }
 .nightMode .sec-val     { color: #e0e0e0; }
 .nightMode .conf-entry  { color: #ddd; }
 .nightMode .hint        { color: #777; }
 .nightMode .voice-label { color: #888; }
 .nightMode .example     { color: #e0e0e0; border-right-color: #555; }
 .nightMode .divider     { border-top-color: #333; }
 .nightMode .freq-badge  { color: #888; border-color: #444; }
 .nightMode .more-header { color: #bbb; background: #2a2a2e; border-color: #555; }
 .nightMode .related-header { color: #999; }
 .nightMode .rw-word     { color: #e0e0e0; }
 .nightMode .rw-meaning  { color: #999; }
 .nightMode .plural-direction { color: #aaa; }
 """
 # ──────────────────────────────────────────────────────────────────────────────
@ -422,7 +449,7 @@ CONJ_BACK = """
 <div class="hebrew">{{ConjugatedForm}}{{#Prep}} ({{Prep}}){{/Prep}}</div>
 {{#Audio}}<div>{{Audio}}</div>{{/Audio}}
 <details class="more-toggle"><summary class="more-header">מידע נוסף</summary>
-{{#Meaning}}<div class="sec-label" style="text-align:center;display:block;">{{Meaning}}</div>{{/Meaning}}
+{{#Meaning}}<div class="meaning" style="font-size:28px;">{{Meaning}}</div>{{/Meaning}}
 <div class="sec-table">
 <div class="sec-label"><span class="sec-key">שֹׁרֶשׁ:</span><span class="sec-val">{{Root}}</span></div>
 <div class="sec-label"><span class="sec-key">בִּנְיָן:</span><span class="sec-val">{{Binyan}}</span></div>
@ -693,6 +720,116 @@ _EMOJI_STOP = frozenset(
        "bar",
        "wheel",
        "horizontal",
        # Polysemous keywords producing wrong-sense emoji (Sprint 17 audit)
        "high",  # ⚡ high voltage, not "tall"
        "down",  # 🫳 palm down, not "descend"
        "off",  # 📴 phone off, not "remove"
        "away",  # 💨 dashing away, not "depart"
        "together",  # 🤲 palms together, not "unite"
        "top",  # 🎩 top hat, not "upper"
        "low",  # 🔈 low volume, not "short"
        "flat",  # 🥿 ballet flat, not "apartment"
        "soft",  # 🍦 soft serve, not "quiet"
        "broken",  # 💔 broken heart, not "damaged"
        "round",  # 📍 round pushpin, not "circular"
        "cool",  # 🆒 COOL button, not "cold"
        "free",  # 🆓 FREE button, not "liberated"
        "long",  # 🪘 long drum, not "lengthy"
        "straight",  # 📏 straight ruler, not "direct"
        "empty",  # 🪹 empty nest, not "void"
        "hot",  # 🥵 hot face, not "warm"
        "cross",  # ✝️ latin cross, not "intersect"
        "bright",  # 🔆 bright button, not "luminous"
        "old",  # 👴 old man, not "aged"
        "head",  # 🙂‍↔️ shaking head, not "leader"
        # Category words that match generic emoji
        "military",  # 🎖️ military medal for any military term
        "sports",  # 🏅 sports medal for any sports term
        "food",  # 😋 yummy face for any food term
        "city",  # 🇻🇦 Vatican flag for any city
        "china",  # 🇨🇳 China flag for "porcelain"
        "polish",  # 💅 nail polish for "to polish/shine"
        "aid",  # 🦻 hearing aid for "to help"
        "office",  # 🧑‍💼 office worker for "bureau"
        "construction",  # 🏛️ classical building, not construction
        "cinema",  # 🎦 cinema emoji for any film term
        "ceremony",  # 🎑 moon ceremony for any ceremony
        "building",  # 🏛️ classical building for any structure
        # Body parts / human features → wrong emoji
        "arm",  # 🦾 mechanical arm for "to arm"
        "hair",  # 👱 blond person for "hair"
        "nose",  # 😤 steam from nose
        "tongue",  # 😛 tongue-out face
        "chest",  # 🪎 not a chest
        "eyes",  # 😃 face with eyes
        # Abstract/vague words
        "fear",  # 😱 screaming face
        "anger",  # 💢 anger symbol
        "angry",  # 😠 angry face
        "tired",  # 😫 tired face
        "sad",  # 😥 sad face
        "joy",  # 😂 tears of joy
        "love",  # 💌 love letter
        "cold",  # 🥶 cold face
        "pile",  # 💩 pile of poo
        "man",  # 👨 man
        "woman",  # 👩 woman
        "boy",  # 👦 boy
        "girl",  # 👧 girl
        "baby",  # 👶 baby
        "children",  # 🚸 children crossing
        "student",  # 🧑‍🎓 student
        "adult",  # 🧑‍🧑‍🧒 family
        "name",  # 📛 name badge
        "check",  # ✅ check mark
        "line",  # 🫥 dotted line face
        "floor",  # 🤣 ROFL (rolling on floor)
        "room",  # 🧖 person in steamy room
        "bubble",  # 👁️‍🗨️ speech bubble
        "car",  # 🚃 railway car, not automobile
        "bullet",  # 🚅 bullet train
        "steam",  # 😤 face with steam
        "fly",  # 🪰 the insect, not the verb
        "plant",  # 🪴 potted plant for all "X (plant)" entries
        "tree",  # 🌲 evergreen for all "X (tree)" entries
        "ball",  # ⛹️ person bouncing ball
        "bag",  # 👝 clutch bag
        "fight",  # 🫯 not a fight
        "cloud",  # 🫯 not a cloud
        "video",  # 🎮 video game, not video
        "rescue",  # ⛑️ rescue worker helmet
        "exchange",  # 💱 currency exchange
        "cut",  # 🥩 cut of meat, not "to cut"
        "key",  # 🔐 locked with key
        "walking",  # 🚶 person walking
        "running",  # 🏃 person running
        "climbing",  # 🧗 person climbing
        "speaking",  # 🗣️ speaking head
        "playing",  # 🤽 person playing
        "feeding",  # 👩‍🍼 person feeding
        "shooting",  # 🌠 shooting star
        "clapping",  # 👏 clapping hands
        "cooking",  # 🍳 cooking emoji
        "holding",  # 🥹 face holding back tears
        # More wrong-sense matches from remaining audit
        "paper",  # 🏮 red lantern for "paper"
        "track",  # 🛤️ railroad for "track record"
        "vertical",  # 🚦 traffic light for "vertical"
        "speaker",  # 🔇 muted speaker for "speaker (person)"
        "square",  # 🟥 red square for "plaza"
        "wrapped",  # 🎁 gift for "wrapped, bound"
        "volume",  # 🔈 speaker for "volume (book)"
        "mobile",  # 📱 phone for "mobile, moveable"
        "flash",  # 📸 camera flash for "to shine"
        "identification",  # 🪪 ID card for "locating"
        "service",  # 🐕‍🦺 service dog for "service, term"
        "ground",  # ⛱️ umbrella on ground
        "machine",  # 🎰 slot machine for "mechanism"
        "liquid",  # 🫗 pouring for "liquid, drop"
        "vehicle",  # 🚙 SUV for any vehicle mention
        "window",  # 🪟 window pane for "window, gap"
        "information",  # ℹ️ info symbol
        "child",  # 🧒 child emoji
    }
 )
@ -832,9 +969,11 @@ def build_vocab_deck(
        if word_nikkud not in word_to_pos_cat:
            word_to_pos_cat[word_nikkud] = _categorize_pos(pos_raw) if pos_raw else "Other"
-    # Sort entries by frequency (null → 999999), applying limit after sort
+    # Sort entries by effective frequency (pseudo_frequency for confusables,
    # else regular frequency; null → 999999), applying limit after sort
    def _freq_key(item: tuple[str, dict]) -> int:
-        return item[1].get("frequency") or 999_999
+        e = item[1]
        return e.get("pseudo_frequency") or e.get("frequency") or 999_999
    sorted_entries = sorted(words.items(), key=_freq_key)
    if limit:
@ -860,7 +999,6 @@ def build_vocab_deck(
        meaning = re.sub(r"\s{2,}", " ", meaning).strip(", ;:")
        meaning = re.sub(r"(\w)\(", r"\1 (", meaning)  # space before opening paren
        meaning = re.sub(r",(\S)", r", \1", meaning)  # space after comma
        meaning_raw = entry.get("meaning_raw", "") or ""
        slug = entry.get("slug", "") or ""
        frequency = entry.get("frequency") or 999_999
        audio_file = entry.get("audio_file", "") or ""
@ -895,20 +1033,22 @@ def build_vocab_deck(
        else:
            freq_display = "Unlisted"
-        # Emoji: use entry's emoji if emoji_visible, else fall back to emoji_lookup
+        # Emoji: use entry's emoji if emoji_visible, else fall back to emoji_lookup.
        # Skip fallback for verbs — keyword matching on verb definitions produces
        # wrong-sense emoji (e.g. "to cut" → 🥩, "to arm" → 🦾).
        emoji_str = ""
        if entry.get("emoji_visible") and entry.get("emoji"):
            emoji_str = entry["emoji"]
-        elif not emoji_str and emoji_lookup:
+        elif emoji_lookup and not meaning.startswith("to "):
            meaning_clean_for_emoji = EMOJI_RE.sub("", meaning).strip()
            for kw in re.sub(r"[^\w\s]", " ", meaning_clean_for_emoji.lower()).split()[:5]:
                if len(kw) > 2 and kw not in _EMOJI_STOP and kw in emoji_lookup:
                    emoji_str = emoji_lookup[kw]
                    break
-        # Extract Hebrew prepositions from meaning_raw
+        # Hebrew prepositions — extracted upstream by list scraper
-        preps = HBPAREN_RE.findall(meaning_raw)
+        entry_prep = entry.get("prep")
-        prep_str = " ".join(f"({p})" for p in preps)
+        prep_str = " ".join(f"({p})" for p in entry_prep.split()) if entry_prep else ""
        # Audio — use audio_file from entry; for confusables it's already slug-based
        audio_tag = ""
@ -1118,25 +1258,12 @@ def build_conj_deck(
        root = ".".join(root_list)
        voice = VOICE_MAP.get(binyan, "")
        meaning_raw = entry.get("meaning_raw", "") or ""
        meaning = entry.get("meaning", "") or ""
-        # Extract Hebrew preposition — strip from meaning, show on Hebrew side
+        # Hebrew preposition — extracted upstream by scraper
        prep_str = ""
-        conj_prep = conj.get("prep")
+        conj_prep = conj.get("prep") or entry.get("prep")
        if conj_prep:
            # Strip any parentheses from stored prep value
            prep_str = conj_prep.strip("() ")
        elif meaning_raw:
            preps = HBPAREN_RE.findall(meaning_raw)
            if preps:
                prep_str = preps[0]
        # Strip Hebrew prepositions from English meaning to avoid duplication
        if prep_str:
            meaning = HBPAREN_RE.sub("", meaning).strip()
            # Also strip from meaning_raw patterns like "(על)"
            meaning = re.sub(r"\(\s*" + re.escape(prep_str) + r"\s*-?\s*\)", "", meaning).strip()
            # Clean up double spaces and trailing commas
            meaning = re.sub(r"\s{2,}", " ", meaning).strip(", ")
        related = [(f, w, m) for f, w, m in root_words.get(root, []) if w != infinitive]
        if related:
@ -1433,9 +1560,12 @@ def build_confusables_deck(
            guid = genanki.guid_for("confusable", entry["word"].get("ktiv_male", unique_key))
        guid_to_entries.setdefault(guid, []).append(entry)
    def _eff_freq(e: dict) -> int:
        return e.get("pseudo_frequency") or e.get("frequency") or 999_999
    for guid, group_entries in sorted(
        guid_to_entries.items(),
-        key=lambda x: sum(e.get("frequency") or 999_999 for e in x[1]) / len(x[1]),
+        key=lambda x: sum(_eff_freq(e) for e in x[1]) / len(x[1]),
    ):
        if guid in seen_guids:
            continue
@ -1454,9 +1584,13 @@ def build_confusables_deck(
                unique_entries.append(e)
        if len(unique_entries) < 2:
            continue
        # Sort by pseudo/frequency so most common meaning appears first
        unique_entries.sort(key=_eff_freq)
        if len(unique_entries) < 2:
            continue
        word_no_nik = unique_entries[0]["word"].get("ktiv_male", "")
-        words_display = " / ".join(e["word"]["nikkud"] for e in unique_entries)
+        words_display = word_no_nik  # Show ktiv male (shared form) on front
        defs_parts: list[str] = []
        audio_parts: list[str] = []
@ -1525,8 +1659,8 @@ def write_conf_apkg(
 PLURAL_FRONT_SG = """
 <div class="hebrew" style="color:#1a1a8c;">{{Singular}}</div>
 {{#SingularAudio}}<div>{{SingularAudio}}</div>{{/SingularAudio}}
-<div class="sec-label">{{Meaning}}</div>
+<div class="meaning" style="font-size:28px;">{{Meaning}}</div>
-<div class="hint" style="font-size:28px;">יָחִיד ← רַבִּים</div>
+<div class="plural-direction">יָחִיד ← רַבִּים</div>
 """
 PLURAL_BACK_SG = """
@ -1542,14 +1676,14 @@ PLURAL_BACK_SG = """
 PLURAL_FRONT_PL = """
 <div class="hebrew" style="color:#1a1a8c;">{{Plural}}</div>
 {{#PluralAudio}}<div>{{PluralAudio}}</div>{{/PluralAudio}}
-<div class="hint" style="font-size:28px;">רַבִּים ← יָחִיד</div>
+<div class="plural-direction">רַבִּים ← יָחִיד</div>
 """
 PLURAL_BACK_PL = """
 {{FrontSide}}<hr>
 <div class="hebrew">{{Singular}}</div>
 {{#SingularAudio}}<div>{{SingularAudio}}</div>{{/SingularAudio}}
-<div class="sec-label" style="text-align:center;display:block;">{{Meaning}}</div>
+<div class="meaning" style="font-size:28px;">{{Meaning}}</div>
 <div class="sec-table">
 {{#Gender}}<div class="sec-label"><span class="sec-key">מִין:</span><span class="sec-val">{{Gender}}</span></div>{{/Gender}}
 {{#Mishkal}}<div class="sec-label"><span class="sec-key">מִשְׁקָל:</span><span class="sec-val">{{Mishkal}}</span></div>{{/Mishkal}}
@ -1646,9 +1780,9 @@ def build_plural_deck(
    irregular_count = len(irregulars)
    target_regular = irregular_count * 2
    mishkal_count = len(by_mishkal) or 1
-    per_mishkal = max(2, target_regular // mishkal_count)
+    # Over-sample per mishkal to compensate for small patterns, then trim
    per_mishkal = max(3, (target_regular * 3) // (mishkal_count * 2))
    selected: list[tuple[str, dict, dict]] = list(irregulars)
    regular_pool: list[tuple[str, dict, dict]] = []
    for _mishkal, entries in sorted(by_mishkal.items()):
        entries.sort(key=lambda e: e[1].get("frequency") or 999_999)
@ -1659,7 +1793,24 @@ def build_plural_deck(
        regular_pool.sort(key=lambda e: e[1].get("frequency") or 999_999)
        regular_pool = regular_pool[:target_regular]
-    selected.extend(regular_pool)
+    # Sort both pools by frequency, then interleave for homogeneous 2:1 regular:irregular
    irregulars.sort(key=lambda e: e[1].get("frequency") or 999_999)
    regular_pool.sort(key=lambda e: e[1].get("frequency") or 999_999)
    # Interleave: for every 1 irregular, insert 2 regulars
    selected: list[tuple[str, dict, dict]] = []
    ri = 0  # regular index
    for _ii, irr in enumerate(irregulars):
        # Insert 2 regulars before each irregular (when available)
        for _ in range(2):
            if ri < len(regular_pool):
                selected.append(regular_pool[ri])
                ri += 1
        selected.append(irr)
    # Append remaining regulars
    while ri < len(regular_pool):
        selected.append(regular_pool[ri])
        ri += 1
    note_count = 0
    for _unique_key, entry, noun_inflection in selected:
@ -1682,12 +1833,20 @@ def build_plural_deck(
        sg_audio = ""
        pl_audio = ""
        if include_audio:
-            sg_tag = _audio_tag(singular_ktiv)
+            slug = entry.get("slug", "")
            sg_tag = _audio_tag(singular_ktiv, slug=slug)
            if sg_tag:
                sg_audio = sg_tag
                mp3_path = AUDIO_DIR / sg_tag.removeprefix("[sound:").removesuffix("]")
                if mp3_path not in media_files:
                    media_files.append(mp3_path)
            # Plural audio: {slug}_plural.mp3
            if slug:
                pl_mp3 = AUDIO_DIR / f"{slug}_plural.mp3"
                if pl_mp3.exists():
                    pl_audio = f"[sound:{pl_mp3.name}]"
                    if pl_mp3 not in media_files:
                        media_files.append(pl_mp3)
        mishkal_eng = noun_inflection.get("mishkal") or ""
        tags = [RELEASE_TAG]
--- a/data/en_50k.txt
+++ b/data/en_50k.txt
--- a/data/words.json
+++ b/data/words.json
--- a/docs/superpowers/specs/2026-03-15-adaptive-sentence-difficulty-design.md
+++ b/docs/superpowers/specs/2026-03-15-adaptive-sentence-difficulty-design.md
@ -0,0 +1,150 @@
 # Adaptive Sentence Difficulty Cloze — v0.20 Design Spec
 **Date:** 2026-03-15
 **Status:** Approved
 **Release:** v0.20
 ## Problem
 Cloze cards currently select the example sentence closest to 9 words in length. This ignores whether the surrounding context words are familiar to the learner. A sentence full of rare words is harder than one with common words, regardless of length.
 ## Solution
 Replace the length-based `_score()` function in `epub_examples.py` with a **frequency-based difficulty score**. The easiest sentence (most common context words) becomes the cloze. All vetted sentences remain on the card, ordered easy→hard.
 ## Scoring Pipeline
 ### Token Frequency Lookup (5-tier)
 Given a nikkud sentence token, resolve its frequency rank:
 1. **Known mapping** — look up token in the nikkud→ktiv_male map built from words.json headwords, conjugations, and inflections (94k mappings). If found, look up the ktiv_male in the frequency data.
 2. **Nikkud prefix stripping** — use `_try_strip_prefix()` to strip validated Hebrew prefixes (בהוכלמש), then resolve the remainder via the known mapping.
 3. **Academy rules converter** — apply `nikkud_to_ktiv_male.convert()` (91.6% accuracy) to produce ktiv_male, look up in frequency data.
 4. **strip_nikkud fallback** — use `helpers.strip_nikkud()` as a lossy fallback.
 5. **Ktiv_male prefix stripping** — strip 1-2 character Hebrew prefixes from the converted/stripped form and look up the stem.
 Tokens not found in any tier are assigned a default high rank (50,000).
 **Coverage:** ~93% of example sentence tokens resolve to a frequency rank (measured empirically on 7,588 sentences).
 **Frequency data source:** Use `frequency_lookup.py` which auto-selects `frequency_clean.json` when available, falling back to `frequency_cache.json`.
 ### Sentence Difficulty Score
 For a given word's candidate sentence:
 1. Tokenize: split on whitespace, strip punctuation (.,!?;:"'"״׳–—()[]{}), split on maqaf (־).
 2. Exclude the target word's token using `cloze_word_start`/`cloze_word_end` offsets from the matched sentence.
 3. For each remaining token (length >= 2), resolve its frequency rank via the 5-tier pipeline.
 4. **Score = median frequency rank of context tokens.**
 Lower score = easier (context words are more common). Median resists outliers (one rare proper noun shouldn't dominate).
 ### Integration Point
 The scoring integrates into `epub_examples.py`'s existing `_score()` closure inside `update_words_json()` (line ~677). Currently:
 ```python
 def _score(s: dict) -> tuple[int,]:
    wc = s["word_count"]
    length_score = abs(wc - 9) if not (6 <= wc <= 12) else 0
    return (length_score,)
 ```
 New scoring replaces length with frequency-based difficulty. The `_score` function gains access to the frequency pipeline via closure over the nikkud_map, nikkud_index, and freq_data built once at the start of `update_words_json()`.
 **Minimum sentence length:** Reduced from 4 words to 3 words (`MIN_WORDS = 3` in epub_examples.py). Hebrew is more concise than English — 3-word sentences are valid and common. This expands the candidate pool for cloze selection.
 **Behavioral change:** Because `pool.sort(key=_score)` determines which 3 sentences are selected as `best = pool[:3]`, changing the scoring function changes **which sentences are selected**, not just their order. This is intentional — we want the easiest sentences as cloze candidates, not the closest-to-9-words ones. Existing cloze GUIDs will be preserved when the same sentence text is re-selected; entries where a different sentence wins will get new GUIDs.
 ## Data Model Changes
 ### words.json
 The `examples.cloze` dict (single sentence) gains an optional `difficulty_score` field:
 ```json
 {
  "examples": {
    "vetted": [
      {"text": "...", "source": "...", "match_method": "..."},
      {"text": "...", "source": "...", "match_method": "..."}
    ],
    "cloze": {
      "text": "...",
      "cloze_word_start": 5,
      "cloze_word_end": 10,
      "cloze_hint": null,
      "cloze_guid": "abc123",
      "difficulty_score": 234
    }
  }
 }
 ```
 The vetted list is also sorted by difficulty (easiest first), so the card back shows sentences in pedagogically useful order.
 ### SCHEMA.yaml
 Add `difficulty_score` as optional integer field under `examples.cloze`.
 ## Implementation Scope
 ### New file: `sentence_difficulty.py`
 Standalone module for sentence scoring. No pipeline step — called by `epub_examples.py`.
 - `score_sentence(sentence_text: str, target_start: int, target_end: int, nikkud_map: dict, nikkud_index: dict, freq_data: dict) -> int` — returns median context frequency rank. Uses `target_start`/`target_end` character offsets to exclude the cloze target token.
 - `build_nikkud_map(words: dict) -> dict[str, str]` — builds nikkud→ktiv_male lookup from words.json (headwords + conjugation forms + noun inflections). Returns `{nikkud_form: ktiv_male_form}`. Implementation note: should share iteration logic with `epub_examples._build_nikkud_index()` or derive from its output to avoid duplicating the traversal of words.json forms.
 - `_resolve_token_frequency(token: str, nikkud_map: dict, nikkud_index: dict, freq_data: dict) -> int` — the 5-tier lookup. Uses `_try_strip_prefix` from epub_examples (made importable by removing underscore or adding a public wrapper).
 ### Modified files
 - **`epub_examples.py`**:
  - Import `sentence_difficulty.score_sentence` and `sentence_difficulty.build_nikkud_map`
  - In `update_words_json()`: build nikkud_map and load freq_data once at start (before per-word loop)
  - Replace `_score()` closure with frequency-based scoring that calls `score_sentence()`
  - Sort vetted list by difficulty score (easiest first)
  - Store `difficulty_score` in the cloze dict
  - Make `_try_strip_prefix` importable (rename to `try_strip_prefix` or add public alias)
 - **`frequency_lookup.py`** — add `get_freq_data() -> dict` public accessor to expose the loaded frequency dict (avoids accessing private `_freq` directly)
 - **`SCHEMA.yaml`** — add `difficulty_score` field
 - **`run.py`** — no changes; scoring happens inside epub_examples step
 ### Not modified
 - **`apkg_builder.py`** — reads cloze as-is; vetted order is already respected
 - **`nikkud_to_ktiv_male.py`** — used as-is
 - **Card templates** — no changes needed
 ## Dependencies
 - `nikkud_to_ktiv_male.convert()` — Academy rules converter (already written)
 - `epub_examples._try_strip_prefix()` / `_build_nikkud_index()` — nikkud prefix stripping and index
 - `frequency_lookup.py` — loads frequency data (auto-selects clean vs cache)
 - `helpers.strip_nikkud()` — fallback converter
 ## Validation
 - **Unit tests** for `score_sentence()` with known easy/hard sentences
 - **Unit tests** for `_resolve_token_frequency()` covering all 5 tiers
 - **Integration test**: verify cloze selection picks easiest sentence, vetted list is sorted
 - **Spot check**: manually review 10 words with 3+ sentences to confirm ordering
 - **Regression**: existing tests pass, GUID coverage unchanged, deck validates
 ## Constraints
 - `examples.cloze` remains a single dict (not converted to list)
 - No new Anki card types or fields
 - No runtime JS in Anki cards
 - No network calls during scoring
 - `difficulty_score` is informational metadata; card rendering doesn't depend on it
 - Existing cloze GUIDs preserved when the same sentence is re-selected
 ## Scope Exclusions (Future Work)
 - **Pronominal suffix stripping** — would improve the ~7% unscored token rate; deferred (PROJECT_NOTES.md)
 - **Kamatz katan disambiguation** — requires morphological analysis; accepted limitation
 - **Per-learner adaptive difficulty** — requires Anki plugin; out of scope for static deck
 - **Multiple cloze sentences per card** — would require schema migration to list; deferred
--- a/epub_examples.py
+++ b/epub_examples.py
@ -18,7 +18,9 @@ import zipfile
 from html.parser import HTMLParser
 from pathlib import Path
 import frequency_lookup
 from helpers import strip_nikkud
 from sentence_difficulty import build_nikkud_map, score_sentence
 logger = logging.getLogger(__name__)
@ -57,7 +59,7 @@ def _discover_epubs() -> dict[str, str]:
 # Sentence length bounds (word count)
-MIN_WORDS = 4
+MIN_WORDS = 3
 MAX_WORDS = 15
@ -448,6 +450,10 @@ def _try_strip_prefix(token: str, nikkud_index: dict) -> list[tuple[str, str, st
    return results
 # Public alias for use by sentence_difficulty module
 try_strip_prefix = _try_strip_prefix
 def _build_nikkud_index(words: dict) -> dict[str, list[tuple[str, str]]]:
    """Build a mapping from nikkud form to list of (unique_key, match_type).
@ -654,6 +660,11 @@ def update_words_json(words: dict, matches: dict, confusable_keys: set[str]) ->
    updated = 0
    # Build frequency scoring infrastructure (once for all words)
    nikkud_index = _build_nikkud_index(words)
    nikkud_map = build_nikkud_map(words)
    freq_data = frequency_lookup.get_freq_data()
    for unique_key, sent_list in matches.items():
        if unique_key not in words:
            continue
@ -673,11 +684,18 @@ def update_words_json(words: dict, matches: dict, confusable_keys: set[str]) ->
        prefix_only = [s for s in unique if "prefix" in s["match_method"]]
        pool = direct if direct else prefix_only
-        # Score: prefer 6–12 word sentences
+        # Score: prefer sentences with easier (more common) context words
        def _score(s: dict) -> tuple[int,]:
-            wc = s["word_count"]
+            return (
-            length_score = abs(wc - 9) if not (6 <= wc <= 12) else 0
+                score_sentence(
-            return (length_score,)
+                    s["text"],
                    s["char_offset"],
                    s["char_end"],
                    nikkud_map,
                    nikkud_index,
                    freq_data,
                ),
            )
        pool.sort(key=_score)
        best = pool[:3]
@ -712,6 +730,7 @@ def update_words_json(words: dict, matches: dict, confusable_keys: set[str]) ->
                "cloze_word_end": top["char_end"],
                "cloze_hint": None,
                "cloze_guid": cloze_guid,
                "difficulty_score": _score(top)[0],
            }
        elif is_confusable:
            examples.pop("cloze", None)
@ -719,9 +738,87 @@ def update_words_json(words: dict, matches: dict, confusable_keys: set[str]) ->
        examples["rejected_count"] = 0
        updated += 1
    # Deduplicate shared examples across confusable groups
    cleared = _deduplicate_confusable_examples(words)
    if cleared:
        logger.info(f"    Cleared shared examples from {cleared} confusable entries")
    return updated
 def _deduplicate_confusable_examples(words: dict) -> int:
    """Remove shared examples from less-common confusable group members.
    After example matching assigns sentences, confusable entries often share
    identical examples (matched via shared nikkud forms).  This function keeps
    examples only on the highest-frequency member, clearing others.
    Args:
        words: The full words.json dict, modified in place (examples already
            assigned).
    Returns:
        Count of entries whose examples were cleared.
    """
    from collections import defaultdict
    # Build confusable group map: group_id → [unique_key, ...]
    group_map: dict[tuple[str, ...], list[str]] = defaultdict(list)
    for key, entry in words.items():
        cg = entry.get("confusable_group")
        if cg:
            group_id = tuple(sorted(cg))
            group_map[group_id].append(key)
    cleared = 0
    for _group_id, members in group_map.items():
        if len(members) < 2:
            continue
        # Collect vetted sentence text sets per member
        member_texts: dict[str, frozenset[str]] = {}
        for key in members:
            vetted = (words[key].get("examples") or {}).get("vetted") or []
            texts = frozenset(e.get("text", "") for e in vetted)
            member_texts[key] = texts
        # Find members with identical non-empty sentence sets
        # Group members by their sentence set
        text_groups: dict[frozenset[str], list[str]] = defaultdict(list)
        for key, texts in member_texts.items():
            if texts:  # skip entries with no examples
                text_groups[texts].append(key)
        # For each set of members sharing identical examples, keep only the
        # highest-frequency one
        for _texts, sharing_keys in text_groups.items():
            if len(sharing_keys) < 2:
                continue
            # Sort by frequency_rank (lower = more common = winner).
            # No frequency → sort last (use large sentinel).
            # Tie-break: alphabetical by unique_key.
            def _sort_key(k: str) -> tuple[int, str]:
                rank = words[k].get("frequency_rank")
                return (rank if rank is not None else 999999, k)
            sharing_keys.sort(key=_sort_key)
            winner = sharing_keys[0]
            losers = sharing_keys[1:]
            for loser_key in losers:
                entry = words[loser_key]
                examples = entry.get("examples") or {}
                examples["vetted"] = []
                examples.pop("cloze", None)
                entry["examples"] = examples
                cleared += 1
                logger.debug(f"      Cleared examples from {loser_key} (kept on {winner})")
    return cleared
 # ── Public API ───────────────────────────────────────────────────
--- a/frequency_lookup.py
+++ b/frequency_lookup.py
@ -74,6 +74,16 @@ def get_frequency_rank(word_no_nikkud: str) -> int | None:
    return _freq.get(clean)
 def get_freq_data() -> dict[str, int]:
    """Return the full frequency dict (word -> rank).
    Auto-loads from cache if not yet loaded.
    """
    if not _freq:
        load()
    return _freq
 if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")
    load()
--- a/nikkud_to_ktiv_male.py
+++ b/nikkud_to_ktiv_male.py
@ -0,0 +1,185 @@
 """Convert nikkud (vocalized) Hebrew to ktiv male (plene spelling).
 Implements Hebrew Academy rules for matres lectionis insertion:
 - Rule A: U vowel (kubutz) → always insert vav
 - Rule B: O vowel (holam on non-vav) → insert vav
 - Rule C: I vowel (hiriq) → insert yod (conditionally)
 - Rule D: E vowel (tsere) → insert yod (limited cases)
 - Rule E/F: Consonantal vav/yod doubling
 Reference: https://hebrew-academy.org.il/topic/hahlatot/missingvocalizationspelling/
 """
 import unicodedata
 # Hebrew nikkud code points
 SHVA = "\u05b0"
 HATAF_SEGOL = "\u05b1"
 HATAF_PATAH = "\u05b2"
 HATAF_KAMATZ = "\u05b3"
 HIRIQ = "\u05b4"
 TSERE = "\u05b5"
 SEGOL = "\u05b6"
 PATAH = "\u05b7"
 KAMATZ = "\u05b8"
 HOLAM = "\u05b9"
 HOLAM_HASER = "\u05ba"
 KUBUTZ = "\u05bb"
 DAGESH = "\u05bc"
 METEG = "\u05bd"
 RAFE = "\u05bf"
 SHIN_DOT = "\u05c1"
 SIN_DOT = "\u05c2"
 VAV = "ו"
 YOD = "י"
 MAQAF = "־"
 VOWELS = {SHVA, HATAF_SEGOL, HATAF_PATAH, HATAF_KAMATZ, HIRIQ, TSERE, SEGOL, PATAH, KAMATZ, HOLAM, HOLAM_HASER, KUBUTZ}
 NIKKUD_MARKS = VOWELS | {DAGESH, METEG, RAFE, SHIN_DOT, SIN_DOT}
 def _parse_segments(text: str) -> list[tuple[str, list[str]]]:
    """Parse nikkud text into (character, [marks]) segments."""
    segments: list[tuple[str, list[str]]] = []
    cur_char: str | None = None
    cur_marks: list[str] = []
    for ch in text:
        if unicodedata.category(ch) == "Mn":
            cur_marks.append(ch)
        else:
            if cur_char is not None:
                segments.append((cur_char, cur_marks))
            cur_char = ch
            cur_marks = []
    if cur_char is not None:
        segments.append((cur_char, cur_marks))
    return segments
 def _get_vowel(marks: list[str]) -> str | None:
    """Extract the vowel mark from a list of combining marks."""
    for m in marks:
        if m in VOWELS:
            return m
    return None
 def _has_dagesh(marks: list[str]) -> bool:
    return DAGESH in marks
 def _is_hebrew_letter(ch: str) -> bool:
    return "\u05d0" <= ch <= "\u05ea"
 def convert(text: str) -> str:
    """Convert nikkud Hebrew text to ktiv male.
    Strips all nikkud marks and inserts matres lectionis (vav/yod)
    according to Hebrew Academy spelling rules.
    """
    segments = _parse_segments(text)
    result: list[str] = []
    for i, (ch, marks) in enumerate(segments):
        if not _is_hebrew_letter(ch):
            # Non-Hebrew character: output as-is (no marks)
            result.append(ch)
            continue
        vowel = _get_vowel(marks)
        has_dag = _has_dagesh(marks)
        # Output the base letter (strip all nikkud marks)
        result.append(ch)
        # --- Rule A: U vowel (kubutz) → always add vav ---
        if vowel == KUBUTZ:
            result.append(VAV)
            continue
        # --- Shuruk detection ---
        # Vav with dagesh and no other vowel = shuruk (already a mater)
        # Vav with dagesh AND a vowel = consonantal vav (ב with dagesh)
        # If letter is vav with dagesh only → it's shuruk, already output
        if ch == VAV and has_dag and vowel is None:
            # Shuruk: vav IS the mater lectionis, already output
            continue
        # --- Rule B: O vowel (holam) → add vav ---
        if vowel in (HOLAM, HOLAM_HASER):
            if ch != VAV:
                # Exception: holam before aleph (pe-aleph verbs) — no vav
                # e.g., תֹּאבַד→תאבד, יֹאבַד→יאבד, נֹאבַד→נאבד
                next_is_aleph = i + 1 < len(segments) and segments[i + 1][0] == "א"
                if not next_is_aleph:
                    result.append(VAV)
            # If ch IS vav (holam male), vav already output
            continue
        # --- Rule C: I vowel (hiriq) → conditionally add yod ---
        if vowel == HIRIQ:
            if ch == YOD:
                # Yod already present, don't double
                continue
            # Don't insert yod if next letter is already yod
            if i + 1 < len(segments) and segments[i + 1][0] == YOD:
                continue
            # Rule C Section 3: Don't add yod if the NEXT consonant
            # has shva (indicating shva nach on that consonant)
            add_yod = True
            if i + 1 < len(segments):
                next_ch, next_marks = segments[i + 1]
                next_vowel = _get_vowel(next_marks)
                # Shva on next consonant = shva nach → don't add yod
                # UNLESS next consonant also has dagesh (= shva na / doubled)
                next_has_dagesh = _has_dagesh(next_marks)
                if next_vowel == SHVA and not next_has_dagesh:
                    add_yod = False
                # No vowel on next consonant (word-final) = closed syllable
                # → don't add yod (e.g., suffix -תי -נו -תם)
                elif next_vowel is None and _is_hebrew_letter(next_ch):
                    # Check if this is truly word-final or next-to-last
                    remaining_letters = sum(1 for j in range(i + 1, len(segments)) if _is_hebrew_letter(segments[j][0]))
                    if remaining_letters <= 2:
                        # Short suffix like תי, נו — don't add yod
                        add_yod = False
            if add_yod:
                result.append(YOD)
            continue
        # --- Rule D: E vowel (tsere/segol) → generally NO yod ---
        # Exception (b): tsere before guttural/resh gets yod ONLY
        # in word-initial position (dagesh substitution in Hif'il/noun patterns)
        # e.g., הֵחֵל→היחל, תֵּאָבֵד→תיאבד, הֵרִיעַ→היריע
        # but NOT mid-word: מְסַפֵּר→מספר, מְעַבֵּר→מעבר
        if vowel == TSERE:
            add_yod = False
            if i + 1 < len(segments):
                next_ch = segments[i + 1][0]
                if next_ch in "אהחער":
                    # Only at word-initial (pos 0) or after prefix (pos 1)
                    # where dagesh substitution applies
                    hebrew_pos = sum(1 for j in range(i) if _is_hebrew_letter(segments[j][0]))
                    if hebrew_pos <= 1:
                        add_yod = True
            if add_yod:
                result.append(YOD)
            continue
        # All other vowels (patah, kamatz, segol, shva, hataf-*):
        # No mater lectionis insertion needed
    return "".join(result)
--- a/pealim_detail_scrape.py
+++ b/pealim_detail_scrape.py
@ -40,6 +40,9 @@ SAVE_INTERVAL = 50  # write words.json every N processed entries
 WORDS_JSON = Path(__file__).parent / "data" / "words.json"
 # Regex for Hebrew prepositions wrapped in parentheses, e.g. "(על)" or "(ב-)"
 HBPAREN_RE = re.compile(r"\(([\u05b0-\u05ea\u05f0-\u05f4\-]+)\)")
 BINYAN_NAMES: tuple[str, ...] = ("Pa'al", "Nif'al", "Pi'el", "Pu'al", "Hitpa'el", "Hif'il", "Huf'al")
 _BINYAN_NAMES_LOWER: tuple[str, ...] = tuple(b.lower() for b in BINYAN_NAMES)
@ -948,9 +951,17 @@ def _scrape_verb_detail(slug: str, mo_html: str, vl_html: str, existing_conj: di
    binyan = _extract_binyan_from_page(mo_soup)
    meaning = ""
    prep: str | None = None
    lead_div = mo_soup.find("div", class_="lead")
    if lead_div:
        meaning = lead_div.get_text(strip=True)
        # Extract preposition(s) from the lead text, e.g. "(על)" → "על"
        prep_matches = HBPAREN_RE.findall(meaning)
        if prep_matches:
            prep = " ".join(prep_matches)
    # Fall back to any prep already stored (e.g. from a previous manual edit)
    if prep is None:
        prep = existing.get("prep")
    # Parse active forms
    mo_active = _parse_conjugation_table(mo_soup, passive=False)
@ -1002,7 +1013,7 @@ def _scrape_verb_detail(slug: str, mo_html: str, vl_html: str, existing_conj: di
        "binyan": binyan,
        "binyan_hebrew": BINYAN_HEBREW.get(binyan, ""),
        "meaning": meaning,
-        "prep": existing.get("prep"),
+        "prep": prep,
        "active_forms": active_forms,
        "hufal_pual_forms": hufal_pual_forms,
        "reference_form_passive": reference_form_passive,
--- a/pealim_list_scrape.py
+++ b/pealim_list_scrape.py
@ -82,10 +82,13 @@ BINYAN_HEBREW: dict[str, str] = {
 # Regex for extracting emoji characters
 EMOJI_RE = re.compile(
-    r"[\U0001F300-\U0001FFFF\U00002600-\U000027BF\U0001F000-\U0001F9FF\u2600-\u26FF\u2700-\u27BF]+",
+    r"[\U0001F300-\U0001FFFF\U00002600-\U000027BF\U0001F000-\U0001F9FF\u2600-\u26FF\u2700-\u27BF\uFE0E\uFE0F\u200D]+",
    re.UNICODE,
 )
 # Regex for extracting Hebrew prepositions wrapped in parentheses, e.g. "(על)" or "(ב-)"
 HBPAREN_RE = re.compile(r"\(([\u05b0-\u05ea\u05f0-\u05f4\-]+)\)")
 # Fields that must never be overwritten when updating an existing entry
 PROTECTED_FIELDS = frozenset(
    [
@ -149,6 +152,7 @@ def _default_entry() -> dict:
        "image": None,
        "image_source": None,
        "hint": "",
        "prep": None,
        "shared_roots": [],
        "confusable_group": None,
        "confusables_guid": None,
@ -170,8 +174,9 @@ def _extract_emoji(text: str) -> str | None:
 def _clean_meaning(raw: str) -> str:
-    """Strip emoji and extra whitespace from a raw meaning string."""
+    """Strip emoji, Hebrew parenthesized prepositions, and extra whitespace from a raw meaning string."""
    cleaned = EMOJI_RE.sub("", raw)
    cleaned = HBPAREN_RE.sub("", cleaned)
    return " ".join(cleaned.split())
@ -453,6 +458,9 @@ def _merge_row(
    emoji = _extract_emoji(meaning_raw_raw)
    tags = _build_tags(pos_en, root)
    audio_file = _compute_audio_file(slug, ktiv_male)
    # Extract Hebrew preposition(s) from the raw meaning (e.g. "(על)" → "על")
    prep_matches = HBPAREN_RE.findall(meaning_raw)
    prep: str | None = " ".join(prep_matches) if prep_matches else None
    # ---- locate existing entry ----
    unique_key: str | None = slug_index.get(slug) if slug else None
@ -468,6 +476,7 @@ def _merge_row(
        entry["pos_hebrew"] = pos_heb
        entry["meaning"] = meaning
        entry["meaning_raw"] = meaning_raw
        entry["prep"] = prep
        entry["audio_url"] = audio_url
        entry["audio_file"] = audio_file
        entry["tags"] = tags
@ -484,6 +493,7 @@ def _merge_row(
        entry["pos_hebrew"] = pos_heb
        entry["meaning"] = meaning
        entry["meaning_raw"] = meaning_raw
        entry["prep"] = prep
        entry["emoji"] = emoji
        entry["emoji_source"] = "from_pealim" if emoji else None
        entry["audio_url"] = audio_url
--- a/release.py
+++ b/release.py
@ -20,8 +20,11 @@ from pathlib import Path
 import requests
 sys.path.insert(0, "/home/node/projects")
 import load_keeshare
 REPO_API = "https://git.nevo.engineer/api/v1/repos/nevo/hebrew_flash_cards"
-FORGEJO_TOKEN = "f023bd4cfd4b77aac584647f2fa8481df3906578"
+FORGEJO_TOKEN: str = load_keeshare.get_entry("git.nevo.engineer")["password"]
 OUTPUT_DIR = Path(__file__).parent / "output"
 # All deck variants to include in release
--- a/scripts/assign_pseudo_frequency.py
+++ b/scripts/assign_pseudo_frequency.py
@ -0,0 +1,269 @@
 #!/usr/bin/env python3
 """Assign pseudo-frequency to confusable groups using English word frequency.
 Problem: Confusable entries share the same ktiv_male and thus the same Hebrew
 frequency rank.  This script uses English frequency to differentiate them so
 Anki sorts more-common meanings first.
 Algorithm:
  1. For each confusable group where all entries share the same Hebrew frequency,
     extract the first meaningful English keyword from each entry's meaning field.
  2. Look up English frequency rank for each keyword.
  3. Assign pseudo_frequency: the most frequent English meaning keeps the original
     Hebrew rank; less frequent meanings get progressively higher (worse) ranks
     by adding an offset (100 * position in group).
 Usage:
    python3 scripts/assign_pseudo_frequency.py              # assign and save
    python3 scripts/assign_pseudo_frequency.py --dry-run    # preview only
 """
 from __future__ import annotations
 import argparse
 import json
 import logging
 import re
 from collections import defaultdict
 from pathlib import Path
 logger = logging.getLogger(__name__)
 PROJECT_ROOT = Path(__file__).parent.parent
 WORDS_JSON = PROJECT_ROOT / "data" / "words.json"
 EN_FREQ_PATH = PROJECT_ROOT / "data" / "en_50k.txt"
 # Words too common/vague to use as frequency signal
 _EN_STOP = frozenset(
    {
        "to",
        "be",
        "a",
        "an",
        "the",
        "of",
        "in",
        "on",
        "at",
        "for",
        "and",
        "with",
        "by",
        "or",
        "but",
        "not",
        "as",
        "its",
        "it",
        "is",
        "was",
        "are",
        "from",
        "that",
        "this",
        "have",
        "has",
        "had",
        "do",
        "does",
        "did",
        "will",
        "would",
        "can",
        "could",
        "may",
        "might",
        "shall",
        "should",
        "must",
        "no",
        "yes",
        "very",
        "too",
        "also",
        "just",
        "only",
        "so",
        "up",
        "out",
        "into",
        "over",
        "after",
        "before",
        "about",
        "more",
        "than",
        "other",
        "some",
        "any",
        "all",
        "each",
        "every",
        "both",
        "few",
        "many",
        "much",
        "most",
        "such",
        "own",
        "same",
        "well",
        "still",
        "even",
        "how",
        "what",
        "when",
        "where",
        "which",
        "who",
        "whom",
        "whose",
        "why",
        "because",
        "if",
        "then",
        "else",
        "while",
        "until",
        "though",
        "whether",
    }
 )
 def _load_en_freq() -> dict[str, int]:
    """Load English frequency data: word -> rank (1 = most common)."""
    freq: dict[str, int] = {}
    rank = 1
    with open(EN_FREQ_PATH, encoding="utf-8") as f:
        for line in f:
            parts = line.strip().split()
            if parts:
                word = parts[0].lower()
                if word not in freq:
                    freq[word] = rank
                    rank += 1
    return freq
 def _extract_keywords(meaning: str) -> list[str]:
    """Extract meaningful English keywords from a meaning string.
    Returns list of lowercase words, filtered for stop words and short words.
    """
    # Strip parenthesized content, punctuation
    cleaned = re.sub(r"\([^)]*\)", " ", meaning)
    cleaned = re.sub(r"[^\w\s]", " ", cleaned)
    return [w.lower() for w in cleaned.split() if len(w) > 2 and w.lower() not in _EN_STOP]
 def assign_pseudo_frequencies(
    words: dict,
    en_freq: dict[str, int],
    dry_run: bool = False,
 ) -> int:
    """Assign pseudo_frequency to confusable groups. Returns count of changes."""
    # Group by confusables_guid
    groups: dict[str, list[str]] = defaultdict(list)
    for key, entry in words.items():
        cg = entry.get("confusables_guid")
        if cg:
            groups[cg].append(key)
    changes = 0
    assigned_groups = 0
    skipped_diff = 0
    skipped_no_en = 0
    for _guid, keys in groups.items():
        entries = [words[k] for k in keys]
        freqs = [e.get("frequency") for e in entries]
        # Skip groups that are already differentiated
        unique_freqs = set(freqs)
        if len(unique_freqs) > 1:
            skipped_diff += 1
            continue
        base_freq = freqs[0]  # All same (or all None)
        # Look up English frequency for each entry
        en_ranks: list[tuple[int, str]] = []  # (en_rank, key)
        for key, entry in zip(keys, entries, strict=True):
            keywords = _extract_keywords(entry.get("meaning", ""))
            en_rank = 999_999
            for kw in keywords[:5]:
                r = en_freq.get(kw)
                if r is not None:
                    en_rank = r
                    break
            en_ranks.append((en_rank, key))
        # Sort by English frequency (lower rank = more common)
        en_ranks.sort()
        # Check if all entries have the same English rank (no signal)
        if len({r for r, _ in en_ranks}) <= 1:
            skipped_no_en += 1
            continue
        assigned_groups += 1
        # Assign pseudo_frequency: most common gets base, others get offset
        for position, (en_rank, key) in enumerate(en_ranks):
            pseudo = base_freq + position * 100 if base_freq is not None else 50000 + en_rank
            if not dry_run:
                words[key]["pseudo_frequency"] = pseudo
            changes += 1
            if dry_run:
                meaning = words[key].get("meaning", "")[:40]
                logger.info(
                    "  [en:%5d] pseudo=%6d  %s",
                    en_rank,
                    pseudo,
                    meaning,
                )
    logger.info(
        "Pseudo-frequency: %d groups assigned, %d already differentiated, %d no English signal",
        assigned_groups,
        skipped_diff,
        skipped_no_en,
    )
    return changes
 def main() -> None:
    parser = argparse.ArgumentParser(description="Assign pseudo-frequency to confusables")
    parser.add_argument("--dry-run", action="store_true", help="Preview without saving")
    args = parser.parse_args()
    logging.basicConfig(
        level=logging.INFO,
        format="%(asctime)s %(levelname)s %(message)s",
    )
    logger.info("Loading English frequency data: %s", EN_FREQ_PATH)
    en_freq = _load_en_freq()
    logger.info("English frequency: %d entries", len(en_freq))
    with open(WORDS_JSON, encoding="utf-8") as f:
        words: dict = json.load(f)
    changes = assign_pseudo_frequencies(words, en_freq, dry_run=args.dry_run)
    if args.dry_run:
        logger.info("Dry run — %d changes would be made", changes)
        return
    with open(WORDS_JSON, "w", encoding="utf-8") as f:
        json.dump(words, f, ensure_ascii=False, indent=2)
    logger.info("Saved %d pseudo-frequency assignments to words.json", changes)
 if __name__ == "__main__":
    main()
--- a/scripts/validate_data.py
+++ b/scripts/validate_data.py
@ -685,6 +685,57 @@ def test_no_stripped_form_sentence_collisions(data: dict[str, Any]) -> None:
    _pass(name)
 def test_no_shared_confusable_examples(data: dict[str, Any]) -> None:
    """Within each confusable group, no two entries should share the same set of vetted sentence texts.
    Shared examples indicate the deduplication step in epub_examples.py
    failed to assign examples to only the highest-frequency member.
    """
    name = "no_shared_confusable_examples"
    errors: list[str] = []
    from collections import defaultdict
    # Build confusable group map
    group_map: dict[tuple[str, ...], list[str]] = defaultdict(list)
    for key, entry in data.items():
        cg = entry.get("confusable_group")
        if cg:
            group_id = tuple(sorted(cg))
            group_map[group_id].append(key)
    for _group_id, members in group_map.items():
        if len(members) < 2:
            continue
        # Collect sentence text sets per member
        text_sets: dict[str, frozenset[str]] = {}
        for key in members:
            vetted = (data[key].get("examples") or {}).get("vetted") or []
            texts = frozenset(e.get("text", "") for e in vetted)
            if texts:
                text_sets[key] = texts
        # Check for identical sets
        seen: dict[frozenset[str], str] = {}
        for key, texts in text_sets.items():
            if texts in seen:
                meaning_a = (data[seen[texts]].get("meaning") or "")[:30]
                meaning_b = (data[key].get("meaning") or "")[:30]
                errors.append(
                    f"{seen[texts]} ({meaning_a}) and {key} ({meaning_b}) share {len(texts)} identical example(s)"
                )
            else:
                seen[texts] = key
    if errors:
        _fail(name, errors[:20] if not _verbose else errors)
        if len(errors) > 20 and not _verbose:
            print(f"          ... ({len(errors) - 20} more; use --verbose)")
    else:
        _pass(name)
 def test_no_hebrew_in_meaning(data: dict[str, Any]) -> None:
    """English meanings must not contain bare Hebrew text (spoils the card)."""
    name = "no_hebrew_in_meaning"
@ -801,6 +852,7 @@ ALL_TESTS: dict[str, Any] = {
    "conjugation_form_guids": test_conjugation_form_guids,
    "conjugation_person_codes": test_conjugation_person_codes,
    "no_stripped_form_sentence_collisions": test_no_stripped_form_sentence_collisions,
    "no_shared_confusable_examples": test_no_shared_confusable_examples,
    "no_hebrew_in_meaning": test_no_hebrew_in_meaning,
    "mishkal_consistency": test_mishkal_consistency,
 }
--- a/sentence_difficulty.py
+++ b/sentence_difficulty.py
@ -0,0 +1,198 @@
 """Sentence difficulty scoring by context-word frequency.
 Scores sentences by the median frequency rank of context words
 (excluding the cloze target). Lower score = easier sentence.
 Used by epub_examples.py to select the best cloze sentence.
 """
 from __future__ import annotations
 from statistics import median
 import helpers
 import nikkud_to_ktiv_male
 DEFAULT_RANK = 50_000
 # Hebrew prefix consonants for ktiv_male prefix stripping (tier 5)
 _KM_PREFIX_CHARS = set("בהוכלמשע")
 # Punctuation to strip from tokens
 _PUNCT = set('.,!?;:"\'"״׳–—()[]{}')
 # Maqaf (Hebrew hyphen) — splits tokens
 _MAQAF = "־"
 def build_nikkud_map(words: dict) -> dict[str, str]:
    """Build nikkud→ktiv_male lookup from words.json.
    Indexes: headwords, conjugation forms (active, passive, infinitive,
    reference_form), noun inflections (singular, plural, construct,
    pronominal suffixes), and adjective inflections (ms/fs/mp/fp).
    Args:
        words: The full words.json dict keyed by unique_key.
    Returns:
        Dict mapping nikkud form to ktiv_male string.
        When collisions occur, last-write wins (acceptable for frequency lookup).
    """
    nmap: dict[str, str] = {}
    def _add(nikkud: str | None, ktiv_male: str | None) -> None:
        if nikkud and ktiv_male:
            nmap[nikkud] = ktiv_male
    for entry in words.values():
        word = entry.get("word") or {}
        _add(word.get("nikkud"), word.get("ktiv_male"))
        # Conjugation forms
        conj = entry.get("conjugation") or {}
        for form_entry in conj.get("active_forms") or []:
            form = form_entry.get("form") or {}
            _add(form.get("nikkud"), form.get("ktiv_male"))
        for form_entry in conj.get("hufal_pual_forms") or []:
            form = form_entry.get("form") or {}
            _add(form.get("nikkud"), form.get("ktiv_male"))
        inf = conj.get("infinitive") or {}
        _add(inf.get("nikkud"), inf.get("ktiv_male"))
        ref = conj.get("reference_form") or {}
        _add(ref.get("nikkud"), ref.get("ktiv_male"))
        # Noun inflection forms
        noun = entry.get("noun_inflection") or {}
        for field in ("singular", "plural", "construct_singular", "construct_plural"):
            sub = noun.get(field) or {}
            nikkud_form = sub.get("nikkud")
            ktiv = sub.get("ktiv_male")
            _add(nikkud_form, ktiv)
            # Index construct forms without maqaf
            if nikkud_form and nikkud_form.endswith("־") and ktiv:
                _add(nikkud_form[:-1], ktiv)
        pronominal = noun.get("pronominal_suffixes") or {}
        for sub in pronominal.values():
            if isinstance(sub, dict):
                _add(sub.get("nikkud"), sub.get("ktiv_male"))
        # Adjective inflection forms
        adj = entry.get("adjective_inflection") or {}
        for field in ("ms", "fs", "mp", "fp"):
            sub = adj.get(field) or {}
            _add(sub.get("nikkud"), sub.get("ktiv_male"))
    return nmap
 def _resolve_token_frequency(
    token: str,
    nikkud_map: dict[str, str],
    nikkud_index: dict,
    freq_data: dict[str, int],
 ) -> int:
    """Resolve a nikkud sentence token to its frequency rank.
    Uses a 5-tier pipeline:
    1. Known mapping (nikkud_map from words.json)
    2. Nikkud prefix stripping (epub_examples.try_strip_prefix)
    3. Academy rules converter (nikkud_to_ktiv_male.convert)
    4. strip_nikkud fallback (helpers.strip_nikkud)
    5. Ktiv_male prefix stripping on the converted form
    Returns:
        Frequency rank (1 = most common). DEFAULT_RANK (50000) if not found.
    """
    # Tier 1: Direct lookup in nikkud→ktiv_male map
    ktiv = nikkud_map.get(token)
    if ktiv and ktiv in freq_data:
        return freq_data[ktiv]
    # Tier 2: Nikkud prefix stripping → resolve remainder via nikkud_map
    from epub_examples import try_strip_prefix
    prefix_hits = try_strip_prefix(token, nikkud_index)
    for _unique_key, _match_type, matched_remainder in prefix_hits:
        remainder_ktiv = nikkud_map.get(matched_remainder)
        if remainder_ktiv and remainder_ktiv in freq_data:
            return freq_data[remainder_ktiv]
    # Tier 3: Academy rules converter
    converted = nikkud_to_ktiv_male.convert(token)
    if converted in freq_data:
        return freq_data[converted]
    # Tier 4: strip_nikkud fallback
    stripped = helpers.strip_nikkud(token)
    if stripped != converted and stripped in freq_data:
        return freq_data[stripped]
    # Tier 5: Ktiv_male prefix stripping on converted/stripped form
    for form in (converted, stripped):
        for prefix_len in (1, 2):
            if len(form) > prefix_len + 1:
                prefix = form[:prefix_len]
                if all(c in _KM_PREFIX_CHARS for c in prefix):
                    stem = form[prefix_len:]
                    if stem in freq_data:
                        return freq_data[stem]
    return DEFAULT_RANK
 def score_sentence(
    text: str,
    target_start: int,
    target_end: int,
    nikkud_map: dict[str, str],
    nikkud_index: dict,
    freq_data: dict[str, int],
 ) -> int:
    """Score a sentence by median frequency rank of context words.
    Args:
        text: The full sentence text (with nikkud).
        target_start: Character offset where the cloze target word starts.
        target_end: Character offset where the cloze target word ends.
        nikkud_map: nikkud→ktiv_male mapping from build_nikkud_map().
        nikkud_index: nikkud index from epub_examples._build_nikkud_index().
        freq_data: Frequency dict from frequency_lookup.get_freq_data().
    Returns:
        Median frequency rank of context tokens (int). Lower = easier.
        Returns DEFAULT_RANK if no scoreable context tokens.
    """
    # Tokenize: split on whitespace, then split on maqaf
    raw_tokens = text.split()
    tokens_with_pos: list[tuple[str, int, int]] = []
    pos = 0
    for raw in raw_tokens:
        start = text.index(raw, pos)
        # Split on maqaf
        parts = raw.split(_MAQAF)
        sub_pos = start
        for part in parts:
            if part:
                tokens_with_pos.append((part, sub_pos, sub_pos + len(part)))
            sub_pos += len(part) + 1  # +1 for maqaf
        pos = start + len(raw)
    # Filter: exclude target word, strip punctuation, skip short tokens
    context_ranks: list[int] = []
    for token, tok_start, tok_end in tokens_with_pos:
        # Exclude target word by overlap with char offsets
        if tok_start < target_end and tok_end > target_start:
            continue
        # Strip punctuation from edges
        cleaned = token.strip("".join(_PUNCT))
        if len(cleaned) < 2:
            continue
        rank = _resolve_token_frequency(cleaned, nikkud_map, nikkud_index, freq_data)
        context_ranks.append(rank)
    if not context_ranks:
        return DEFAULT_RANK
    return int(median(context_ranks))
--- a/tests/test_epub_examples.py
+++ b/tests/test_epub_examples.py
@ -0,0 +1,127 @@
 """Tests for epub_examples deduplication of confusable group examples."""
 from epub_examples import _deduplicate_confusable_examples
 def _make_entry(meaning, confusable_group, vetted_texts=None, frequency_rank=None):
    """Build a minimal words.json entry for testing."""
    entry = {
        "meaning": meaning,
        "confusable_group": confusable_group,
    }
    if vetted_texts is not None:
        entry["examples"] = {
            "vetted": [{"text": t, "source": "test", "match_method": "direct"} for t in vetted_texts],
        }
    if frequency_rank is not None:
        entry["frequency_rank"] = frequency_rank
    return entry
 class TestDeduplicateConfusableExamples:
    """Tests for _deduplicate_confusable_examples()."""
    def test_shared_examples_kept_on_higher_frequency(self):
        """When two confusables share identical examples, the one with
        lower frequency_rank (more common) keeps them."""
        group = ["key_a", "key_b"]
        words = {
            "key_a": _make_entry("brother", group, ["sent1", "sent2"], frequency_rank=500),
            "key_b": _make_entry("fireplace", group, ["sent1", "sent2"], frequency_rank=8000),
        }
        cleared = _deduplicate_confusable_examples(words)
        assert cleared == 1
        assert len(words["key_a"]["examples"]["vetted"]) == 2
        assert words["key_b"]["examples"]["vetted"] == []
    def test_no_action_when_examples_differ(self):
        """Groups with different example sets are left untouched."""
        group = ["key_a", "key_b"]
        words = {
            "key_a": _make_entry("meaning1", group, ["sent1"], frequency_rank=100),
            "key_b": _make_entry("meaning2", group, ["sent2"], frequency_rank=200),
        }
        cleared = _deduplicate_confusable_examples(words)
        assert cleared == 0
        assert len(words["key_a"]["examples"]["vetted"]) == 1
        assert len(words["key_b"]["examples"]["vetted"]) == 1
    def test_no_action_when_one_has_no_examples(self):
        """If only one member has examples, nothing to deduplicate."""
        group = ["key_a", "key_b"]
        words = {
            "key_a": _make_entry("meaning1", group, ["sent1"], frequency_rank=100),
            "key_b": _make_entry("meaning2", group, frequency_rank=200),
        }
        cleared = _deduplicate_confusable_examples(words)
        assert cleared == 0
    def test_no_frequency_uses_alphabetical_tiebreak(self):
        """When no member has frequency data, first alphabetically wins."""
        group = ["alpha_key", "beta_key"]
        words = {
            "alpha_key": _make_entry("meaning1", group, ["sent1"]),
            "beta_key": _make_entry("meaning2", group, ["sent1"]),
        }
        cleared = _deduplicate_confusable_examples(words)
        assert cleared == 1
        assert len(words["alpha_key"]["examples"]["vetted"]) == 1
        assert words["beta_key"]["examples"]["vetted"] == []
    def test_three_way_group(self):
        """Three-member group: highest frequency wins, other two cleared."""
        group = ["key_a", "key_b", "key_c"]
        words = {
            "key_a": _make_entry("yes", group, ["sent1", "sent2"], frequency_rank=50),
            "key_b": _make_entry("honest", group, ["sent1", "sent2"], frequency_rank=3000),
            "key_c": _make_entry("pedestal", group, ["sent1", "sent2"], frequency_rank=15000),
        }
        cleared = _deduplicate_confusable_examples(words)
        assert cleared == 2
        assert len(words["key_a"]["examples"]["vetted"]) == 2
        assert words["key_b"]["examples"]["vetted"] == []
        assert words["key_c"]["examples"]["vetted"] == []
    def test_cloze_removed_from_losers(self):
        """Losing entries should have their cloze data removed too."""
        group = ["key_a", "key_b"]
        words = {
            "key_a": _make_entry("common", group, ["sent1"], frequency_rank=100),
            "key_b": _make_entry("rare", group, ["sent1"], frequency_rank=9000),
        }
        # Add cloze to both
        words["key_b"]["examples"]["cloze"] = {"text": "sent1", "cloze_guid": "abc"}
        cleared = _deduplicate_confusable_examples(words)
        assert cleared == 1
        assert "cloze" not in words["key_b"]["examples"]
    def test_no_confusable_groups_returns_zero(self):
        """Words without confusable_group are ignored."""
        words = {
            "key_a": {"meaning": "word1", "examples": {"vetted": [{"text": "s1"}]}},
            "key_b": {"meaning": "word2", "examples": {"vetted": [{"text": "s1"}]}},
        }
        cleared = _deduplicate_confusable_examples(words)
        assert cleared == 0
    def test_mixed_frequency_and_none(self):
        """Member with frequency beats member without."""
        group = ["key_a", "key_b"]
        words = {
            "key_a": _make_entry("has_freq", group, ["sent1"], frequency_rank=5000),
            "key_b": _make_entry("no_freq", group, ["sent1"]),
        }
        cleared = _deduplicate_confusable_examples(words)
        assert cleared == 1
        assert len(words["key_a"]["examples"]["vetted"]) == 1
        assert words["key_b"]["examples"]["vetted"] == []
    def test_partial_overlap_not_deduplicated(self):
        """Groups with overlapping but not identical sentence sets are not touched."""
        group = ["key_a", "key_b"]
        words = {
            "key_a": _make_entry("m1", group, ["sent1", "sent2"], frequency_rank=100),
            "key_b": _make_entry("m2", group, ["sent1", "sent3"], frequency_rank=200),
        }
        cleared = _deduplicate_confusable_examples(words)
        assert cleared == 0
--- a/tests/test_scoring_integration.py
+++ b/tests/test_scoring_integration.py
@ -0,0 +1,83 @@
 """Integration tests for frequency-based sentence scoring in update_words_json."""
 def _make_sentence(text, source="test", match_method="direct", word_count=None, char_offset=0, char_end=3):
    """Build a minimal sentence dict as match_sentences would produce."""
    if word_count is None:
        word_count = len(text.split())
    return {
        "text": text,
        "source": source,
        "match_method": match_method,
        "word_count": word_count,
        "char_offset": char_offset,
        "char_end": char_end,
    }
 class TestScoringIntegration:
    """Tests that update_words_json uses frequency scoring."""
    def test_cloze_has_difficulty_score(self):
        """Cloze dict includes difficulty_score field."""
        from epub_examples import update_words_json
        words = {
            "טוֹב": {
                "word": {"nikkud": "טוֹב", "ktiv_male": "טוב"},
                "examples": {},
            }
        }
        matches = {
            "טוֹב": [
                _make_sentence("הוּא אָדָם טוֹב מְאוֹד", char_offset=10, char_end=13),
            ]
        }
        update_words_json(words, matches, confusable_keys=set())
        cloze = words["טוֹב"]["examples"].get("cloze")
        assert cloze is not None
        assert "difficulty_score" in cloze
        assert isinstance(cloze["difficulty_score"], int)
    def test_vetted_sorted_by_difficulty(self):
        """Vetted sentences are sorted easiest first."""
        from epub_examples import update_words_json
        words = {
            "טוֹב": {
                "word": {"nikkud": "טוֹב", "ktiv_male": "טוב"},
                "examples": {},
            }
        }
        matches = {
            "טוֹב": [
                _make_sentence("הוּא טוֹב", char_offset=4, char_end=7),
                _make_sentence("הַתַּפְנִיט טוֹב בְּיוֹתֵר", char_offset=10, char_end=13),
                _make_sentence("אֲנִי טוֹב הַיּוֹם", char_offset=5, char_end=8),
            ]
        }
        update_words_json(words, matches, confusable_keys=set())
        vetted = words["טוֹב"]["examples"]["vetted"]
        assert len(vetted) == 3
    def test_easiest_sentence_becomes_cloze(self):
        """The sentence with the lowest difficulty score becomes the cloze."""
        from epub_examples import update_words_json
        words = {
            "טוֹב": {
                "word": {"nikkud": "טוֹב", "ktiv_male": "טוב"},
                "examples": {},
            }
        }
        easy_text = "הוּא טוֹב מְאוֹד"
        hard_text = "הַפַּרְנָסִימוֹן טוֹב לְהַפְלִיא"
        matches = {
            "טוֹב": [
                _make_sentence(hard_text, char_offset=14, char_end=17),
                _make_sentence(easy_text, char_offset=4, char_end=7),
            ]
        }
        update_words_json(words, matches, confusable_keys=set())
        cloze = words["טוֹב"]["examples"]["cloze"]
        assert cloze["text"] == easy_text
--- a/tests/test_sentence_difficulty.py
+++ b/tests/test_sentence_difficulty.py
@ -0,0 +1,207 @@
 """Tests for sentence difficulty scoring."""
 import json
 from pathlib import Path
 import pytest
 import frequency_lookup
 from sentence_difficulty import DEFAULT_RANK, _resolve_token_frequency, build_nikkud_map, score_sentence
 class TestBuildNikkudMap:
    def test_maps_direct_headwords(self):
        words = {"אָב": {"word": {"nikkud": "אָב", "ktiv_male": "אב"}}}
        nmap = build_nikkud_map(words)
        assert nmap["אָב"] == "אב"
    def test_maps_conjugation_forms(self):
        words = {
            "שָׁמַר": {
                "word": {"nikkud": "שָׁמַר", "ktiv_male": "שמר"},
                "conjugation": {
                    "active_forms": [
                        {
                            "person": "1s",
                            "tense": "עָבָר",
                            "form": {"nikkud": "שָׁמַרְתִּי", "ktiv_male": "שמרתי"},
                        },
                    ],
                    "infinitive": {"nikkud": "לִשְׁמֹר", "ktiv_male": "לשמור"},
                    "reference_form": {"nikkud": "שָׁמַר", "ktiv_male": "שמר"},
                },
            }
        }
        nmap = build_nikkud_map(words)
        assert nmap["שָׁמַרְתִּי"] == "שמרתי"
        assert nmap["לִשְׁמֹר"] == "לשמור"
    def test_maps_noun_inflections(self):
        words = {
            "אָב": {
                "word": {"nikkud": "אָב", "ktiv_male": "אב"},
                "noun_inflection": {
                    "singular": {"nikkud": "אָב", "ktiv_male": "אב"},
                    "plural": {"nikkud": "אָבוֹת", "ktiv_male": "אבות"},
                    "pronominal_suffixes": {"1s": {"nikkud": "אָבִי", "ktiv_male": "אבי"}},
                },
            }
        }
        nmap = build_nikkud_map(words)
        assert nmap["אָבוֹת"] == "אבות"
        assert nmap["אָבִי"] == "אבי"
    def test_maps_adjective_inflections(self):
        words = {
            "גָּדוֹל": {
                "word": {"nikkud": "גָּדוֹל", "ktiv_male": "גדול"},
                "adjective_inflection": {
                    "ms": {"nikkud": "גָּדוֹל", "ktiv_male": "גדול"},
                    "fs": {"nikkud": "גְּדוֹלָה", "ktiv_male": "גדולה"},
                    "mp": {"nikkud": "גְּדוֹלִים", "ktiv_male": "גדולים"},
                    "fp": {"nikkud": "גְּדוֹלוֹת", "ktiv_male": "גדולות"},
                },
            }
        }
        nmap = build_nikkud_map(words)
        assert nmap["גְּדוֹלָה"] == "גדולה"
        assert nmap["גְּדוֹלִים"] == "גדולים"
    def test_construct_forms_strip_maqaf(self):
        words = {
            "בֵּית": {
                "word": {"nikkud": "בֵּית", "ktiv_male": "בית"},
                "noun_inflection": {
                    "construct_singular": {"nikkud": "בֵּית־", "ktiv_male": "בית"},
                },
            }
        }
        nmap = build_nikkud_map(words)
        assert "בֵּית־" in nmap
        assert "בֵּית" in nmap
    def test_handles_missing_fields(self):
        words = {
            "test": {
                "word": {"nikkud": "טֶסְט", "ktiv_male": "טסט"},
                "conjugation": None,
                "noun_inflection": None,
                "adjective_inflection": None,
            }
        }
        nmap = build_nikkud_map(words)
        assert nmap["טֶסְט"] == "טסט"
    def test_real_words_json_coverage(self):
        words_path = Path(__file__).parent.parent / "data" / "words.json"
        if not words_path.exists():
            pytest.skip("words.json not available")
        with open(words_path, encoding="utf-8") as f:
            words = json.load(f)
        nmap = build_nikkud_map(words)
        assert len(nmap) > 90_000
 class TestResolveTokenFrequency:
    @pytest.fixture()
    def freq_setup(self):
        frequency_lookup.load()
        freq_data = frequency_lookup.get_freq_data()
        words_path = Path(__file__).parent.parent / "data" / "words.json"
        if not words_path.exists():
            pytest.skip("words.json not available")
        with open(words_path, encoding="utf-8") as f:
            words = json.load(f)
        from epub_examples import _build_nikkud_index
        nikkud_map = build_nikkud_map(words)
        nikkud_index = _build_nikkud_index(words)
        return nikkud_map, nikkud_index, freq_data
    def test_tier1_known_mapping(self, freq_setup):
        nikkud_map, nikkud_index, freq_data = freq_setup
        rank = _resolve_token_frequency("אָב", nikkud_map, nikkud_index, freq_data)
        assert rank is not None
        assert rank < 50_000
    def test_tier3_academy_converter(self, freq_setup):
        nikkud_map, nikkud_index, freq_data = freq_setup
        rank = _resolve_token_frequency("שָׁלוֹם", nikkud_map, nikkud_index, freq_data)
        assert rank is not None
        assert rank < 1000
    def test_unknown_token_returns_default(self, freq_setup):
        nikkud_map, nikkud_index, freq_data = freq_setup
        rank = _resolve_token_frequency("קְסַנְתּוֹפּוּלוֹס", nikkud_map, nikkud_index, freq_data)
        assert rank == 50_000
    def test_tier5_ktiv_male_prefix_strip(self, freq_setup):
        nikkud_map, nikkud_index, freq_data = freq_setup
        assert freq_data.get("שלום") is not None
 class TestScoreSentence:
    @pytest.fixture()
    def scoring_setup(self):
        frequency_lookup.load()
        freq_data = frequency_lookup.get_freq_data()
        words_path = Path(__file__).parent.parent / "data" / "words.json"
        if not words_path.exists():
            pytest.skip("words.json not available")
        with open(words_path, encoding="utf-8") as f:
            words = json.load(f)
        from epub_examples import _build_nikkud_index
        nikkud_map = build_nikkud_map(words)
        nikkud_index = _build_nikkud_index(words)
        return nikkud_map, nikkud_index, freq_data
    def test_returns_integer(self, scoring_setup):
        nmap, nidx, freq = scoring_setup
        text = "הוּא הָלַךְ הַבַּיְתָה"
        start = text.index("הָלַךְ")
        end = start + len("הָלַךְ")
        score = score_sentence(text, start, end, nmap, nidx, freq)
        assert isinstance(score, int)
    def test_easy_sentence_scores_lower(self, scoring_setup):
        nmap, nidx, freq = scoring_setup
        easy = "הוּא אָמַר שָׁלוֹם"
        easy_start = easy.index("אָמַר")
        easy_end = easy_start + len("אָמַר")
        hard = "הַפַּרְדֵּס נִשְׁתַּטֵּחַ בַּדַּהֲרָה"
        hard_start = hard.index("נִשְׁתַּטֵּחַ")
        hard_end = hard_start + len("נִשְׁתַּטֵּחַ")
        easy_score = score_sentence(easy, easy_start, easy_end, nmap, nidx, freq)
        hard_score = score_sentence(hard, hard_start, hard_end, nmap, nidx, freq)
        assert easy_score < hard_score
    def test_single_context_token(self, scoring_setup):
        nmap, nidx, freq = scoring_setup
        text = "הוּא טוֹב"
        start = 0
        end = len("הוּא")
        score = score_sentence(text, start, end, nmap, nidx, freq)
        assert isinstance(score, int)
    def test_handles_punctuation(self, scoring_setup):
        nmap, nidx, freq = scoring_setup
        text = '"הוּא טוֹב!"'
        start = text.index("טוֹב")
        end = start + len("טוֹב")
        score = score_sentence(text, start, end, nmap, nidx, freq)
        assert isinstance(score, int)
    def test_splits_on_maqaf(self, scoring_setup):
        nmap, nidx, freq = scoring_setup
        text = "בֵּית־סֵפֶר גָּדוֹל"
        start = text.index("גָּדוֹל")
        end = start + len("גָּדוֹל")
        score = score_sentence(text, start, end, nmap, nidx, freq)
        assert isinstance(score, int)
    def test_no_context_tokens_returns_default(self, scoring_setup):
        nmap, nidx, freq = scoring_setup
        text = "א ב"
        score = score_sentence(text, 0, 1, nmap, nidx, freq)
        assert score == DEFAULT_RANK
Author	SHA1	Message	Date
Sochen	6d2d446ed5	feat: pseudo-frequency for confusables using English word frequency 264 confusable groups where all entries shared the same Hebrew frequency now have differentiated pseudo_frequency values based on English word commonality (hermitdave en_50k.txt). Most common meaning keeps base rank; less common meanings get +100 offset per position. Examples: - אב: "father" (en:194) → 2491, "bud" (en:2963) → 2591 - אח: "brother" (en:300) → 911, "fireplace" (en:9389) → 1011 Builder uses pseudo_frequency for sort order when available. Confusable card definitions now sorted most-common-first. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-03 05:28:30 +00:00
Sochen	f978e5f39a	fix: vet fallback emoji — verb gate + expanded stop list removes 852 bad matches The fallback emoji system (keyword→Unicode char matching at build time) was producing 1,733 matches, many with wrong-sense emoji: - "high, tall" → ⚡ (from "high voltage") - "to cut" → 🥩 (cut of meat) - "city" → 🇻🇦 (Vatican flag) Two fixes: 1. Skip fallback for verbs (meanings starting "to ") — 476 removed 2. Expand _EMOJI_STOP with 100+ polysemous/abstract keywords — 376 more Result: 1733 → 881 fallback matches (49% reduction). The 114 from_pealim emojis (concrete nouns like 🍎 apple, 🐪 camel) are unaffected. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-03 05:17:31 +00:00
Sochen	5f617af4ba	feat: vet emoji assignments — 114 visible, 3 removed, 1 fixed Reviewed all 117 pealim-inherited emoji assignments: - Made 114 correct assignments visible (emoji_visible: true) - Removed: goblet (🏆 is trophy), fitness (🏋 too abstract), red (💄 is lipstick) - Fixed: onion 🌰→🧅 (was chestnut emoji) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-26 23:19:43 +00:00
Sochen	f3496998f5	feat: confusables show ktiv male, emoji/prep stripping fully upstream - Confusables deck front now shows shared ktiv male form instead of nikkud variants joined by "/". Back still shows nikkud with definitions. - Fixed list scraper EMOJI_RE to catch variation selectors (U+FE0F) and ZWJ (U+200D) — cleaned 17 entries with leftover selectors in meaning. - Removed build-time prep extraction fallback (0 entries relied on it). - release.py: fix keeshare field name (API_TOKEN → password). Closes: Pealim #11 (emoji/prep upstream), Pealim #16 (confusables ktiv male) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-21 02:19:03 +00:00
Sochen	138acb06d8	bump RELEASE_TAG to v0.20 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-15 14:09:19 +00:00
Sochen	0a85291975	feat(v0.20): adaptive sentence difficulty scoring — 3163 scored cloze sentences Replaces length-based sentence selection with frequency-based difficulty scoring. Easiest sentences (most common context words) become cloze candidates. Score range: 1-50000, median=179. Coverage: 3163/3325 cloze entries scored. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-15 13:35:23 +00:00
Sochen	14d567a261	schema: add difficulty_score field + update spec with MIN_WORDS=3 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-15 13:30:13 +00:00
Sochen	8b24d0fd26	Task 7: Add integration tests for frequency-based sentence scoring Tests verify that update_words_json produces a cloze with `difficulty_score`, that vetted sentences are sorted by difficulty, and that the easiest sentence becomes the cloze candidate. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-15 13:29:25 +00:00
Sochen	272a2a080d	Task 7: Replace length-based scoring with frequency-based scoring in update_words_json - Import `sentence_difficulty.build_nikkud_map`, `score_sentence`, and `frequency_lookup` - Build `nikkud_index`, `nikkud_map`, `freq_data` once before the per-word loop - Replace `_score()` closure with call to `score_sentence()` (median context-word rank) - Store `difficulty_score` in cloze dict for downstream use Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-15 13:29:22 +00:00
Sochen	fb12f806a8	feat: add sentence_difficulty module with 5-tier frequency scoring Implements build_nikkud_map(), _resolve_token_frequency(), and score_sentence() for v0.20 adaptive cloze sentence selection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-15 13:23:21 +00:00
Sochen	00fba934fb	feat(epub_examples): export try_strip_prefix as public alias Exposes _try_strip_prefix under a public name so the upcoming sentence_difficulty module can reuse Hebrew prefix stripping logic without duplicating it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-15 13:18:55 +00:00
Sochen	d2a7c9d483	feat(epub_examples): lower MIN_WORDS from 4 to 3 Hebrew is more concise than English — 3-word sentences are valid candidates for the example sentence pool. Expands the pool for the upcoming adaptive sentence difficulty ranking. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-15 13:18:44 +00:00
Sochen	d0f4aea58d	feat(frequency_lookup): add get_freq_data() for batch frequency access Exposes the full word→rank dict for use by the upcoming sentence_difficulty module without requiring per-word lookups. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-15 13:18:11 +00:00
Sochen	b3ea086e85	v0.20 design spec + nikkud-to-ktiv-male converter Add Academy-rules-based nikkud→ktiv male converter (91.6% accuracy vs 77.2% for strip_nikkud) and v0.20 adaptive sentence difficulty cloze design spec. The converter enables frequency-based sentence scoring by properly resolving nikkud tokens to their ktiv male forms for frequency corpus lookup. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-15 12:57:14 +00:00
Sochen	af186e2030	Sprint 17: homograph example dedup + plural audio + prep extraction - Homograph collision fix: _deduplicate_confusable_examples() clears shared examples from less-common confusable group members (36 entries fixed). Keeps examples only on highest-frequency meaning. - Plural deck audio: wired up PluralAudio field in apkg_builder.py, downloaded 613 plural audio files from pealim.com for all deck entries. - Prep extraction upstream: moved Hebrew preposition parsing from build time into list/detail scrapers (SCHEMA.yaml prep field added). - Validation: new no_shared_confusable_examples check in validate_data.py - Tests: 9 new unit tests for confusable deduplication (98 total) - Release: v0.19 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 21:51:35 +00:00