v0.20 design spec + nikkud-to-ktiv-male converter

Add Academy-rules-based nikkud→ktiv male converter (91.6% accuracy vs 77.2% for strip_nikkud) and v0.20 adaptive sentence difficulty cloze design spec. The converter enables frequency-based sentence scoring by properly resolving nikkud tokens to their ktiv male forms for frequency corpus lookup. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 12:57:14 +00:00 · 2026-03-15 12:57:14 +00:00 · b3ea086e85
commit b3ea086e85
parent af186e2030
2 changed files with 333 additions and 0 deletions
--- a/docs/superpowers/specs/2026-03-15-adaptive-sentence-difficulty-design.md
+++ b/docs/superpowers/specs/2026-03-15-adaptive-sentence-difficulty-design.md
@ -0,0 +1,148 @@
 # Adaptive Sentence Difficulty Cloze — v0.20 Design Spec
 **Date:** 2026-03-15
 **Status:** Approved
 **Release:** v0.20
 ## Problem
 Cloze cards currently select the example sentence closest to 9 words in length. This ignores whether the surrounding context words are familiar to the learner. A sentence full of rare words is harder than one with common words, regardless of length.
 ## Solution
 Replace the length-based `_score()` function in `epub_examples.py` with a **frequency-based difficulty score**. The easiest sentence (most common context words) becomes the cloze. All vetted sentences remain on the card, ordered easy→hard.
 ## Scoring Pipeline
 ### Token Frequency Lookup (5-tier)
 Given a nikkud sentence token, resolve its frequency rank:
 1. **Known mapping** — look up token in the nikkud→ktiv_male map built from words.json headwords, conjugations, and inflections (94k mappings). If found, look up the ktiv_male in the frequency data.
 2. **Nikkud prefix stripping** — use `_try_strip_prefix()` to strip validated Hebrew prefixes (בהוכלמש), then resolve the remainder via the known mapping.
 3. **Academy rules converter** — apply `nikkud_to_ktiv_male.convert()` (91.6% accuracy) to produce ktiv_male, look up in frequency data.
 4. **strip_nikkud fallback** — use `helpers.strip_nikkud()` as a lossy fallback.
 5. **Ktiv_male prefix stripping** — strip 1-2 character Hebrew prefixes from the converted/stripped form and look up the stem.
 Tokens not found in any tier are assigned a default high rank (50,000).
 **Coverage:** ~93% of example sentence tokens resolve to a frequency rank (measured empirically on 7,588 sentences).
 **Frequency data source:** Use `frequency_lookup.py` which auto-selects `frequency_clean.json` when available, falling back to `frequency_cache.json`.
 ### Sentence Difficulty Score
 For a given word's candidate sentence:
 1. Tokenize: split on whitespace, strip punctuation (.,!?;:"'"״׳–—()[]{}), split on maqaf (־).
 2. Exclude the target word's token using `cloze_word_start`/`cloze_word_end` offsets from the matched sentence.
 3. For each remaining token (length >= 2), resolve its frequency rank via the 5-tier pipeline.
 4. **Score = median frequency rank of context tokens.**
 Lower score = easier (context words are more common). Median resists outliers (one rare proper noun shouldn't dominate).
 ### Integration Point
 The scoring integrates into `epub_examples.py`'s existing `_score()` closure inside `update_words_json()` (line ~677). Currently:
 ```python
 def _score(s: dict) -> tuple[int,]:
    wc = s["word_count"]
    length_score = abs(wc - 9) if not (6 <= wc <= 12) else 0
    return (length_score,)
 ```
 New scoring replaces length with frequency-based difficulty. The `_score` function gains access to the frequency pipeline via closure over the nikkud_map, nikkud_index, and freq_data built once at the start of `update_words_json()`.
 **Behavioral change:** Because `pool.sort(key=_score)` determines which 3 sentences are selected as `best = pool[:3]`, changing the scoring function changes **which sentences are selected**, not just their order. This is intentional — we want the easiest sentences as cloze candidates, not the closest-to-9-words ones. Existing cloze GUIDs will be preserved when the same sentence text is re-selected; entries where a different sentence wins will get new GUIDs.
 ## Data Model Changes
 ### words.json
 The `examples.cloze` dict (single sentence) gains an optional `difficulty_score` field:
 ```json
 {
  "examples": {
    "vetted": [
      {"text": "...", "source": "...", "match_method": "..."},
      {"text": "...", "source": "...", "match_method": "..."}
    ],
    "cloze": {
      "text": "...",
      "cloze_word_start": 5,
      "cloze_word_end": 10,
      "cloze_hint": null,
      "cloze_guid": "abc123",
      "difficulty_score": 234
    }
  }
 }
 ```
 The vetted list is also sorted by difficulty (easiest first), so the card back shows sentences in pedagogically useful order.
 ### SCHEMA.yaml
 Add `difficulty_score` as optional integer field under `examples.cloze`.
 ## Implementation Scope
 ### New file: `sentence_difficulty.py`
 Standalone module for sentence scoring. No pipeline step — called by `epub_examples.py`.
 - `score_sentence(sentence_text: str, target_start: int, target_end: int, nikkud_map: dict, nikkud_index: dict, freq_data: dict) -> int` — returns median context frequency rank. Uses `target_start`/`target_end` character offsets to exclude the cloze target token.
 - `build_nikkud_map(words: dict) -> dict[str, str]` — builds nikkud→ktiv_male lookup from words.json (headwords + conjugation forms + noun inflections). Returns `{nikkud_form: ktiv_male_form}`. Implementation note: should share iteration logic with `epub_examples._build_nikkud_index()` or derive from its output to avoid duplicating the traversal of words.json forms.
 - `_resolve_token_frequency(token: str, nikkud_map: dict, nikkud_index: dict, freq_data: dict) -> int` — the 5-tier lookup. Uses `_try_strip_prefix` from epub_examples (made importable by removing underscore or adding a public wrapper).
 ### Modified files
 - **`epub_examples.py`**:
  - Import `sentence_difficulty.score_sentence` and `sentence_difficulty.build_nikkud_map`
  - In `update_words_json()`: build nikkud_map and load freq_data once at start (before per-word loop)
  - Replace `_score()` closure with frequency-based scoring that calls `score_sentence()`
  - Sort vetted list by difficulty score (easiest first)
  - Store `difficulty_score` in the cloze dict
  - Make `_try_strip_prefix` importable (rename to `try_strip_prefix` or add public alias)
 - **`frequency_lookup.py`** — add `get_freq_data() -> dict` public accessor to expose the loaded frequency dict (avoids accessing private `_freq` directly)
 - **`SCHEMA.yaml`** — add `difficulty_score` field
 - **`run.py`** — no changes; scoring happens inside epub_examples step
 ### Not modified
 - **`apkg_builder.py`** — reads cloze as-is; vetted order is already respected
 - **`nikkud_to_ktiv_male.py`** — used as-is
 - **Card templates** — no changes needed
 ## Dependencies
 - `nikkud_to_ktiv_male.convert()` — Academy rules converter (already written)
 - `epub_examples._try_strip_prefix()` / `_build_nikkud_index()` — nikkud prefix stripping and index
 - `frequency_lookup.py` — loads frequency data (auto-selects clean vs cache)
 - `helpers.strip_nikkud()` — fallback converter
 ## Validation
 - **Unit tests** for `score_sentence()` with known easy/hard sentences
 - **Unit tests** for `_resolve_token_frequency()` covering all 5 tiers
 - **Integration test**: verify cloze selection picks easiest sentence, vetted list is sorted
 - **Spot check**: manually review 10 words with 3+ sentences to confirm ordering
 - **Regression**: existing tests pass, GUID coverage unchanged, deck validates
 ## Constraints
 - `examples.cloze` remains a single dict (not converted to list)
 - No new Anki card types or fields
 - No runtime JS in Anki cards
 - No network calls during scoring
 - `difficulty_score` is informational metadata; card rendering doesn't depend on it
 - Existing cloze GUIDs preserved when the same sentence is re-selected
 ## Scope Exclusions (Future Work)
 - **Pronominal suffix stripping** — would improve the ~7% unscored token rate; deferred (PROJECT_NOTES.md)
 - **Kamatz katan disambiguation** — requires morphological analysis; accepted limitation
 - **Per-learner adaptive difficulty** — requires Anki plugin; out of scope for static deck
 - **Multiple cloze sentences per card** — would require schema migration to list; deferred
--- a/nikkud_to_ktiv_male.py
+++ b/nikkud_to_ktiv_male.py
@ -0,0 +1,185 @@
 """Convert nikkud (vocalized) Hebrew to ktiv male (plene spelling).
 Implements Hebrew Academy rules for matres lectionis insertion:
 - Rule A: U vowel (kubutz) → always insert vav
 - Rule B: O vowel (holam on non-vav) → insert vav
 - Rule C: I vowel (hiriq) → insert yod (conditionally)
 - Rule D: E vowel (tsere) → insert yod (limited cases)
 - Rule E/F: Consonantal vav/yod doubling
 Reference: https://hebrew-academy.org.il/topic/hahlatot/missingvocalizationspelling/
 """
 import unicodedata
 # Hebrew nikkud code points
 SHVA = "\u05b0"
 HATAF_SEGOL = "\u05b1"
 HATAF_PATAH = "\u05b2"
 HATAF_KAMATZ = "\u05b3"
 HIRIQ = "\u05b4"
 TSERE = "\u05b5"
 SEGOL = "\u05b6"
 PATAH = "\u05b7"
 KAMATZ = "\u05b8"
 HOLAM = "\u05b9"
 HOLAM_HASER = "\u05ba"
 KUBUTZ = "\u05bb"
 DAGESH = "\u05bc"
 METEG = "\u05bd"
 RAFE = "\u05bf"
 SHIN_DOT = "\u05c1"
 SIN_DOT = "\u05c2"
 VAV = "ו"
 YOD = "י"
 MAQAF = "־"
 VOWELS = {SHVA, HATAF_SEGOL, HATAF_PATAH, HATAF_KAMATZ, HIRIQ, TSERE, SEGOL, PATAH, KAMATZ, HOLAM, HOLAM_HASER, KUBUTZ}
 NIKKUD_MARKS = VOWELS | {DAGESH, METEG, RAFE, SHIN_DOT, SIN_DOT}
 def _parse_segments(text: str) -> list[tuple[str, list[str]]]:
    """Parse nikkud text into (character, [marks]) segments."""
    segments: list[tuple[str, list[str]]] = []
    cur_char: str | None = None
    cur_marks: list[str] = []
    for ch in text:
        if unicodedata.category(ch) == "Mn":
            cur_marks.append(ch)
        else:
            if cur_char is not None:
                segments.append((cur_char, cur_marks))
            cur_char = ch
            cur_marks = []
    if cur_char is not None:
        segments.append((cur_char, cur_marks))
    return segments
 def _get_vowel(marks: list[str]) -> str | None:
    """Extract the vowel mark from a list of combining marks."""
    for m in marks:
        if m in VOWELS:
            return m
    return None
 def _has_dagesh(marks: list[str]) -> bool:
    return DAGESH in marks
 def _is_hebrew_letter(ch: str) -> bool:
    return "\u05d0" <= ch <= "\u05ea"
 def convert(text: str) -> str:
    """Convert nikkud Hebrew text to ktiv male.
    Strips all nikkud marks and inserts matres lectionis (vav/yod)
    according to Hebrew Academy spelling rules.
    """
    segments = _parse_segments(text)
    result: list[str] = []
    for i, (ch, marks) in enumerate(segments):
        if not _is_hebrew_letter(ch):
            # Non-Hebrew character: output as-is (no marks)
            result.append(ch)
            continue
        vowel = _get_vowel(marks)
        has_dag = _has_dagesh(marks)
        # Output the base letter (strip all nikkud marks)
        result.append(ch)
        # --- Rule A: U vowel (kubutz) → always add vav ---
        if vowel == KUBUTZ:
            result.append(VAV)
            continue
        # --- Shuruk detection ---
        # Vav with dagesh and no other vowel = shuruk (already a mater)
        # Vav with dagesh AND a vowel = consonantal vav (ב with dagesh)
        # If letter is vav with dagesh only → it's shuruk, already output
        if ch == VAV and has_dag and vowel is None:
            # Shuruk: vav IS the mater lectionis, already output
            continue
        # --- Rule B: O vowel (holam) → add vav ---
        if vowel in (HOLAM, HOLAM_HASER):
            if ch != VAV:
                # Exception: holam before aleph (pe-aleph verbs) — no vav
                # e.g., תֹּאבַד→תאבד, יֹאבַד→יאבד, נֹאבַד→נאבד
                next_is_aleph = i + 1 < len(segments) and segments[i + 1][0] == "א"
                if not next_is_aleph:
                    result.append(VAV)
            # If ch IS vav (holam male), vav already output
            continue
        # --- Rule C: I vowel (hiriq) → conditionally add yod ---
        if vowel == HIRIQ:
            if ch == YOD:
                # Yod already present, don't double
                continue
            # Don't insert yod if next letter is already yod
            if i + 1 < len(segments) and segments[i + 1][0] == YOD:
                continue
            # Rule C Section 3: Don't add yod if the NEXT consonant
            # has shva (indicating shva nach on that consonant)
            add_yod = True
            if i + 1 < len(segments):
                next_ch, next_marks = segments[i + 1]
                next_vowel = _get_vowel(next_marks)
                # Shva on next consonant = shva nach → don't add yod
                # UNLESS next consonant also has dagesh (= shva na / doubled)
                next_has_dagesh = _has_dagesh(next_marks)
                if next_vowel == SHVA and not next_has_dagesh:
                    add_yod = False
                # No vowel on next consonant (word-final) = closed syllable
                # → don't add yod (e.g., suffix -תי -נו -תם)
                elif next_vowel is None and _is_hebrew_letter(next_ch):
                    # Check if this is truly word-final or next-to-last
                    remaining_letters = sum(1 for j in range(i + 1, len(segments)) if _is_hebrew_letter(segments[j][0]))
                    if remaining_letters <= 2:
                        # Short suffix like תי, נו — don't add yod
                        add_yod = False
            if add_yod:
                result.append(YOD)
            continue
        # --- Rule D: E vowel (tsere/segol) → generally NO yod ---
        # Exception (b): tsere before guttural/resh gets yod ONLY
        # in word-initial position (dagesh substitution in Hif'il/noun patterns)
        # e.g., הֵחֵל→היחל, תֵּאָבֵד→תיאבד, הֵרִיעַ→היריע
        # but NOT mid-word: מְסַפֵּר→מספר, מְעַבֵּר→מעבר
        if vowel == TSERE:
            add_yod = False
            if i + 1 < len(segments):
                next_ch = segments[i + 1][0]
                if next_ch in "אהחער":
                    # Only at word-initial (pos 0) or after prefix (pos 1)
                    # where dagesh substitution applies
                    hebrew_pos = sum(1 for j in range(i) if _is_hebrew_letter(segments[j][0]))
                    if hebrew_pos <= 1:
                        add_yod = True
            if add_yod:
                result.append(YOD)
            continue
        # All other vowels (patah, kamatz, segol, shva, hataf-*):
        # No mater lectionis insertion needed
    return "".join(result)