diff --git a/docs/superpowers/specs/2026-03-15-adaptive-sentence-difficulty-design.md b/docs/superpowers/specs/2026-03-15-adaptive-sentence-difficulty-design.md new file mode 100644 index 0000000..7ab5e58 --- /dev/null +++ b/docs/superpowers/specs/2026-03-15-adaptive-sentence-difficulty-design.md @@ -0,0 +1,148 @@ +# Adaptive Sentence Difficulty Cloze — v0.20 Design Spec + +**Date:** 2026-03-15 +**Status:** Approved +**Release:** v0.20 + +## Problem + +Cloze cards currently select the example sentence closest to 9 words in length. This ignores whether the surrounding context words are familiar to the learner. A sentence full of rare words is harder than one with common words, regardless of length. + +## Solution + +Replace the length-based `_score()` function in `epub_examples.py` with a **frequency-based difficulty score**. The easiest sentence (most common context words) becomes the cloze. All vetted sentences remain on the card, ordered easy→hard. + +## Scoring Pipeline + +### Token Frequency Lookup (5-tier) + +Given a nikkud sentence token, resolve its frequency rank: + +1. **Known mapping** — look up token in the nikkud→ktiv_male map built from words.json headwords, conjugations, and inflections (94k mappings). If found, look up the ktiv_male in the frequency data. +2. **Nikkud prefix stripping** — use `_try_strip_prefix()` to strip validated Hebrew prefixes (בהוכלמש), then resolve the remainder via the known mapping. +3. **Academy rules converter** — apply `nikkud_to_ktiv_male.convert()` (91.6% accuracy) to produce ktiv_male, look up in frequency data. +4. **strip_nikkud fallback** — use `helpers.strip_nikkud()` as a lossy fallback. +5. **Ktiv_male prefix stripping** — strip 1-2 character Hebrew prefixes from the converted/stripped form and look up the stem. + +Tokens not found in any tier are assigned a default high rank (50,000). + +**Coverage:** ~93% of example sentence tokens resolve to a frequency rank (measured empirically on 7,588 sentences). + +**Frequency data source:** Use `frequency_lookup.py` which auto-selects `frequency_clean.json` when available, falling back to `frequency_cache.json`. + +### Sentence Difficulty Score + +For a given word's candidate sentence: + +1. Tokenize: split on whitespace, strip punctuation (.,!?;:"'"״׳–—()[]{}), split on maqaf (־). +2. Exclude the target word's token using `cloze_word_start`/`cloze_word_end` offsets from the matched sentence. +3. For each remaining token (length >= 2), resolve its frequency rank via the 5-tier pipeline. +4. **Score = median frequency rank of context tokens.** + +Lower score = easier (context words are more common). Median resists outliers (one rare proper noun shouldn't dominate). + +### Integration Point + +The scoring integrates into `epub_examples.py`'s existing `_score()` closure inside `update_words_json()` (line ~677). Currently: + +```python +def _score(s: dict) -> tuple[int,]: + wc = s["word_count"] + length_score = abs(wc - 9) if not (6 <= wc <= 12) else 0 + return (length_score,) +``` + +New scoring replaces length with frequency-based difficulty. The `_score` function gains access to the frequency pipeline via closure over the nikkud_map, nikkud_index, and freq_data built once at the start of `update_words_json()`. + +**Behavioral change:** Because `pool.sort(key=_score)` determines which 3 sentences are selected as `best = pool[:3]`, changing the scoring function changes **which sentences are selected**, not just their order. This is intentional — we want the easiest sentences as cloze candidates, not the closest-to-9-words ones. Existing cloze GUIDs will be preserved when the same sentence text is re-selected; entries where a different sentence wins will get new GUIDs. + +## Data Model Changes + +### words.json + +The `examples.cloze` dict (single sentence) gains an optional `difficulty_score` field: + +```json +{ + "examples": { + "vetted": [ + {"text": "...", "source": "...", "match_method": "..."}, + {"text": "...", "source": "...", "match_method": "..."} + ], + "cloze": { + "text": "...", + "cloze_word_start": 5, + "cloze_word_end": 10, + "cloze_hint": null, + "cloze_guid": "abc123", + "difficulty_score": 234 + } + } +} +``` + +The vetted list is also sorted by difficulty (easiest first), so the card back shows sentences in pedagogically useful order. + +### SCHEMA.yaml + +Add `difficulty_score` as optional integer field under `examples.cloze`. + +## Implementation Scope + +### New file: `sentence_difficulty.py` + +Standalone module for sentence scoring. No pipeline step — called by `epub_examples.py`. + +- `score_sentence(sentence_text: str, target_start: int, target_end: int, nikkud_map: dict, nikkud_index: dict, freq_data: dict) -> int` — returns median context frequency rank. Uses `target_start`/`target_end` character offsets to exclude the cloze target token. +- `build_nikkud_map(words: dict) -> dict[str, str]` — builds nikkud→ktiv_male lookup from words.json (headwords + conjugation forms + noun inflections). Returns `{nikkud_form: ktiv_male_form}`. Implementation note: should share iteration logic with `epub_examples._build_nikkud_index()` or derive from its output to avoid duplicating the traversal of words.json forms. +- `_resolve_token_frequency(token: str, nikkud_map: dict, nikkud_index: dict, freq_data: dict) -> int` — the 5-tier lookup. Uses `_try_strip_prefix` from epub_examples (made importable by removing underscore or adding a public wrapper). + +### Modified files + +- **`epub_examples.py`**: + - Import `sentence_difficulty.score_sentence` and `sentence_difficulty.build_nikkud_map` + - In `update_words_json()`: build nikkud_map and load freq_data once at start (before per-word loop) + - Replace `_score()` closure with frequency-based scoring that calls `score_sentence()` + - Sort vetted list by difficulty score (easiest first) + - Store `difficulty_score` in the cloze dict + - Make `_try_strip_prefix` importable (rename to `try_strip_prefix` or add public alias) +- **`frequency_lookup.py`** — add `get_freq_data() -> dict` public accessor to expose the loaded frequency dict (avoids accessing private `_freq` directly) +- **`SCHEMA.yaml`** — add `difficulty_score` field +- **`run.py`** — no changes; scoring happens inside epub_examples step + +### Not modified + +- **`apkg_builder.py`** — reads cloze as-is; vetted order is already respected +- **`nikkud_to_ktiv_male.py`** — used as-is +- **Card templates** — no changes needed + +## Dependencies + +- `nikkud_to_ktiv_male.convert()` — Academy rules converter (already written) +- `epub_examples._try_strip_prefix()` / `_build_nikkud_index()` — nikkud prefix stripping and index +- `frequency_lookup.py` — loads frequency data (auto-selects clean vs cache) +- `helpers.strip_nikkud()` — fallback converter + +## Validation + +- **Unit tests** for `score_sentence()` with known easy/hard sentences +- **Unit tests** for `_resolve_token_frequency()` covering all 5 tiers +- **Integration test**: verify cloze selection picks easiest sentence, vetted list is sorted +- **Spot check**: manually review 10 words with 3+ sentences to confirm ordering +- **Regression**: existing tests pass, GUID coverage unchanged, deck validates + +## Constraints + +- `examples.cloze` remains a single dict (not converted to list) +- No new Anki card types or fields +- No runtime JS in Anki cards +- No network calls during scoring +- `difficulty_score` is informational metadata; card rendering doesn't depend on it +- Existing cloze GUIDs preserved when the same sentence is re-selected + +## Scope Exclusions (Future Work) + +- **Pronominal suffix stripping** — would improve the ~7% unscored token rate; deferred (PROJECT_NOTES.md) +- **Kamatz katan disambiguation** — requires morphological analysis; accepted limitation +- **Per-learner adaptive difficulty** — requires Anki plugin; out of scope for static deck +- **Multiple cloze sentences per card** — would require schema migration to list; deferred diff --git a/nikkud_to_ktiv_male.py b/nikkud_to_ktiv_male.py new file mode 100644 index 0000000..e761a99 --- /dev/null +++ b/nikkud_to_ktiv_male.py @@ -0,0 +1,185 @@ +"""Convert nikkud (vocalized) Hebrew to ktiv male (plene spelling). + +Implements Hebrew Academy rules for matres lectionis insertion: +- Rule A: U vowel (kubutz) → always insert vav +- Rule B: O vowel (holam on non-vav) → insert vav +- Rule C: I vowel (hiriq) → insert yod (conditionally) +- Rule D: E vowel (tsere) → insert yod (limited cases) +- Rule E/F: Consonantal vav/yod doubling + +Reference: https://hebrew-academy.org.il/topic/hahlatot/missingvocalizationspelling/ +""" + +import unicodedata + +# Hebrew nikkud code points +SHVA = "\u05b0" +HATAF_SEGOL = "\u05b1" +HATAF_PATAH = "\u05b2" +HATAF_KAMATZ = "\u05b3" +HIRIQ = "\u05b4" +TSERE = "\u05b5" +SEGOL = "\u05b6" +PATAH = "\u05b7" +KAMATZ = "\u05b8" +HOLAM = "\u05b9" +HOLAM_HASER = "\u05ba" +KUBUTZ = "\u05bb" +DAGESH = "\u05bc" +METEG = "\u05bd" +RAFE = "\u05bf" +SHIN_DOT = "\u05c1" +SIN_DOT = "\u05c2" + +VAV = "ו" +YOD = "י" +MAQAF = "־" + +VOWELS = {SHVA, HATAF_SEGOL, HATAF_PATAH, HATAF_KAMATZ, HIRIQ, TSERE, SEGOL, PATAH, KAMATZ, HOLAM, HOLAM_HASER, KUBUTZ} + +NIKKUD_MARKS = VOWELS | {DAGESH, METEG, RAFE, SHIN_DOT, SIN_DOT} + + +def _parse_segments(text: str) -> list[tuple[str, list[str]]]: + """Parse nikkud text into (character, [marks]) segments.""" + segments: list[tuple[str, list[str]]] = [] + cur_char: str | None = None + cur_marks: list[str] = [] + + for ch in text: + if unicodedata.category(ch) == "Mn": + cur_marks.append(ch) + else: + if cur_char is not None: + segments.append((cur_char, cur_marks)) + cur_char = ch + cur_marks = [] + + if cur_char is not None: + segments.append((cur_char, cur_marks)) + + return segments + + +def _get_vowel(marks: list[str]) -> str | None: + """Extract the vowel mark from a list of combining marks.""" + for m in marks: + if m in VOWELS: + return m + return None + + +def _has_dagesh(marks: list[str]) -> bool: + return DAGESH in marks + + +def _is_hebrew_letter(ch: str) -> bool: + return "\u05d0" <= ch <= "\u05ea" + + +def convert(text: str) -> str: + """Convert nikkud Hebrew text to ktiv male. + + Strips all nikkud marks and inserts matres lectionis (vav/yod) + according to Hebrew Academy spelling rules. + """ + segments = _parse_segments(text) + result: list[str] = [] + + for i, (ch, marks) in enumerate(segments): + if not _is_hebrew_letter(ch): + # Non-Hebrew character: output as-is (no marks) + result.append(ch) + continue + + vowel = _get_vowel(marks) + has_dag = _has_dagesh(marks) + + # Output the base letter (strip all nikkud marks) + result.append(ch) + + # --- Rule A: U vowel (kubutz) → always add vav --- + if vowel == KUBUTZ: + result.append(VAV) + continue + + # --- Shuruk detection --- + # Vav with dagesh and no other vowel = shuruk (already a mater) + # Vav with dagesh AND a vowel = consonantal vav (ב with dagesh) + # If letter is vav with dagesh only → it's shuruk, already output + if ch == VAV and has_dag and vowel is None: + # Shuruk: vav IS the mater lectionis, already output + continue + + # --- Rule B: O vowel (holam) → add vav --- + if vowel in (HOLAM, HOLAM_HASER): + if ch != VAV: + # Exception: holam before aleph (pe-aleph verbs) — no vav + # e.g., תֹּאבַד→תאבד, יֹאבַד→יאבד, נֹאבַד→נאבד + next_is_aleph = i + 1 < len(segments) and segments[i + 1][0] == "א" + if not next_is_aleph: + result.append(VAV) + # If ch IS vav (holam male), vav already output + continue + + # --- Rule C: I vowel (hiriq) → conditionally add yod --- + if vowel == HIRIQ: + if ch == YOD: + # Yod already present, don't double + continue + + # Don't insert yod if next letter is already yod + if i + 1 < len(segments) and segments[i + 1][0] == YOD: + continue + + # Rule C Section 3: Don't add yod if the NEXT consonant + # has shva (indicating shva nach on that consonant) + add_yod = True + + if i + 1 < len(segments): + next_ch, next_marks = segments[i + 1] + next_vowel = _get_vowel(next_marks) + + # Shva on next consonant = shva nach → don't add yod + # UNLESS next consonant also has dagesh (= shva na / doubled) + next_has_dagesh = _has_dagesh(next_marks) + if next_vowel == SHVA and not next_has_dagesh: + add_yod = False + # No vowel on next consonant (word-final) = closed syllable + # → don't add yod (e.g., suffix -תי -נו -תם) + elif next_vowel is None and _is_hebrew_letter(next_ch): + # Check if this is truly word-final or next-to-last + remaining_letters = sum(1 for j in range(i + 1, len(segments)) if _is_hebrew_letter(segments[j][0])) + if remaining_letters <= 2: + # Short suffix like תי, נו — don't add yod + add_yod = False + + if add_yod: + result.append(YOD) + continue + + # --- Rule D: E vowel (tsere/segol) → generally NO yod --- + # Exception (b): tsere before guttural/resh gets yod ONLY + # in word-initial position (dagesh substitution in Hif'il/noun patterns) + # e.g., הֵחֵל→היחל, תֵּאָבֵד→תיאבד, הֵרִיעַ→היריע + # but NOT mid-word: מְסַפֵּר→מספר, מְעַבֵּר→מעבר + if vowel == TSERE: + add_yod = False + + if i + 1 < len(segments): + next_ch = segments[i + 1][0] + if next_ch in "אהחער": + # Only at word-initial (pos 0) or after prefix (pos 1) + # where dagesh substitution applies + hebrew_pos = sum(1 for j in range(i) if _is_hebrew_letter(segments[j][0])) + if hebrew_pos <= 1: + add_yod = True + + if add_yod: + result.append(YOD) + continue + + # All other vowels (patah, kamatz, segol, shva, hataf-*): + # No mater lectionis insertion needed + + return "".join(result)