v0.20 design spec + nikkud-to-ktiv-male converter

Add Academy-rules-based nikkud→ktiv male converter (91.6% accuracy
vs 77.2% for strip_nikkud) and v0.20 adaptive sentence difficulty
cloze design spec. The converter enables frequency-based sentence
scoring by properly resolving nikkud tokens to their ktiv male forms
for frequency corpus lookup.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Sochen 2026-03-15 12:57:14 +00:00
parent af186e2030
commit b3ea086e85
2 changed files with 333 additions and 0 deletions

View file

@ -0,0 +1,148 @@
# Adaptive Sentence Difficulty Cloze — v0.20 Design Spec
**Date:** 2026-03-15
**Status:** Approved
**Release:** v0.20
## Problem
Cloze cards currently select the example sentence closest to 9 words in length. This ignores whether the surrounding context words are familiar to the learner. A sentence full of rare words is harder than one with common words, regardless of length.
## Solution
Replace the length-based `_score()` function in `epub_examples.py` with a **frequency-based difficulty score**. The easiest sentence (most common context words) becomes the cloze. All vetted sentences remain on the card, ordered easy→hard.
## Scoring Pipeline
### Token Frequency Lookup (5-tier)
Given a nikkud sentence token, resolve its frequency rank:
1. **Known mapping** — look up token in the nikkud→ktiv_male map built from words.json headwords, conjugations, and inflections (94k mappings). If found, look up the ktiv_male in the frequency data.
2. **Nikkud prefix stripping** — use `_try_strip_prefix()` to strip validated Hebrew prefixes (בהוכלמש), then resolve the remainder via the known mapping.
3. **Academy rules converter** — apply `nikkud_to_ktiv_male.convert()` (91.6% accuracy) to produce ktiv_male, look up in frequency data.
4. **strip_nikkud fallback** — use `helpers.strip_nikkud()` as a lossy fallback.
5. **Ktiv_male prefix stripping** — strip 1-2 character Hebrew prefixes from the converted/stripped form and look up the stem.
Tokens not found in any tier are assigned a default high rank (50,000).
**Coverage:** ~93% of example sentence tokens resolve to a frequency rank (measured empirically on 7,588 sentences).
**Frequency data source:** Use `frequency_lookup.py` which auto-selects `frequency_clean.json` when available, falling back to `frequency_cache.json`.
### Sentence Difficulty Score
For a given word's candidate sentence:
1. Tokenize: split on whitespace, strip punctuation (.,!?;:"'"״׳–—()[]{}), split on maqaf (־).
2. Exclude the target word's token using `cloze_word_start`/`cloze_word_end` offsets from the matched sentence.
3. For each remaining token (length >= 2), resolve its frequency rank via the 5-tier pipeline.
4. **Score = median frequency rank of context tokens.**
Lower score = easier (context words are more common). Median resists outliers (one rare proper noun shouldn't dominate).
### Integration Point
The scoring integrates into `epub_examples.py`'s existing `_score()` closure inside `update_words_json()` (line ~677). Currently:
```python
def _score(s: dict) -> tuple[int,]:
wc = s["word_count"]
length_score = abs(wc - 9) if not (6 <= wc <= 12) else 0
return (length_score,)
```
New scoring replaces length with frequency-based difficulty. The `_score` function gains access to the frequency pipeline via closure over the nikkud_map, nikkud_index, and freq_data built once at the start of `update_words_json()`.
**Behavioral change:** Because `pool.sort(key=_score)` determines which 3 sentences are selected as `best = pool[:3]`, changing the scoring function changes **which sentences are selected**, not just their order. This is intentional — we want the easiest sentences as cloze candidates, not the closest-to-9-words ones. Existing cloze GUIDs will be preserved when the same sentence text is re-selected; entries where a different sentence wins will get new GUIDs.
## Data Model Changes
### words.json
The `examples.cloze` dict (single sentence) gains an optional `difficulty_score` field:
```json
{
"examples": {
"vetted": [
{"text": "...", "source": "...", "match_method": "..."},
{"text": "...", "source": "...", "match_method": "..."}
],
"cloze": {
"text": "...",
"cloze_word_start": 5,
"cloze_word_end": 10,
"cloze_hint": null,
"cloze_guid": "abc123",
"difficulty_score": 234
}
}
}
```
The vetted list is also sorted by difficulty (easiest first), so the card back shows sentences in pedagogically useful order.
### SCHEMA.yaml
Add `difficulty_score` as optional integer field under `examples.cloze`.
## Implementation Scope
### New file: `sentence_difficulty.py`
Standalone module for sentence scoring. No pipeline step — called by `epub_examples.py`.
- `score_sentence(sentence_text: str, target_start: int, target_end: int, nikkud_map: dict, nikkud_index: dict, freq_data: dict) -> int` — returns median context frequency rank. Uses `target_start`/`target_end` character offsets to exclude the cloze target token.
- `build_nikkud_map(words: dict) -> dict[str, str]` — builds nikkud→ktiv_male lookup from words.json (headwords + conjugation forms + noun inflections). Returns `{nikkud_form: ktiv_male_form}`. Implementation note: should share iteration logic with `epub_examples._build_nikkud_index()` or derive from its output to avoid duplicating the traversal of words.json forms.
- `_resolve_token_frequency(token: str, nikkud_map: dict, nikkud_index: dict, freq_data: dict) -> int` — the 5-tier lookup. Uses `_try_strip_prefix` from epub_examples (made importable by removing underscore or adding a public wrapper).
### Modified files
- **`epub_examples.py`**:
- Import `sentence_difficulty.score_sentence` and `sentence_difficulty.build_nikkud_map`
- In `update_words_json()`: build nikkud_map and load freq_data once at start (before per-word loop)
- Replace `_score()` closure with frequency-based scoring that calls `score_sentence()`
- Sort vetted list by difficulty score (easiest first)
- Store `difficulty_score` in the cloze dict
- Make `_try_strip_prefix` importable (rename to `try_strip_prefix` or add public alias)
- **`frequency_lookup.py`** — add `get_freq_data() -> dict` public accessor to expose the loaded frequency dict (avoids accessing private `_freq` directly)
- **`SCHEMA.yaml`** — add `difficulty_score` field
- **`run.py`** — no changes; scoring happens inside epub_examples step
### Not modified
- **`apkg_builder.py`** — reads cloze as-is; vetted order is already respected
- **`nikkud_to_ktiv_male.py`** — used as-is
- **Card templates** — no changes needed
## Dependencies
- `nikkud_to_ktiv_male.convert()` — Academy rules converter (already written)
- `epub_examples._try_strip_prefix()` / `_build_nikkud_index()` — nikkud prefix stripping and index
- `frequency_lookup.py` — loads frequency data (auto-selects clean vs cache)
- `helpers.strip_nikkud()` — fallback converter
## Validation
- **Unit tests** for `score_sentence()` with known easy/hard sentences
- **Unit tests** for `_resolve_token_frequency()` covering all 5 tiers
- **Integration test**: verify cloze selection picks easiest sentence, vetted list is sorted
- **Spot check**: manually review 10 words with 3+ sentences to confirm ordering
- **Regression**: existing tests pass, GUID coverage unchanged, deck validates
## Constraints
- `examples.cloze` remains a single dict (not converted to list)
- No new Anki card types or fields
- No runtime JS in Anki cards
- No network calls during scoring
- `difficulty_score` is informational metadata; card rendering doesn't depend on it
- Existing cloze GUIDs preserved when the same sentence is re-selected
## Scope Exclusions (Future Work)
- **Pronominal suffix stripping** — would improve the ~7% unscored token rate; deferred (PROJECT_NOTES.md)
- **Kamatz katan disambiguation** — requires morphological analysis; accepted limitation
- **Per-learner adaptive difficulty** — requires Anki plugin; out of scope for static deck
- **Multiple cloze sentences per card** — would require schema migration to list; deferred

185
nikkud_to_ktiv_male.py Normal file
View file

@ -0,0 +1,185 @@
"""Convert nikkud (vocalized) Hebrew to ktiv male (plene spelling).
Implements Hebrew Academy rules for matres lectionis insertion:
- Rule A: U vowel (kubutz) always insert vav
- Rule B: O vowel (holam on non-vav) insert vav
- Rule C: I vowel (hiriq) insert yod (conditionally)
- Rule D: E vowel (tsere) insert yod (limited cases)
- Rule E/F: Consonantal vav/yod doubling
Reference: https://hebrew-academy.org.il/topic/hahlatot/missingvocalizationspelling/
"""
import unicodedata
# Hebrew nikkud code points
SHVA = "\u05b0"
HATAF_SEGOL = "\u05b1"
HATAF_PATAH = "\u05b2"
HATAF_KAMATZ = "\u05b3"
HIRIQ = "\u05b4"
TSERE = "\u05b5"
SEGOL = "\u05b6"
PATAH = "\u05b7"
KAMATZ = "\u05b8"
HOLAM = "\u05b9"
HOLAM_HASER = "\u05ba"
KUBUTZ = "\u05bb"
DAGESH = "\u05bc"
METEG = "\u05bd"
RAFE = "\u05bf"
SHIN_DOT = "\u05c1"
SIN_DOT = "\u05c2"
VAV = "ו"
YOD = "י"
MAQAF = "־"
VOWELS = {SHVA, HATAF_SEGOL, HATAF_PATAH, HATAF_KAMATZ, HIRIQ, TSERE, SEGOL, PATAH, KAMATZ, HOLAM, HOLAM_HASER, KUBUTZ}
NIKKUD_MARKS = VOWELS | {DAGESH, METEG, RAFE, SHIN_DOT, SIN_DOT}
def _parse_segments(text: str) -> list[tuple[str, list[str]]]:
"""Parse nikkud text into (character, [marks]) segments."""
segments: list[tuple[str, list[str]]] = []
cur_char: str | None = None
cur_marks: list[str] = []
for ch in text:
if unicodedata.category(ch) == "Mn":
cur_marks.append(ch)
else:
if cur_char is not None:
segments.append((cur_char, cur_marks))
cur_char = ch
cur_marks = []
if cur_char is not None:
segments.append((cur_char, cur_marks))
return segments
def _get_vowel(marks: list[str]) -> str | None:
"""Extract the vowel mark from a list of combining marks."""
for m in marks:
if m in VOWELS:
return m
return None
def _has_dagesh(marks: list[str]) -> bool:
return DAGESH in marks
def _is_hebrew_letter(ch: str) -> bool:
return "\u05d0" <= ch <= "\u05ea"
def convert(text: str) -> str:
"""Convert nikkud Hebrew text to ktiv male.
Strips all nikkud marks and inserts matres lectionis (vav/yod)
according to Hebrew Academy spelling rules.
"""
segments = _parse_segments(text)
result: list[str] = []
for i, (ch, marks) in enumerate(segments):
if not _is_hebrew_letter(ch):
# Non-Hebrew character: output as-is (no marks)
result.append(ch)
continue
vowel = _get_vowel(marks)
has_dag = _has_dagesh(marks)
# Output the base letter (strip all nikkud marks)
result.append(ch)
# --- Rule A: U vowel (kubutz) → always add vav ---
if vowel == KUBUTZ:
result.append(VAV)
continue
# --- Shuruk detection ---
# Vav with dagesh and no other vowel = shuruk (already a mater)
# Vav with dagesh AND a vowel = consonantal vav (ב with dagesh)
# If letter is vav with dagesh only → it's shuruk, already output
if ch == VAV and has_dag and vowel is None:
# Shuruk: vav IS the mater lectionis, already output
continue
# --- Rule B: O vowel (holam) → add vav ---
if vowel in (HOLAM, HOLAM_HASER):
if ch != VAV:
# Exception: holam before aleph (pe-aleph verbs) — no vav
# e.g., תֹּאבַד→תאבד, יֹאבַד→יאבד, נֹאבַד→נאבד
next_is_aleph = i + 1 < len(segments) and segments[i + 1][0] == "א"
if not next_is_aleph:
result.append(VAV)
# If ch IS vav (holam male), vav already output
continue
# --- Rule C: I vowel (hiriq) → conditionally add yod ---
if vowel == HIRIQ:
if ch == YOD:
# Yod already present, don't double
continue
# Don't insert yod if next letter is already yod
if i + 1 < len(segments) and segments[i + 1][0] == YOD:
continue
# Rule C Section 3: Don't add yod if the NEXT consonant
# has shva (indicating shva nach on that consonant)
add_yod = True
if i + 1 < len(segments):
next_ch, next_marks = segments[i + 1]
next_vowel = _get_vowel(next_marks)
# Shva on next consonant = shva nach → don't add yod
# UNLESS next consonant also has dagesh (= shva na / doubled)
next_has_dagesh = _has_dagesh(next_marks)
if next_vowel == SHVA and not next_has_dagesh:
add_yod = False
# No vowel on next consonant (word-final) = closed syllable
# → don't add yod (e.g., suffix -תי -נו -תם)
elif next_vowel is None and _is_hebrew_letter(next_ch):
# Check if this is truly word-final or next-to-last
remaining_letters = sum(1 for j in range(i + 1, len(segments)) if _is_hebrew_letter(segments[j][0]))
if remaining_letters <= 2:
# Short suffix like תי, נו — don't add yod
add_yod = False
if add_yod:
result.append(YOD)
continue
# --- Rule D: E vowel (tsere/segol) → generally NO yod ---
# Exception (b): tsere before guttural/resh gets yod ONLY
# in word-initial position (dagesh substitution in Hif'il/noun patterns)
# e.g., הֵחֵל→היחל, תֵּאָבֵד→תיאבד, הֵרִיעַ→היריע
# but NOT mid-word: מְסַפֵּר→מספר, מְעַבֵּר→מעבר
if vowel == TSERE:
add_yod = False
if i + 1 < len(segments):
next_ch = segments[i + 1][0]
if next_ch in "אהחער":
# Only at word-initial (pos 0) or after prefix (pos 1)
# where dagesh substitution applies
hebrew_pos = sum(1 for j in range(i) if _is_hebrew_letter(segments[j][0]))
if hebrew_pos <= 1:
add_yod = True
if add_yod:
result.append(YOD)
continue
# All other vowels (patah, kamatz, segol, shva, hataf-*):
# No mater lectionis insertion needed
return "".join(result)