v0.20 design spec + nikkud-to-ktiv-male converter
Add Academy-rules-based nikkud→ktiv male converter (91.6% accuracy vs 77.2% for strip_nikkud) and v0.20 adaptive sentence difficulty cloze design spec. The converter enables frequency-based sentence scoring by properly resolving nikkud tokens to their ktiv male forms for frequency corpus lookup. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
af186e2030
commit
b3ea086e85
2 changed files with 333 additions and 0 deletions
|
|
@ -0,0 +1,148 @@
|
||||||
|
# Adaptive Sentence Difficulty Cloze — v0.20 Design Spec
|
||||||
|
|
||||||
|
**Date:** 2026-03-15
|
||||||
|
**Status:** Approved
|
||||||
|
**Release:** v0.20
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
Cloze cards currently select the example sentence closest to 9 words in length. This ignores whether the surrounding context words are familiar to the learner. A sentence full of rare words is harder than one with common words, regardless of length.
|
||||||
|
|
||||||
|
## Solution
|
||||||
|
|
||||||
|
Replace the length-based `_score()` function in `epub_examples.py` with a **frequency-based difficulty score**. The easiest sentence (most common context words) becomes the cloze. All vetted sentences remain on the card, ordered easy→hard.
|
||||||
|
|
||||||
|
## Scoring Pipeline
|
||||||
|
|
||||||
|
### Token Frequency Lookup (5-tier)
|
||||||
|
|
||||||
|
Given a nikkud sentence token, resolve its frequency rank:
|
||||||
|
|
||||||
|
1. **Known mapping** — look up token in the nikkud→ktiv_male map built from words.json headwords, conjugations, and inflections (94k mappings). If found, look up the ktiv_male in the frequency data.
|
||||||
|
2. **Nikkud prefix stripping** — use `_try_strip_prefix()` to strip validated Hebrew prefixes (בהוכלמש), then resolve the remainder via the known mapping.
|
||||||
|
3. **Academy rules converter** — apply `nikkud_to_ktiv_male.convert()` (91.6% accuracy) to produce ktiv_male, look up in frequency data.
|
||||||
|
4. **strip_nikkud fallback** — use `helpers.strip_nikkud()` as a lossy fallback.
|
||||||
|
5. **Ktiv_male prefix stripping** — strip 1-2 character Hebrew prefixes from the converted/stripped form and look up the stem.
|
||||||
|
|
||||||
|
Tokens not found in any tier are assigned a default high rank (50,000).
|
||||||
|
|
||||||
|
**Coverage:** ~93% of example sentence tokens resolve to a frequency rank (measured empirically on 7,588 sentences).
|
||||||
|
|
||||||
|
**Frequency data source:** Use `frequency_lookup.py` which auto-selects `frequency_clean.json` when available, falling back to `frequency_cache.json`.
|
||||||
|
|
||||||
|
### Sentence Difficulty Score
|
||||||
|
|
||||||
|
For a given word's candidate sentence:
|
||||||
|
|
||||||
|
1. Tokenize: split on whitespace, strip punctuation (.,!?;:"'"״׳–—()[]{}), split on maqaf (־).
|
||||||
|
2. Exclude the target word's token using `cloze_word_start`/`cloze_word_end` offsets from the matched sentence.
|
||||||
|
3. For each remaining token (length >= 2), resolve its frequency rank via the 5-tier pipeline.
|
||||||
|
4. **Score = median frequency rank of context tokens.**
|
||||||
|
|
||||||
|
Lower score = easier (context words are more common). Median resists outliers (one rare proper noun shouldn't dominate).
|
||||||
|
|
||||||
|
### Integration Point
|
||||||
|
|
||||||
|
The scoring integrates into `epub_examples.py`'s existing `_score()` closure inside `update_words_json()` (line ~677). Currently:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def _score(s: dict) -> tuple[int,]:
|
||||||
|
wc = s["word_count"]
|
||||||
|
length_score = abs(wc - 9) if not (6 <= wc <= 12) else 0
|
||||||
|
return (length_score,)
|
||||||
|
```
|
||||||
|
|
||||||
|
New scoring replaces length with frequency-based difficulty. The `_score` function gains access to the frequency pipeline via closure over the nikkud_map, nikkud_index, and freq_data built once at the start of `update_words_json()`.
|
||||||
|
|
||||||
|
**Behavioral change:** Because `pool.sort(key=_score)` determines which 3 sentences are selected as `best = pool[:3]`, changing the scoring function changes **which sentences are selected**, not just their order. This is intentional — we want the easiest sentences as cloze candidates, not the closest-to-9-words ones. Existing cloze GUIDs will be preserved when the same sentence text is re-selected; entries where a different sentence wins will get new GUIDs.
|
||||||
|
|
||||||
|
## Data Model Changes
|
||||||
|
|
||||||
|
### words.json
|
||||||
|
|
||||||
|
The `examples.cloze` dict (single sentence) gains an optional `difficulty_score` field:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"examples": {
|
||||||
|
"vetted": [
|
||||||
|
{"text": "...", "source": "...", "match_method": "..."},
|
||||||
|
{"text": "...", "source": "...", "match_method": "..."}
|
||||||
|
],
|
||||||
|
"cloze": {
|
||||||
|
"text": "...",
|
||||||
|
"cloze_word_start": 5,
|
||||||
|
"cloze_word_end": 10,
|
||||||
|
"cloze_hint": null,
|
||||||
|
"cloze_guid": "abc123",
|
||||||
|
"difficulty_score": 234
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
The vetted list is also sorted by difficulty (easiest first), so the card back shows sentences in pedagogically useful order.
|
||||||
|
|
||||||
|
### SCHEMA.yaml
|
||||||
|
|
||||||
|
Add `difficulty_score` as optional integer field under `examples.cloze`.
|
||||||
|
|
||||||
|
## Implementation Scope
|
||||||
|
|
||||||
|
### New file: `sentence_difficulty.py`
|
||||||
|
|
||||||
|
Standalone module for sentence scoring. No pipeline step — called by `epub_examples.py`.
|
||||||
|
|
||||||
|
- `score_sentence(sentence_text: str, target_start: int, target_end: int, nikkud_map: dict, nikkud_index: dict, freq_data: dict) -> int` — returns median context frequency rank. Uses `target_start`/`target_end` character offsets to exclude the cloze target token.
|
||||||
|
- `build_nikkud_map(words: dict) -> dict[str, str]` — builds nikkud→ktiv_male lookup from words.json (headwords + conjugation forms + noun inflections). Returns `{nikkud_form: ktiv_male_form}`. Implementation note: should share iteration logic with `epub_examples._build_nikkud_index()` or derive from its output to avoid duplicating the traversal of words.json forms.
|
||||||
|
- `_resolve_token_frequency(token: str, nikkud_map: dict, nikkud_index: dict, freq_data: dict) -> int` — the 5-tier lookup. Uses `_try_strip_prefix` from epub_examples (made importable by removing underscore or adding a public wrapper).
|
||||||
|
|
||||||
|
### Modified files
|
||||||
|
|
||||||
|
- **`epub_examples.py`**:
|
||||||
|
- Import `sentence_difficulty.score_sentence` and `sentence_difficulty.build_nikkud_map`
|
||||||
|
- In `update_words_json()`: build nikkud_map and load freq_data once at start (before per-word loop)
|
||||||
|
- Replace `_score()` closure with frequency-based scoring that calls `score_sentence()`
|
||||||
|
- Sort vetted list by difficulty score (easiest first)
|
||||||
|
- Store `difficulty_score` in the cloze dict
|
||||||
|
- Make `_try_strip_prefix` importable (rename to `try_strip_prefix` or add public alias)
|
||||||
|
- **`frequency_lookup.py`** — add `get_freq_data() -> dict` public accessor to expose the loaded frequency dict (avoids accessing private `_freq` directly)
|
||||||
|
- **`SCHEMA.yaml`** — add `difficulty_score` field
|
||||||
|
- **`run.py`** — no changes; scoring happens inside epub_examples step
|
||||||
|
|
||||||
|
### Not modified
|
||||||
|
|
||||||
|
- **`apkg_builder.py`** — reads cloze as-is; vetted order is already respected
|
||||||
|
- **`nikkud_to_ktiv_male.py`** — used as-is
|
||||||
|
- **Card templates** — no changes needed
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
- `nikkud_to_ktiv_male.convert()` — Academy rules converter (already written)
|
||||||
|
- `epub_examples._try_strip_prefix()` / `_build_nikkud_index()` — nikkud prefix stripping and index
|
||||||
|
- `frequency_lookup.py` — loads frequency data (auto-selects clean vs cache)
|
||||||
|
- `helpers.strip_nikkud()` — fallback converter
|
||||||
|
|
||||||
|
## Validation
|
||||||
|
|
||||||
|
- **Unit tests** for `score_sentence()` with known easy/hard sentences
|
||||||
|
- **Unit tests** for `_resolve_token_frequency()` covering all 5 tiers
|
||||||
|
- **Integration test**: verify cloze selection picks easiest sentence, vetted list is sorted
|
||||||
|
- **Spot check**: manually review 10 words with 3+ sentences to confirm ordering
|
||||||
|
- **Regression**: existing tests pass, GUID coverage unchanged, deck validates
|
||||||
|
|
||||||
|
## Constraints
|
||||||
|
|
||||||
|
- `examples.cloze` remains a single dict (not converted to list)
|
||||||
|
- No new Anki card types or fields
|
||||||
|
- No runtime JS in Anki cards
|
||||||
|
- No network calls during scoring
|
||||||
|
- `difficulty_score` is informational metadata; card rendering doesn't depend on it
|
||||||
|
- Existing cloze GUIDs preserved when the same sentence is re-selected
|
||||||
|
|
||||||
|
## Scope Exclusions (Future Work)
|
||||||
|
|
||||||
|
- **Pronominal suffix stripping** — would improve the ~7% unscored token rate; deferred (PROJECT_NOTES.md)
|
||||||
|
- **Kamatz katan disambiguation** — requires morphological analysis; accepted limitation
|
||||||
|
- **Per-learner adaptive difficulty** — requires Anki plugin; out of scope for static deck
|
||||||
|
- **Multiple cloze sentences per card** — would require schema migration to list; deferred
|
||||||
185
nikkud_to_ktiv_male.py
Normal file
185
nikkud_to_ktiv_male.py
Normal file
|
|
@ -0,0 +1,185 @@
|
||||||
|
"""Convert nikkud (vocalized) Hebrew to ktiv male (plene spelling).
|
||||||
|
|
||||||
|
Implements Hebrew Academy rules for matres lectionis insertion:
|
||||||
|
- Rule A: U vowel (kubutz) → always insert vav
|
||||||
|
- Rule B: O vowel (holam on non-vav) → insert vav
|
||||||
|
- Rule C: I vowel (hiriq) → insert yod (conditionally)
|
||||||
|
- Rule D: E vowel (tsere) → insert yod (limited cases)
|
||||||
|
- Rule E/F: Consonantal vav/yod doubling
|
||||||
|
|
||||||
|
Reference: https://hebrew-academy.org.il/topic/hahlatot/missingvocalizationspelling/
|
||||||
|
"""
|
||||||
|
|
||||||
|
import unicodedata
|
||||||
|
|
||||||
|
# Hebrew nikkud code points
|
||||||
|
SHVA = "\u05b0"
|
||||||
|
HATAF_SEGOL = "\u05b1"
|
||||||
|
HATAF_PATAH = "\u05b2"
|
||||||
|
HATAF_KAMATZ = "\u05b3"
|
||||||
|
HIRIQ = "\u05b4"
|
||||||
|
TSERE = "\u05b5"
|
||||||
|
SEGOL = "\u05b6"
|
||||||
|
PATAH = "\u05b7"
|
||||||
|
KAMATZ = "\u05b8"
|
||||||
|
HOLAM = "\u05b9"
|
||||||
|
HOLAM_HASER = "\u05ba"
|
||||||
|
KUBUTZ = "\u05bb"
|
||||||
|
DAGESH = "\u05bc"
|
||||||
|
METEG = "\u05bd"
|
||||||
|
RAFE = "\u05bf"
|
||||||
|
SHIN_DOT = "\u05c1"
|
||||||
|
SIN_DOT = "\u05c2"
|
||||||
|
|
||||||
|
VAV = "ו"
|
||||||
|
YOD = "י"
|
||||||
|
MAQAF = "־"
|
||||||
|
|
||||||
|
VOWELS = {SHVA, HATAF_SEGOL, HATAF_PATAH, HATAF_KAMATZ, HIRIQ, TSERE, SEGOL, PATAH, KAMATZ, HOLAM, HOLAM_HASER, KUBUTZ}
|
||||||
|
|
||||||
|
NIKKUD_MARKS = VOWELS | {DAGESH, METEG, RAFE, SHIN_DOT, SIN_DOT}
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_segments(text: str) -> list[tuple[str, list[str]]]:
|
||||||
|
"""Parse nikkud text into (character, [marks]) segments."""
|
||||||
|
segments: list[tuple[str, list[str]]] = []
|
||||||
|
cur_char: str | None = None
|
||||||
|
cur_marks: list[str] = []
|
||||||
|
|
||||||
|
for ch in text:
|
||||||
|
if unicodedata.category(ch) == "Mn":
|
||||||
|
cur_marks.append(ch)
|
||||||
|
else:
|
||||||
|
if cur_char is not None:
|
||||||
|
segments.append((cur_char, cur_marks))
|
||||||
|
cur_char = ch
|
||||||
|
cur_marks = []
|
||||||
|
|
||||||
|
if cur_char is not None:
|
||||||
|
segments.append((cur_char, cur_marks))
|
||||||
|
|
||||||
|
return segments
|
||||||
|
|
||||||
|
|
||||||
|
def _get_vowel(marks: list[str]) -> str | None:
|
||||||
|
"""Extract the vowel mark from a list of combining marks."""
|
||||||
|
for m in marks:
|
||||||
|
if m in VOWELS:
|
||||||
|
return m
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def _has_dagesh(marks: list[str]) -> bool:
|
||||||
|
return DAGESH in marks
|
||||||
|
|
||||||
|
|
||||||
|
def _is_hebrew_letter(ch: str) -> bool:
|
||||||
|
return "\u05d0" <= ch <= "\u05ea"
|
||||||
|
|
||||||
|
|
||||||
|
def convert(text: str) -> str:
|
||||||
|
"""Convert nikkud Hebrew text to ktiv male.
|
||||||
|
|
||||||
|
Strips all nikkud marks and inserts matres lectionis (vav/yod)
|
||||||
|
according to Hebrew Academy spelling rules.
|
||||||
|
"""
|
||||||
|
segments = _parse_segments(text)
|
||||||
|
result: list[str] = []
|
||||||
|
|
||||||
|
for i, (ch, marks) in enumerate(segments):
|
||||||
|
if not _is_hebrew_letter(ch):
|
||||||
|
# Non-Hebrew character: output as-is (no marks)
|
||||||
|
result.append(ch)
|
||||||
|
continue
|
||||||
|
|
||||||
|
vowel = _get_vowel(marks)
|
||||||
|
has_dag = _has_dagesh(marks)
|
||||||
|
|
||||||
|
# Output the base letter (strip all nikkud marks)
|
||||||
|
result.append(ch)
|
||||||
|
|
||||||
|
# --- Rule A: U vowel (kubutz) → always add vav ---
|
||||||
|
if vowel == KUBUTZ:
|
||||||
|
result.append(VAV)
|
||||||
|
continue
|
||||||
|
|
||||||
|
# --- Shuruk detection ---
|
||||||
|
# Vav with dagesh and no other vowel = shuruk (already a mater)
|
||||||
|
# Vav with dagesh AND a vowel = consonantal vav (ב with dagesh)
|
||||||
|
# If letter is vav with dagesh only → it's shuruk, already output
|
||||||
|
if ch == VAV and has_dag and vowel is None:
|
||||||
|
# Shuruk: vav IS the mater lectionis, already output
|
||||||
|
continue
|
||||||
|
|
||||||
|
# --- Rule B: O vowel (holam) → add vav ---
|
||||||
|
if vowel in (HOLAM, HOLAM_HASER):
|
||||||
|
if ch != VAV:
|
||||||
|
# Exception: holam before aleph (pe-aleph verbs) — no vav
|
||||||
|
# e.g., תֹּאבַד→תאבד, יֹאבַד→יאבד, נֹאבַד→נאבד
|
||||||
|
next_is_aleph = i + 1 < len(segments) and segments[i + 1][0] == "א"
|
||||||
|
if not next_is_aleph:
|
||||||
|
result.append(VAV)
|
||||||
|
# If ch IS vav (holam male), vav already output
|
||||||
|
continue
|
||||||
|
|
||||||
|
# --- Rule C: I vowel (hiriq) → conditionally add yod ---
|
||||||
|
if vowel == HIRIQ:
|
||||||
|
if ch == YOD:
|
||||||
|
# Yod already present, don't double
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Don't insert yod if next letter is already yod
|
||||||
|
if i + 1 < len(segments) and segments[i + 1][0] == YOD:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Rule C Section 3: Don't add yod if the NEXT consonant
|
||||||
|
# has shva (indicating shva nach on that consonant)
|
||||||
|
add_yod = True
|
||||||
|
|
||||||
|
if i + 1 < len(segments):
|
||||||
|
next_ch, next_marks = segments[i + 1]
|
||||||
|
next_vowel = _get_vowel(next_marks)
|
||||||
|
|
||||||
|
# Shva on next consonant = shva nach → don't add yod
|
||||||
|
# UNLESS next consonant also has dagesh (= shva na / doubled)
|
||||||
|
next_has_dagesh = _has_dagesh(next_marks)
|
||||||
|
if next_vowel == SHVA and not next_has_dagesh:
|
||||||
|
add_yod = False
|
||||||
|
# No vowel on next consonant (word-final) = closed syllable
|
||||||
|
# → don't add yod (e.g., suffix -תי -נו -תם)
|
||||||
|
elif next_vowel is None and _is_hebrew_letter(next_ch):
|
||||||
|
# Check if this is truly word-final or next-to-last
|
||||||
|
remaining_letters = sum(1 for j in range(i + 1, len(segments)) if _is_hebrew_letter(segments[j][0]))
|
||||||
|
if remaining_letters <= 2:
|
||||||
|
# Short suffix like תי, נו — don't add yod
|
||||||
|
add_yod = False
|
||||||
|
|
||||||
|
if add_yod:
|
||||||
|
result.append(YOD)
|
||||||
|
continue
|
||||||
|
|
||||||
|
# --- Rule D: E vowel (tsere/segol) → generally NO yod ---
|
||||||
|
# Exception (b): tsere before guttural/resh gets yod ONLY
|
||||||
|
# in word-initial position (dagesh substitution in Hif'il/noun patterns)
|
||||||
|
# e.g., הֵחֵל→היחל, תֵּאָבֵד→תיאבד, הֵרִיעַ→היריע
|
||||||
|
# but NOT mid-word: מְסַפֵּר→מספר, מְעַבֵּר→מעבר
|
||||||
|
if vowel == TSERE:
|
||||||
|
add_yod = False
|
||||||
|
|
||||||
|
if i + 1 < len(segments):
|
||||||
|
next_ch = segments[i + 1][0]
|
||||||
|
if next_ch in "אהחער":
|
||||||
|
# Only at word-initial (pos 0) or after prefix (pos 1)
|
||||||
|
# where dagesh substitution applies
|
||||||
|
hebrew_pos = sum(1 for j in range(i) if _is_hebrew_letter(segments[j][0]))
|
||||||
|
if hebrew_pos <= 1:
|
||||||
|
add_yod = True
|
||||||
|
|
||||||
|
if add_yod:
|
||||||
|
result.append(YOD)
|
||||||
|
continue
|
||||||
|
|
||||||
|
# All other vowels (patah, kamatz, segol, shva, hataf-*):
|
||||||
|
# No mater lectionis insertion needed
|
||||||
|
|
||||||
|
return "".join(result)
|
||||||
Loading…
Reference in a new issue