Sochen 14d567a261 schema: add difficulty_score field + update spec with MIN_WORDS=3

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-15 13:30:13 +00:00

7.7 KiB

Raw Blame History

Adaptive Sentence Difficulty Cloze — v0.20 Design Spec

Date: 2026-03-15 Status: Approved Release: v0.20

Problem

Cloze cards currently select the example sentence closest to 9 words in length. This ignores whether the surrounding context words are familiar to the learner. A sentence full of rare words is harder than one with common words, regardless of length.

Solution

Replace the length-based _score() function in epub_examples.py with a frequency-based difficulty score. The easiest sentence (most common context words) becomes the cloze. All vetted sentences remain on the card, ordered easy→hard.

Scoring Pipeline

Token Frequency Lookup (5-tier)

Given a nikkud sentence token, resolve its frequency rank:

Known mapping — look up token in the nikkud→ktiv_male map built from words.json headwords, conjugations, and inflections (94k mappings). If found, look up the ktiv_male in the frequency data.
Nikkud prefix stripping — use _try_strip_prefix() to strip validated Hebrew prefixes (בהוכלמש), then resolve the remainder via the known mapping.
Academy rules converter — apply nikkud_to_ktiv_male.convert() (91.6% accuracy) to produce ktiv_male, look up in frequency data.
strip_nikkud fallback — use helpers.strip_nikkud() as a lossy fallback.
Ktiv_male prefix stripping — strip 1-2 character Hebrew prefixes from the converted/stripped form and look up the stem.

Tokens not found in any tier are assigned a default high rank (50,000).

Coverage: ~93% of example sentence tokens resolve to a frequency rank (measured empirically on 7,588 sentences).

Frequency data source: Use frequency_lookup.py which auto-selects frequency_clean.json when available, falling back to frequency_cache.json.

Sentence Difficulty Score

For a given word's candidate sentence:

Tokenize: split on whitespace, strip punctuation (.,!?;:"'"״׳–—()[]{}), split on maqaf (־).
Exclude the target word's token using cloze_word_start/cloze_word_end offsets from the matched sentence.
For each remaining token (length >= 2), resolve its frequency rank via the 5-tier pipeline.
Score = median frequency rank of context tokens.

Lower score = easier (context words are more common). Median resists outliers (one rare proper noun shouldn't dominate).

Integration Point

The scoring integrates into epub_examples.py's existing _score() closure inside update_words_json() (line ~677). Currently:

def _score(s: dict) -> tuple[int,]:
    wc = s["word_count"]
    length_score = abs(wc - 9) if not (6 <= wc <= 12) else 0
    return (length_score,)

New scoring replaces length with frequency-based difficulty. The _score function gains access to the frequency pipeline via closure over the nikkud_map, nikkud_index, and freq_data built once at the start of update_words_json().

Minimum sentence length: Reduced from 4 words to 3 words (MIN_WORDS = 3 in epub_examples.py). Hebrew is more concise than English — 3-word sentences are valid and common. This expands the candidate pool for cloze selection.

Behavioral change: Because pool.sort(key=_score) determines which 3 sentences are selected as best = pool[:3], changing the scoring function changes which sentences are selected, not just their order. This is intentional — we want the easiest sentences as cloze candidates, not the closest-to-9-words ones. Existing cloze GUIDs will be preserved when the same sentence text is re-selected; entries where a different sentence wins will get new GUIDs.

Data Model Changes

words.json

The examples.cloze dict (single sentence) gains an optional difficulty_score field:

{
  "examples": {
    "vetted": [
      {"text": "...", "source": "...", "match_method": "..."},
      {"text": "...", "source": "...", "match_method": "..."}
    ],
    "cloze": {
      "text": "...",
      "cloze_word_start": 5,
      "cloze_word_end": 10,
      "cloze_hint": null,
      "cloze_guid": "abc123",
      "difficulty_score": 234
    }
  }
}

The vetted list is also sorted by difficulty (easiest first), so the card back shows sentences in pedagogically useful order.

SCHEMA.yaml

Add difficulty_score as optional integer field under examples.cloze.

Implementation Scope

New file: `sentence_difficulty.py`

Standalone module for sentence scoring. No pipeline step — called by epub_examples.py.

score_sentence(sentence_text: str, target_start: int, target_end: int, nikkud_map: dict, nikkud_index: dict, freq_data: dict) -> int — returns median context frequency rank. Uses target_start/target_end character offsets to exclude the cloze target token.
build_nikkud_map(words: dict) -> dict[str, str] — builds nikkud→ktiv_male lookup from words.json (headwords + conjugation forms + noun inflections). Returns {nikkud_form: ktiv_male_form}. Implementation note: should share iteration logic with epub_examples._build_nikkud_index() or derive from its output to avoid duplicating the traversal of words.json forms.
_resolve_token_frequency(token: str, nikkud_map: dict, nikkud_index: dict, freq_data: dict) -> int — the 5-tier lookup. Uses _try_strip_prefix from epub_examples (made importable by removing underscore or adding a public wrapper).

Modified files

epub_examples.py:
- Import sentence_difficulty.score_sentence and sentence_difficulty.build_nikkud_map
- In update_words_json(): build nikkud_map and load freq_data once at start (before per-word loop)
- Replace _score() closure with frequency-based scoring that calls score_sentence()
- Sort vetted list by difficulty score (easiest first)
- Store difficulty_score in the cloze dict
- Make _try_strip_prefix importable (rename to try_strip_prefix or add public alias)
frequency_lookup.py — add get_freq_data() -> dict public accessor to expose the loaded frequency dict (avoids accessing private _freq directly)
SCHEMA.yaml — add difficulty_score field
run.py — no changes; scoring happens inside epub_examples step

Not modified

apkg_builder.py — reads cloze as-is; vetted order is already respected
nikkud_to_ktiv_male.py — used as-is
Card templates — no changes needed

Dependencies

nikkud_to_ktiv_male.convert() — Academy rules converter (already written)
epub_examples._try_strip_prefix() / _build_nikkud_index() — nikkud prefix stripping and index
frequency_lookup.py — loads frequency data (auto-selects clean vs cache)
helpers.strip_nikkud() — fallback converter

Validation

Unit tests for score_sentence() with known easy/hard sentences
Unit tests for _resolve_token_frequency() covering all 5 tiers
Integration test: verify cloze selection picks easiest sentence, vetted list is sorted
Spot check: manually review 10 words with 3+ sentences to confirm ordering
Regression: existing tests pass, GUID coverage unchanged, deck validates

Constraints

examples.cloze remains a single dict (not converted to list)
No new Anki card types or fields
No runtime JS in Anki cards
No network calls during scoring
difficulty_score is informational metadata; card rendering doesn't depend on it
Existing cloze GUIDs preserved when the same sentence is re-selected

Scope Exclusions (Future Work)

Pronominal suffix stripping — would improve the ~7% unscored token rate; deferred (PROJECT_NOTES.md)
Kamatz katan disambiguation — requires morphological analysis; accepted limitation
Per-learner adaptive difficulty — requires Anki plugin; out of scope for static deck
Multiple cloze sentences per card — would require schema migration to list; deferred

7.7 KiB Raw Blame History