schema: add difficulty_score field + update spec with MIN_WORDS=3

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Sochen 2026-03-15 13:30:13 +00:00
parent 8b24d0fd26
commit 14d567a261
2 changed files with 3 additions and 0 deletions

View file

@ -69,6 +69,7 @@ entry:
cloze_word_end: 4 # End offset — enables exact extraction regardless of nikkud changes
cloze_hint: "family member"
cloze_guid: "def456..." # GUID for the cloze note
difficulty_score: 234 # Median frequency rank of context words (lower = easier); optional
rejected_count: 0
# --- Noun-specific: Inflection Forms ---

View file

@ -54,6 +54,8 @@ def _score(s: dict) -> tuple[int,]:
New scoring replaces length with frequency-based difficulty. The `_score` function gains access to the frequency pipeline via closure over the nikkud_map, nikkud_index, and freq_data built once at the start of `update_words_json()`.
**Minimum sentence length:** Reduced from 4 words to 3 words (`MIN_WORDS = 3` in epub_examples.py). Hebrew is more concise than English — 3-word sentences are valid and common. This expands the candidate pool for cloze selection.
**Behavioral change:** Because `pool.sort(key=_score)` determines which 3 sentences are selected as `best = pool[:3]`, changing the scoring function changes **which sentences are selected**, not just their order. This is intentional — we want the easiest sentences as cloze candidates, not the closest-to-9-words ones. Existing cloze GUIDs will be preserved when the same sentence text is re-selected; entries where a different sentence wins will get new GUIDs.
## Data Model Changes