Sprint 9: cloze cards, plurals deck, project reorg, lint tooling

- Cloze card pipeline: 924 cards from 2,296 AI-vetted Hebrew book sentences - Plurals deck: 375 notes (144 irregular + 231 regular from 86 mishkal patterns) - Ktiv male forms expanded to 20,711 entries for sentence matching - Project reorg: helpers.py (deduped strip_nikkud from 10 files), scripts/ for one-off tools, tests/ with smoke tests, deleted 3 dead files - Lint tooling: pyproject.toml with ruff/vulture/bandit/pytest config, .editorconfig, fixed all 129 ruff errors (B023 closure fix, SIM103, unused vars) - validate_apkg.py: card count range check for optional cloze template - Data caches committed: vetted_sentences, ktiv_male_forms, noun_plurals, noun_slug_map, vocab_sentence_matches, epub_sentence_index Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 08:09:39 +00:00 · 2026-03-07 08:09:39 +00:00 · 17f7458d19
commit 17f7458d19
parent 419e952389
37 changed files with 330541 additions and 871 deletions
--- a/.editorconfig
+++ b/.editorconfig
@ -0,0 +1,15 @@
 root = true
 [*]
 indent_style = space
 indent_size = 4
 end_of_line = lf
 charset = utf-8
 trim_trailing_whitespace = true
 insert_final_newline = true
 [*.{json,yml,yaml,toml}]
 indent_size = 2
 [*.md]
 trim_trailing_whitespace = false
--- a/.gitignore
+++ b/.gitignore
@ -11,6 +11,7 @@ pyvenv.cfg
 venv/
 __pycache__/
 *.pyc
 .pytest_cache/
 # Large generated cache files (rebuild locally)
 data/benyehuda_index.json
@ -31,6 +32,20 @@ ANKIWEB_DESCRIPTION.md
 PROJECTS.md
 SPRINT_LOG.md
 CLAUDE.md
 RECOMMENDATIONS.md
 # Intermediate scrape progress files
 data/ktiv_male_forms.json.partial
 data/ktiv_male_forms_partial.json
 data/ktiv_scrape_progress.json
 data/noun_slug_map_progress.json
 data/top_verbs_to_scrape.json
 # EPUB source files (large; user-specific)
 data/epubs/
 # Stray deck files
 Everything__*.apkg
 # Release artifacts — distributed via Forgejo releases, not committed to tree
 releases/
--- a/README.md
+++ b/README.md
@ -6,16 +6,17 @@
 ## For Hebrew learners
-This project generates two Anki decks for learning Modern Hebrew:
+A set of Anki flashcard decks for learning Modern Hebrew — vocabulary, verb conjugations, and more. All words include nikkud (vowel marks), audio, and are sorted by frequency so you learn the most useful words first.
- **Vocabulary deck** — ~9,100 words from [pealim.com](https://www.pealim.com/dict/), with nikkud (vowel marks), roots, parts of speech, related words, and example sentences from classic Hebrew literature.
+### What's included
 - **Conjugation deck** — 70 paradigm verbs from Coffin & Bolozky's *A Reference Grammar of Modern Hebrew* (2005), fully conjugated in all tenses and persons, across all seven binyanim.
-All card data comes from open or academic sources:
+- **Vocabulary** — ~9,100 Hebrew words with pronunciation audio, roots, example sentences from Hebrew literature, images, and frequency rankings.
- Word data: [pealim.com](https://www.pealim.com) — a free Modern Hebrew dictionary
+- **Verb conjugations** — 71 core verbs fully conjugated in all tenses and persons, covering all seven binyanim (verb patterns).
- Example sentences: [Project Ben-Yehuda](https://benyehuda.org) — public-domain Hebrew literature corpus
+- **Confusables** — Words that look the same without vowel marks (e.g., דָּבָר "thing" vs. דִּבֵּר "spoke") shown side by side so you can tell them apart.
- Word frequency: [hermitdave/FrequencyWords](https://github.com/hermitdave/FrequencyWords) — Hebrew frequency list
+- **Noun plurals** — Practice forming singular↔plural pairs, with a focus on irregular plurals and common patterns.
- Verb paradigm list: Coffin, Edna Amir and Shmuel Bolozky. *A Reference Grammar of Modern Hebrew*. Cambridge University Press, 2005.
+- **All-in-one** — A combined deck with everything above, organized as subdecks.
 You can download and import any deck individually — or use the combined deck to get everything at once.
 ---
@ -25,17 +26,19 @@ All card data comes from open or academic sources:
 2. Double-click to import into [Anki](https://apps.ankiweb.net/) (free, cross-platform)
 3. Start studying
-Both decks can be imported independently. If you already have one, re-importing the same file updates your deck without losing study progress.
+All decks can be imported independently — pick just the ones you want. Re-importing the same file later updates your deck without losing study progress.
 ---
 ## What's in the vocabulary deck
-Each card has two sides:
+Each note generates up to three cards:
 **Hebrew → English:** See the Hebrew word (with nikkud) + hear audio → recall the meaning.
-**English → Hebrew:** See the English meaning → recall the Hebrew word, its root, and how to write it.
+**English → Hebrew:** See the English meaning → recall the Hebrew word. When multiple words share the same English meaning, a disambiguation hint (part of speech + binyan) helps you know which word is expected.
 **Sentence Cloze:** A Hebrew sentence with the target word blanked out → fill in the missing word. Only generated for words with a vetted example sentence. Tests recognition in context.
 Fields on each card:
 | Field | Example |
@ -43,56 +46,84 @@ Fields on each card:
 | Hebrew word (nikkud) | שָׁמַר |
 | Meaning | kept, watched over |
 | Root | שמ״ר |
-| Part of speech | פועל (verb) |
+| Part of speech | פועל — פָּעַל |
 | Without nikkud | שמר |
-| Related words | שׁוֹמֵר, שְׁמִירָה |
+| Related words | שׁוֹמֵר, שְׁמִירָה (grouped by Part of Speech) |
-| Example sentence | from Ben-Yehuda corpus |
+| Example sentence | from nikkud'd Hebrew books |
 | Audio | pronunciation from pealim.com |
 | Frequency rank | #412 |
 | Image / Emoji | for concrete nouns |
 | Plural form | for nouns: רבים: שֻׁלְחָנוֹת |
 | Disambiguation hint | for ambiguous Eng→Heb cards |
-Cards are presented in **frequency order** — Anki will show you the most common words first. Frequency rank is displayed on every card so you can see how common each word is. Words not in the top 50,000 show a "50k+" badge.
+Cards are presented in **frequency order** — Anki will show you the most common words first.
 ### Eng→Heb disambiguation
 When two Hebrew words translate to the same English (e.g., both mean "to return"), the Eng→Heb card shows a hint to tell them apart:
 - **Layer 1:** Automatic Part of Speech + binyan hints for words with different parts of speech (163 words)
 - **Layer 2:** AI-refined distinct glosses for true synonyms sharing the same Part of Speech (440 words)
 ---
 ## What's in the conjugation deck
-70 paradigm verbs from Coffin & Bolozky's *A Reference Grammar of Modern Hebrew* (Appendix 1), covering all seven binyanim:
+71  verbs listed in Appendix 1 of Coffin & Bolozky's *A Reference Grammar of Modern Hebrew* covering all seven binyanim, and **all irregular forms**
 - פָּעַל (Pa'al), נִפְעַל (Nif'al), פִּעֵל (Pi'el), פֻּעַל (Pu'al)
 - הִתְפַּעֵל (Hitpa'el), הִפְעִיל (Hif'il), הֻפְעַל (Huf'al)
-Each verb is drilled in: present, past, future, and imperative — all persons and genders. The infinitive is shown on the card front as context but is not quizzed.
+Each verb is drilled in: present, past, future, and imperative — all persons and genders. Each card shows the English meaning and related vocabulary from the same root.
-**Present tense expansion:** Each present form generates 3 cards (one per pronoun that uses it), so you learn אֲנִי, אַתָּה, and הוּא all separately with the same masculine singular form.
+**Present tense expansion:** Each present tense form randomly generates a pronoun to be shown in the front of the card, so you acclimate to seeing  אֲנִי, אַתָּה, and הוּא with the conjugated verb, even though they are all conjugated the same in present tense. 
-**Modern Hebrew 2fp/3fp:** Classical feminine plural future forms (e.g., תִּשְׁמֹרְנָה) are shown in parentheses; the card's primary answer is the modern masculine plural form used in everyday speech.
+**Modern Hebrew 2fp/3fp:** Classical feminine plural future forms (e.g., תִּשְׁמֹרְנָה) are shown in parentheses, and played via audio (for the audio-included decks). the card's primary answer is the modern masculine plural form used in everyday speech.
-**Passive label:** Pu'al and Huf'al cards show the active partner's infinitive on the front (e.g., לְבַטֵּל) followed by **(סָבִיל)** in smaller text, so you know you're drilling the passive conjugation. Active verbs show no label.
+**Passive label:** Pu'al and Huf'al cards show the active partner's infinitive on the front (e.g., לְבַטֵּל) followed by **(סָבִיל)** in smaller text, so you know you're drilling the passive conjugation.
-**Card order:** New cards are introduced in random order.
+**Card order:** New conjugation cards are introduced in random order (not grouped by verb).
-**Citation:** Coffin, Edna Amir and Shmuel Bolozky. *A Reference Grammar of Modern Hebrew*. Cambridge University Press, 2005.
+---
 ## What's in the confusables deck
 Hebrew without vowel marks is full of lookalikes. This deck groups words that are spelled identically without nikkud and asks "מה ההבדל?" (what's the difference?). The answer reveals all the words side by side with their nikkud and definitions.
 Examples: דָּבָר (thing) vs. דִּבֵּר (spoke), סֵפֶר (book) vs. סָפַר (counted) vs. סַפָּר (barber).
 ---
 ## What's in the plurals deck
 Two card directions for each noun:
 - **Singular → Plural:** See שֻׁלְחָן → produce שֻׁלְחָנוֹת
 - **Plural → Singular:** See שֻׁלְחָנוֹת → produce שֻׁלְחָן
 Focuses on irregular plurals (the tricky ones that don't follow the rules) and common examples from each noun pattern. Cards are tagged by pattern for filtered study.
 ---
 ## Suggested study strategy
-Start with the vocabulary deck. Anki will present the most frequent words first. Don't try to study to many cards every single day-- Anki suggests 20 per day. 
+Start with the vocabulary deck. Anki will present the most frequent words first. Don't try to study too many cards every single day — Anki suggests 20 per day.
-The conjugation cards reinforce verb forms you've already seen in vocabulary. 
+The conjugation cards reinforce verb forms you've already seen in vocabulary.
-Use the Hebrew → English direction to build reading comprehension. Use the English → Hebrew direction to build writing and speaking recall.
+Use the Hebrew → English direction to build reading comprehension. Use the English → Hebrew direction to build writing and speaking recall. The sentence cloze cards test whether you can recognize words in real Hebrew text.
 ---
 ## About the data sources
-**pealim.com** — A comprehensive free Modern Hebrew dictionary with nikkud, roots, conjugations, and audio. This project scrapes the public dictionary and conjugation tables. 
+**pealim.com** — A comprehensive free Modern Hebrew dictionary with nikkud, roots, conjugations, and audio. This project scrapes the public dictionary and conjugation tables.
 **Project Ben-Yehuda** — A public-domain digital library of Hebrew literature. Example sentences come from the nikkud corpus (classic texts with full vowel marks).
 **Hebrew books** — Additional example sentences from nikkud'd (menukad) Hebrew books, with Claude Sonnet AI-vetted quality filtering. The AI doesn't generate the sentences, it just determines whether it is a high quality sentence as an example, or not. 
 **FrequencyWords** — An open Hebrew word frequency list derived from subtitle data. Used to sort vocabulary cards from most to least common.
-**Coffin & Bolozky** — The verb paradigm list for the conjugation deck comes from Appendix 1 of *A Reference Grammar of Modern Hebrew* (Cambridge University Press, 2005), which provides a comprehensive reference for Modern Hebrew verbal morphology.
+**Coffin & Bolozky** — The verb list, and known good conjugation reference for the conjugation deck comes from Appendix 1 of *A Reference Grammar of Modern Hebrew* (Cambridge University Press, 2005).
 ---
@ -100,9 +131,9 @@ Use the Hebrew → English direction to build reading comprehension. Use the Eng
 If you notice a wrong translation, missing audio, or incorrect conjugation:
- For vocabulary errors: the source is pealim.com — you can suggest corrections there. But if you think morfix has a correct translation and pealim.com does not, we may be able to encode an override. 
+- For vocabulary errors: the source is pealim.com — you can suggest corrections there. But if you think morfix has a correct translation and pealim.com does not, we may be able to encode an override.
-For any other issue, whether you know to code or not: Email me at pealim [at] nevo [dot] engineer
+For any other issue, whether you know how to code or not: Email me at hebrew [at] nevo [dot] engineer
 ---
@ -136,45 +167,78 @@ python run.py --skip-scrape --refresh-examples
 ```
 python run.py [options]
-  --skip-scrape        Use cached data/hebrew_dict.csv (no pealim.com scraping)
+  --only {vocab,conjugations,confusables,plurals,complete}
-  --skip-audio         Skip audio .mp3 downloads
+                         Build only one deck type
-  --skip-examples      Skip Ben Yehuda example fetching
+  --skip-scrape          Use cached data/hebrew_dict.csv
-  --only {vocab,conjugations}  Run only one deck (skips all unrelated steps)
+  --skip-audio           Skip audio .mp3 downloads
-  --skip-conjugations  Skip verb conjugation extraction (deprecated: use --only vocab)
+  --skip-examples        Skip Ben Yehuda example fetching
-  --skip-images        Skip image fetching for concrete nouns
+  --skip-conjugations    Skip verb conjugation extraction
-  --refresh-examples   Force rebuild of Ben Yehuda index (nikkud corpus)
+  --skip-images          Skip image fetching for concrete nouns
-  --test N             Process only first N words
+  --refresh-examples     Force rebuild of Ben Yehuda index
  --test N               Process only first N words
 ```
 ### Output files
 | File | Description |
 |------|-------------|
-| `data/hebrew_dict.csv` | Raw dictionary |
+| `output/hebrew_vocabulary.apkg` | Vocabulary deck (text only) |
-| `data/hebrew_dict_for_anki.csv` | Enriched Anki CSV |
+| `output/hebrew_vocabulary_audio.apkg` | Vocabulary deck + audio |
-| `data/conjugations.json` | Verb conjugation data |
+| `output/hebrew_vocabulary_images.apkg` | Vocabulary deck + images |
-| `data/audio/` | Vocabulary audio (.mp3) |
+| `output/hebrew_vocabulary_audio_images.apkg` | Vocabulary deck + audio + images |
-| `data/audio_conj/` | Conjugation audio (.mp3) |
+| `output/hebrew_conjugations.apkg` | Conjugation deck |
-| `data/fonts/` | Heebo font files (bundled in .apkg) |
+| `output/hebrew_conjugations_audio.apkg` | Conjugation deck + audio |
-| `data/images/` | Noun images from Wikipedia/Commons |
+| `output/hebrew_confusables.apkg` | Confusables deck |
-| `data/image_cache.json` | Image fetch cache |
+| `output/hebrew_confusables_audio.apkg` | Confusables deck + audio |
-| `output/hebrew_vocabulary.apkg` | Vocabulary Anki deck |
+| `output/hebrew_plurals.apkg` | Plurals deck |
-| `output/hebrew_conjugations.apkg` | Conjugation Anki deck |
+| `output/hebrew_plurals_audio.apkg` | Plurals deck + audio |
 | `output/hebrew_complete.apkg` | All decks combined |
 | `output/hebrew_complete_audio.apkg` | All decks combined + audio |
 ### Data files
 | File | Description |
 |------|-------------|
 | `data/hebrew_dict_for_anki.csv` | Enriched vocabulary CSV |
 | `data/conjugations.json` | Verb conjugation data (71 verbs) |
 | `data/noun_plurals.json` | Noun plural/construct forms |
 | `data/refined_meanings.json` | AI-disambiguated meanings (440 words) |
 | `data/vetted_sentences.json` | AI-vetted example sentences |
 | `data/ktiv_male_forms.json` | Ktiv male (plene) forms for sentence matching |
 | `data/legacy_guid_map.json` | Legacy GUIDs for study progress preservation |
 ### Pipeline overview
 1. `hebrew_extract.py` — scrapes pealim.com dictionary
 2. `frequency_lookup.py` — downloads/loads Hebrew frequency data
-3. `benyehuda.py` — builds sentence index from Ben-Yehuda corpus
+3. `benyehuda.py` — builds sentence index from Ben-Yehuda nikkud corpus
 4. `extract_verb_list.py` — extracts verb list from Coffin & Bolozky PDF
-5. `conjugation_extract.py` — fetches conjugation tables from pealim.com
+5. `conjugation_extract.py` — fetches conjugation tables + meanings from pealim.com
 6. `image_fetch.py` — fetches Wikipedia/Commons images for concrete nouns
-7. `validate_verb_list.py` — validates verb list against pealim.com
+7. `scrape_noun_plurals.py` — scrapes noun plural/construct forms from pealim.com
-8. `apkg_builder.py` — assembles both `.apkg` files
+8. `scrape_ktiv_male.py` — scrapes ktiv male (plene) forms for sentence matching
-9. `run.py` — orchestrates all steps
+9. `rebuild_sentence_matches.py` — matches vocab words to book sentences
 10. `apkg_builder.py` — assembles all `.apkg` files
 11. `run.py` — orchestrates all steps
 12. `validate_apkg.py` — validates output decks
 ---
 ## Deck variants
 | Variant | Contents | Size |
 |---------|----------|------|
 | `hebrew_vocabulary.apkg` | Text + images | ~15 MB |
 | `hebrew_vocabulary_audio.apkg` | Text + images + audio | ~80 MB |
 | `hebrew_conjugations.apkg` | Text only | ~1 MB |
 | `hebrew_conjugations_audio.apkg` | Text + audio | ~5 MB |
 | `hebrew_confusables.apkg` | Text only | ~1 MB |
 | `hebrew_plurals.apkg` | Text only | ~1 MB |
 | `hebrew_complete.apkg` | Everything combined | ~20 MB |
 | `hebrew_complete_audio.apkg` | Everything + audio | ~90 MB |
 ---
 ## AnkiWeb
-The decks will be published as shared decks on AnkiWeb (TBD). 
+The decks will be published as shared decks on AnkiWeb (TBD).
--- a/apkg_builder.py
+++ b/apkg_builder.py
--- a/benyehuda.py
+++ b/benyehuda.py
@ -14,20 +14,18 @@ Exposed API:
 import json
 import logging
 import re
 import unicodedata
 import zipfile
 from io import BytesIO
 from pathlib import Path
 import requests
 from helpers import strip_nikkud as _strip_nikkud
 logger = logging.getLogger(__name__)
 # Nikkud-bearing corpus (txt.zip instead of txt_stripped.zip)
-CORPUS_URL = (
+CORPUS_URL = "https://github.com/projectbenyehuda/public_domain_dump/releases/download/2025-10/txt.zip"
    "https://github.com/projectbenyehuda/public_domain_dump/releases/"
    "download/2025-10/txt.zip"
 )
 INDEX_PATH = Path(__file__).parent / "data" / "benyehuda_index.json"
 EXAMPLES_CACHE_PATH = Path(__file__).parent / "data" / "examples_cache.json"
 REQUEST_TIMEOUT = 120
@ -36,15 +34,8 @@ MAX_SENTENCE_LEN = 200
 MAX_INDEX_ENTRIES = 500  # cap examples kept per word in index to limit memory
 # Module-level state
-_index: dict[str, list[str]] = {}          # word (with nikkud) -> [sentence, ...]
+_index: dict[str, list[str]] = {}  # word (with nikkud) -> [sentence, ...]
-_examples_cache: dict[str, list[str]] = {} # word -> cached result for this run
+_examples_cache: dict[str, list[str]] = {}  # word -> cached result for this run
 def _strip_nikkud(text: str) -> str:
    return "".join(
        ch for ch in unicodedata.normalize("NFD", text)
        if unicodedata.category(ch) != "Mn"
    )
 def _split_sentences(text: str) -> list[str]:
@ -73,7 +64,7 @@ def _build_index(corpus_zip_bytes: bytes) -> None:
        for fname in txt_files:
            try:
                raw = zf.read(fname).decode("utf-8", errors="ignore")
-            except Exception:
+            except Exception:  # noqa: S112
                continue
            for sentence in _split_sentences(raw):
                # Index by each unique Hebrew token (with nikkud) in the sentence
--- a/conjugation_extract.py
+++ b/conjugation_extract.py
@ -19,13 +19,14 @@ import json
 import logging
 import re
 import time
 import unicodedata
 import urllib.parse
 from pathlib import Path
 import requests
 from bs4 import BeautifulSoup
 from helpers import strip_nikkud as _strip_nikkud
 logger = logging.getLogger(__name__)
 PEALIM_BASE = "https://www.pealim.com"
@ -34,10 +35,14 @@ REQUEST_TIMEOUT = 15
 VERBS_INPUT = Path(__file__).parent / "verbs_input.txt"
 CONJUGATIONS_PATH = Path(__file__).parent / "data" / "conjugations.json"
 DICT_CSV = next(
-    (p for p in [
+    (
-        Path(__file__).parent / "data" / "hebrew_dict_for_anki.csv",
+        p
-        Path(__file__).parent / "data" / "pealim_dict_for_anki.csv",
+        for p in [
-    ] if p.exists()),
+            Path(__file__).parent / "data" / "hebrew_dict_for_anki.csv",
            Path(__file__).parent / "data" / "pealim_dict_for_anki.csv",
        ]
        if p.exists()
    ),
    Path(__file__).parent / "data" / "hebrew_dict_for_anki.csv",
 )
@ -47,17 +52,17 @@ PRONOUN_LABELS = {
    "present_fs": "",
    "present_mp": "",
    "present_fp": "",
-    "past_1s":    "אֲנִי",
+    "past_1s": "אֲנִי",
-    "past_1p":    "אֲנַחְנוּ",
+    "past_1p": "אֲנַחְנוּ",
-    "past_2ms":   "אַתָּה",
+    "past_2ms": "אַתָּה",
-    "past_2fs":   "אַתְּ",
+    "past_2fs": "אַתְּ",
-    "past_2mp":   "אַתֶּם",
+    "past_2mp": "אַתֶּם",
-    "past_2fp":   "אַתֶּן",
+    "past_2fp": "אַתֶּן",
-    "past_3ms":   "הוּא",
+    "past_3ms": "הוּא",
-    "past_3fs":   "הִיא",
+    "past_3fs": "הִיא",
-    "past_3p":    "הֵם / הֵן",
+    "past_3p": "הֵם / הֵן",
-    "future_1s":  "אֲנִי",
+    "future_1s": "אֲנִי",
-    "future_1p":  "אֲנַחְנוּ",
+    "future_1p": "אֲנַחְנוּ",
    "future_2ms": "אַתָּה",
    "future_2fs": "אַתְּ",
    "future_2mp": "אַתֶּם",
@ -79,17 +84,17 @@ TENSE_DESCRIPTION = {
    "present_fs": "הוֹוֶה",
    "present_mp": "הוֹוֶה",
    "present_fp": "הוֹוֶה",
-    "past_1s":    "עָבָר",
+    "past_1s": "עָבָר",
-    "past_1p":    "עָבָר",
+    "past_1p": "עָבָר",
-    "past_2ms":   "עָבָר",
+    "past_2ms": "עָבָר",
-    "past_2fs":   "עָבָר",
+    "past_2fs": "עָבָר",
-    "past_2mp":   "עָבָר",
+    "past_2mp": "עָבָר",
-    "past_2fp":   "עָבָר",
+    "past_2fp": "עָבָר",
-    "past_3ms":   "עָבָר",
+    "past_3ms": "עָבָר",
-    "past_3fs":   "עָבָר",
+    "past_3fs": "עָבָר",
-    "past_3p":    "עָבָר",
+    "past_3p": "עָבָר",
-    "future_1s":  "עָתִיד",
+    "future_1s": "עָתִיד",
-    "future_1p":  "עָתִיד",
+    "future_1p": "עָתִיד",
    "future_2ms": "עָתִיד",
    "future_2fs": "עָתִיד",
    "future_2mp": "עָתִיד",
@ -105,21 +110,12 @@ TENSE_DESCRIPTION = {
    "infinitive": "מְקוֹר",
 }
-BINYAN_NAMES: tuple[str, ...] = (
+BINYAN_NAMES: tuple[str, ...] = ("Pa'al", "Nif'al", "Pi'el", "Pu'al", "Hitpa'el", "Hif'il", "Huf'al")
    "Pa'al", "Nif'al", "Pi'el", "Pu'al", "Hitpa'el", "Hif'il", "Huf'al"
 )
 session = requests.Session()
 session.headers.update({"User-Agent": "Mozilla/5.0 (compatible; pealim-anki/2.0)"})
 def _strip_nikkud(text: str) -> str:
    """Remove Hebrew nikkud (diacritics) from a string."""
    return "".join(
        ch for ch in unicodedata.normalize("NFD", text)
        if unicodedata.category(ch) != "Mn"
    )
 def _build_pos_lookup() -> dict[str, str]:
    """Build word_stripped → binyan dict from pealim_dict_for_anki.csv."""
@ -129,6 +125,7 @@ def _build_pos_lookup() -> dict[str, str]:
    try:
        import pandas as pd
        try:
            df = pd.read_csv(DICT_CSV, sep=";", index_col=0)
            if df.shape[1] < 3:
@ -168,13 +165,13 @@ def _binyan_from_pos(word: str) -> str:
    pos_lower = pos_str.lower()
    # Map lowercase pealim.com PoS variants → canonical names
    for bname, variants in [
-        ("Pa'al",    ["pa'al", "paal"]),
+        ("Pa'al", ["pa'al", "paal"]),
-        ("Nif'al",   ["nif'al", "nifal"]),
+        ("Nif'al", ["nif'al", "nifal"]),
-        ("Pi'el",    ["pi'el", "piel"]),
+        ("Pi'el", ["pi'el", "piel"]),
-        ("Pu'al",    ["pu'al", "pual"]),
+        ("Pu'al", ["pu'al", "pual"]),
        ("Hitpa'el", ["hitpa'el", "hitpael"]),
-        ("Hif'il",   ["hif'il", "hifil"]),
+        ("Hif'il", ["hif'il", "hifil"]),
-        ("Huf'al",   ["huf'al", "hufal"]),
+        ("Huf'al", ["huf'al", "hufal"]),
    ]:
        if any(v in pos_lower for v in variants):
            return bname
@ -305,7 +302,7 @@ def _parse_table(soup: BeautifulSoup, passive: bool = False, table_el=None) -> d
    if present_row >= 0:
        hf = first_heb_forms(present_row)
        keys = ["present_ms", "present_fs", "present_mp", "present_fp"]
-        for k, (v, au) in zip(keys, hf):
+        for k, (v, au) in zip(keys, hf, strict=False):
            store(k, v, au)
    # Past tense
@ -319,13 +316,13 @@ def _parse_table(soup: BeautifulSoup, passive: bool = False, table_el=None) -> d
        if past_row + 1 < len(rows):
            hf2 = first_heb_forms(past_row + 1)
            keys2 = ["past_2ms", "past_2fs", "past_2mp", "past_2fp"]
-            for k, (v, au) in zip(keys2, hf2):
+            for k, (v, au) in zip(keys2, hf2, strict=False):
                store(k, v, au)
        if past_row + 2 < len(rows):
            unique3 = deduplicate(first_heb_forms(past_row + 2))
            keys3 = ["past_3ms", "past_3fs", "past_3p"]
-            for k, (v, au) in zip(keys3, unique3):
+            for k, (v, au) in zip(keys3, unique3, strict=False):
                store(k, v, au)
    # Future tense
@ -339,20 +336,20 @@ def _parse_table(soup: BeautifulSoup, passive: bool = False, table_el=None) -> d
        if future_row + 1 < len(rows):
            hf2 = first_heb_forms(future_row + 1)
            keys2 = ["future_2ms", "future_2fs", "future_2mp", "future_2fp"]
-            for k, (v, au) in zip(keys2, hf2):
+            for k, (v, au) in zip(keys2, hf2, strict=False):
                store(k, v, au)
        if future_row + 2 < len(rows):
            hf3 = first_heb_forms(future_row + 2)
            keys3 = ["future_3ms", "future_3fs", "future_3mp", "future_3fp"]
-            for k, (v, au) in zip(keys3, hf3):
+            for k, (v, au) in zip(keys3, hf3, strict=False):
                store(k, v, au)
    # Imperative
    if imp_row >= 0:
        hf = first_heb_forms(imp_row)
        keys = ["imperative_ms", "imperative_fs", "imperative_mp", "imperative_fp"]
-        for k, (v, au) in zip(keys, hf):
+        for k, (v, au) in zip(keys, hf, strict=False):
            store(k, v, au)
    # Infinitive
@ -399,7 +396,9 @@ def _extract_passive_binyan_from_page(soup: BeautifulSoup) -> str:
    return ""
-def _extract_conjugations(slug: str, search_term: str, is_3ms_search: bool = False, binyan_hint: str = "") -> dict | None:
+def _extract_conjugations(
    slug: str, search_term: str, is_3ms_search: bool = False, binyan_hint: str = ""
 ) -> dict | None:
    """Fetch /dict/<slug>/ and parse conjugation table (active + passive)."""
    url = f"{PEALIM_BASE}/dict/{slug}/"
    try:
@ -411,6 +410,12 @@ def _extract_conjugations(slug: str, search_term: str, is_3ms_search: bool = Fal
    soup = BeautifulSoup(resp.text, "lxml")
    # Extract meaning from <div class="lead"> (English translation)
    meaning = ""
    lead_div = soup.find("div", class_="lead")
    if lead_div:
        meaning = lead_div.get_text(strip=True)
    # Extract root
    root = ""
    for span in soup.find_all("span", class_="menukad"):
@ -440,10 +445,7 @@ def _extract_conjugations(slug: str, search_term: str, is_3ms_search: bool = Fal
    infinitive_form = forms_raw.get("infinitive", {}).get("form", "") if not is_passive else ""
    past_3ms_form = forms_raw.get("past_3ms", {}).get("form", "")
-    if is_passive:
+    reference_form = (past_3ms_form or search_term) if is_passive else (infinitive_form or search_term)
        reference_form = past_3ms_form or search_term
    else:
        reference_form = infinitive_form or search_term
    # Build active result
    result = {
@ -451,6 +453,7 @@ def _extract_conjugations(slug: str, search_term: str, is_3ms_search: bool = Fal
        "slug": slug,
        "root": root,
        "binyan": binyan,
        "meaning": meaning,
        "is_passive": is_passive,
        "reference_form": reference_form,
        "forms": {},
@ -474,10 +477,7 @@ def _extract_conjugations(slug: str, search_term: str, is_3ms_search: bool = Fal
    passive_table_ids = {
        id(t) for t in (passive_h3.find_all_next("table", class_="conjugation-table") if passive_h3 else [])
    }
-    active_tables = [
+    active_tables = [t for t in soup.find_all("table", class_="conjugation-table") if id(t) not in passive_table_ids]
        t for t in soup.find_all("table", class_="conjugation-table")
        if id(t) not in passive_table_ids
    ]
    if len(active_tables) >= 2:
        alt_raw = _parse_table(soup, passive=False, table_el=active_tables[1])
        alternate_forms = {}
@ -521,6 +521,12 @@ def _extract_passive_from_active_slug(active_slug: str, search_term: str, binyan
    soup = BeautifulSoup(resp.text, "lxml")
    # Extract meaning (this is the active verb's meaning — useful context for passive)
    meaning = ""
    lead_div = soup.find("div", class_="lead")
    if lead_div:
        meaning = lead_div.get_text(strip=True)
    root = ""
    for span in soup.find_all("span", class_="menukad"):
        txt = span.get_text(strip=True)
@ -548,6 +554,7 @@ def _extract_passive_from_active_slug(active_slug: str, search_term: str, binyan
        "slug": active_slug,
        "root": root,
        "binyan": passive_binyan,
        "meaning": meaning,
        "is_passive": True,
        "reference_form": active_infinitive or search_term,
        "forms": {},
@ -578,14 +585,19 @@ def main(verbs_file: Path = VERBS_INPUT) -> dict:
    for line in raw_lines:
        stripped = line.strip()
        if stripped.startswith("# slug:"):
-            parts = stripped[len("# slug:"):].strip().split()
+            parts = stripped[len("# slug:") :].strip().split()
            if len(parts) >= 2:
                slug_overrides[parts[0]] = parts[1]
    # Map section header keywords → binyan name (for binyan_hint fallback)
    SECTION_BINYAN = {
-        "pa'al": "Pa'al", "nif'al": "Nif'al", "pi'el": "Pi'el",
+        "pa'al": "Pa'al",
-        "pu'al": "Pu'al", "hitpa'el": "Hitpa'el", "hif'il": "Hif'il", "huf'al": "Huf'al",
+        "nif'al": "Nif'al",
        "pi'el": "Pi'el",
        "pu'al": "Pu'al",
        "hitpa'el": "Hitpa'el",
        "hif'il": "Hif'il",
        "huf'al": "Huf'al",
    }
    # Parse: regular verbs and # 3ms: lines (optional active slug on 3ms lines)
@ -597,7 +609,7 @@ def main(verbs_file: Path = VERBS_INPUT) -> dict:
        if not stripped or stripped.startswith("# slug:"):
            continue
        if stripped.startswith("# 3ms:"):
-            parts = stripped[len("# 3ms:"):].strip().split()
+            parts = stripped[len("# 3ms:") :].strip().split()
            if parts:
                form = parts[0]
                active_slug = parts[1] if len(parts) >= 2 else None
@ -612,8 +624,7 @@ def main(verbs_file: Path = VERBS_INPUT) -> dict:
        else:
            verbs.append((stripped, False, None, current_binyan_hint))
-    logger.info(f"Loaded {len(verbs)} verbs from {verbs_file} "
+    logger.info(f"Loaded {len(verbs)} verbs from {verbs_file} ({sum(1 for _, p, _, _ in verbs if p)} passive 3ms)")
                f"({sum(1 for _, p, _, _ in verbs if p)} passive 3ms)")
    if slug_overrides:
        logger.info(f"  Slug overrides: {slug_overrides}")
--- a/data/conjugations.json
+++ b/data/conjugations.json
@ -175,7 +175,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to guard; to keep, to maintain (על)"
  },
  "ללמוד": {
    "infinitive": "ללמוד",
@ -353,7 +354,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to learn, to study"
  },
  "לאסוף": {
    "infinitive": "לאסוף",
@ -531,7 +533,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to collect, to pick up, to reap"
  },
  "לעבוד": {
    "infinitive": "לעבוד",
@ -709,7 +712,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to work; to operate, to function"
  },
  "לחבוש": {
    "infinitive": "לחבוש",
@ -887,7 +891,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to bandage; to put on (a hat)"
  },
  "לאכול": {
    "infinitive": "לאכול",
@ -1065,7 +1070,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to eat"
  },
  "לשאול": {
    "infinitive": "לשאול",
@ -1243,7 +1249,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to ask; to borrow"
  },
  "לשלוח": {
    "infinitive": "לשלוח",
@ -1421,7 +1428,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to send, to dispatch"
  },
  "לגבוה": {
    "infinitive": "לגבוה",
@ -1599,7 +1607,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to be high, exalted"
  },
  "לשבת": {
    "infinitive": "לשבת",
@ -1777,7 +1786,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to sit, to settle"
  },
  "לרשת": {
    "infinitive": "לרשת",
@ -1955,7 +1965,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to inherit"
  },
  "לִיפּוֹל": {
    "infinitive": "לִיפּוֹל",
@ -2133,7 +2144,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to fall, to drop"
  },
  "לקום": {
    "infinitive": "לקום",
@ -2311,7 +2323,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to get up, to stand up, to arise; to be established, to come into being"
  },
  "לחון": {
    "infinitive": "לחון",
@ -2489,7 +2502,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to pardon, to amnesty; to endow"
  },
  "לקרוא": {
    "infinitive": "לקרוא",
@ -2667,7 +2681,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to read (ב-, את); to call (ל-)"
  },
  "לקנות": {
    "infinitive": "לקנות",
@ -2845,7 +2860,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to buy, to purchase"
  },
  "להיבדק": {
    "infinitive": "להיבדק",
@ -3023,7 +3039,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to be tested, examined"
  },
  "להרדם": {
    "infinitive": "להרדם",
@ -3201,7 +3218,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to fall asleep, to doze off"
  },
  "להיהרג": {
    "infinitive": "להיהרג",
@ -3379,7 +3397,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to be killed"
  },
  "להחקר": {
    "infinitive": "להחקר",
@ -3557,7 +3576,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to be investigated, explored"
  },
  "להישאר": {
    "infinitive": "להישאר",
@ -3735,7 +3755,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to remain"
  },
  "להיפגע": {
    "infinitive": "להיפגע",
@ -3913,7 +3934,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to be damaged, to be injured, to be wounded; to be insulted, to be offended"
  },
  "להיוולד": {
    "infinitive": "להיוולד",
@ -4091,7 +4113,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to be born"
  },
  "להנצל": {
    "infinitive": "להנצל",
@ -4269,7 +4292,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to be saved, to be rescued, to survive"
  },
  "להיסוג": {
    "infinitive": "להיסוג",
@ -4447,7 +4471,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to withdraw, to retreat"
  },
  "להימצא": {
    "infinitive": "להימצא",
@ -4625,7 +4650,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to be found, discovered; to be present, to be located"
  },
  "להיבנות": {
    "infinitive": "להיבנות",
@ -4803,7 +4829,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to be built, constructed"
  },
  "לדבר": {
    "infinitive": "לדבר",
@ -5130,7 +5157,8 @@
          "tense": "עָתִיד"
        }
      }
-    }
+    },
    "meaning": "to speak, to talk"
  },
  "לברך": {
    "infinitive": "לברך",
@ -5457,7 +5485,8 @@
          "tense": "עָתִיד"
        }
      }
-    }
+    },
    "meaning": "to bless, to greet, to felicitate"
  },
  "לנהל": {
    "infinitive": "לנהל",
@ -5784,7 +5813,8 @@
          "tense": "עָתִיד"
        }
      }
-    }
+    },
    "meaning": "to manage, to organize"
  },
  "לנצח": {
    "infinitive": "לנצח",
@ -6111,7 +6141,8 @@
          "tense": "עָתִיד"
        }
      }
-    }
+    },
    "meaning": "to win; to overcome, to beat; to conduct, to orchestrate"
  },
  "לקומם": {
    "infinitive": "לקומם",
@ -6438,7 +6469,8 @@
          "tense": "עָתִיד"
        }
      }
-    }
+    },
    "meaning": "to outrage, to anger"
  },
  "למלא": {
    "infinitive": "למלא",
@ -6765,7 +6797,8 @@
          "tense": "עָתִיד"
        }
      }
-    }
+    },
    "meaning": "to fill; to fill out; to fulfil"
  },
  "לחכות": {
    "infinitive": "לחכות",
@ -7092,7 +7125,8 @@
          "tense": "עָתִיד"
        }
      }
-    }
+    },
    "meaning": "to await, to wait for (ל-)"
  },
  "לגלגל": {
    "infinitive": "לגלגל",
@ -7419,7 +7453,8 @@
          "tense": "עָתִיד"
        }
      }
-    }
+    },
    "meaning": "to roll, to revolve (transitive)"
  },
  "להתלבש": {
    "infinitive": "להתלבש",
@ -7597,7 +7632,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to dress oneself"
  },
  "להסתלק": {
    "infinitive": "להסתלק",
@ -7775,7 +7811,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to leave, to go away"
  },
  "להצטלם": {
    "infinitive": "להצטלם",
@ -7953,7 +7990,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to pose for a photograph, to be photographed"
  },
  "להזדקק": {
    "infinitive": "להזדקק",
@ -8131,7 +8169,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to need, to require (ל-)"
  },
  "להתנהג": {
    "infinitive": "להתנהג",
@ -8309,7 +8348,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to behave"
  },
  "להתקומם": {
    "infinitive": "להתקומם",
@ -8487,7 +8527,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to rebel, to revolt"
  },
  "להתפלא": {
    "infinitive": "להתפלא",
@ -8665,7 +8706,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to wonder, to be surprised"
  },
  "להתקלקל": {
    "infinitive": "להתקלקל",
@ -8843,7 +8885,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to be damaged, to be spoiled (of food products)"
  },
  "להכניס": {
    "infinitive": "להכניס",
@ -9170,7 +9213,8 @@
          "tense": "עָתִיד"
        }
      }
-    }
+    },
    "meaning": "to insert, to bring in"
  },
  "להעסיק": {
    "infinitive": "להעסיק",
@ -9497,7 +9541,8 @@
          "tense": "עָתִיד"
        }
      }
-    }
+    },
    "meaning": "to keep busy; to employ"
  },
  "להחליט": {
    "infinitive": "להחליט",
@ -9824,7 +9869,8 @@
          "tense": "עָתִיד"
        }
      }
-    }
+    },
    "meaning": "to decide"
  },
  "להבטיח": {
    "infinitive": "להבטיח",
@ -10151,7 +10197,8 @@
          "tense": "עָתִיד"
        }
      }
-    }
+    },
    "meaning": "to ensure, to promise"
  },
  "להוריד": {
    "infinitive": "להוריד",
@ -10478,7 +10525,8 @@
          "tense": "עָתִיד"
        }
      }
-    }
+    },
    "meaning": "to lower, to reduce; to download (computing)"
  },
  "להפיל": {
    "infinitive": "להפיל",
@ -10805,7 +10853,8 @@
          "tense": "עָתִיד"
        }
      }
-    }
+    },
    "meaning": "to drop, to throw down"
  },
  "להקים": {
    "infinitive": "להקים",
@ -11132,7 +11181,8 @@
          "tense": "עָתִיד"
        }
      }
-    }
+    },
    "meaning": "to build, to found, to establish"
  },
  "להמציא": {
    "infinitive": "להמציא",
@ -11459,7 +11509,8 @@
          "tense": "עָתִיד"
        }
      }
-    }
+    },
    "meaning": "to invent; to make up; to present"
  },
  "להרשות": {
    "infinitive": "להרשות",
@ -11786,7 +11837,8 @@
          "tense": "עָתִיד"
        }
      }
-    }
+    },
    "meaning": "to allow, to permit"
  },
  "להקל": {
    "infinitive": "להקל",
@ -12113,7 +12165,8 @@
          "tense": "עָתִיד"
        }
      }
-    }
+    },
    "meaning": "to ease, to alleviate"
  },
  "לָשִׂים": {
    "infinitive": "לָשִׂים",
@ -12291,7 +12344,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to put, to put on"
  },
  "בוטל": {
    "infinitive": "בוטל",
@ -12439,7 +12493,8 @@
        "pronoun": "הֵן",
        "tense": "עָתִיד"
      }
-    }
+    },
    "meaning": "to cancel, to undo"
  },
  "תואם": {
    "infinitive": "תואם",
@ -12587,7 +12642,8 @@
        "pronoun": "הֵן",
        "tense": "עָתִיד"
      }
-    }
+    },
    "meaning": "to coordinate"
  },
  "קומם": {
    "infinitive": "קומם",
@ -12735,7 +12791,8 @@
        "pronoun": "הֵן",
        "tense": "עָתִיד"
      }
-    }
+    },
    "meaning": "to outrage, to anger"
  },
  "דוכא": {
    "infinitive": "דוכא",
@ -12883,7 +12940,8 @@
        "pronoun": "הֵן",
        "tense": "עָתִיד"
      }
-    }
+    },
    "meaning": "to oppress, to crush; to cause depression"
  },
  "זוכה": {
    "infinitive": "זוכה",
@ -13031,7 +13089,8 @@
        "pronoun": "הֵן",
        "tense": "עָתִיד"
      }
-    }
+    },
    "meaning": "to achieve; to credit"
  },
  "פורסם": {
    "infinitive": "פורסם",
@ -13179,7 +13238,8 @@
        "pronoun": "הֵן",
        "tense": "עָתִיד"
      }
-    }
+    },
    "meaning": "to advertise, to publish, to publicize"
  },
  "הוגבל": {
    "infinitive": "הוגבל",
@ -13327,7 +13387,8 @@
        "pronoun": "הֵן",
        "tense": "עָתִיד"
      }
-    }
+    },
    "meaning": "to limit, to restrict, to confine"
  },
  "העבר": {
    "infinitive": "העבר",
@ -13475,7 +13536,8 @@
        "pronoun": "הֵן",
        "tense": "עָתִיד"
      }
-    }
+    },
    "meaning": "to transfer, to pass something"
  },
  "הוזהר": {
    "infinitive": "הוזהר",
@ -13623,7 +13685,8 @@
        "pronoun": "הֵן",
        "tense": "עָתִיד"
      }
-    }
+    },
    "meaning": "to warn"
  },
  "הופל": {
    "infinitive": "הופל",
@ -13771,7 +13834,8 @@
        "pronoun": "הֵן",
        "tense": "עָתִיד"
      }
-    }
+    },
    "meaning": "to drop, to throw down"
  },
  "הוקם": {
    "infinitive": "הוקם",
@ -13919,7 +13983,8 @@
        "pronoun": "הֵן",
        "tense": "עָתִיד"
      }
-    }
+    },
    "meaning": "to build, to found, to establish"
  },
  "הוחל": {
    "infinitive": "הוחל",
@ -14067,7 +14132,8 @@
        "pronoun": "הֵן",
        "tense": "עָתִיד"
      }
-    }
+    },
    "meaning": "to apply, to enforce, to put in force"
  },
  "הוקפא": {
    "infinitive": "הוקפא",
@ -14215,7 +14281,8 @@
        "pronoun": "הֵן",
        "tense": "עָתִיד"
      }
-    }
+    },
    "meaning": "to freeze (something)"
  },
  "הופנה": {
    "infinitive": "הופנה",
@ -14363,7 +14430,8 @@
        "pronoun": "הֵן",
        "tense": "עָתִיד"
      }
-    }
+    },
    "meaning": "to direct; to refer someone"
  },
  "להתקלח": {
    "infinitive": "להתקלח",
@ -14541,7 +14609,8 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to take a shower"
  },
  "להתגלות": {
    "infinitive": "להתגלות",
@ -14719,6 +14788,162 @@
        "pronoun": "",
        "tense": "מְקוֹר"
      }
-    }
+    },
    "meaning": "to be discovered, to appear"
  },
  "להיות": {
    "infinitive": "להיות",
    "slug": "454-lihyot",
    "root": "ה - י - ה",
    "binyan": "Pa'al",
    "is_passive": false,
    "reference_form": "לִהְיוֹת",
    "forms": {
      "past_1s": {
        "form": "הָיִיתִי",
        "audio_url": "https://audio.pealim.com/v0/bx/bxtedharx4kd.mp3",
        "pronoun": "אֲנִי",
        "tense": "עָבָר"
      },
      "past_1p": {
        "form": "הָיִינוּ",
        "audio_url": "https://audio.pealim.com/v0/bz/bztr7bt7yw8j.mp3",
        "pronoun": "אֲנַחְנוּ",
        "tense": "עָבָר"
      },
      "past_2ms": {
        "form": "הָיִיתָ",
        "audio_url": "https://audio.pealim.com/v0/1i/1imxfddysg8d8.mp3",
        "pronoun": "אַתָּה",
        "tense": "עָבָר"
      },
      "past_2fs": {
        "form": "הָיִית",
        "audio_url": "https://audio.pealim.com/v0/si/sizbwqsi2wej.mp3",
        "pronoun": "אַתְּ",
        "tense": "עָבָר"
      },
      "past_2mp": {
        "form": "הֱיִיתֶם",
        "audio_url": "https://audio.pealim.com/v0/31/31081nk4lvxj.mp3",
        "pronoun": "אַתֶּם",
        "tense": "עָבָר"
      },
      "past_2fp": {
        "form": "הֱיִיתֶן",
        "audio_url": "https://audio.pealim.com/v0/30/30zpav63u9ig.mp3",
        "pronoun": "אַתֶּן",
        "tense": "עָבָר"
      },
      "past_3ms": {
        "form": "הָיָה",
        "audio_url": "https://audio.pealim.com/v0/1h/1hxhgoyxra6fs.mp3",
        "pronoun": "הוּא",
        "tense": "עָבָר"
      },
      "past_3fs": {
        "form": "הָיְתָה",
        "audio_url": "https://audio.pealim.com/v0/17/17fb6fulu2da8.mp3",
        "pronoun": "הִיא",
        "tense": "עָבָר"
      },
      "past_3p": {
        "form": "הָיוּ",
        "audio_url": "https://audio.pealim.com/v0/1h/1hxhgf26s3ou9.mp3",
        "pronoun": "הֵם / הֵן",
        "tense": "עָבָר"
      },
      "future_1s": {
        "form": "אֶהְיֶה",
        "audio_url": "https://audio.pealim.com/v0/at/atd2i0kljhge.mp3",
        "pronoun": "אֲנִי",
        "tense": "עָתִיד"
      },
      "future_1p": {
        "form": "נִהְיֶה",
        "audio_url": "https://audio.pealim.com/v0/2a/2a41xa7h8jei.mp3",
        "pronoun": "אֲנַחְנוּ",
        "tense": "עָתִיד"
      },
      "future_2ms": {
        "form": "תִּהְיֶה",
        "audio_url": "https://audio.pealim.com/v0/g6/g6saa9abkllk.mp3",
        "pronoun": "אַתָּה",
        "tense": "עָתִיד"
      },
      "future_2fs": {
        "form": "תִּהְיִי",
        "audio_url": "https://audio.pealim.com/v0/g6/g6s9q8uugtnx.mp3",
        "pronoun": "אַתְּ",
        "tense": "עָתִיד"
      },
      "future_2mp": {
        "form": "תִּהְיוּ",
        "audio_url": "https://audio.pealim.com/v0/g6/g6sjf854r5a7.mp3",
        "pronoun": "אַתֶּם",
        "tense": "עָתִיד"
      },
      "future_2fp": {
        "form": "תִּהְיֶינָה",
        "audio_url": "https://audio.pealim.com/v0/12/12upso035jy8g.mp3",
        "pronoun": "אַתֶּן",
        "tense": "עָתִיד"
      },
      "future_3ms": {
        "form": "יִהְיֶה",
        "audio_url": "https://audio.pealim.com/v0/yy/yyo97spf6rob.mp3",
        "pronoun": "הוּא",
        "tense": "עָתִיד"
      },
      "future_3fs": {
        "form": "תִּהְיֶה",
        "audio_url": "https://audio.pealim.com/v0/g6/g6saa9abkllk.mp3",
        "pronoun": "הִיא",
        "tense": "עָתִיד"
      },
      "future_3mp": {
        "form": "יִהְיוּ",
        "audio_url": "https://audio.pealim.com/v0/yy/yyo02tum07zo.mp3",
        "pronoun": "הֵם",
        "tense": "עָתִיד"
      },
      "future_3fp": {
        "form": "תִּהְיֶינָה",
        "audio_url": "https://audio.pealim.com/v0/12/12upso035jy8g.mp3",
        "pronoun": "הֵן",
        "tense": "עָתִיד"
      },
      "imperative_ms": {
        "form": "הֱיֵה!‏",
        "audio_url": "https://audio.pealim.com/v0/1h/1hxjabs7uspli.mp3",
        "pronoun": "אַתָּה",
        "tense": "צִוּוּי"
      },
      "imperative_fs": {
        "form": "הֱיִי!‏",
        "audio_url": "https://audio.pealim.com/v0/1h/1hxjac2th43as.mp3",
        "pronoun": "אַתְּ",
        "tense": "צִוּוּי"
      },
      "imperative_mp": {
        "form": "הֱיוּ!‏",
        "audio_url": "https://audio.pealim.com/v0/1h/1hxja0tjuptcu.mp3",
        "pronoun": "אַתֶּם",
        "tense": "צִוּוּי"
      },
      "imperative_fp": {
        "form": "הֱיֶינָה!‏",
        "audio_url": "https://audio.pealim.com/v0/xe/xef6kg7mexvb.mp3",
        "pronoun": "אַתֶּן",
        "tense": "צִוּוּי"
      },
      "infinitive": {
        "form": "לִהְיוֹת",
        "audio_url": "https://audio.pealim.com/v0/1n/1nej50k4t35xi.mp3",
        "pronoun": "",
        "tense": "מְקוֹר"
      }
    },
    "meaning": "to be"
  }
 }
--- a/data/epub_sentence_index.json
+++ b/data/epub_sentence_index.json
--- a/data/examples_cache.json
+++ b/data/examples_cache.json
--- a/data/ktiv_male_forms.json
+++ b/data/ktiv_male_forms.json
--- a/data/legacy_guid_map.json
+++ b/data/legacy_guid_map.json
--- a/data/noun_plurals.json
+++ b/data/noun_plurals.json
--- a/data/noun_slug_map.json
+++ b/data/noun_slug_map.json
--- a/data/refined_meanings.json
+++ b/data/refined_meanings.json
@ -0,0 +1,442 @@
 {
  "שְׁלָל": "abundance; loot, plunder, spoils",
  "שֶׁפַע": "abundance, plenty, profusion",
  "נַר": "acquaintance (person one knows)",
  "הֶכֵּרוּת": "acquaintance (the state of knowing someone)",
  "כְּתֹבֶת": "address (postal/location)",
  "מַעַן": "address (formal, for the sake of; destination)",
  "שׁוּב": "again (once more, to repeat an action)",
  "שֵׁנִית": "again; a second time, secondly",
  "כְּנֶגֶד": "against; compared to, as opposed to",
  "מוּל": "opposite, facing; against",
  "נֶגֶד": "against; contrary to",
  "נֶכֶס": "asset, property (financial/material possession)",
  "קִנְיָן": "asset, property; possession, ownership (abstract or acquired)",
  "הִתְבּוֹלְלוּת": "assimilation (cultural/ethnic blending in)",
  "הִטַּמְּעוּת": "assimilation (absorption, integration into surroundings)",
  "כְּפִיפָה": "basket (woven, traditional/biblical)",
  "סַל": "basket (general, everyday)",
  "מַשְׁמִים": "boring, dreary (causing desolation/boredom)",
  "מְשַׁעְמֵם": "boring, tedious (causing boredom, common usage)",
  "מַשָּׂא": "burden, load (heavy cargo; figurative weight)",
  "נֵטֶל": "burden, load; ballast (dead weight)",
  "טָרוּד": "busy, preoccupied (mentally troubled/distracted)",
  "עָסוּק": "busy, occupied (engaged in an activity)",
  "מַמְתָּק": "candy, sweet (generic confection)",
  "סֻכָּרִיָּה": "candy, sweet (individual wrapped candy piece)",
  "מַרְבָד": "carpet, rug (literary/poetic); bedspread",
  "שָׁטִיחַ": "carpet, rug (standard, everyday word)",
  "כַּרְפַּס": "celery (also: the Passover seder vegetable)",
  "סֶלֶרִי": "celery (modern loanword, everyday usage)",
  "שַׁלְשֶׁלֶת": "chain (figurative: chain of events, lineage)",
  "שַׁרְשֶׁרֶת": "chain (physical chain, links)",
  "אָפְיָן": "characteristic (trait, attribute of a person/thing)",
  "סַמְמָן": "characteristic; indicator, hallmark",
  "שׁוֹקוֹלָד": "chocolate (the substance, mass noun, masc.)",
  "שׁוֹקוֹלָדָה": "chocolate (a piece of chocolate; hot chocolate, fem.)",
  "עִגּוּל": "circle (the shape); rounding",
  "מַעֲגָל": "circle (circular path, cycle, circuit)",
  "נִקּוּי": "cleaning (the act of cleaning, removing dirt)",
  "נִקָּיוֹן": "cleanliness, tidiness (state of being clean)",
  "בִּקּוּעַ": "cleaving, splitting (a single crack or fissure)",
  "הִתְבַּקְּעוּת": "cleaving, splitting (the process of cracking apart)",
  "בְּעִילָה": "coitus, sexual intercourse (legal/halachic term)",
  "מִשְׁגָּל": "coitus, sexual intercourse (formal/literary)",
  "מִדְרָשָׁה": "college (religious seminary, study institute)",
  "מִכְלָלָה": "college (academic institution, secular)",
  "תַּחֲרוּת": "competition, contest (an event or rivalry)",
  "הִתְחָרוּת": "competition (the act/process of competing)",
  "לְגַמְרֵי": "completely, totally (colloquial, very common)",
  "כָּלִיל": "completely, entirely (literary/formal); wholly",
  "רְכִיב": "component (technical part, element in a system)",
  "מַרְכִּיב": "component, ingredient (constituent that makes up a whole)",
  "תַּבְעֵרָה": "conflagration, fire (intense blaze, biblical/literary)",
  "דְּלֵקָה": "fire (accidental fire, house fire, everyday)",
  "צַרְכָנוּת": "consumerism; consumer advocacy",
  "צְרִיכָה": "consumption (using up resources, usage)",
  "קֵרוּר": "cooling, refrigeration (active process of making cold)",
  "הִתְקָרְרוּת": "cooling (becoming cold); catching a cold",
  "חָשׁוּךְ": "dark (of a place, lacking light; figuratively bleak)",
  "כֵּהֶה": "dark (of a color, shade; dim)",
  "אֲפֵלָה": "darkness (deep gloom; figurative despair)",
  "אֹפֶל": "darkness (poetic/literary, deep darkness)",
  "חֹשֶׁךְ": "darkness (general, common word)",
  "יַקִּיר": "darling, dear (masculine form)",
  "יַקִּירָה": "darling, dear (feminine form)",
  "מִרְמָה": "deceit, fraud (cunning deception, trickery)",
  "תַּרְמִית": "deceit, fraud (a specific act of swindling)",
  "אֲבַדּוֹן": "destruction (total ruin, perdition; the abyss)",
  "הֶרֶס": "destruction, demolition (physical wreckage)",
  "הֶבְדֵּל": "difference, distinction (between two things)",
  "שֹׁנִי": "difference (variance, otherness)",
  "הֵעָלְמוּת": "disappearance (the act of vanishing, going missing)",
  "הֶעֱלֵם": "disappearance (concealment, suppression of information)",
  "נְדָבָה": "donation (voluntary, charitable gift; tip)",
  "תְּרוּמָה": "donation, contribution (formal; also: religious offering)",
  "הִשְׁתַּעְבְּדוּת": "enslavement (the process of becoming enslaved)",
  "שִׁעְבּוּד": "enslavement, subjugation; mortgaging (finance)",
  "טָעוּת": "mistake, error (common, everyday blunder)",
  "שְׁגִיאָה": "error, mistake (formal, technical error)",
  "הִתְאַדּוּת": "evaporation (natural process of turning to vapor)",
  "הִתְאַיְּדוּת": "evaporation (process of dissipating, vaporizing)",
  "דֻּגְמָה": "example, sample (concrete instance or specimen)",
  "מָשָׁל": "example; parable, allegory, proverb",
  "גּוֹלָה": "exile, diaspora (the community in exile)",
  "גָּלוּת": "exile, diaspora (the state/condition of being exiled)",
  "חֲוָיָה": "experience (a lived event, an adventure)",
  "הִתְנַסּוּת": "experience (the process of trying/experimenting)",
  "נִסָּיוֹן": "experience (accumulated knowledge); attempt, trial",
  "בֵּאוּר": "explanation, elucidation (detailed clarification)",
  "הֶסְבֵּר": "explanation (the act of explaining, making understood)",
  "פָּנִים": "face (standard word); surface",
  "פַּרְצוּף": "face (appearance, facial expression; colloquial)",
  "מֶחְדָּל": "failure, omission (negligent failure to act)",
  "כִּשָּׁלוֹן": "failure (general: failed attempt or endeavor)",
  "כֶּשֶׁל": "failure, malfunction (technical breakdown)",
  "תַּעְנִית": "fast (religious fast day, formal term)",
  "צוֹם": "fast, fasting (the act of fasting, general)",
  "תְּחוּשָׁה": "feeling, sensation (physical or gut feeling)",
  "הַרְגָּשָׁה": "feeling (emotional sense; well-being)",
  "רֶגֶשׁ": "feeling, emotion (inner emotional state)",
  "לֶהָבָה": "flame (common word for a flame)",
  "שַׁלְהֶבֶת": "flame (poetic/literary, blazing flame)",
  "כָּפִיף": "flexible, pliable (can be bent physically)",
  "מָתִיחַ": "flexible, elastic (stretchy, resilient)",
  "זֶרֶם": "flow, current (of water, electricity, or ideas)",
  "זְרִימָה": "flow, flowing (the act/process of flowing)",
  "אֹכֶל": "food (general, everyday word for food/meal)",
  "מַאֲכָל": "food (a specific dish, a prepared food item)",
  "מָזוֹן": "food, nourishment (sustenance, nutrition)",
  "חֹפֶשׁ": "freedom; vacation, time off (colloquial)",
  "חֵרוּת": "freedom, liberty (formal, political/ideological)",
  "הַקְפָּאָה": "freezing (active act of freezing something; a freeze/suspension)",
  "קִפָּאוֹן": "freezing; standstill, stagnation (frozen state)",
  "תְּדִירוּת": "frequency (how often something occurs)",
  "תֶּדֶר": "frequency (radio/physics frequency)",
  "תָּדִיר": "frequent, regular (happening at steady intervals)",
  "תָּכוּף": "frequent, rapid (happening in quick succession)",
  "גָּאוֹן": "genius (title of greatness; rabbinical title Gaon)",
  "עִלּוּי": "genius, prodigy (exceptionally gifted person)",
  "תְּשׁוּרָה": "gift, present (formal/literary offering)",
  "שַׁי": "gift, present (a token gift, small present)",
  "אַכְלָן": "glutton (big eater, food-lover, common)",
  "רְעַבְתָּן": "glutton (insatiably hungry person)",
  "מֶמְשֶׁלֶת": "government (construct state form, used in compounds)",
  "מֶמְשָׁלָה": "government (standard form)",
  "מֶמְשַׁלְתִּי": "governmental (relating to the government/cabinet)",
  "שִׁלְטוֹנִי": "governmental (relating to ruling authority/regime)",
  "חֹפֶן": "handful (cupped palm, a scooped amount)",
  "קֹמֶץ": "handful (a pinch, a small quantity)",
  "יָד": "handle (of a tool, door); hand",
  "יָדִית": "handle (a knob or grip, specifically a handle)",
  "כָּאן": "here (standard, common usage)",
  "פֹּה": "here (colloquial/informal variant)",
  "טָמוּן": "hidden (buried, latent, lying within)",
  "נִסְתָּר": "hidden, concealed (secret, mysterious; grammar: 3rd person)",
  "מֻצְנָע": "hidden, concealed (modestly tucked away, discreet)",
  "תְּמוּנָה": "image, picture (photo, illustration, scene)",
  "צֶלֶם": "image (likeness, form); idol",
  "הִתְרַשְּׁמוּת": "impression (the experience of being impressed)",
  "רֹשֶׁם": "impression (a mark left; an effect on someone)",
  "בִּפְנִים": "inside (location: on the inside, indoors)",
  "פְּנִימָה": "inside (direction: inward, toward the inside)",
  "עֶלְבּוֹן": "insult, offence (the slight or affront itself)",
  "הַעֲלָבָה": "insult (the act of insulting someone)",
  "פְּנִים": "interior, inside (inner part, inner side)",
  "קֶרֶב": "interior; innards, midst (among, in the thick of)",
  "תָּוֶךְ": "interior, inside; center, middle; essence",
  "תַּחְקִיר": "investigation (journalistic/official inquiry)",
  "חֲקִירָה": "investigation, inquiry (police/legal; research)",
  "רִנָּה": "joy; joyful song, singing (literary)",
  "מָשׂוֹשׂ": "joy, delight (source of joy, literary)",
  "גִּיל": "joy, elation (exuberant happiness; age)",
  "שִׂמְחָה": "joy, happiness (celebration, festive occasion)",
  "עֶלְצוֹן": "jubilance, exultation (archaic, the feeling)",
  "עֶלְצָה": "jubilance, exultation (archaic, feminine noun form)",
  "עָצֵל": "lazy, idle (basic adjective form)",
  "עַצְלָן": "lazy, lazybones (characteristically lazy person)",
  "תְּחִקָּה": "legislation (a specific statute or enacted law)",
  "חֲקִיקָה": "legislation (the process/act of legislating)",
  "הִתְהוֹלְלוּת": "licentiousness, revelry (wild raucous behavior)",
  "הוֹלֵלוּת": "licentiousness, debauchery (moral depravity)",
  "שׁוֹשָׁן": "lily (the flower, masculine; also: the name Shoshan)",
  "שׁוֹשַׁנָּה": "lily; rose (archaic); the name Shoshana",
  "הִמָּצְאוּת": "location; presence (being found/situated somewhere)",
  "מִקּוּם": "location, positioning (placing in a specific spot)",
  "נַעֲלֶה": "lofty, exalted (elevated, superior in quality)",
  "נִשְׂגָּב": "lofty, exalted (sublime, beyond reach, grand)",
  "תַּאֲוָה": "lust, craving (appetite, physical desire)",
  "תְּשׁוּקָה": "passion, desire (deep longing, yearning)",
  "אַחְזָקָה": "maintenance; holding (corporate; upkeep of property)",
  "תַּחְזוּקָה": "maintenance (technical upkeep of systems/equipment)",
  "תִּחְזוּק": "maintenance (the process/act of maintaining)",
  "מִנְהָל": "administration, management (the office/system)",
  "נִהוּל": "management (the act/process of managing)",
  "הַנְהָלָה": "management (the managing body, executive board)",
  "פֵּרוּשׁ": "meaning; interpretation, commentary",
  "מַשְׁמָעוּת": "meaning, significance (broader importance)",
  "מַשְׁמָע": "meaning, implication (what is implied)",
  "לַחַן": "melody, tune (a musical composition)",
  "נִגּוּן": "melody, tune (a chant; Hasidic wordless melody)",
  "נְעִימָה": "melody, tune; tone, intonation (of voice)",
  "נֵס": "miracle (divine intervention; common word)",
  "פֶּלֶא": "wonder, marvel (something astonishing)",
  "תְּזוּזָה": "movement (a budge, slight motion, shift)",
  "תְּנוּעָה": "movement (broad: traffic; organization; vowel mark)",
  "מִסְתּוֹרִין": "mystery (enigma, something hidden/secret)",
  "תַּעֲלוּמָה": "mystery (unsolved puzzle, unknown secret)",
  "עֵירֹם": "naked (completely nude, formal)",
  "עָרֹם": "naked (nude; also: shrewd, cunning in biblical Hebrew)",
  "אֻמָּה": "nation (a unified political/cultural entity)",
  "לְאֹם": "nation, people (ethnic group; literary/formal)",
  "זִלְזוּל": "negligence; contempt, disrespect (dismissive attitude)",
  "הִתְרַשְּׁלוּת": "negligence (carelessness, failure to take proper care)",
  "נֵיטְרָלִי": "neutral (politically/scientifically neutral, loanword)",
  "סְתָמִי": "neutral; vague, nondescript, generic",
  "אֲצֻלָּה": "nobility, aristocracy (the aristocratic class)",
  "אֲצִילוּת": "nobility (the quality of being noble, refinement)",
  "הִסְתַּכְּלוּת": "observation (looking, watching, contemplation)",
  "תַּצְפִּית": "observation (military/scientific lookout; observation post)",
  "מִכְשׁוֹל": "obstacle, stumbling block (impediment to progress)",
  "נֶגֶף": "obstacle; plague, affliction (biblical)",
  "עַל": "on, upon; about, regarding",
  "עַל גַּב": "on, upon (on the back/surface of)",
  "עַל גַּבֵּי": "on, upon (on top of, on the surface of)",
  "פְּקֻדָּה": "order, command (military/authoritative directive)",
  "צַו": "order, decree (legal injunction, official order)",
  "בָּחוּץ": "outside (location: on the outside, outdoors)",
  "הַחוּצָה": "outside (direction: outward, to the outside)",
  "מַאֲרָז": "package (a packed container, packaging)",
  "חֲבִילָה": "package, parcel (a bundle, a wrapped item)",
  "מְחִילָה": "pardon, forgiveness (personal, between individuals)",
  "סְלִיחָה": "pardon, forgiveness (also: excuse me; liturgical pardon)",
  "סַיֶּרֶת": "patrol (elite military unit, commando squad)",
  "סִיּוּר": "patrol; tour (a round of inspection or sightseeing)",
  "שָׂכָר": "payment; salary, wage (earned compensation)",
  "תַּשְׁלוּם": "payment (a single payment/installment; compensation)",
  "עֲצוּמָה": "petition (public petition with signatures)",
  "עֲתִירָה": "petition (legal petition, court appeal)",
  "דַּלּוּת": "poverty; meagerness, paucity (scarcity of quality/quantity)",
  "עֹנִי": "poverty (destitution, financial hardship)",
  "עָצְמָתִי": "powerful (having great inherent power)",
  "רַב עָצְמָה": "powerful (of great might, formidable)",
  "הַאֲמָרָה": "price increase (deliberate raising of prices)",
  "הִתְיַקְּרוּת": "price increase (becoming more expensive, rising costs)",
  "קִדְמָה": "progress (general/societal advancement, modernity)",
  "הִתְקַדְּמוּת": "progress (the process of advancing, making headway)",
  "הַסְבָּרָה": "propaganda; public diplomacy (Israeli hasbara)",
  "תַּעֲמוּלָה": "propaganda (political propaganda, agitation)",
  "סְמִיכוּת": "proximity; construct state (grammar term)",
  "קִרְבָה": "proximity; kinship, closeness (relational nearness)",
  "תְּהִלּוֹת": "Psalms (variant plural form)",
  "תְּהִלִּים": "Psalms (standard name for the Book of Psalms)",
  "קְנִיָּה": "purchase (a buy, an act of buying, everyday)",
  "רְכִישָׁה": "acquisition (formal purchase, procurement)",
  "בִּזְרִיזוּת": "quickly, nimbly (with agile efficiency)",
  "בִּמְהִירוּת": "quickly, at high speed (with velocity)",
  "רִיצָה": "running (the activity of running)",
  "מְרוּצָה": "race (a competitive running event)",
  "גְּאֻלָּה": "redemption (national/messianic deliverance)",
  "פְּדוּת": "redemption (ransoming, being redeemed; literary)",
  "הוֹצָאָה": "removal; expense, expenditure; publishing house",
  "הַסָּחָה": "removal; deflection, diversion, distraction",
  "יִצּוּג": "representation (acting on behalf of; depiction)",
  "נְצִיגוּת": "representation (the body of representatives, delegation)",
  "מְכִירָה": "sale (the act of selling, a transaction)",
  "מֶכֶר": "sale; merchandise, value (literary/biblical)",
  "יֶשַׁע": "salvation, deliverance (divine rescue, literary)",
  "תְּשׁוּעָה": "salvation, victory (triumphant rescue, literary)",
  "הַפְרָדָה": "separation (active act of separating things/people)",
  "הִפָּרְדוּת": "separation (the process of parting ways)",
  "חַד": "sharp (of edges, blades; clear-cut)",
  "חָרִיף": "sharp, acute; spicy, pungent; keen, witty",
  "חָסוּת": "shelter, patronage (protection under authority)",
  "מִקְלָט": "shelter, refuge (bomb shelter, safe haven, physical place)",
  "חֻלְצָה": "shirt, blouse (modern everyday word)",
  "כֻּתֹּנֶת": "shirt; tunic, gown (biblical/traditional garment)",
  "שֶׁקֶט": "silence, quiet (peaceful calm, serenity)",
  "שְׁתִיקָה": "silence (the act of keeping silent, not speaking)",
  "חֶטְא": "sin (a specific transgression, missing the mark)",
  "עָווֹן": "sin, iniquity (moral guilt; legal: misdemeanor)",
  "זִמְרָה": "singing (musical performance, song/hymn)",
  "רְנָנָה": "singing; joyful song, jubilant cry (literary)",
  "נָטוּי": "slanted, inclined (tilted, leaning; grammar: inflected)",
  "מְשֻׁפָּע": "slanted, inclined; having an abundance of something",
  "כִּשּׁוּף": "sorcery, witchcraft (dark magic, spellcasting)",
  "קֶסֶם": "magic, charm (enchantment, allure)",
  "נֶפֶשׁ": "soul (life force, self, being; appetite)",
  "נְשָׁמָה": "soul (divine breath of life, spiritual essence)",
  "מַצָּת": "spark plug (automotive ignition component)",
  "פְּלָג": "spark plug (variant/slang term)",
  "דּוֹבֵר": "speaker, spokesman (masculine form)",
  "דּוֹבֶרֶת": "speaker, spokeswoman (feminine form)",
  "סוּפָה": "storm, tempest (violent windstorm)",
  "סְעָרָה": "storm, tempest (raging storm; figurative turmoil)",
  "קַשׁ": "straw (dry stalks; figuratively: trivial thing)",
  "תֶּבֶן": "straw, hay (animal feed, dried grass)",
  "עִקֵּשׁ": "stubborn, obstinate (perversely rigid)",
  "עַקְשָׁן": "stubborn, obstinate (characteristically persistent/stubborn person)",
  "חָנִיךְ": "student, pupil (trainee, apprentice, cadet)",
  "תַּלְמִיד": "student, pupil (school student, common word)",
  "פִּקּוּחַ": "supervision (regulatory oversight, monitoring)",
  "הַשְׁגָּחָה": "supervision (watchful care, divine providence; kosher certification)",
  "הַסְפָּקָה": "supply, provision (the act of supplying goods)",
  "אַסְפָּקָה": "supply, provision (military/logistical provisioning)",
  "אֲרָעִי": "temporary, provisional (makeshift, not permanent)",
  "זְמַנִּי": "temporary, time-limited (for a limited period)",
  "אֵלֶה": "these (standard demonstrative pronoun)",
  "אֵלוּ": "these (literary/Mishnaic variant)",
  "בֹּהֶן": "thumb; big toe (anatomical term)",
  "אֲגוּדָל": "thumb (common/colloquial word for thumb)",
  "זְמַן": "time (general, measurable time; tense in grammar)",
  "עֵת": "time (a specific moment, epoch, literary/biblical)",
  "עִתּוּי": "timing (choosing the right moment)",
  "תִּזְמוּן": "timing (synchronization, technical scheduling)",
  "לְכַתֵּב": "to address (write an address on); to engrave",
  "לְמַעֵן": "to address (direct/target communication toward)",
  "לְזַיֵּן": "to arm (equip with weapons; vulgar slang)",
  "לְחַמֵּשׁ": "to arm (equip/furnish with armaments)",
  "לְהִתְאַסֵּף": "to assemble, to gather together (of people collecting)",
  "לְהִתְכַּנֵּס": "to assemble, to convene (a formal meeting/conference)",
  "לְהִכָּבֵל": "to be bound (chained, shackled with chains)",
  "לְהִכָּפֵת": "to be bound (handcuffed, tied up physically)",
  "לְהִבָּרֵא": "to be created (divine/fundamental creation, ex nihilo)",
  "לְהִוָּצֵר": "to be created (formed, shaped, manufactured)",
  "לְהִגָּזֵז": "to be cut off (sheared, trimmed, as hair/wool)",
  "לְהִגָּזֵר": "to be cut off (decreed, sentenced; derived from)",
  "לְהִקָּטֵעַ": "to be cut off (interrupted, severed abruptly)",
  "לְהִנָּגֵף": "to be defeated (struck down, plagued; biblical)",
  "לְהֵרָעֵץ": "to be defeated (crushed, shattered; literary)",
  "לְהֵהָרֵס": "to be destroyed (demolished, wrecked; slang: exhausted)",
  "לְהֵחָרֵב": "to be destroyed (laid waste, devastated; of cities/temples)",
  "לְהִסָּתֵר": "to be hidden; to hide oneself (take cover)",
  "לְהִצָּפֵן": "to be hidden (encoded, concealed from view)",
  "לְהִנָּטֵעַ": "to be planted (of trees/plants, set in soil)",
  "לְהִשָּׁתֵל": "to be planted (implanted, transplanted; of an organ or undercover agent)",
  "לָדֹם": "to be silent (to become utterly still; literary)",
  "לִשְׁתֹּק": "to be silent (to stop talking, keep quiet; common)",
  "לְהִתְקַמֵּץ": "to be stingy (to pinch pennies, scrimp)",
  "לְהִתְקַמְצֵן": "to be stingy (to act like a miser, be miserly)",
  "לְהִבָּדֵק": "to be tested, checked (verified, inspected)",
  "לְהִבָּחֵן": "to be tested, examined (undergo a formal exam/evaluation)",
  "נִהְיָה": "to become (turn into, come to be; common)",
  "לְהֵעָשׂוֹת": "to become; to be made, to be done, to be carried out",
  "לְהִתְבַּהֵר": "to become clear (clarified, understood)",
  "לְהִצְטַלֵּל": "to become clear (of liquid becoming transparent/limpid)",
  "לְכוֹפֵף": "to bend (flex, bow down, curve something)",
  "לְקַמֵּר": "to bend, to vault (arch over, create a dome shape)",
  "לְקַשֵּׁת": "to bend, to curve (form into a bow/arc shape)",
  "לְפַחֵם": "to blacken (carbonize, char with coal/charcoal)",
  "לְפַיֵּחַ": "to blacken (cover with soot, smoke residue)",
  "לְמַצְמֵץ": "to blink (rapidly open and close one's eyes)",
  "לְעַפְעֵף": "to blink (flutter one's eyelids)",
  "לִנְפֹּחַ": "to blow (puff up, inflate; blow air)",
  "לִנְשֹׁף": "to blow, to exhale; to play a wind instrument",
  "לְצַיֵּץ": "to chirp, to tweet (of birds; to post on social media)",
  "לְצַפְצֵף": "to chirp, to whistle (shrill piping sound; to not care — slang)",
  "לְחַבֵּר": "to connect, to join (attach together; to compose/write)",
  "לְקַשֵּׁר": "to connect, to link (establish a relationship/connection)",
  "לְהָסִיחַ": "to converse (engage in casual talk; to divert attention)",
  "לְהָשִׂיחַ": "to converse, to talk (literary; to speak with)",
  "לְסַלְסֵל": "to curl (hair); to trill (music)",
  "לְתַלְתֵּל": "to curl (hair into ringlets/curls)",
  "לְיַפּוֹת": "to beautify, to embellish (make more attractive)",
  "לְפַרְכֵּס": "to embellish; to squirm, to flounder",
  "לִדְרֹשׁ": "to demand; to inquire, to preach (seek/expound)",
  "לִתְבֹּעַ": "to demand; to sue, to claim (legal demand)",
  "לְהֵישִׁיר": "to direct; to straighten, to look straight at",
  "לְהַפְנוֹת": "to direct; to refer someone (redirect attention/person)",
  "לְהַגְזִים": "to exaggerate (overstate, blow out of proportion; common)",
  "לְהַפְרִיז": "to exaggerate (go to extremes, overdo; formal)",
  "לְהִמּוֹג": "to fade, to dissolve (melt away, lose form; literary)",
  "לְהִנָּדֵף": "to fade, to dissipate (blown away, scattered by wind)",
  "לִפֹּל": "to fall (general: fall down, collapse; common word)",
  "לִנְשֹׁר": "to fall, to drop (shed: leaves, hair; drop out of school)",
  "לְכַלּוֹת": "to finish (consume entirely, exhaust; to annihilate)",
  "לְסַיֵּם": "to finish, to complete (conclude, bring to an end; common)",
  "לִנְהֹר": "to flow (stream toward); to shine, to glow",
  "לִשְׁתֹּת": "to flow (pour forth, stream out; literary)",
  "לִמְחֹל": "to forgive (pardon on a personal level, waive a claim)",
  "לִסְלֹחַ": "to forgive, to pardon (general, standard word for forgiving)",
  "לְהַחְבִּיא": "to hide, to conceal (physically stash away; common)",
  "לְהַעֲלִים": "to hide, to conceal (suppress information; to evade)",
  "לִדְלֹף": "to leak (of a pipe, roof; seep through)",
  "לִנְזֹל": "to drip, to trickle (flow in drops, ooze)",
  "לִזְנֹחַ": "to abandon, to neglect (forsake, discard)",
  "לַעֲזֹב": "to leave, to abandon (depart from; give up; common word)",
  "לְהַנִּיחַ": "to place, to put (set down carefully); to assume",
  "לְהָשִׂים": "to place, to put (set/assign); to turn into something",
  "לְפָאֵר": "to glorify, to adorn (extol with grandeur)",
  "לְשַׁבֵּחַ": "to praise, to commend (express approval; common)",
  "לִדְחֹף": "to push, to shove (physically push forward; common)",
  "לִדְחֹק": "to push, to press (squeeze, crowd; urge insistently)",
  "לְהַבְרִיא": "to recover (regain health, get well; common)",
  "לְהַחְלִים": "to recover, to convalesce (heal fully from illness; formal)",
  "לַעֲלֹץ": "to rejoice, to exult (leap with joy; literary)",
  "לָשׂוּשׂ": "to rejoice (be glad, delight in; biblical/literary)",
  "לְהוֹשִׁיעַ": "to rescue, to save (deliver from danger; biblical/literary)",
  "לְהַצִּיל": "to rescue, to save (common, everyday word)",
  "לְחַכֵּךְ": "to rub (scratch an itch, abrade gently)",
  "לְשַׁפְשֵׁף": "to rub (scrub, polish by rubbing repeatedly)",
  "לִסְרֹט": "to scratch (scrape with a sharp object; to make a video/film)",
  "לִשְׂרֹט": "to scratch (draw a line, score a surface)",
  "לִנְגֹּהַּ": "to shine (glow with bright light; literary)",
  "לִקְרֹן": "to shine, to beam (radiate light, as from horns of light)",
  "לְהַחֲרִישׁ": "to silence; to be silent (choose not to respond; literary)",
  "לְהַשְׁתִּיק": "to silence (make someone/something stop making noise; common)",
  "לִטְבֹּחַ": "to slaughter (massacre, butcher violently)",
  "לִשְׁחֹט": "to slaughter (ritually slaughter an animal; shecht)",
  "לְהִתְמַחוֹת": "to specialize (become an expert in a field)",
  "לְהִתְמַקְצֵעַ": "to specialize (become a professional, gain proficiency)",
  "לְבַקֵּעַ": "to split, to cleave (crack open forcefully)",
  "לְבַתֵּק": "to split, to cleave; to pierce (cut through)",
  "לִמְרֹחַ": "to spread (smear, apply a spread on surface)",
  "לִשְׁטֹחַ": "to spread (lay out flat, unfurl); to present, explicate",
  "לְאַשֵּׁשׁ": "to strengthen, to establish (shore up, substantiate)",
  "לְחַזֵּק": "to strengthen (make stronger, reinforce; common word)",
  "לְהִתְיַסֵּר": "to suffer (be tormented, endure agony)",
  "לְהִתְעַנּוֹת": "to suffer; to fast (endure hardship/deprivation; literary)",
  "לִידוֹת": "to throw, to hurl (cast, fling; biblical)",
  "לִרְמוֹת": "to throw, to hurl (toss; biblical)",
  "לִגְזֹז": "to trim (shear wool/hair, clip close)",
  "לִגְזֹם": "to trim (prune branches/bushes, cut back vegetation)",
  "לְאַדּוֹת": "to vaporize (steam, evaporate); to simmer, to poach (cooking)",
  "לְאַיֵּד": "to vaporize, to evaporate (cause to turn into vapor)",
  "לֶאֱרֹג": "to weave (on a loom, produce fabric; common word)",
  "לִשְׁזֹר": "to weave (intertwine, braid, thread together)",
  "בְּיַחַד": "together (as a group, common usage with 'be-')",
  "יַחַד": "together (jointly, in unison; literary)",
  "יַחְדָּו": "together (jointly; biblical/poetic variant)",
  "מִסְחָר": "trade, commerce (the business/sector of trading)",
  "סַחַר": "trade, commerce (goods traded, merchandise; literary)",
  "אֱמֶת": "truth (common word for truth, verity)",
  "אֲמִתָּה": "truth; axiom (fundamental truth, literary)",
  "מִצְנֶפֶת": "turban (formal headdress, priestly turban)",
  "צָנִיף": "turban, head wrap (wrapped head covering)",
  "אַחְדוּת": "unity (state of being united, solidarity)",
  "אִחוּד": "unification (the act of uniting, merging)",
  "בִּקְעָה": "valley (broad, flat valley plain)",
  "עֵמֶק": "valley (deep valley between mountains/hills)",
  "אִשְׁרָה": "visa; approval (entry permit; formal approval)",
  "וִיזָה": "visa (travel visa, loanword)",
  "כֹּתֶל": "wall (the Western Wall; a freestanding stone wall)",
  "קִיר": "wall (common word for wall of a room/building)",
  "אַזְהָרָה": "warning (a caution, alert; legal/safety warning)",
  "הַזְהָרָה": "warning (the act of warning someone; admonition)",
  "רַהַט": "water trough (channel, gutter for water flow)",
  "שֹׁקֶת": "water trough (feeding/drinking trough for animals)",
  "אִלּוּלֵי": "were it not for (standard conditional; common)",
  "לוּלֵא": "were it not for (literary/Talmudic variant)",
  "אוֹפַןּ": "wheel (a single wheel; biblical/poetic)",
  "גַּלְגַּל": "wheel (rolling wheel; cycle, pulley)",
  "אַיֵּה": "where? (literary/biblical: where is?)",
  "הֵיכָן": "where? (standard literary form of 'where')",
  "לֹבֶן": "whiteness (white of the eye; white color)",
  "צְחוֹר": "whiteness; purity (brilliant white, radiance)",
  "עוֹלָם": "world (the world, universe; eternity; common word)",
  "תֵּבֵל": "world, universe (the inhabited world; poetic/literary)",
  "פֶּצַע": "wound (a specific cut, gash, open wound)",
  "פְּצִיעָה": "wound, injury (the event/act of being wounded)",
  "כִּסּוּפִים": "yearning, longing (wistful craving, literary; plural)",
  "עֶרְגָּה": "yearning, longing (deep nostalgic longing, literary)"
 }
--- a/data/vetted_sentences.json
+++ b/data/vetted_sentences.json
--- a/data/vocab_sentence_matches.json
+++ b/data/vocab_sentence_matches.json
--- a/epub_examples.py
+++ b/epub_examples.py
@ -0,0 +1,446 @@
 #!/usr/bin/env python3
 """
 Extract example sentences from nikud'd Hebrew EPUBs (and PDFs where possible),
 match them against the vocab list, and produce examples_cache.json.
 Usage:
    python3 epub_examples.py
 Outputs:
    data/epub_sentence_index.json  — full sentence corpus
    data/examples_cache.json       — best sentence(s) per vocab word
 """
 import csv
 import json
 import os
 import re
 import zipfile
 from html.parser import HTMLParser
 from pathlib import Path
 from helpers import strip_nikkud
 DATA_DIR = Path(__file__).parent / "data"
 EPUB_DIR = DATA_DIR / "epubs"
 DICT_CSV = DATA_DIR / "hebrew_dict_for_anki.csv"
 # Book metadata: filename -> display name
 EPUB_BOOKS = {
    "little_prince.epub": "הנסיך הקטן",
    "time_tunnel_82.epub": "מנהרת הזמן 82",
 }
 # PDF books are excluded — pypdf produces garbled RTL text (reversed chars within
 # words). If/when a proper EPUB version becomes available on Calibre, add it to
 # EPUB_BOOKS above instead.
 PDF_BOOKS: dict[str, str] = {}
 # Sentence length bounds (word count)
 MIN_WORDS = 4
 MAX_WORDS = 15
 # ── HTML text extraction ─────────────────────────────────────────
 class _TextExtractor(HTMLParser):
    """Extract text content from HTML, skipping script/style tags."""
    SKIP_TAGS = {"script", "style", "head"}
    def __init__(self):
        super().__init__()
        self.parts: list[str] = []
        self._skip_depth = 0
    def handle_starttag(self, tag, attrs):
        if tag in self.SKIP_TAGS:
            self._skip_depth += 1
        # Insert space for block-level elements to avoid word concatenation
        if tag in (
            "p",
            "div",
            "br",
            "li",
            "h1",
            "h2",
            "h3",
            "h4",
            "h5",
            "h6",
            "td",
            "th",
            "tr",
            "blockquote",
            "section",
        ):
            self.parts.append("\n")
    def handle_endtag(self, tag):
        if tag in self.SKIP_TAGS:
            self._skip_depth = max(0, self._skip_depth - 1)
    def handle_data(self, data):
        if self._skip_depth == 0:
            self.parts.append(data)
    def get_text(self) -> str:
        return "".join(self.parts)
 def extract_text_from_html(html: str) -> str:
    """Parse HTML and return plain text."""
    parser = _TextExtractor()
    parser.feed(html)
    return parser.get_text()
 # ── EPUB processing ──────────────────────────────────────────────
 def _content_files_from_epub(zf: zipfile.ZipFile) -> list[str]:
    """Get ordered list of content XHTML files from the OPF manifest."""
    # Find the OPF file
    opf_path = None
    for name in zf.namelist():
        if name.endswith(".opf"):
            opf_path = name
            break
    if not opf_path:
        # Fallback: just use all xhtml files
        return sorted(
            n
            for n in zf.namelist()
            if n.endswith((".xhtml", ".html"))
            and "toc" not in n.lower()
            and "cover" not in n.lower()
            and "nav" not in n.lower()
        )
    # Parse OPF to get spine order
    opf_content = zf.read(opf_path).decode("utf-8")
    opf_dir = os.path.dirname(opf_path)
    # Extract manifest items: id -> href
    manifest = {}
    for m in re.finditer(r'<item\s+[^>]*id="([^"]+)"[^>]*href="([^"]+)"', opf_content):
        manifest[m.group(1)] = m.group(2)
    # Also try reversed attribute order
    for m in re.finditer(r'<item\s+[^>]*href="([^"]+)"[^>]*id="([^"]+)"', opf_content):
        manifest[m.group(2)] = m.group(1)
    # Extract spine order
    spine_ids = re.findall(r'<itemref\s+[^>]*idref="([^"]+)"', opf_content)
    result = []
    for sid in spine_ids:
        href = manifest.get(sid, "")
        if href and href.endswith((".xhtml", ".html")):
            full_path = os.path.join(opf_dir, href) if opf_dir else href
            # Normalize path separators
            full_path = full_path.replace("\\", "/")
            if full_path in zf.namelist():
                result.append(full_path)
    if not result:
        # Fallback
        return sorted(
            n
            for n in zf.namelist()
            if n.endswith((".xhtml", ".html")) and "toc" not in n.lower() and "cover" not in n.lower()
        )
    return result
 def extract_sentences_from_epub(epub_path: Path, book_name: str) -> list[dict]:
    """Extract sentences from an EPUB file.
    Returns list of {"text": str, "book": str, "stripped": str}
    """
    zf = zipfile.ZipFile(epub_path)
    content_files = _content_files_from_epub(zf)
    all_text = []
    for cf in content_files:
        try:
            html = zf.read(cf).decode("utf-8")
        except (KeyError, UnicodeDecodeError):
            continue
        text = extract_text_from_html(html)
        all_text.append(text)
    full_text = "\n".join(all_text)
    return _split_into_sentences(full_text, book_name)
 # ── PDF processing ───────────────────────────────────────────────
 def extract_sentences_from_pdf(pdf_path: Path, book_name: str) -> list[dict]:
    """Extract sentences from a PDF file (best-effort, handles RTL reversal)."""
    try:
        import pypdf
    except ImportError:
        print(f"  [SKIP] pypdf not installed, cannot process {pdf_path.name}")
        return []
    reader = pypdf.PdfReader(pdf_path)
    all_text_parts = []
    for page in reader.pages:
        raw = page.extract_text()
        if not raw:
            continue
        # pypdf often reverses word order for RTL text; fix it
        fixed_lines = []
        for line in raw.split("\n"):
            words = line.split()
            # Check if this line is predominantly Hebrew
            hebrew_chars = sum(1 for c in line if "\u0590" <= c <= "\u05ff")
            if hebrew_chars > len(line) * 0.3 and len(words) > 1:
                # Reverse word order
                fixed_lines.append(" ".join(reversed(words)))
            else:
                fixed_lines.append(line)
        all_text_parts.append("\n".join(fixed_lines))
    full_text = "\n".join(all_text_parts)
    return _split_into_sentences(full_text, book_name)
 # ── Sentence splitting ───────────────────────────────────────────
 # Hebrew sentence terminators: period, exclamation, question mark, sof pasuk
 _SENT_SPLIT = re.compile(r"[.!?\u05C3]+")
 # Punctuation to strip from word boundaries when matching
 _PUNCT = re.compile(
    r'^[\u0022\u0027\u05F4\u05F3,;:\-–—…\u201C\u201D\u201E\u201F\u2018\u2019()\[\]{}«»"\']+|[\u0022\u0027\u05F4\u05F3,;:\-–—…\u201C\u201D\u201E\u201F\u2018\u2019()\[\]{}«»"\']+$'
 )
 def _split_into_sentences(text: str, book_name: str) -> list[dict]:
    """Split text into sentences and filter by length."""
    # Normalize whitespace
    text = re.sub(r"\s+", " ", text).strip()
    raw_sentences = _SENT_SPLIT.split(text)
    results = []
    seen = set()
    for sent in raw_sentences:
        sent = sent.strip()
        if not sent:
            continue
        # Count Hebrew words (skip non-Hebrew tokens like numbers)
        words = sent.split()
        hebrew_words = [w for w in words if any("\u0590" <= c <= "\u05ff" for c in w)]
        if len(hebrew_words) < MIN_WORDS or len(hebrew_words) > MAX_WORDS:
            continue
        # Skip duplicates
        stripped = strip_nikkud(sent)
        if stripped in seen:
            continue
        seen.add(stripped)
        results.append(
            {
                "text": sent,
                "book": book_name,
                "stripped": stripped,
            }
        )
    return results
 # ── Vocab loading ────────────────────────────────────────────────
 def load_vocab(csv_path: Path) -> dict:
    """Load vocab CSV and return {stripped_form: nikkud_word} mapping.
    Also returns reverse mapping for lookup.
    Returns (word_to_nikkud, nikkud_words_set)
    """
    words_by_stripped: dict[str, list[str]] = {}  # stripped -> [nikkud words]
    with open(csv_path, encoding="utf-8") as f:
        reader = csv.DictReader(f, delimiter=";")
        for row in reader:
            nikkud_word = row.get("Word", "").strip()
            word_no_nik = row.get("Word Without Nikkud", "").strip()
            if not nikkud_word:
                continue
            # Method 1: strip nikkud from the Word column
            stripped_from_nikkud = strip_nikkud(nikkud_word)
            # Add both forms for matching
            for form in {stripped_from_nikkud, word_no_nik}:
                if form:
                    words_by_stripped.setdefault(form, []).append(nikkud_word)
    return words_by_stripped
 # ── Matching ─────────────────────────────────────────────────────
 def match_sentences(sentences: list[dict], words_by_stripped: dict) -> dict:
    """Match sentences against vocab words.
    Returns {nikkud_word: [sentences]} with best (shortest) first.
    """
    # Build a set of all stripped forms for fast lookup
    all_forms = set(words_by_stripped.keys())
    # Hebrew single-letter prefixes: ב, ה, ו, כ, ל, מ, ש, ד (של)
    _HEB_PREFIXES = set("בהוכלמשד")
    # For each sentence, extract stripped words
    matches: dict[str, list[tuple[int, str]]] = {}  # nikkud_word -> [(word_count, sentence)]
    for sent_info in sentences:
        sent_text = sent_info["text"]
        sent_stripped = sent_info["stripped"]
        word_count = len(sent_text.split())
        # Get stripped words from the sentence
        raw_words = sent_stripped.split()
        # Map: candidate_form -> set of original cleaned words that produced it
        # This lets us verify that prefix stripping is plausible
        candidates: dict[str, str] = {}  # form -> original_word
        for w in raw_words:
            cleaned = _PUNCT.sub("", w)
            if not cleaned:
                continue
            # Direct match (always try)
            candidates[cleaned] = cleaned
            # Prefix stripping: only if remaining stem is >= 2 chars
            # and the prefix char is a known Hebrew prefix letter
            for prefix_len in (1, 2):
                if len(cleaned) > prefix_len + 1:
                    prefix = cleaned[:prefix_len]
                    stem = cleaned[prefix_len:]
                    if all(c in _HEB_PREFIXES for c in prefix) and len(stem) >= 2:
                        candidates[stem] = cleaned
        # Check which vocab words appear in this sentence
        matched_forms = set(candidates.keys()) & all_forms
        for form in matched_forms:
            # Skip spurious matches: very short vocab forms (1-2 chars)
            # should only match via direct word match, not prefix stripping
            if len(form) <= 2 and form not in {_PUNCT.sub("", w) for w in raw_words}:
                continue
            for nikkud_word in words_by_stripped[form]:
                matches.setdefault(nikkud_word, []).append((word_count, sent_text))
    # Sort by word count (prefer shorter sentences) and deduplicate
    result = {}
    for nikkud_word, sent_list in matches.items():
        sent_list.sort(key=lambda x: x[0])
        seen = set()
        unique = []
        for _, sent in sent_list:
            if sent not in seen:
                seen.add(sent)
                unique.append(sent)
                if len(unique) >= 5:  # Keep top 5 per word
                    break
        result[nikkud_word] = unique
    return result
 # ── Main ─────────────────────────────────────────────────────────
 def main():
    print("=" * 60)
    print("EPUB Example Sentence Extraction Pipeline")
    print("=" * 60)
    # Step 1: Extract sentences from all books
    all_sentences = []
    book_counts = {}
    for filename, book_name in EPUB_BOOKS.items():
        path = EPUB_DIR / filename
        if not path.exists():
            print(f"\n[SKIP] {filename} not found")
            continue
        print(f"\n[EPUB] Extracting: {book_name} ({filename})")
        sentences = extract_sentences_from_epub(path, book_name)
        book_counts[book_name] = len(sentences)
        all_sentences.extend(sentences)
        print(f"  -> {len(sentences)} sentences")
    for filename, book_name in PDF_BOOKS.items():
        path = EPUB_DIR / filename
        if not path.exists():
            print(f"\n[SKIP] {filename} not found")
            continue
        print(f"\n[PDF]  Extracting: {book_name} ({filename})")
        sentences = extract_sentences_from_pdf(path, book_name)
        book_counts[book_name] = len(sentences)
        all_sentences.extend(sentences)
        print(f"  -> {len(sentences)} sentences")
    print(f"\nTotal sentences: {len(all_sentences)}")
    # Step 2: Save sentence index
    index_path = DATA_DIR / "epub_sentence_index.json"
    with open(index_path, "w", encoding="utf-8") as f:
        json.dump({"sentences": all_sentences}, f, ensure_ascii=False, indent=2)
    print(f"\nSaved sentence index: {index_path}")
    # Step 3: Load vocab and match
    print(f"\nLoading vocab from {DICT_CSV} ...")
    words_by_stripped = load_vocab(DICT_CSV)
    total_vocab = len({w for wlist in words_by_stripped.values() for w in wlist})
    print(f"  {total_vocab} unique vocab words ({len(words_by_stripped)} lookup forms)")
    print("\nMatching sentences against vocab ...")
    examples_cache = match_sentences(all_sentences, words_by_stripped)
    # Step 4: Save examples_cache
    cache_path = DATA_DIR / "examples_cache.json"
    with open(cache_path, "w", encoding="utf-8") as f:
        json.dump(examples_cache, f, ensure_ascii=False, indent=2)
    print(f"Saved examples cache: {cache_path}")
    # Step 5: Summary stats
    print("\n" + "=" * 60)
    print("SUMMARY")
    print("=" * 60)
    print("\nSentences per book:")
    for book_name, count in book_counts.items():
        print(f"  {book_name}: {count}")
    print(f"  Total: {len(all_sentences)}")
    print("\nVocab matching:")
    print(f"  Total vocab words: {total_vocab}")
    print(f"  Words with examples: {len(examples_cache)}")
    coverage = 100 * len(examples_cache) / total_vocab if total_vocab else 0
    print(f"  Coverage: {coverage:.1f}%")
    # Show some sample matches
    print("\nSample matches:")
    count = 0
    for word, sents in examples_cache.items():
        if count >= 5:
            break
        print(f"  {word} -> {sents[0][:60]}...")
        count += 1
    return examples_cache
 if __name__ == "__main__":
    main()
--- a/frequency_lookup.py
+++ b/frequency_lookup.py
@ -7,18 +7,15 @@ Exposed API: get_frequency_rank(word_no_nikkud) -> int | None
 import json
 import logging
 import re
 import unicodedata
 from pathlib import Path
 import requests
 from helpers import strip_nikkud as _strip_nikkud
 logger = logging.getLogger(__name__)
-FREQ_URL = (
+FREQ_URL = "https://raw.githubusercontent.com/hermitdave/FrequencyWords/master/content/2016/he/he_50k.txt"
    "https://raw.githubusercontent.com/hermitdave/FrequencyWords/"
    "master/content/2016/he/he_50k.txt"
 )
 CACHE_PATH = Path(__file__).parent / "data" / "frequency_cache.json"
 REQUEST_TIMEOUT = 30
@ -26,14 +23,6 @@ REQUEST_TIMEOUT = 30
 _freq: dict[str, int] = {}
 def _strip_nikkud(text: str) -> str:
    """Remove Hebrew nikkud (diacritics) from a string."""
    return "".join(
        ch for ch in unicodedata.normalize("NFD", text)
        if unicodedata.category(ch) != "Mn"
    )
 def load(cache_path: Path = CACHE_PATH) -> None:
    """Load frequency data from cache, downloading if not present."""
    global _freq
--- a/hebrew_extract.py
+++ b/hebrew_extract.py
@ -4,25 +4,20 @@ Extract Hebrew vocabulary from pealim.com dictionary.
 Scrapes word entries, roots, parts of speech, and audio URLs for Anki flashcards.
 """
 import requests
 import pandas as pd
 from bs4 import BeautifulSoup
 import logging
 import time
-from typing import Optional
+
 import pandas as pd
 import requests
 from bs4 import BeautifulSoup
 # Configure logging
-logging.basicConfig(
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
 )
 logger = logging.getLogger(__name__)
 # Session for connection pooling
 session = requests.Session()
-session.headers.update({
+session.headers.update({"User-Agent": "Mozilla/5.0 (compatible; pealim-scraper/1.0)"})
    'User-Agent': 'Mozilla/5.0 (compatible; pealim-scraper/1.0)'
 })
 PEALIM_DICT_URL = "https://www.pealim.com/dict/"
 REQUEST_DELAY = 1.5  # seconds between requests (respectful scraping)
@ -33,7 +28,7 @@ def get_total_pages() -> int:
    """Dynamically determine total pages from first request."""
    try:
        logger.info("Fetching total page count...")
-        cookies = {'translit': 'none', 'hebstyle': 'mo'}
+        cookies = {"translit": "none", "hebstyle": "mo"}
        response = session.get(PEALIM_DICT_URL, cookies=cookies, timeout=REQUEST_TIMEOUT)
        response.raise_for_status()
        # Hardcoded — pealim.com has ~608 pages at ~15 words/page
@ -48,17 +43,17 @@ def _parse_page_with_audio(html_bytes: bytes) -> list[dict]:
    Parse a dict page with BeautifulSoup to extract word data + audio URL.
    Returns list of dicts with keys: Word, Root, Part of Speech, Meaning, audio_url.
    """
-    soup = BeautifulSoup(html_bytes, 'html.parser')
+    soup = BeautifulSoup(html_bytes, "html.parser")
    rows = []
-    for tr in soup.select('table tr'):
+    for tr in soup.select("table tr"):
-        tds = tr.find_all('td')
+        tds = tr.find_all("td")
        if len(tds) < 4:
            continue
        # Audio URL from span[data-audio] in first td
-        audio_span = tds[0].find(attrs={'data-audio': True})
+        audio_span = tds[0].find(attrs={"data-audio": True})
-        audio_url = audio_span['data-audio'] if audio_span else ''
+        audio_url = audio_span["data-audio"] if audio_span else ""
        # Word with nikkud
-        menukad = tds[0].find('span', class_='menukad')
+        menukad = tds[0].find("span", class_="menukad")
        word = menukad.get_text(strip=True) if menukad else tds[0].get_text(strip=True)
        # Root (may be link or plain text)
        root = tds[1].get_text(strip=True)
@ -67,17 +62,19 @@ def _parse_page_with_audio(html_bytes: bytes) -> list[dict]:
        # Meaning
        meaning = tds[3].get_text(strip=True)
        if word:
-            rows.append({
+            rows.append(
-                'Word': word,
+                {
-                'Root': root if root else '-',
+                    "Word": word,
-                'Part of Speech': pos,
+                    "Root": root if root else "-",
-                'Meaning': meaning,
+                    "Part of Speech": pos,
-                'audio_url': audio_url,
+                    "Meaning": meaning,
-            })
+                    "audio_url": audio_url,
                }
            )
    return rows
-def extract_from_website(max_pages: Optional[int] = None) -> pd.DataFrame:
+def extract_from_website(max_pages: int | None = None) -> pd.DataFrame:
    """
    Extract dictionary entries from pealim.com.
    Captures audio URLs from each word entry's data-audio attribute.
@ -93,33 +90,33 @@ def extract_from_website(max_pages: Optional[int] = None) -> pd.DataFrame:
    all_rows: list[dict] = []
-    for page_num in range(1, total_pages):
+    for page_num in range(1, total_pages + 1):
        try:
            url = f"{PEALIM_DICT_URL}?page={page_num}"
            # First request: with nikkud — parse with BeautifulSoup for audio URL
-            cookies = {'translit': 'none', 'hebstyle': 'mo'}
+            cookies = {"translit": "none", "hebstyle": "mo"}
            response = session.get(url, cookies=cookies, timeout=REQUEST_TIMEOUT)
            response.raise_for_status()
            page_rows = _parse_page_with_audio(response.content)
            # Second request: without nikkud — just get the word column
-            cookies_vl = {'translit': 'none', 'hebstyle': 'vl', 'showmeaning': 'off'}
+            cookies_vl = {"translit": "none", "hebstyle": "vl", "showmeaning": "off"}
            resp_vl = session.get(url, cookies=cookies_vl, timeout=REQUEST_TIMEOUT)
            resp_vl.raise_for_status()
-            soup_vl = BeautifulSoup(resp_vl.content, 'html.parser')
+            soup_vl = BeautifulSoup(resp_vl.content, "html.parser")
            no_nik_words = []
-            for tr in soup_vl.select('table tr'):
+            for tr in soup_vl.select("table tr"):
-                tds = tr.find_all('td')
+                tds = tr.find_all("td")
                if len(tds) < 4:
                    continue
-                menukad = tds[0].find('span', class_='menukad')
+                menukad = tds[0].find("span", class_="menukad")
                w = menukad.get_text(strip=True) if menukad else tds[0].get_text(strip=True)
                no_nik_words.append(w)
            # Merge no-nikkud words into rows
            for i, row in enumerate(page_rows):
-                row['Word Without Nikkud'] = no_nik_words[i] if i < len(no_nik_words) else ''
+                row["Word Without Nikkud"] = no_nik_words[i] if i < len(no_nik_words) else ""
            all_rows.extend(page_rows)
@ -136,7 +133,7 @@ def extract_from_website(max_pages: Optional[int] = None) -> pd.DataFrame:
            continue
    df = pd.DataFrame(all_rows)
-    audio_count = (df['audio_url'] != '').sum() if 'audio_url' in df.columns else 0
+    audio_count = (df["audio_url"] != "").sum() if "audio_url" in df.columns else 0
    logger.info(f"Extraction complete. Total words: {len(df)}, with audio URL: {audio_count}")
    return df
@ -150,39 +147,39 @@ def modify_for_anki(df: pd.DataFrame) -> pd.DataFrame:
    # Find shared root words
    shared_root_words = []
-    for idx, row in df.iterrows():
+    for _idx, row in df.iterrows():
-        root = row['Root']
+        root = row["Root"]
-        word = row['Word']
+        word = row["Word"]
-        if root != '-' and pd.notna(root):
+        if root != "-" and pd.notna(root):
-            same_root = df[(df['Root'] == root) & (df['Word'] != word)]['Word'].values
+            same_root = df[(df["Root"] == root) & (df["Word"] != word)]["Word"].values
-            shared = ' '.join(str(w) for w in same_root)
+            shared = " ".join(str(w) for w in same_root)
            shared_root_words.append(shared)
        else:
-            shared_root_words.append('')
+            shared_root_words.append("")
-    df['shared roots'] = shared_root_words
+    df["shared roots"] = shared_root_words
    # Generate Hebrew tags
    tags = []
-    for idx, row in df.iterrows():
+    for _idx, row in df.iterrows():
        tag_parts = []
-        root = str(row['Root']).replace(' ', '').replace('-', '')
+        root = str(row["Root"]).replace(" ", "").replace("-", "")
-        if 'nan' not in root and root:
+        if "nan" not in root and root:
-            root_clean = root.replace('.', '')
+            root_clean = root.replace(".", "")
            tag_parts.append(f"שורש::{root_clean}")
-        pos = str(row['Part of Speech'])
+        pos = str(row["Part of Speech"])
        pos_tags = {
-            'Adverb': 'תוארי_הפועל',
+            "Adverb": "תוארי_הפועל",
-            'Pronoun': 'כינויי_גוף',
+            "Pronoun": "כינויי_גוף",
-            'Noun': 'שם_עצם',
+            "Noun": "שם_עצם",
-            'Verb': 'פעלים',
+            "Verb": "פעלים",
-            'Adjective': 'שם_תואר',
+            "Adjective": "שם_תואר",
-            'Preposition': 'מילות_יחס',
+            "Preposition": "מילות_יחס",
-            'Conjunction': 'מילות_חיבור',
+            "Conjunction": "מילות_חיבור",
-            'Particle': 'מילית'
+            "Particle": "מילית",
        }
        for key, value in pos_tags.items():
@ -190,9 +187,9 @@ def modify_for_anki(df: pd.DataFrame) -> pd.DataFrame:
                tag_parts.append(value)
                break
-        tags.append(' '.join(tag_parts))
+        tags.append(" ".join(tag_parts))
-    df['tags'] = tags
+    df["tags"] = tags
    logger.info("Anki preparation complete.")
    return df
@ -201,11 +198,11 @@ def main():
    """Main entry point."""
    try:
        df = extract_from_website()
-        df.to_csv('hebrew_dict.csv', index=True)
+        df.to_csv("hebrew_dict.csv", index=True)
        logger.info("Saved: hebrew_dict.csv")
        df = modify_for_anki(df)
-        df.to_csv('hebrew_dict_for_anki.csv', sep=';', index=True)
+        df.to_csv("hebrew_dict_for_anki.csv", sep=";", index=True)
        logger.info("Saved: hebrew_dict_for_anki.csv")
        logger.info("Complete!")
@ -215,5 +212,5 @@ def main():
        raise
-if __name__ == '__main__':
+if __name__ == "__main__":
    main()
--- a/helpers.py
+++ b/helpers.py
@ -0,0 +1,8 @@
 """Shared helper functions for the Hebrew Flash Cards project."""
 import unicodedata
 def strip_nikkud(text: str) -> str:
    """Remove Hebrew nikkud (diacritics) from a string."""
    return "".join(ch for ch in unicodedata.normalize("NFD", text) if unicodedata.category(ch) != "Mn")
--- a/image_fetch.py
+++ b/image_fetch.py
@ -22,40 +22,43 @@ import argparse
 import json
 import logging
 import re
 import sys
 import time
 import unicodedata
 from pathlib import Path
 import requests
 from helpers import strip_nikkud as _strip_nikkud
 logger = logging.getLogger(__name__)
-DATA_DIR    = Path(__file__).parent / "data"
+DATA_DIR = Path(__file__).parent / "data"
-IMAGES_DIR  = DATA_DIR / "images"
+IMAGES_DIR = DATA_DIR / "images"
-CACHE_PATH  = DATA_DIR / "image_cache.json"
+CACHE_PATH = DATA_DIR / "image_cache.json"
-REQUEST_DELAY   = 0.5
+REQUEST_DELAY = 0.5
 REQUEST_TIMEOUT = 10
 # Abstract noun suffixes — words whose English meaning ends in these are skipped
 ABSTRACT_SUFFIXES = (
-    "tion", "ity", "ness", "ment", "ance", "ence", "ism",
+    "tion",
-    "hood", "ship", "ure", "age",
+    "ity",
    "ness",
    "ment",
    "ance",
    "ence",
    "ism",
    "hood",
    "ship",
    "ure",
    "age",
 )
 session = requests.Session()
-session.headers.update({
+session.headers.update(
-    "User-Agent": "pealim-anki/3.0 (educational Hebrew Anki deck builder; contact: anki@pealim.invalid)"
+    {"User-Agent": "pealim-anki/3.0 (educational Hebrew Anki deck builder; contact: anki@pealim.invalid)"}
-})
+)
 def _strip_nikkud(text: str) -> str:
    return "".join(
        ch for ch in unicodedata.normalize("NFD", text)
        if unicodedata.category(ch) != "Mn"
    )
 def is_concrete(english_meaning: str) -> bool:
    """Return True if the English meaning looks like a concrete noun."""
@ -196,7 +199,7 @@ def load_cache() -> dict:
        try:
            with open(CACHE_PATH, encoding="utf-8") as f:
                return json.load(f)
-        except Exception:
+        except Exception:  # noqa: S110
            pass
    return {}
@ -242,10 +245,10 @@ def run(limit: int | None = None, dry_run: bool = False, single_word: str | None
        if limit is not None and processed >= limit:
            break
-        word      = str(row.get("Word", "")).strip()
+        word = str(row.get("Word", "")).strip()
-        meaning   = str(row.get("Meaning", "")).strip()
+        meaning = str(row.get("Meaning", "")).strip()
        word_plain = str(row.get("Word Without Nikkud", "")).strip()
-        pos_raw   = str(row.get("Part of speech", row.get("Part of Speech", ""))).strip()
+        pos_raw = str(row.get("Part of speech", row.get("Part of Speech", ""))).strip()
        if not word or not meaning or meaning in ("nan", "None"):
            continue
--- a/pealim_extract.py
+++ b/pealim_extract.py
@ -1,187 +0,0 @@
 #!/usr/bin/env python3
 """
 Extract Hebrew vocabulary from pealim.com dictionary.
 Scrapes word entries, roots, and parts of speech for Anki flashcards.
 """
 import requests
 import pandas as pd
 import logging
 import time
 from typing import Optional
 # Configure logging
 logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
 )
 logger = logging.getLogger(__name__)
 # Session for connection pooling
 session = requests.Session()
 session.headers.update({
    'User-Agent': 'Mozilla/5.0 (compatible; pealim-scraper/1.0)'
 })
 PEALIM_DICT_URL = "https://www.pealim.com/dict/"
 REQUEST_DELAY = 1.5  # seconds between requests (respectful scraping)
 REQUEST_TIMEOUT = 10  # seconds
 def get_total_pages() -> int:
    """Dynamically determine total pages from first request."""
    try:
        logger.info("Fetching total page count...")
        cookies = {'translit': 'none', 'hebstyle': 'mo'}
        response = session.get(PEALIM_DICT_URL, cookies=cookies, timeout=REQUEST_TIMEOUT)
        response.raise_for_status()
        dfs = pd.read_html(response.content)
        if dfs:
            # Estimate pages from first page (typically 15 words per page)
            # For now, use hardcoded value but this could be improved
            return 608
    except Exception as e:
        logger.error(f"Error fetching page count: {e}. Using default (608).")
        return 608
 def extract_from_website(max_pages: Optional[int] = None) -> pd.DataFrame:
    """
    Extract dictionary entries from pealim.com.
    Args:
        max_pages: Maximum pages to scrape (None = all)
    Returns:
        DataFrame with Word, Root, Part of Speech, and Word Without Nikkud columns
    """
    total_pages = max_pages or get_total_pages()
    logger.info(f"Starting extraction from {total_pages} pages...")
    df = pd.DataFrame()
    for page_num in range(1, total_pages):
        try:
            url = f"{PEALIM_DICT_URL}?page={page_num}"
            # First request: with nikkud
            cookies = {'translit': 'none', 'hebstyle': 'mo'}
            response = session.get(url, cookies=cookies, timeout=REQUEST_TIMEOUT)
            response.raise_for_status()
            df_list = pd.read_html(response.content)
            # Second request: without nikkud
            cookies = {'translit': 'none', 'hebstyle': 'vl', 'showmeaning': 'off'}
            response = session.get(url, cookies=cookies, timeout=REQUEST_TIMEOUT)
            response.raise_for_status()
            without_nikkud_words = pd.read_html(response.content)[-1]['Word']
            without_nikkud_words = without_nikkud_words.rename('Word Without Nikkud')
            # Combine and append
            df_to_add = pd.concat([df_list[-1], without_nikkud_words], axis=1)
            df = pd.concat([df, df_to_add], ignore_index=True)
            if page_num % 50 == 0:
                logger.info(f"Processed {page_num}/{total_pages} pages...")
            time.sleep(REQUEST_DELAY)
        except requests.RequestException as e:
            logger.error(f"Error fetching page {page_num}: {e}. Retrying...")
            time.sleep(REQUEST_DELAY * 2)
        except Exception as e:
            logger.error(f"Unexpected error on page {page_num}: {e}")
            continue
    logger.info(f"Extraction complete. Total words: {len(df)}")
    return df
 def modify_for_anki(df: pd.DataFrame) -> pd.DataFrame:
    """
    Transform dictionary DataFrame for Anki import.
    Adds shared root words and Hebrew tags.
    Args:
        df: Dictionary DataFrame
    Returns:
        Modified DataFrame ready for Anki
    """
    logger.info("Preparing data for Anki...")
    # Find shared root words
    shared_root_words = []
    for idx, row in df.iterrows():
        root = row['Root']
        word = row['Word']
        if root != '-' and pd.notna(root):
            # Find other words with same root
            same_root = df[(df['Root'] == root) & (df['Word'] != word)]['Word'].values
            shared = ' '.join(str(w) for w in same_root)
            shared_root_words.append(shared)
        else:
            shared_root_words.append('')
    df['shared roots'] = shared_root_words
    # Generate Hebrew tags
    tags = []
    for idx, row in df.iterrows():
        tag_parts = []
        # Root tag
        root = str(row['Root']).replace(' ', '').replace('-', '')
        if 'nan' not in root and root:
            root_clean = root.replace('.', '')
            tag_parts.append(f"שורש::{root_clean}")
        # Part of speech tag
        pos = str(row['Part of Speech'])
        pos_tags = {
            'Adverb': 'תוארי_הפועל',
            'Pronoun': 'כינויי_גוף',
            'Noun': 'שם_עצם',
            'Verb': 'פעלים',
            'Adjective': 'שם_תואר',
            'Preposition': 'מילות_יחס',
            'Conjunction': 'מילות_חיבור',
            'Particle': 'מילית'
        }
        for key, value in pos_tags.items():
            if key in pos:
                tag_parts.append(value)
                break
        tags.append(' '.join(tag_parts))
    df['tags'] = tags
    logger.info("Anki preparation complete.")
    return df
 def main():
    """Main entry point."""
    try:
        # Extract from website
        df = extract_from_website()
        df.to_csv('pealim_dict.csv', index=True)
        logger.info("Saved: pealim_dict.csv")
        # Transform for Anki
        df = modify_for_anki(df)
        df.to_csv('pealim_dict_for_anki.csv', sep=';', index=True)
        logger.info("Saved: pealim_dict_for_anki.csv")
        logger.info("✅ Complete!")
    except Exception as e:
        logger.error(f"Fatal error: {e}")
        raise
 if __name__ == '__main__':
    main()
--- a/pyproject.toml
+++ b/pyproject.toml
@ -0,0 +1,80 @@
 [project]
 name = "hebrew-flash-cards"
 version = "0.13"
 description = "Hebrew vocabulary & verb conjugation flashcards for Anki"
 requires-python = ">=3.11"
 dependencies = [
    "beautifulsoup4>=4.11.0",
    "genanki>=0.8.0",
    "lxml>=4.9.0",
    "numpy>=1.21.0",
    "pandas>=1.3.0",
    "pymupdf>=1.23.0",
    "pypdf>=3.0.0",
    "python-bidi>=0.4.2",
    "requests>=2.26.0",
 ]
 [project.optional-dependencies]
 dev = [
    "bandit",
    "pytest",
    "ruff",
    "vulture",
 ]
 [tool.pytest.ini_options]
 testpaths = ["tests"]
 [tool.ruff]
 target-version = "py311"
 line-length = 120
 exclude = [
    "lib/",
    "bin/",
    "include/",
    "lib64/",
    "archive/",
    "venv/",
 ]
 [tool.ruff.lint]
 select = [
    "E",     # pycodestyle errors
    "W",     # pycodestyle warnings
    "F",     # pyflakes
    "I",     # isort
    "UP",    # pyupgrade
    "B",     # flake8-bugbear
    "SIM",   # flake8-simplify
    "PIE",   # flake8-pie
    "T20",   # flake8-print (flag print statements)
    "RET",   # flake8-return
    "C4",    # flake8-comprehensions
    "S",     # flake8-bandit (security)
 ]
 ignore = [
    "T201",  # allow print() — this is a CLI tool, not a library
    "S603",  # subprocess call with shell=False is fine
    "S607",  # partial executable path is fine for CLI tools
    "S105",  # PASS = "✓" is not a password
    "S108",  # /tmp paths are intentional for temp downloads
    "S311",  # random.Random() is for card ordering, not crypto
    "E501",  # line too long — handled by formatter
 ]
 [tool.ruff.lint.per-file-ignores]
 "test_*.py" = ["S101"]  # allow assert in tests
 [tool.ruff.format]
 quote-style = "double"
 indent-style = "space"
 [tool.vulture]
 paths = ["."]
 exclude = ["lib/", "bin/", "include/", "lib64/", "venv/", "archive/"]
 min_confidence = 80
 [tool.bandit]
 exclude_dirs = ["lib", "bin", "include", "lib64", "venv", "archive"]
 skips = ["B101"]  # allow assert
--- a/rebuild_sentence_matches.py
+++ b/rebuild_sentence_matches.py
@ -0,0 +1,183 @@
 #!/usr/bin/env python3
 """
 Rebuild vocab_sentence_matches.json using both direct word matching
 and ktiv male conjugated/declined form matching.
 This dramatically improves sentence coverage by matching not just
 dictionary forms but all conjugated verbs and declined nouns.
 """
 import json
 import logging
 import re
 from pathlib import Path
 import pandas as pd
 from helpers import strip_nikkud as _strip_nikkud
 logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
 logger = logging.getLogger(__name__)
 DATA_DIR = Path(__file__).parent / "data"
 def main():
    # Load sentences
    with open(DATA_DIR / "epub_sentence_index.json") as f:
        sentences = json.load(f).get("sentences", [])
    logger.info(f"Loaded {len(sentences)} sentences")
    # Load vocab CSV
    csv_path = DATA_DIR / "hebrew_dict_for_anki.csv"
    try:
        df = pd.read_csv(csv_path, sep=";", index_col=0)
        if df.shape[1] < 3:
            raise ValueError
    except (ValueError, pd.errors.ParserError):
        df = pd.read_csv(csv_path, index_col=0)
    logger.info(f"Loaded {len(df)} vocab entries")
    # Build word lookup: stripped_form → (word_nikkud, word_no_nikkud)
    word_lookup: dict[str, list[tuple[str, str]]] = {}
    for _, row in df.iterrows():
        word = str(row.get("Word", "")).strip()
        wni = str(row.get("Word Without Nikkud", "")).strip()
        if not word or word in ("nan", "None"):
            continue
        stripped = _strip_nikkud(word)
        if stripped:
            word_lookup.setdefault(stripped, []).append((word, wni))
    # Load ktiv male forms: ktiv_male_form → [{word_nikkud, form_type, ...}]
    ktiv_path = DATA_DIR / "ktiv_male_forms.json"
    ktiv_forms: dict[str, list[dict]] = {}
    if ktiv_path.exists():
        with open(ktiv_path) as f:
            ktiv_forms = json.load(f)
        logger.info(f"Loaded {len(ktiv_forms)} ktiv male forms")
    else:
        logger.warning("No ktiv_male_forms.json — only using direct matching")
    # Build reverse lookup: ktiv_male → set of dictionary words (nikkud)
    ktiv_to_word: dict[str, set[str]] = {}
    for ktiv, entries in ktiv_forms.items():
        for entry in entries:
            word_nikkud = entry.get("word_nikkud", "")
            if word_nikkud:
                ktiv_to_word.setdefault(ktiv, set()).add(word_nikkud)
    # Also add all vocab words' own stripped forms to ktiv_to_word
    for stripped, entries in word_lookup.items():
        for word_nikkud, _ in entries:
            ktiv_to_word.setdefault(stripped, set()).add(word_nikkud)
    logger.info(f"Total matchable forms: {len(ktiv_to_word)}")
    # Tokenize all sentences once
    sentence_tokens: list[tuple[dict, list[str]]] = []
    for s in sentences:
        stripped = s.get("stripped", _strip_nikkud(s.get("text", "")))
        tokens = [re.sub(r'[.,!?;:"\'\u05be]', "", t) for t in stripped.split()]
        tokens = [t for t in tokens if t]  # remove empty
        sentence_tokens.append((s, tokens))
    # Match: for each sentence token, check ktiv_to_word lookup
    # Build word_nikkud → [sentence_info]
    matches: dict[str, list[dict]] = {}  # word_nikkud → [sentences]
    for sent, tokens in sentence_tokens:
        text = sent.get("text", "")
        book = sent.get("book", "")
        word_len = len(tokens)
        # Skip sentences that are too short or too long
        if word_len < 4 or word_len > 15:
            continue
        for tok in tokens:
            if tok in ktiv_to_word:
                for word_nikkud in ktiv_to_word[tok]:
                    matches.setdefault(word_nikkud, []).append(
                        {
                            "text": text,
                            "book": book,
                            "matched_form": tok,
                            "word_count": word_len,
                        }
                    )
    logger.info(f"Words with at least 1 match: {len(matches)}")
    # Deduplicate and limit to 3 best sentences per word
    # Prefer shorter sentences (6-12 words ideal)
    output: dict[str, dict] = {}
    for word_nikkud, sents in matches.items():
        # Deduplicate by text
        seen_texts = set()
        unique = []
        for s in sents:
            if s["text"] not in seen_texts:
                seen_texts.add(s["text"])
                unique.append(s)
        # Score: prefer 6-12 word sentences
        def score(s):
            wc = s["word_count"]
            if 6 <= wc <= 12:
                return 0  # ideal
            return abs(wc - 9)  # distance from ideal
        unique.sort(key=score)
        best = unique[:3]
        # Find the Word Without Nikkud for this word
        stripped = _strip_nikkud(word_nikkud)
        wni = stripped  # default
        if stripped in word_lookup:
            for wn, w_wni in word_lookup[stripped]:
                if wn == word_nikkud:
                    wni = w_wni
                    break
        output[wni] = {
            "word_nikkud": word_nikkud,
            "sentences": [{"text": s["text"], "book": s["book"]} for s in best],
        }
    # Save
    out_path = DATA_DIR / "vocab_sentence_matches.json"
    with open(out_path, "w") as f:
        json.dump(output, f, ensure_ascii=False, indent=1)
    total_sents = sum(len(v["sentences"]) for v in output.values())
    logger.info(f"Saved {len(output)} words with {total_sents} sentences → {out_path}")
    # Stats
    total_vocab = len(df)
    pct = len(output) * 100 / total_vocab
    logger.info(f"Coverage: {len(output)}/{total_vocab} ({pct:.1f}%)")
    # Breakdown by match type
    direct_only = 0
    ktiv_only = 0
    both = 0
    for _wni, info in output.items():
        word = info["word_nikkud"]
        stripped = _strip_nikkud(word)
        has_direct = stripped in word_lookup
        has_ktiv = any(s.get("matched_form", "") != stripped for s in info["sentences"])
        if has_direct and has_ktiv:
            both += 1
        elif has_ktiv:
            ktiv_only += 1
        else:
            direct_only += 1
    logger.info(f"  Direct matches only: {direct_only}")
    logger.info(f"  Ktiv male matches only: {ktiv_only}")
    logger.info(f"  Both: {both}")
 if __name__ == "__main__":
    main()
--- a/run.py
+++ b/run.py
@ -6,7 +6,7 @@ Usage:
  python run.py [options]
 Options:
-  --only {vocab,conjugations}  Run only one deck (skips all unrelated steps)
+  --only {vocab,conjugations,confusables,plurals,complete}  Run only one deck
  --skip-scrape        Use existing data/pealim_dict.csv (no pealim.com dict scraping)
  --skip-audio         Skip audio .mp3 downloads
  --skip-examples      Skip Ben Yehuda example fetching
@ -22,9 +22,10 @@ import logging
 import re
 import sys
 import time
 import unicodedata
 from pathlib import Path
 from helpers import strip_nikkud
 sys.path.insert(0, str(Path(__file__).parent))
 logging.basicConfig(
@ -33,23 +34,31 @@ logging.basicConfig(
 )
 logger = logging.getLogger(__name__)
-DATA_DIR       = Path(__file__).parent / "data"
+DATA_DIR = Path(__file__).parent / "data"
-OUTPUT_DIR     = Path(__file__).parent / "output"
+OUTPUT_DIR = Path(__file__).parent / "output"
-AUDIO_DIR      = DATA_DIR / "audio"
+AUDIO_DIR = DATA_DIR / "audio"
 AUDIO_CONJ_DIR = DATA_DIR / "audio_conj"
-FONTS_DIR      = DATA_DIR / "fonts"
+FONTS_DIR = DATA_DIR / "fonts"
 def parse_args():
    p = argparse.ArgumentParser(description="Pealim Anki deck builder")
-    p.add_argument("--only",               choices=["vocab", "conjugations"], help="Run only one deck (skips all unrelated steps)")
+    p.add_argument(
-    p.add_argument("--skip-scrape",        action="store_true", help="Skip dict scraping; use cached CSV")
+        "--only",
-    p.add_argument("--skip-audio",         action="store_true", help="Skip audio downloads")
+        choices=["vocab", "conjugations", "confusables", "plurals", "complete"],
-    p.add_argument("--skip-examples",      action="store_true", help="Skip Ben Yehuda example lookup")
+        help="Run only one deck (skips all unrelated steps)",
-    p.add_argument("--skip-conjugations",  action="store_true", help="Skip verb conjugation extraction (deprecated: use --only vocab)")
+    )
-    p.add_argument("--skip-images",        action="store_true", help="Skip image fetching")
+    p.add_argument("--skip-scrape", action="store_true", help="Skip dict scraping; use cached CSV")
-    p.add_argument("--refresh-examples",   action="store_true", help="Force rebuild of Ben Yehuda index")
+    p.add_argument("--skip-audio", action="store_true", help="Skip audio downloads")
-    p.add_argument("--test",               type=int, metavar="N", help="Limit to first N words")
+    p.add_argument("--skip-examples", action="store_true", help="Skip Ben Yehuda example lookup")
    p.add_argument(
        "--skip-conjugations",
        action="store_true",
        help="Skip verb conjugation extraction (deprecated: use --only vocab)",
    )
    p.add_argument("--skip-images", action="store_true", help="Skip image fetching")
    p.add_argument("--refresh-examples", action="store_true", help="Force rebuild of Ben Yehuda index")
    p.add_argument("--test", type=int, metavar="N", help="Limit to first N words")
    return p.parse_args()
@ -59,8 +68,6 @@ def step_scrape(args):
    anki_csv = DATA_DIR / "hebrew_dict_for_anki.csv"
    # Legacy fallback names
    legacy_dict = DATA_DIR / "pealim_dict.csv"
    legacy_anki = DATA_DIR / "pealim_dict_for_anki.csv"
    if args.skip_scrape:
        if dict_csv.exists():
            logger.info(f"[1] Using existing {dict_csv}")
@ -72,8 +79,8 @@ def step_scrape(args):
        return
    logger.info("[1] Scraping dictionary from pealim.com …")
    import hebrew_extract
    import pandas as pd
    df = hebrew_extract.extract_from_website()
    df.to_csv(dict_csv, index=True)
@ -88,6 +95,7 @@ def step_frequency() -> dict[str, int]:
    """Step 2 — load/download word frequency data."""
    logger.info("[2] Loading word frequency data …")
    import frequency_lookup
    frequency_lookup.load()
    return frequency_lookup._freq
@ -104,6 +112,7 @@ def step_examples(args, freq_cache: dict):
    logger.info("[3] Loading Ben Yehuda example index …")
    import benyehuda
    benyehuda.load(force_rebuild=args.refresh_examples)
    dict_csv = DATA_DIR / "hebrew_dict_for_anki.csv"
@ -116,6 +125,7 @@ def step_examples(args, freq_cache: dict):
    try:
        import pandas as pd
        try:
            df = pd.read_csv(dict_csv, sep=";", index_col=0)
            if df.shape[1] < 3:
@ -158,6 +168,7 @@ def step_audio(args):
    import pandas as pd
    import requests
    try:
        try:
            df = pd.read_csv(dict_csv, sep=";", index_col=0)
@ -166,7 +177,7 @@ def step_audio(args):
        except (ValueError, pd.errors.ParserError):
            df = pd.read_csv(dict_csv, index_col=0)
-        if 'audio_url' not in df.columns:
+        if "audio_url" not in df.columns:
            logger.warning("    No audio_url column in CSV — re-scrape with hebrew_extract.py to capture audio URLs")
            return
@ -178,10 +189,6 @@ def step_audio(args):
        skipped = 0
        no_url = 0
        def strip_nik(t: str) -> str:
            return "".join(c for c in unicodedata.normalize("NFD", t)
                           if unicodedata.category(c) != "Mn")
        for _, row in df.iterrows():
            word = str(row.get("Word", "")).strip()
            word_plain = str(row.get("Word Without Nikkud", "")).strip()
@ -190,7 +197,7 @@ def step_audio(args):
            if not word:
                continue
-            safe_name = re.sub(r"[^\u05d0-\u05ea]", "", strip_nik(word_plain or word))
+            safe_name = re.sub(r"[^\u05d0-\u05ea]", "", strip_nikkud(word_plain or word))
            if not safe_name:
                continue
            mp3_path = AUDIO_DIR / f"{safe_name}.mp3"
@ -228,11 +235,12 @@ def step_conj_audio(args, conjugations: dict):
    AUDIO_CONJ_DIR.mkdir(parents=True, exist_ok=True)
    import requests
    downloaded = 0
    skipped = 0
    failed = 0
-    for infinitive, data in conjugations.items():
+    for _infinitive, data in conjugations.items():
        if not data or not data.get("forms"):
            continue
@ -282,17 +290,14 @@ def step_conj_audio(args, conjugations: dict):
                    logger.debug(f"    Conj audio failed {filename}: {e}")
                    failed += 1
-    logger.info(
+    logger.info(f"    Conjugation audio: {downloaded} downloaded, {skipped} cached, {failed} failed")
        f"    Conjugation audio: {downloaded} downloaded, "
        f"{skipped} cached, {failed} failed"
    )
 def step_fonts(args):
    """Step 4c — download Heebo font files (one-time, cached)."""
    FONTS_DIR.mkdir(parents=True, exist_ok=True)
    regular = FONTS_DIR / "_Heebo-Regular.ttf"
-    bold    = FONTS_DIR / "_Heebo-Bold.ttf"
+    bold = FONTS_DIR / "_Heebo-Bold.ttf"
    if regular.exists() and bold.exists():
        logger.info("[4c] Heebo fonts already cached")
@ -302,6 +307,7 @@ def step_fonts(args):
    # Fetch CSS to get actual TTF source URLs (static subset for Hebrew + Latin)
    import requests as _req
    headers = {
        # Request TTF (not woff2) so Anki can embed them
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) Gecko/20100101 Firefox/120.0"
@ -355,10 +361,13 @@ def step_images(args) -> dict:
    limit = args.test  # When in test mode, limit images too
    logger.info("[4d] Fetching images for concrete nouns …")
    import image_fetch
    return image_fetch.run(limit=limit)
-def step_build_all(args, examples_cache: dict, freq_cache: dict, conjugations: dict | None, image_cache: dict | None = None):
+def step_build_all(
    args, examples_cache: dict, freq_cache: dict, conjugations: dict | None, image_cache: dict | None = None
 ):
    """Step 5 — build all 6 release variants (4 vocab + 2 conj)."""
    logger.info("[5] Building all deck variants …")
    import apkg_builder
@ -394,6 +403,7 @@ def step_conjugations(args):
            logger.info("[6] --skip-conjugations: loading from cache …")
            with open(conj_cache) as f:
                import json as _json
                return _json.load(f)
        logger.info("[6] --skip-conjugations: no cache found, skipping conj decks")
        return None
@ -407,10 +417,12 @@ def step_conjugations(args):
        logger.info("[6] Using cached conjugations.json …")
        with open(conj_cache) as f:
            import json as _json
            conjugations = _json.load(f)
    else:
        logger.info("[6] Extracting verb conjugations …")
        import conjugation_extract
        conjugations = conjugation_extract.main(verbs_file)
    # Download conjugation audio
@ -434,6 +446,7 @@ def print_summary(args, examples_cache, freq_cache, conjugations):
        dict_csv = DATA_DIR / "pealim_dict.csv"
    if dict_csv.exists():
        import pandas as pd
        try:
            df = pd.read_csv(dict_csv, sep=";", index_col=0)
            if df.shape[1] < 3:
@ -446,7 +459,7 @@ def print_summary(args, examples_cache, freq_cache, conjugations):
    logger.info(f"  Example cache entries: {len(examples_cache)}")
    covered = sum(1 for v in examples_cache.values() if v)
    if examples_cache:
-        logger.info(f"  Example coverage: {covered}/{len(examples_cache)} ({100*covered//len(examples_cache)}%)")
+        logger.info(f"  Example coverage: {covered}/{len(examples_cache)} ({100 * covered // len(examples_cache)}%)")
    if AUDIO_DIR.exists():
        mp3s = list(AUDIO_DIR.glob("*.mp3"))
@ -455,9 +468,9 @@ def print_summary(args, examples_cache, freq_cache, conjugations):
    if AUDIO_CONJ_DIR.exists():
        # Count only files that will be bundled: active non-infinitive forms
        # (excludes {slug}_passive_* and {slug}_infinitive.mp3 on-disk extras)
-        mp3s = [p for p in AUDIO_CONJ_DIR.glob("*.mp3")
+        mp3s = [
-                if not p.stem.endswith("_infinitive")
+            p for p in AUDIO_CONJ_DIR.glob("*.mp3") if not p.stem.endswith("_infinitive") and "_passive_" not in p.stem
-                and "_passive_" not in p.stem]
+        ]
        logger.info(f"  Conjugation audio files (bundled): {len(mp3s)}")
    image_cache_path = DATA_DIR / "image_cache.json"
@ -468,9 +481,18 @@ def print_summary(args, examples_cache, freq_cache, conjugations):
        logger.info(f"  Images: {found_imgs}/{len(ic)} nouns with images")
    import apkg_builder as _ab
    all_apkgs = [
-        _ab.VOCAB_APKG, _ab.VOCAB_APKG_AUDIO, _ab.VOCAB_APKG_IMAGES, _ab.VOCAB_APKG_AUDIO_IMAGES,
+        _ab.VOCAB_APKG,
-        _ab.CONJ_APKG, _ab.CONJ_APKG_AUDIO,
+        _ab.VOCAB_APKG_AUDIO,
        _ab.VOCAB_APKG_IMAGES,
        _ab.VOCAB_APKG_AUDIO_IMAGES,
        _ab.CONJ_APKG,
        _ab.CONJ_APKG_AUDIO,
        _ab.CONF_APKG,
        _ab.CONF_APKG_AUDIO,
        _ab.COMPLETE_APKG,
        _ab.COMPLETE_APKG_AUDIO,
    ]
    for apkg in all_apkgs:
        if apkg.exists():
@ -502,24 +524,80 @@ def main():
        conjugations = step_conjugations(args)
        if conjugations:
            import apkg_builder
-            apkg_builder.build_all_variants(
+
-                DATA_DIR / "hebrew_dict_for_anki.csv",
+            dict_csv = DATA_DIR / "hebrew_dict_for_anki.csv"
-                conjugations=conjugations,
+            if not dict_csv.exists():
-                limit=args.test,
+                dict_csv = DATA_DIR / "hebrew_dict.csv"
-            )
+            for audio, path in [(False, apkg_builder.CONJ_APKG), (True, apkg_builder.CONJ_APKG_AUDIO)]:
                deck, media = apkg_builder.build_conj_deck(
                    conjugations,
                    include_audio=audio,
                    dict_csv=dict_csv,
                )
                apkg_builder.write_conj_apkg(deck, media, out_path=path)
        print_summary(args, {}, {}, conjugations or {})
        return
    if args.only == "confusables":
        step_fonts(args)
        import apkg_builder
        dict_csv = DATA_DIR / "hebrew_dict_for_anki.csv"
        for audio, path in [(False, apkg_builder.CONF_APKG), (True, apkg_builder.CONF_APKG_AUDIO)]:
            deck, media = apkg_builder.build_confusables_deck(dict_csv, include_audio=audio)
            apkg_builder.write_conf_apkg(deck, media, out_path=path)
        print_summary(args, {}, {}, {})
        return
    if args.only == "plurals":
        step_fonts(args)
        import apkg_builder
        dict_csv = DATA_DIR / "hebrew_dict_for_anki.csv"
        if not dict_csv.exists():
            dict_csv = DATA_DIR / "hebrew_dict.csv"
        for audio, path in [(False, apkg_builder.PLURAL_APKG), (True, apkg_builder.PLURAL_APKG_AUDIO)]:
            deck, media = apkg_builder.build_plural_deck(dict_csv=dict_csv, include_audio=audio)
            apkg_builder.write_plural_apkg(deck, media, out_path=path)
        print_summary(args, {}, {}, {})
        return
    if args.only == "complete":
        step_fonts(args)
        freq_cache = step_frequency() if not args.skip_scrape else {}
        examples_cache = step_examples(args, freq_cache) if not args.skip_examples else {}
        image_cache = step_images(args) if not args.skip_images else {}
        conjugations = step_conjugations(args)
        import apkg_builder
        dict_csv = DATA_DIR / "hebrew_dict_for_anki.csv"
        if not dict_csv.exists():
            dict_csv = DATA_DIR / "hebrew_dict.csv"
        emoji_lookup = apkg_builder._load_emoji_lookup()
        for audio, path in [(False, apkg_builder.COMPLETE_APKG), (True, apkg_builder.COMPLETE_APKG_AUDIO)]:
            decks, media = apkg_builder.build_complete_deck(
                dict_csv,
                conjugations=conjugations or {},
                examples_cache=examples_cache,
                freq_cache=freq_cache,
                image_cache=image_cache,
                emoji_lookup=emoji_lookup,
                include_audio=audio,
            )
            apkg_builder.write_complete_apkg(decks, media, out_path=path)
        print_summary(args, examples_cache, freq_cache, conjugations or {})
        return
    if args.only == "vocab":
        args.skip_conjugations = True
    step_scrape(args)
-    freq_cache     = step_frequency()
+    freq_cache = step_frequency()
    examples_cache = step_examples(args, freq_cache)
    step_audio(args)
    step_fonts(args)
-    image_cache    = step_images(args)
+    image_cache = step_images(args)
-    conjugations   = step_conjugations(args)
+    conjugations = step_conjugations(args)
    step_build_all(args, examples_cache, freq_cache, conjugations, image_cache)
    print_summary(args, examples_cache, freq_cache, conjugations or {})
--- a/scripts/extract_pdf_sentences.py
+++ b/scripts/extract_pdf_sentences.py
@ -0,0 +1,405 @@
 #!/usr/bin/env python3
 """
 Extract sentences from PDF books and match vocab words to sentences.
 1. Extract sentences from alice.pdf and lion_strawberry.pdf
 2. Merge into existing epub_sentence_index.json
 3. Match vocab words to sentences, produce vocab_sentence_matches.json
 """
 import json
 import os
 import re
 import sys
 # Use the venv with pymupdf
 sys.path.insert(0, "/home/node/projects/pealim/venv_pdf/lib/python3.11/site-packages")
 # Also need the main venv for pandas
 sys.path.insert(0, "/home/node/projects/pealim/lib/python3.11/site-packages")
 import fitz
 import pandas as pd
 BASE_DIR = "/home/node/projects/pealim"
 DATA_DIR = os.path.join(BASE_DIR, "data")
 EPUBS_DIR = os.path.join(DATA_DIR, "epubs")
 SENTENCE_INDEX = os.path.join(DATA_DIR, "epub_sentence_index.json")
 VOCAB_CSV = os.path.join(DATA_DIR, "hebrew_dict_for_anki.csv")
 MATCHES_FILE = os.path.join(DATA_DIR, "vocab_sentence_matches.json")
 NIKKUD_RE = re.compile(r"[\u0591-\u05C7]")
 HEBREW_RE = re.compile(r"[\u05d0-\u05ea]")
 HEBREW_CHAR_RE = re.compile(r"[\u05d0-\u05ea\ufb20-\ufb4f]")
 def strip_nikkud(text):
    """Remove all Hebrew nikkud/cantillation marks."""
    return NIKKUD_RE.sub("", text)
 def collapse_hebrew_spaces(text):
    """Collapse spaces between Hebrew letter fragments (for badly-encoded PDFs).
    Strategy: strip nikkud first, then iteratively remove spaces between
    Hebrew characters. Real word boundaries are detected by:
    - Final-form letters (ם ן ף ך ץ) followed by space
    - Punctuation (.,;:!?"')
    - Non-Hebrew characters
    """
    stripped = strip_nikkud(text)
    # Normalize presentation forms to standard Hebrew
    # FB20-FB4F contains presentation forms
    for code in range(0xFB2A, 0xFB50):
        ch = chr(code)
        if ch in stripped:
            # Map shin/sin dots, dagesh forms back to base
            # FB2A = שׁ (shin+dot), FB2B = שׂ (sin+dot)
            base_map = {
                "\ufb2a": "ש",
                "\ufb2b": "ש",
                "\ufb35": "ו",
                "\ufb4b": "ו",
                "\ufb30": "א",
                "\ufb31": "ב",
                "\ufb32": "ג",
                "\ufb33": "ד",
                "\ufb34": "ה",
                "\ufb36": "ז",
                "\ufb38": "ט",
                "\ufb39": "י",
                "\ufb3a": "כ",
                "\ufb3b": "כ",
                "\ufb3c": "ל",
                "\ufb3e": "מ",
                "\ufb40": "נ",
                "\ufb41": "ס",
                "\ufb43": "פ",
                "\ufb44": "פ",
                "\ufb46": "צ",
                "\ufb47": "ק",
                "\ufb48": "ר",
                "\ufb49": "ש",
                "\ufb4a": "ת",
            }
            if ch in base_map:
                stripped = stripped.replace(ch, base_map[ch])
    # Replace multiple spaces with single
    stripped = re.sub(r" {2,}", " ", stripped)
    # Now rebuild text, keeping spaces only at word boundaries
    # Word boundary markers: final-form letters, punctuation, non-Hebrew
    final_forms = set("םןףךץ")
    result = []
    i = 0
    chars = list(stripped)
    while i < len(chars):
        if chars[i] != " ":
            result.append(chars[i])
            i += 1
            continue
        # It's a space. Decide if it's a word boundary.
        # Look back for the last non-space character
        prev_ch = None
        for j in range(len(result) - 1, -1, -1):
            if result[j] != " ":
                prev_ch = result[j]
                break
        # Look forward for next non-space character
        next_ch = None
        for j in range(i + 1, len(chars)):
            if chars[j] != " ":
                next_ch = chars[j]
                break
        is_boundary = False
        # After final-form letter = word boundary
        if prev_ch and prev_ch in final_forms:
            is_boundary = True
        # Before/after punctuation or non-Hebrew = word boundary
        if prev_ch and not HEBREW_RE.match(prev_ch):
            is_boundary = True
        if next_ch and not HEBREW_RE.match(next_ch):
            is_boundary = True
        # If either side is not Hebrew at all, boundary
        if prev_ch is None or next_ch is None:
            is_boundary = True
        if is_boundary:
            result.append(" ")
        # else: skip the space (collapse intra-word gap)
        i += 1
    return "".join(result).strip()
 def extract_pdf_sentences(pdf_path, book_name):
    """Extract sentences from a PDF file."""
    doc = fitz.open(pdf_path)
    sentences = []
    for page_num in range(len(doc)):
        page = doc[page_num]
        text = page.get_text()
        if not text.strip():
            continue
        # Split into lines first, then split on sentence-ending punctuation
        lines = text.split("\n")
        raw_sentences = []
        for line in lines:
            line = line.strip()
            if not line:
                continue
            # Split on sentence-ending punctuation followed by space or at end
            parts = re.split(r"(?<=[.?!])\s+", line)
            raw_sentences.extend(parts)
        for sent in raw_sentences:
            sent = sent.strip()
            if not sent:
                continue
            # Must contain Hebrew characters
            if not HEBREW_RE.search(sent):
                continue
            # Create stripped version (no nikkud, collapsed spaces for PDF)
            stripped = collapse_hebrew_spaces(sent)
            # Count Hebrew words in stripped version
            words = [w for w in stripped.split() if HEBREW_RE.search(w)]
            word_count = len(words)
            # Filter: 4-15 Hebrew words
            if word_count < 4 or word_count > 15:
                continue
            # Drop metadata-like lines
            # Page numbers (just digits)
            if re.match(r"^\d+$", sent.strip()):
                continue
            # Copyright text
            if any(kw in sent.lower() for kw in ["copyright", "©", "isbn", "printed in"]):
                continue
            sentences.append(
                {
                    "text": sent,
                    "book": book_name,
                    "stripped": stripped,
                }
            )
    doc.close()
    return sentences
 def has_extractable_text(pdf_path):
    """Check if a PDF has extractable text."""
    doc = fitz.open(pdf_path)
    text_found = False
    for i in range(min(len(doc), 10)):
        if doc[i].get_text().strip():
            text_found = True
            break
    doc.close()
    return text_found
 def load_sentence_index():
    """Load existing sentence index."""
    if os.path.exists(SENTENCE_INDEX):
        with open(SENTENCE_INDEX, encoding="utf-8") as f:
            return json.load(f)
    return {"sentences": []}
 def save_sentence_index(data):
    """Save sentence index."""
    with open(SENTENCE_INDEX, "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=2)
 def match_vocab_to_sentences(sentences, vocab_df):
    """Match vocab words to sentences."""
    matches = {}
    # Build lookup: word_no_nikkud -> word_nikkud
    vocab_words = []
    for _, row in vocab_df.iterrows():
        word_no_nik = str(row.get("Word Without Nikkud", "")).strip()
        word_nik = str(row.get("Word", "")).strip()
        if word_no_nik and word_nik:
            vocab_words.append((word_no_nik, word_nik))
    print(f"Matching {len(vocab_words)} vocab words against {len(sentences)} sentences...")
    # Precompute: for each sentence, get the stripped text
    sent_data = []
    for s in sentences:
        stripped = s.get("stripped", "")
        # For PDF sentences, stripped already has collapsed spaces but words may be joined
        # For EPUB sentences, stripped has proper word spacing
        sent_data.append(
            {
                "text": s["text"],
                "book": s["book"],
                "stripped": stripped,
                "word_count": len(stripped.split()),
            }
        )
    matched_count = 0
    for word_no_nik, word_nik in vocab_words:
        if len(word_no_nik) < 2:
            continue
        # Build regex for word boundary matching
        # Use both approaches: proper word boundary and substring for PDF text
        pattern = re.compile(r"(?:^|\s)" + re.escape(word_no_nik) + r"(?:\s|$)")
        # For PDF texts with collapsed spaces, also try substring match
        # but only for words >= 3 chars to avoid false positives
        use_substring = len(word_no_nik) >= 3
        word_matches = []
        for sd in sent_data:
            stripped = sd["stripped"]
            # Try word-boundary match first
            if pattern.search(stripped):
                word_matches.append(sd)
            elif use_substring and word_no_nik in stripped:
                # Substring match for PDF texts with collapsed spaces
                # Verify it's not part of a longer word by checking the character
                # before and after in the collapsed text
                idx = stripped.find(word_no_nik)
                before_ok = idx == 0 or not HEBREW_RE.match(stripped[idx - 1])
                after_idx = idx + len(word_no_nik)
                after_ok = after_idx >= len(stripped) or not HEBREW_RE.match(stripped[after_idx])
                # Only count if at least one boundary is clear
                # (for PDF collapsed text, boundaries are often missing)
                # For PDF books, we accept substring matches
                if sd["book"] in ("אליס בארץ הפלאות", "האריה שאהב תות") or before_ok or after_ok:
                    word_matches.append(sd)
        if word_matches:
            matched_count += 1
            # Sort by preference: 6-12 words ideal, then shorter is better
            def score(sd):
                wc = sd["word_count"]
                if 6 <= wc <= 12:
                    return (0, wc)  # ideal range, prefer shorter
                if wc < 6:
                    return (1, -wc)  # too short
                return (2, wc)  # too long
            word_matches.sort(key=score)
            best = word_matches[:3]
            matches[word_no_nik] = {
                "word_nikkud": word_nik,
                "sentences": [{"text": m["text"], "book": m["book"]} for m in best],
            }
    print(
        f"Words with at least 1 match: {matched_count}/{len(vocab_words)} ({100 * matched_count / len(vocab_words):.1f}%)"
    )
    return matches
 def main():
    # ── Step 1: Extract from PDFs ──
    pdfs = [
        ("alice.pdf", "אליס בארץ הפלאות"),
        ("lion_strawberry.pdf", "האריה שאהב תות"),
    ]
    all_new_sentences = []
    for filename, book_name in pdfs:
        pdf_path = os.path.join(EPUBS_DIR, filename)
        if not os.path.exists(pdf_path):
            print(f"SKIP: {filename} not found")
            continue
        if not has_extractable_text(pdf_path):
            print(f"SKIP: {filename} has no extractable text (likely scanned images)")
            continue
        print(f"Extracting from {filename} ({book_name})...")
        sentences = extract_pdf_sentences(pdf_path, book_name)
        print(f"  Extracted {len(sentences)} sentences")
        all_new_sentences.extend(sentences)
    # ── Step 2: Merge with existing index ──
    index = load_sentence_index()
    existing_count = len(index["sentences"])
    # Deduplicate by (stripped, book)
    existing_keys = set()
    for s in index["sentences"]:
        key = (s.get("stripped", ""), s.get("book", ""))
        existing_keys.add(key)
    added = 0
    for s in all_new_sentences:
        key = (s["stripped"], s["book"])
        if key not in existing_keys:
            index["sentences"].append(s)
            existing_keys.add(key)
            added += 1
    save_sentence_index(index)
    total = len(index["sentences"])
    print(f"\nSentence index: {existing_count} existing + {added} new = {total} total")
    # ── Per-book stats ──
    book_counts = {}
    for s in index["sentences"]:
        book = s.get("book", "unknown")
        book_counts[book] = book_counts.get(book, 0) + 1
    print("\nSentences per book:")
    for book, count in sorted(book_counts.items(), key=lambda x: -x[1]):
        print(f"  {book}: {count}")
    # ── Step 3: Match vocab words to sentences ──
    print(f"\nLoading vocab from {VOCAB_CSV}...")
    vocab_df = pd.read_csv(VOCAB_CSV, sep=";", index_col=0)
    print(f"  {len(vocab_df)} vocab words loaded")
    matches = match_vocab_to_sentences(index["sentences"], vocab_df)
    with open(MATCHES_FILE, "w", encoding="utf-8") as f:
        json.dump(matches, f, ensure_ascii=False, indent=2)
    print(f"\nWrote {len(matches)} word matches to {MATCHES_FILE}")
    # ── Step 4: Summary stats ──
    total_words = len(vocab_df)
    matched_words = len(matches)
    print(f"\n{'=' * 50}")
    print("SUMMARY")
    print(f"{'=' * 50}")
    print(f"Total sentences: {total}")
    for book, count in sorted(book_counts.items(), key=lambda x: -x[1]):
        print(f"  {book}: {count}")
    print(f"Total vocab words: {total_words}")
    print(f"Words with sentences: {matched_words} ({100 * matched_words / total_words:.1f}%)")
    print(f"Words without sentences: {total_words - matched_words}")
 if __name__ == "__main__":
    main()
--- a/scripts/extract_verb_list.py
+++ b/scripts/extract_verb_list.py
@ -21,9 +21,10 @@ from pathlib import Path
 logger = logging.getLogger(__name__)
-PDF_URL = "https://books.nevo.engineer/opds/download/117/pdf/"
+PROJECT_ROOT = Path(__file__).resolve().parent.parent
 PDF_URL = ""  # Set to URL or local path of Coffin & Bolozky PDF
 PDF_PATH = Path("/tmp/coffin_bolozky.pdf")
-OUTPUT_PATH = Path(__file__).parent / "verbs_input.txt"
+OUTPUT_PATH = PROJECT_ROOT / "verbs_input.txt"
 # Pages to scan (Appendix 1)
 PAGE_START = 390
@ -31,24 +32,38 @@ PAGE_END = 411
 # Binyan headings in Hebrew (vowelled and unvowelled variants)
 BINYAN_HEADINGS_HEB = [
-    "פָּעַל", "פעל",
+    "פָּעַל",
-    "נִפְעַל", "נפעל",
+    "פעל",
-    "פִּעֵל", "פיעל",
+    "נִפְעַל",
-    "פֻּעַל", "פועל",
+    "נפעל",
-    "הִתְפַּעֵל", "התפעל",
+    "פִּעֵל",
-    "הִפְעִיל", "הפעיל",
+    "פיעל",
-    "הֻפְעַל", "הופעל",
+    "פֻּעַל",
    "פועל",
    "הִתְפַּעֵל",
    "התפעל",
    "הִפְעִיל",
    "הפעיל",
    "הֻפְעַל",
    "הופעל",
 ]
 # Binyan heading → canonical name
 BINYAN_CANONICAL = {
-    "פָּעַל": "Pa'al", "פעל": "Pa'al",
+    "פָּעַל": "Pa'al",
-    "נִפְעַל": "Nif'al", "נפעל": "Nif'al",
+    "פעל": "Pa'al",
-    "פִּעֵל": "Pi'el", "פיעל": "Pi'el",
+    "נִפְעַל": "Nif'al",
-    "פֻּעַל": "Pu'al", "פועל": "Pu'al",
+    "נפעל": "Nif'al",
-    "הִתְפַּעֵל": "Hitpa'el", "התפעל": "Hitpa'el",
+    "פִּעֵל": "Pi'el",
-    "הִפְעִיל": "Hif'il", "הפעיל": "Hif'il",
+    "פיעל": "Pi'el",
-    "הֻפְעַל": "Huf'al", "הופעל": "Huf'al",
+    "פֻּעַל": "Pu'al",
    "פועל": "Pu'al",
    "הִתְפַּעֵל": "Hitpa'el",
    "התפעל": "Hitpa'el",
    "הִפְעִיל": "Hif'il",
    "הפעיל": "Hif'il",
    "הֻפְעַל": "Huf'al",
    "הופעל": "Huf'al",
 }
 # Passive binyan names — no infinitive, use 3ms past
@ -156,15 +171,16 @@ FALLBACK_VERBS = """# Verb list from Coffin & Bolozky, A Reference Grammar of Mo
 def _install_deps():
    """Install pymupdf and python-bidi if not available."""
    try:
        import fitz  # noqa: F401
        import bidi  # noqa: F401
        import fitz  # noqa: F401
        return True
    except ImportError:
        logger.info("Installing pymupdf and python-bidi …")
        import subprocess
        result = subprocess.run(
-            [sys.executable, "-m", "pip", "install",
+            [sys.executable, "-m", "pip", "install", "pymupdf", "python-bidi", "--break-system-packages", "-q"],
             "pymupdf", "python-bidi", "--break-system-packages", "-q"],
            capture_output=True,
        )
        if result.returncode != 0:
@ -182,6 +198,7 @@ def _download_pdf() -> bool:
    logger.info(f"Downloading PDF from {PDF_URL} …")
    try:
        import requests
        resp = requests.get(PDF_URL, timeout=120, stream=True)
        resp.raise_for_status()
        PDF_PATH.write_bytes(resp.content)
@ -211,10 +228,7 @@ def _needs_bidi_fix(text: str) -> bool:
 def _strip_nikkud(text: str) -> str:
-    return "".join(
+    return "".join(ch for ch in unicodedata.normalize("NFD", text) if unicodedata.category(ch) != "Mn")
        ch for ch in unicodedata.normalize("NFD", text)
        if unicodedata.category(ch) != "Mn"
    )
 def _extract_from_pdf() -> list[tuple[str, str, str]]:
@ -244,10 +258,9 @@ def _extract_from_pdf() -> list[tuple[str, str, str]]:
    # Check if we need bidi correction
    test_text = ""
    try:
-        for page_num in range(min(PAGE_START, doc.page_count - 1),
+        for page_num in range(min(PAGE_START, doc.page_count - 1), min(PAGE_START + 3, doc.page_count)):
                              min(PAGE_START + 3, doc.page_count)):
            test_text += doc[page_num].get_text("text")
-    except Exception:
+    except Exception:  # noqa: S110
        pass
    use_bidi = _needs_bidi_fix(test_text)
@ -259,6 +272,7 @@ def _extract_from_pdf() -> list[tuple[str, str, str]]:
            return t
        try:
            from bidi.algorithm import get_display
            lines = t.split("\n")
            fixed = []
            for line in lines:
@ -274,7 +288,7 @@ def _extract_from_pdf() -> list[tuple[str, str, str]]:
    for page_num in range(PAGE_START - 1, page_end):  # fitz is 0-indexed
        try:
            raw = doc[page_num].get_text("text")
-        except Exception:
+        except Exception:  # noqa: S112
            continue
        text = fix_text(raw)
@ -316,9 +330,12 @@ def _extract_from_pdf() -> list[tuple[str, str, str]]:
                heb_words = re.findall(r"[\u05d0-\u05ea\u05b0-\u05c7]{3,}", line)
                for w in heb_words:
                    stripped_w = _strip_nikkud(w)
-                    if current_binyan == "Pu'al" and stripped_w.startswith("פ"):
+                    if (
-                        entries.append((current_binyan, "3ms", w))
+                        current_binyan == "Pu'al"
-                    elif current_binyan == "Huf'al" and stripped_w.startswith("ה"):
+                        and stripped_w.startswith("פ")
                        or current_binyan == "Huf'al"
                        and stripped_w.startswith("ה")
                    ):
                        entries.append((current_binyan, "3ms", w))
    doc.close()
@ -357,16 +374,20 @@ def _write_output(entries: list[tuple[str, str, str]]) -> None:
            lines.append(form)
    OUTPUT_PATH.write_text("\n".join(lines) + "\n", encoding="utf-8")
-    verb_count = sum(1 for l in lines if l and not l.startswith("#"))
+    verb_count = sum(1 for ln in lines if ln and not ln.startswith("#"))
-    passive_count = sum(1 for l in lines if l.startswith("# 3ms:"))
+    passive_count = sum(1 for ln in lines if ln.startswith("# 3ms:"))
    logger.info(f"Written {verb_count} active verbs + {passive_count} passive (3ms) → {OUTPUT_PATH}")
 def _binyan_heb(name: str) -> str:
    mapping = {
-        "Pa'al": "פָּעַל", "Nif'al": "נִפְעַל", "Pi'el": "פִּעֵל",
+        "Pa'al": "פָּעַל",
-        "Pu'al": "פֻּעַל", "Hitpa'el": "הִתְפַּעֵל",
+        "Nif'al": "נִפְעַל",
-        "Hif'il": "הִפְעִיל", "Huf'al": "הֻפְעַל",
+        "Pi'el": "פִּעֵל",
        "Pu'al": "פֻּעַל",
        "Hitpa'el": "הִתְפַּעֵל",
        "Hif'il": "הִפְעִיל",
        "Huf'al": "הֻפְעַל",
    }
    return mapping.get(name, name)
--- a/scripts/scrape_ktiv_male.py
+++ b/scripts/scrape_ktiv_male.py
@ -0,0 +1,237 @@
 #!/usr/bin/env python3
 """
 Scrape ktiv male (plene/vowelless) forms from pealim.com.
 Uses hebstyle=vl cookie to get vowelless writing with matres lectionis.
 Builds a lookup: ktiv_male_form → [{word_nikkud, form_type, pos, slug}]
 This enables matching Hebrew text (which is normally in ktiv male)
 against our vocabulary, including conjugated verbs and noun plurals.
 """
 import json
 import logging
 import sys
 import time
 from pathlib import Path
 import requests
 from bs4 import BeautifulSoup
 logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
 logger = logging.getLogger(__name__)
 DATA_DIR = Path(__file__).resolve().parent.parent / "data"
 OUTPUT_PATH = DATA_DIR / "ktiv_male_forms.json"
 COOKIES = {"translit": "none", "hebstyle": "vl"}
 REQUEST_TIMEOUT = 15
 DELAY = 1.5  # seconds between requests
 def fetch_verb_ktiv_male(slug: str, infinitive_nikkud: str) -> list[dict]:
    """Fetch all conjugated forms in ktiv male for a verb."""
    url = f"https://www.pealim.com/dict/{slug}/"
    resp = requests.get(url, cookies=COOKIES, timeout=REQUEST_TIMEOUT)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "html.parser")
    forms = []
    table = soup.find("table", class_="conjugation-table")
    if not table:
        return forms
    # Also get the infinitive from the page
    lead = soup.find("div", class_="lead")
    if lead:
        inf_spans = lead.find_all("span", class_="menukad")
        for s in inf_spans:
            ktiv = s.text.strip()
            if ktiv:
                forms.append(
                    {
                        "ktiv_male": ktiv,
                        "word_nikkud": infinitive_nikkud,
                        "form_type": "infinitive",
                        "pos": "Verb",
                        "slug": slug,
                    }
                )
    rows = table.find_all("tr")
    for row in rows:
        menukad_spans = row.find_all("span", class_="menukad")
        for span in menukad_spans:
            ktiv = span.text.strip()
            if ktiv and ktiv not in {f["ktiv_male"] for f in forms}:
                forms.append(
                    {
                        "ktiv_male": ktiv,
                        "word_nikkud": infinitive_nikkud,
                        "form_type": "conjugation",
                        "pos": "Verb",
                        "slug": slug,
                    }
                )
    return forms
 def fetch_noun_ktiv_male(slug: str, singular_nikkud: str, gender: str) -> list[dict]:
    """Fetch noun declension forms in ktiv male."""
    url = f"https://www.pealim.com/dict/{slug}/"
    resp = requests.get(url, cookies=COOKIES, timeout=REQUEST_TIMEOUT)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "html.parser")
    forms = []
    table = soup.find("table", class_="conjugation-table")
    if not table:
        return forms
    rows = table.find_all("tr")
    form_labels = ["absolute_singular", "absolute_plural", "construct_singular", "construct_plural"]
    label_idx = 0
    for row in rows:
        menukad_spans = row.find_all("span", class_="menukad")
        for span in menukad_spans:
            ktiv = span.text.strip()
            if ktiv:
                ft = form_labels[label_idx] if label_idx < len(form_labels) else "other"
                forms.append(
                    {
                        "ktiv_male": ktiv,
                        "word_nikkud": singular_nikkud,
                        "form_type": ft,
                        "pos": "Noun",
                        "slug": slug,
                        "gender": gender,
                    }
                )
                label_idx += 1
    return forms
 def scrape_verbs() -> list[dict]:
    """Scrape ktiv male forms for all verbs in conjugations.json."""
    conj_path = DATA_DIR / "conjugations.json"
    if not conj_path.exists():
        logger.warning("No conjugations.json found")
        return []
    with open(conj_path) as f:
        conjugations = json.load(f)
    all_forms = []
    slugs_done = set()
    for verb, data in conjugations.items():
        if not data or not data.get("slug"):
            continue
        slug = data["slug"]
        if slug in slugs_done:
            continue
        slugs_done.add(slug)
        try:
            forms = fetch_verb_ktiv_male(slug, verb)
            all_forms.extend(forms)
            logger.info(f"  Verb {verb} ({slug}): {len(forms)} forms")
        except Exception as e:
            logger.warning(f"  Verb {verb} ({slug}) failed: {e}")
        time.sleep(DELAY)
    return all_forms
 def scrape_nouns() -> list[dict]:
    """Scrape ktiv male forms for all nouns in noun_slug_map.json."""
    slug_path = DATA_DIR / "noun_slug_map.json"
    if not slug_path.exists():
        logger.warning("No noun_slug_map.json found")
        return []
    with open(slug_path) as f:
        slug_map = json.load(f)
    # Also load existing plurals to get nikkud singular form
    plurals_path = DATA_DIR / "noun_plurals.json"
    plurals = {}
    if plurals_path.exists():
        with open(plurals_path) as f:
            plurals = json.load(f)
    all_forms = []
    done = 0
    total = len(slug_map)
    for word, info in slug_map.items():
        slug = info.get("slug", "")
        if not slug:
            continue
        # Get nikkud form from plurals data or slug map
        nikkud = info.get("word_nikkud", word)
        if word in plurals:
            nikkud = plurals[word].get("singular", nikkud)
        gender = info.get("gender", "")
        try:
            forms = fetch_noun_ktiv_male(slug, nikkud, gender)
            all_forms.extend(forms)
            done += 1
            if done % 50 == 0:
                logger.info(f"  Nouns: {done}/{total} ({len(all_forms)} forms)")
                # Save incrementally
                _save_forms(all_forms, partial=True)
        except Exception as e:
            logger.warning(f"  Noun {word} ({slug}) failed: {e}")
            done += 1
        time.sleep(DELAY)
    return all_forms
 def _save_forms(all_forms: list[dict], partial: bool = False):
    """Build and save the ktiv male lookup dict."""
    lookup: dict[str, list[dict]] = {}
    for entry in all_forms:
        ktiv = entry["ktiv_male"]
        # Don't include ktiv_male in the stored entry (it's the key)
        stored = {k: v for k, v in entry.items() if k != "ktiv_male"}
        lookup.setdefault(ktiv, []).append(stored)
    suffix = ".partial" if partial else ""
    out = OUTPUT_PATH.parent / (OUTPUT_PATH.name + suffix)
    with open(out, "w") as f:
        json.dump(lookup, f, ensure_ascii=False, indent=1)
    logger.info(f"  Saved {len(lookup)} unique ktiv male forms → {out}")
 def main():
    mode = sys.argv[1] if len(sys.argv) > 1 else "all"
    all_forms = []
    if mode in ("all", "verbs"):
        logger.info("=== Scraping verb ktiv male forms ===")
        verb_forms = scrape_verbs()
        all_forms.extend(verb_forms)
        logger.info(f"Verbs done: {len(verb_forms)} forms from {len({f['slug'] for f in verb_forms})} verbs")
    if mode in ("all", "nouns"):
        logger.info("=== Scraping noun ktiv male forms ===")
        noun_forms = scrape_nouns()
        all_forms.extend(noun_forms)
        logger.info(f"Nouns done: {len(noun_forms)} forms")
    _save_forms(all_forms)
    logger.info(f"Total: {len(all_forms)} forms → {OUTPUT_PATH}")
 if __name__ == "__main__":
    main()
--- a/scripts/scrape_noun_plurals.py
+++ b/scripts/scrape_noun_plurals.py
@ -0,0 +1,365 @@
 #!/usr/bin/env python3
 """
 Scrape pealim.com for noun plural and construct forms.
 Step 1: Collect noun slugs from list pages (/dict/?pos=noun&page=N)
 Step 2: Fetch detail pages for plural + construct forms
 Step 3: Print summary statistics
 """
 import json
 import re
 import time
 from pathlib import Path
 import requests
 from bs4 import BeautifulSoup
 BASE_URL = "https://www.pealim.com"
 COOKIES = {"translit": "none", "hebstyle": "mo"}
 HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; PealimScraper/1.0)"}
 DATA_DIR = Path(__file__).resolve().parent.parent / "data"
 SLUG_MAP_FILE = DATA_DIR / "noun_slug_map.json"
 PROGRESS_FILE = DATA_DIR / "noun_slug_map_progress.json"
 PLURALS_FILE = DATA_DIR / "noun_plurals.json"
 DELAY = 1.5  # seconds between requests
 def load_json(path, default=None):
    if path.exists():
        with open(path) as f:
            return json.load(f)
    return default if default is not None else {}
 def save_json(path, data):
    with open(path, "w") as f:
        json.dump(data, f, ensure_ascii=False, indent=2)
 def fetch_with_retry(url, max_retries=5):
    """Fetch URL with exponential backoff."""
    for attempt in range(max_retries):
        try:
            r = requests.get(url, cookies=COOKIES, headers=HEADERS, timeout=30)
            r.raise_for_status()
            return r
        except (requests.RequestException, ConnectionError) as e:
            wait = min(2**attempt * 2, 60)
            print(f"  Retry {attempt + 1}/{max_retries} for {url}: {e} (waiting {wait}s)")
            time.sleep(wait)
    print(f"  FAILED after {max_retries} retries: {url}")
    return None
 def get_total_pages():
    """Get total number of noun list pages."""
    r = fetch_with_retry(f"{BASE_URL}/dict/?pos=noun&page=1")
    if not r:
        return 0
    soup = BeautifulSoup(r.text, "lxml")
    pages = set()
    for a in soup.select("ul.pagination li a"):
        href = a.get("href", "")
        m = re.search(r"page=(\d+)", href)
        if m:
            pages.add(int(m.group(1)))
    return max(pages) if pages else 1
 def parse_list_page(html):
    """Parse a noun list page and return list of noun entries."""
    soup = BeautifulSoup(html, "lxml")
    table = soup.select_one("table.dict-table")
    if not table:
        return []
    entries = []
    for row in table.select("tr")[1:]:  # skip header
        tds = row.select("td")
        if len(tds) < 3:
            continue
        # First td: word + link
        first_td = tds[0]
        a = first_td.select_one("a")
        if not a:
            continue
        href = a.get("href", "")
        slug_match = re.search(r"/dict/([^/]+)/", href)
        if not slug_match:
            continue
        slug = slug_match.group(1)
        menukad = first_td.select_one("span.menukad")
        word_nikkud = menukad.get_text(strip=True) if menukad else ""
        # Word without nikkud (strip combining marks)
        word_plain = re.sub(r"[\u0591-\u05C7]", "", word_nikkud)
        # Third td: part of speech
        pos_text = tds[2].get_text(strip=True)
        # Gender
        gender = ""
        if "masculine" in pos_text.lower():
            gender = "masculine"
        elif "feminine" in pos_text.lower():
            gender = "feminine"
        # Mishkal pattern
        mishkal = ""
        m = re.search(r"(\w+)\s*pattern", pos_text.lower())
        if m:
            mishkal = m.group(1)
        entries.append(
            {
                "word_plain": word_plain,
                "slug": slug,
                "word_nikkud": word_nikkud,
                "pos": pos_text,
                "gender": gender,
                "mishkal": mishkal,
            }
        )
    return entries
 def step1_collect_slugs():
    """Step 1: Collect noun slugs from list pages."""
    print("=" * 60)
    print("STEP 1: Collecting noun slugs from list pages")
    print("=" * 60)
    slug_map = load_json(SLUG_MAP_FILE, {})
    progress = load_json(PROGRESS_FILE, [])
    completed_pages = set(progress) if isinstance(progress, list) else set()
    # Get total pages
    total_pages = get_total_pages()
    print(f"Total pages: {total_pages}")
    print(f"Already completed: {len(completed_pages)} pages, {len(slug_map)} nouns")
    remaining = [p for p in range(1, total_pages + 1) if p not in completed_pages]
    print(f"Remaining pages: {len(remaining)}")
    if not remaining:
        print("All pages already scraped!")
        return slug_map
    for i, page_num in enumerate(remaining):
        url = f"{BASE_URL}/dict/?pos=noun&page={page_num}"
        r = fetch_with_retry(url)
        if not r:
            print(f"  Skipping page {page_num}")
            continue
        entries = parse_list_page(r.text)
        for entry in entries:
            word = entry["word_plain"]
            slug_map[word] = {
                "slug": entry["slug"],
                "word_nikkud": entry["word_nikkud"],
                "pos": entry["pos"],
                "gender": entry["gender"],
                "mishkal": entry["mishkal"],
            }
        completed_pages.add(page_num)
        done = len(completed_pages)
        print(f"  Page {page_num} ({done}/{total_pages}): {len(entries)} nouns (total: {len(slug_map)})")
        # Save progress every 10 pages
        if (i + 1) % 10 == 0 or page_num == remaining[-1]:
            save_json(SLUG_MAP_FILE, slug_map)
            save_json(PROGRESS_FILE, sorted(completed_pages))
            print(f"  [Saved progress: {len(slug_map)} nouns, {done} pages]")
        time.sleep(DELAY)
    # Final save
    save_json(SLUG_MAP_FILE, slug_map)
    save_json(PROGRESS_FILE, sorted(completed_pages))
    print(f"\nStep 1 complete: {len(slug_map)} total nouns from {len(completed_pages)} pages")
    return slug_map
 def parse_detail_page(html, slug, gender, mishkal):
    """Parse a noun detail page for plural/construct forms."""
    soup = BeautifulSoup(html, "lxml")
    tables = soup.select("table.conjugation-table")
    if not tables:
        return None
    table = tables[0]
    rows = table.select("tr")
    result = {
        "slug": slug,
        "singular": "",
        "singular_audio": "",
        "plural": "",
        "plural_audio": "",
        "construct_singular": "",
        "construct_plural": "",
        "gender": gender,
        "mishkal": mishkal,
    }
    for row in rows:
        th = row.select_one("th")
        if not th:
            continue
        label = th.get_text(strip=True).lower()
        tds = row.select("td")
        if "absolute" in label:
            if len(tds) >= 1:
                td = tds[0]
                m = td.select_one("span.menukad")
                result["singular"] = m.get_text(strip=True) if m else ""
                audio_el = td.select_one("[data-audio]")
                result["singular_audio"] = audio_el.get("data-audio", "") if audio_el else td.get("data-audio", "")
            if len(tds) >= 2:
                td = tds[1]
                m = td.select_one("span.menukad")
                result["plural"] = m.get_text(strip=True) if m else ""
                audio_el = td.select_one("[data-audio]")
                result["plural_audio"] = audio_el.get("data-audio", "") if audio_el else td.get("data-audio", "")
        elif "construct" in label:
            if len(tds) >= 1:
                td = tds[0]
                m = td.select_one("span.menukad")
                result["construct_singular"] = m.get_text(strip=True) if m else ""
            if len(tds) >= 2:
                td = tds[1]
                m = td.select_one("span.menukad")
                result["construct_plural"] = m.get_text(strip=True) if m else ""
    return result
 def step2_fetch_plurals(slug_map):
    """Step 2: Fetch detail pages for plural + construct forms."""
    print("\n" + "=" * 60)
    print("STEP 2: Fetching plural + construct forms from detail pages")
    print("=" * 60)
    plurals = load_json(PLURALS_FILE, {})
    already_done = set(plurals.keys())
    # Build work list: nouns not yet in plurals
    work = []
    for word, info in slug_map.items():
        if word not in already_done:
            work.append((word, info))
    print(f"Already have plural data: {len(already_done)}")
    print(f"Remaining to fetch: {len(work)}")
    if not work:
        print("All nouns already have plural data!")
        return plurals
    skipped = 0
    for i, (word, info) in enumerate(work):
        slug = info["slug"]
        url = f"{BASE_URL}/dict/{slug}/"
        r = fetch_with_retry(url)
        if not r:
            print(f"  Skipping {word} ({slug})")
            skipped += 1
            continue
        entry = parse_detail_page(r.text, slug, info.get("gender", ""), info.get("mishkal", ""))
        if entry:
            plurals[word] = entry
        else:
            # No declension table - store minimal entry
            plurals[word] = {
                "slug": slug,
                "singular": info.get("word_nikkud", ""),
                "singular_audio": "",
                "plural": "",
                "plural_audio": "",
                "construct_singular": "",
                "construct_plural": "",
                "gender": info.get("gender", ""),
                "mishkal": info.get("mishkal", ""),
                "no_declension_table": True,
            }
        done = len(already_done) + i + 1 - skipped
        total = len(already_done) + len(work)
        if (i + 1) % 50 == 0 or i == 0:
            print(
                f"  [{i + 1}/{len(work)}] {word} ({slug}): "
                f"plural={entry['plural'] if entry else 'N/A'} "
                f"(total: {done}/{total})"
            )
        # Save every 50 entries
        if (i + 1) % 50 == 0 or i == len(work) - 1:
            save_json(PLURALS_FILE, plurals)
            print(f"  [Saved: {len(plurals)} entries]")
        time.sleep(DELAY)
    save_json(PLURALS_FILE, plurals)
    print(f"\nStep 2 complete: {len(plurals)} total noun entries with plural data")
    return plurals
 def step3_summary(slug_map, plurals):
    """Step 3: Print summary statistics."""
    print("\n" + "=" * 60)
    print("SUMMARY")
    print("=" * 60)
    total_slugs = len(slug_map)
    total_plurals = len(plurals)
    has_plural = sum(1 for v in plurals.values() if v.get("plural"))
    has_construct = sum(1 for v in plurals.values() if v.get("construct_singular") or v.get("construct_plural"))
    has_audio = sum(1 for v in plurals.values() if v.get("singular_audio") or v.get("plural_audio"))
    no_table = sum(1 for v in plurals.values() if v.get("no_declension_table"))
    # Irregular plurals: masculine with ות- ending, feminine with ים- ending
    irregular = 0
    for _word, v in plurals.items():
        plural = v.get("plural", "")
        gender = v.get("gender", "")
        if not plural or not gender:
            continue
        plain_plural = re.sub(r"[\u0591-\u05C7]", "", plural)
        if (
            gender == "masculine"
            and plain_plural.endswith("ות")
            or gender == "feminine"
            and plain_plural.endswith("ים")
        ):
            irregular += 1
    print(f"Total nouns in slug map:       {total_slugs}")
    print(f"Total nouns with plural data:  {total_plurals}")
    print(f"  - With plural form:          {has_plural}")
    print(f"  - With construct forms:       {has_construct}")
    print(f"  - With audio URLs:            {has_audio}")
    print(f"  - No declension table:        {no_table}")
    print(f"  - Irregular plurals:          {irregular}")
 def main():
    print("Pealim Noun Plural Scraper")
    print(f"Data directory: {DATA_DIR}")
    print()
    slug_map = step1_collect_slugs()
    plurals = step2_fetch_plurals(slug_map)
    step3_summary(slug_map, plurals)
 if __name__ == "__main__":
    main()
--- a/scripts/scrape_verb_ktiv.py
+++ b/scripts/scrape_verb_ktiv.py
@ -0,0 +1,250 @@
 #!/usr/bin/env python3
 """Scrape ktiv male (vowelless plene) conjugation forms for top 500 verbs from pealim.com."""
 import json
 import os
 import re
 import sys
 import time
 sys.stdout.reconfigure(line_buffering=True)
 import requests  # noqa: E402
 from bs4 import BeautifulSoup  # noqa: E402
 DATA_DIR = os.path.join(os.path.dirname(os.path.dirname(__file__)), "data")
 INPUT_FILE = os.path.join(DATA_DIR, "top_verbs_to_scrape.json")
 OUTPUT_FILE = os.path.join(DATA_DIR, "ktiv_male_forms.json")
 PARTIAL_FILE = os.path.join(DATA_DIR, "ktiv_male_forms_partial.json")
 PROGRESS_FILE = os.path.join(DATA_DIR, "ktiv_scrape_progress.json")
 COOKIES = {"translit": "none", "hebstyle": "vl"}
 HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; PealimScraper/1.0)"}
 DELAY = 1.5
 session = requests.Session()
 session.cookies.update(COOKIES)
 session.headers.update(HEADERS)
 def load_json(path):
    if os.path.exists(path):
        with open(path, encoding="utf-8") as f:
            return json.load(f)
    return {}
 def save_json(data, path):
    with open(path, "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=1)
 def search_slug(wni):
    """Search pealim for a verb and return the first result's slug."""
    url = "https://www.pealim.com/search/"
    resp = session.get(url, params={"q": wni}, timeout=15)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "html.parser")
    # Look for result links like /dict/SLUG/
    for a in soup.select("a[href]"):
        href = a["href"]
        m = re.match(r"/dict/(\d+-[^/]+)/", href)
        if m:
            return m.group(1)
    return None
 def scrape_verb_forms(slug):
    """Fetch a verb's detail page and extract all ktiv male conjugation forms."""
    url = f"https://www.pealim.com/dict/{slug}/"
    resp = session.get(url, timeout=15)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "html.parser")
    forms = set()
    # Get infinitive from div.lead or page title
    lead = soup.select_one("div.lead")
    if lead:
        menukad_spans = lead.select("span.menukad")
        for span in menukad_spans:
            text = span.get_text(strip=True)
            if text:
                forms.add(text)
    # Get word_nikkud (the nikkud form of the infinitive) from the page
    # We need to fetch with mo cookie for that, but we already have it from input data
    # Instead, get the page title which usually has the nikkud form
    word_nikkud = None
    title = soup.select_one("h1")
    if title:
        menukad_in_title = title.select_one("span.menukad")
        if menukad_in_title:
            word_nikkud = menukad_in_title.get_text(strip=True)
    # Get ALL span.menukad elements from conjugation tables
    for span in soup.select("span.menukad"):
        text = span.get_text(strip=True)
        if text:
            forms.add(text)
    return forms, word_nikkud
 def main():
    verbs = load_json(INPUT_FILE)
    if not verbs:
        print("ERROR: No verbs found in input file")
        sys.exit(1)
    # Load existing forms
    existing_forms = load_json(OUTPUT_FILE)
    new_forms = {}  # Will be merged into existing at the end
    # Load progress to resume
    progress = load_json(PROGRESS_FILE)
    done_wnis = set(progress.get("done_wnis", []))
    slug_cache = progress.get("slug_cache", {})
    # Pre-populate slug cache from conjugations.json
    conj_file = os.path.join(DATA_DIR, "conjugations.json")
    if os.path.exists(conj_file):
        conj_data = load_json(conj_file)
        for wni_key, cdata in conj_data.items():
            if isinstance(cdata, dict) and "slug" in cdata and wni_key not in slug_cache:
                slug_cache[wni_key] = cdata["slug"]
        print(f"Pre-populated {len(slug_cache)} slugs from conjugations.json")
    # Deduplicate verbs by wni
    seen_wni = set()
    unique_verbs = []
    for v in verbs:
        if v["wni"] not in seen_wni:
            seen_wni.add(v["wni"])
            unique_verbs.append(v)
    total = len(unique_verbs)
    to_scrape = [v for v in unique_verbs if v["wni"] not in done_wnis]
    print(f"Total unique verbs: {total}, already done: {total - len(to_scrape)}, to scrape: {len(to_scrape)}")
    scraped_count = 0
    skipped_count = 0
    total_new_forms = 0
    sample_verbs = {}  # For summary: wni -> list of forms
    for i, verb in enumerate(to_scrape):
        wni = verb["wni"]
        word_nikkud_input = verb["word"]
        try:
            # Step 1: Find slug
            if wni in slug_cache:
                slug = slug_cache[wni]
            else:
                slug = search_slug(wni)
                time.sleep(DELAY)
            if not slug:
                print(f"  [{i + 1}/{len(to_scrape)}] SKIP {wni} - not found on pealim")
                skipped_count += 1
                done_wnis.add(wni)
                continue
            slug_cache[wni] = slug
            # Step 2: Scrape forms
            forms, page_nikkud = scrape_verb_forms(slug)
            time.sleep(DELAY)
            # Use the nikkud form from our input data (more reliable)
            nikkud_to_use = word_nikkud_input
            # Build entries for each form
            for form in forms:
                entry = {
                    "word_nikkud": nikkud_to_use,
                    "form_type": "conjugation",
                    "pos": "Verb",
                    "slug": slug,
                }
                if form not in new_forms:
                    new_forms[form] = []
                # Check for duplicate entry
                if not any(e["slug"] == slug for e in new_forms[form]):
                    new_forms[form].append(entry)
                    total_new_forms += 1
            scraped_count += 1
            # Collect samples (first 3 completed)
            if len(sample_verbs) < 3:
                sample_verbs[wni] = sorted(forms)
            print(f"  [{i + 1}/{len(to_scrape)}] {wni} -> {slug} ({len(forms)} forms)")
            done_wnis.add(wni)
        except Exception as e:
            print(f"  [{i + 1}/{len(to_scrape)}] ERROR {wni}: {e}")
            skipped_count += 1
            done_wnis.add(wni)
        # Save progress every 50 verbs
        if (i + 1) % 50 == 0:
            progress = {"done_wnis": list(done_wnis), "slug_cache": slug_cache}
            save_json(progress, PROGRESS_FILE)
            # Save partial merged result
            merged = dict(existing_forms)
            for form, entries in new_forms.items():
                if form in merged:
                    existing_slugs = {e["slug"] for e in merged[form]}
                    for entry in entries:
                        if entry["slug"] not in existing_slugs:
                            merged[form].append(entry)
                else:
                    merged[form] = entries
            save_json(merged, PARTIAL_FILE)
            print(f"  -- Progress saved at {i + 1}/{len(to_scrape)} --")
    # Final merge
    merged = dict(existing_forms)
    for form, entries in new_forms.items():
        if form in merged:
            existing_slugs = {e["slug"] for e in merged[form]}
            for entry in entries:
                if entry["slug"] not in existing_slugs:
                    merged[form].append(entry)
        else:
            merged[form] = entries
    save_json(merged, OUTPUT_FILE)
    # Save final progress
    progress = {"done_wnis": list(done_wnis), "slug_cache": slug_cache}
    save_json(progress, PROGRESS_FILE)
    # Clean up partial file
    if os.path.exists(PARTIAL_FILE):
        os.remove(PARTIAL_FILE)
    # Summary
    print(f"\n{'=' * 50}")
    print("SUMMARY")
    print(f"{'=' * 50}")
    print(f"Verbs scraped:         {scraped_count}")
    print(f"Verbs skipped:         {skipped_count}")
    print(f"New forms added:       {total_new_forms}")
    print(f"Total unique ktiv male forms: {len(merged)}")
    print(f"Previous forms count:  {len(existing_forms)}")
    print(f"Net new form keys:     {len(merged) - len(existing_forms)}")
    if sample_verbs:
        print("\nSample verbs:")
        for wni, forms in list(sample_verbs.items())[:3]:
            print(f"\n  {wni} ({len(forms)} forms):")
            for f in forms[:8]:
                print(f"    {f}")
            if len(forms) > 8:
                print(f"    ... and {len(forms) - 8} more")
 if __name__ == "__main__":
    main()
--- a/test_scrape.py
+++ b/test_scrape.py
@ -1,31 +0,0 @@
 #!/usr/bin/env python3
 import requests
 from bs4 import BeautifulSoup
 word = 'אבל'
 url = f'https://www.pealim.com/search/?q={word}'
 headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
 }
 try:
    response = requests.get(url, headers=headers, timeout=10)
    print(f'Status: {response.status_code}')
    soup = BeautifulSoup(response.content, 'html.parser')
    # Debug: check what we find
    word_elem = soup.find('h1', class_='word-title')
    pos_elem = soup.find('span', class_='pos')
    definition_elem = soup.find('div', class_='definition')
    print(f'word_elem found: {word_elem is not None}')
    print(f'pos_elem found: {pos_elem is not None}')
    print(f'definition_elem found: {definition_elem is not None}')
    print('\n--- HTML snippet (first 3000 chars) ---')
    print(soup.prettify()[:3000])
 except Exception as e:
    print(f'Error: {e}')
    import traceback
    traceback.print_exc()
--- a/tests/init.py
+++ b/tests/init.py
--- a/tests/test_smoke.py
+++ b/tests/test_smoke.py
@ -0,0 +1,45 @@
 """Smoke tests for the Hebrew Flash Cards project."""
 import sys
 from pathlib import Path
 # Ensure project root is on path
 sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
 def test_helpers_strip_nikkud():
    from helpers import strip_nikkud
    assert strip_nikkud("שָׁלוֹם") == "שלום"
    assert strip_nikkud("hello") == "hello"
    assert strip_nikkud("") == ""
 def test_apkg_builder_imports():
    import apkg_builder
    assert hasattr(apkg_builder, "build_vocab_deck")
    assert hasattr(apkg_builder, "build_conj_deck")
    assert apkg_builder.VOCAB_MODEL_ID == 1_701_222_017_968
 def test_data_files_exist():
    data_dir = Path(__file__).resolve().parent.parent / "data"
    assert (data_dir / "hebrew_dict_for_anki.csv").exists(), "vocab CSV missing"
    assert (data_dir / "conjugations.json").exists(), "conjugations cache missing"
 def test_strip_nikkud_idempotent():
    from helpers import strip_nikkud
    plain = "שלום"
    assert strip_nikkud(plain) == plain
 def test_strip_nikkud_all_marks():
    from helpers import strip_nikkud
    # Comprehensive: patach, kamatz, segol, tsere, hiriq, holam, kubutz, shva, dagesh
    nikkud = "הַמַּלְכָּה"
    plain = strip_nikkud(nikkud)
    assert all(ch < "\u0591" or ch > "\u05C7" for ch in plain), f"Residual nikkud in: {plain}"
--- a/validate_apkg.py
+++ b/validate_apkg.py
@ -14,7 +14,6 @@ import json
 import os
 import re
 import sqlite3
 import struct
 import sys
 import tempfile
 import zipfile
@ -22,6 +21,9 @@ from pathlib import Path
 VOCAB_APKG = Path("output/hebrew_vocabulary.apkg")
 CONJ_APKG = Path("output/hebrew_conjugations.apkg")
 CONF_APKG = Path("output/hebrew_confusables.apkg")
 PLURAL_APKG = Path("output/hebrew_plurals.apkg")
 COMPLETE_APKG = Path("output/hebrew_complete.apkg")
 PASS = "\033[32m✓\033[0m"
 FAIL = "\033[31m✗\033[0m"
@ -60,10 +62,9 @@ def _detect_format(data: bytes) -> str:
 def validate_apkg(apkg_path: Path) -> int:
    """Run all checks. Returns number of failures."""
-    name = apkg_path.name
+    print(f"\n{'=' * 60}")
    print(f"\n{'='*60}")
    print(f"  Validating: {apkg_path}")
-    print(f"{'='*60}")
+    print(f"{'=' * 60}")
    failures = 0
@ -78,16 +79,17 @@ def validate_apkg(apkg_path: Path) -> int:
    print("\n[ZIP structure]")
    try:
        zf = zipfile.ZipFile(apkg_path)
    except zipfile.BadZipFile as e:
        print(f"  {FAIL}  Invalid ZIP: {e}")
        return 1
    with zf, tempfile.TemporaryDirectory() as tmpdir:
        namelist = zf.namelist()
        has_db = "collection.anki2" in namelist
        has_media = "media" in namelist
        failures += 0 if check("collection.anki2 present", has_db) else 1
        failures += 0 if check("media manifest present", has_media) else 1
    except zipfile.BadZipFile as e:
        print(f"  {FAIL}  Invalid ZIP: {e}")
        return 1
    with tempfile.TemporaryDirectory() as tmpdir:
        zf.extractall(tmpdir)
        # --- Media manifest ---
@ -116,8 +118,11 @@ def validate_apkg(apkg_path: Path) -> int:
            size = zf.getinfo(num).file_size if num in zf.NameToInfo else -1
            if size == 0:
                zero_byte.append(orig)
-        failures += 0 if check("No zero-byte media files", len(zero_byte) == 0,
+        failures += (
-                               f"{len(zero_byte)} empty" if zero_byte else "") else 1
+            0
            if check("No zero-byte media files", len(zero_byte) == 0, f"{len(zero_byte)} empty" if zero_byte else "")
            else 1
        )
        # Check audio format sample (first 20 mp3s)
        mp3_names = [num for num, orig in media_map.items() if orig.endswith(".mp3")]
@ -127,16 +132,19 @@ def validate_apkg(apkg_path: Path) -> int:
            fmt = _detect_format(data)
            if "MP3" not in fmt:
                bad_format.append(f"{media_map[num]}: {fmt}")
-        failures += 0 if check(
+        failures += (
-            f"Audio format (sampled {min(20, len(mp3_names))} files)",
+            0
-            len(bad_format) == 0,
+            if check(
-            "; ".join(bad_format) if bad_format else f"all MP3",
+                f"Audio format (sampled {min(20, len(mp3_names))} files)",
-        ) else 1
+                len(bad_format) == 0,
                "; ".join(bad_format) if bad_format else "all MP3",
            )
            else 1
        )
        # Fonts present
        font_files = [v for v in original_names if v.endswith(".ttf")]
-        check("Heebo font files bundled", len(font_files) >= 1,
+        check("Heebo font files bundled", len(font_files) >= 1, ", ".join(font_files) if font_files else "none found")
              ", ".join(font_files) if font_files else "none found")
        # --- Database ---
        print("\n[Database]")
@ -144,8 +152,7 @@ def validate_apkg(apkg_path: Path) -> int:
        conn = sqlite3.connect(db_path)
        schema_ver = conn.execute("SELECT ver FROM col").fetchone()[0]
-        failures += 0 if check("Schema version 11 (Anki 2.1)", schema_ver == 11,
+        failures += 0 if check("Schema version 11 (Anki 2.1)", schema_ver == 11, f"got {schema_ver}") else 1
                               f"got {schema_ver}") else 1
        note_count = conn.execute("SELECT COUNT(*) FROM notes").fetchone()[0]
        card_count = conn.execute("SELECT COUNT(*) FROM cards").fetchone()[0]
@ -153,33 +160,37 @@ def validate_apkg(apkg_path: Path) -> int:
        failures += 0 if check("Cards present", card_count > 0, f"{card_count:,} cards") else 1
        # Determine expected cards per note from model templates
        # Some templates are optional (e.g. cloze only generates when field is non-empty),
        # so we check that cards fall between min and max expected range.
        models_json_raw = conn.execute("SELECT models FROM col").fetchone()[0]
        models_raw = json.loads(models_json_raw)
        tmpl_counts = [len(m["tmpls"]) for m in models_raw.values()]
-        expected_ratio = tmpl_counts[0] if len(set(tmpl_counts)) == 1 else None
+        if len(set(tmpl_counts)) == 1 and len(tmpl_counts) == 1:
-        if expected_ratio:
+            expected_ratio = tmpl_counts[0]
-            failures += 0 if check(
+            # Allow fewer cards when optional templates exist (e.g. cloze)
-                f"{expected_ratio} card(s) per note",
+            min_cards = note_count  # at least 1 card per note
-                card_count == note_count * expected_ratio,
+            max_cards = note_count * expected_ratio
-                f"{note_count} notes × {expected_ratio} = {note_count * expected_ratio}, got {card_count}",
+            failures += (
-            ) else 1
+                0
                if check(
                    f"Cards per note (1–{expected_ratio} templates)",
                    min_cards <= card_count <= max_cards,
                    f"{card_count:,} cards from {note_count:,} notes",
                )
                else 1
            )
        # Duplicate GUIDs
-        dup_guids = conn.execute(
+        dup_guids = conn.execute("SELECT guid, COUNT(*) c FROM notes GROUP BY guid HAVING c > 1").fetchall()
-            "SELECT guid, COUNT(*) c FROM notes GROUP BY guid HAVING c > 1"
+        failures += 0 if check("No duplicate GUIDs", len(dup_guids) == 0, f"{len(dup_guids)} duplicates") else 1
        ).fetchall()
        failures += 0 if check("No duplicate GUIDs", len(dup_guids) == 0,
                               f"{len(dup_guids)} duplicates") else 1
        # Card queue states
-        queues = conn.execute(
+        queues = conn.execute("SELECT type, queue, COUNT(*) FROM cards GROUP BY type, queue").fetchall()
            "SELECT type, queue, COUNT(*) FROM cards GROUP BY type, queue"
        ).fetchall()
        queue_map = {(t, q): cnt for t, q, cnt in queues}
        new_cards = queue_map.get((0, 0), 0)
        suspended = queue_map.get((0, -1), 0) + queue_map.get((1, -1), 0) + queue_map.get((2, -1), 0)
        if new_cards > 0:
-            check(f"Cards in new queue (type=0, queue=0)", True, f"{new_cards:,}")
+            check("Cards in new queue (type=0, queue=0)", True, f"{new_cards:,}")
        if suspended > 0:
            warn("Suspended cards", f"{suspended:,}")
@ -190,23 +201,18 @@ def validate_apkg(apkg_path: Path) -> int:
        per_days = {dc.get("new", {}).get("perDay") for dc in dconf.values() if isinstance(dc, dict)}
        check("new.order configured", bool(orders), f"{orders}")
        if per_days:
-            check("new.perDay > 0", all(p and p > 0 for p in per_days if p is not None),
+            check("new.perDay > 0", all(p and p > 0 for p in per_days if p is not None), f"perDay={per_days}")
                  f"perDay={per_days}")
        # Deck assignment
        decks_json = conn.execute("SELECT decks FROM col").fetchone()[0]
        decks = json.loads(decks_json)
        real_decks = {did: d for did, d in decks.items() if did != "1"}
        if real_decks:
-            check("Custom deck exists (not Default only)", True,
+            check("Custom deck exists (not Default only)", True, ", ".join(d["name"] for d in real_decks.values()))
                  ", ".join(d["name"] for d in real_decks.values()))
            # All cards in the custom deck?
            for did_str in real_decks:
-                assigned = conn.execute(
+                assigned = conn.execute("SELECT COUNT(*) FROM cards WHERE did=?", [int(did_str)]).fetchone()[0]
-                    "SELECT COUNT(*) FROM cards WHERE did=?", [int(did_str)]
+                check(f"Cards in deck '{real_decks[did_str]['name']}'", assigned > 0, f"{assigned:,}/{card_count:,}")
                ).fetchone()[0]
                check(f"Cards in deck '{real_decks[did_str]['name']}'", assigned > 0,
                      f"{assigned:,}/{card_count:,}")
        # --- Sound references vs media manifest ---
        print("\n[Sound references]")
@ -218,16 +224,21 @@ def validate_apkg(apkg_path: Path) -> int:
        missing_audio = sound_refs - original_names
        orphaned_audio = original_names - sound_refs - set(font_files)
-        failures += 0 if check("All sound refs in media manifest", len(missing_audio) == 0,
+        failures += (
-                               f"{len(missing_audio)} missing" if missing_audio else "") else 1
+            0
            if check(
                "All sound refs in media manifest",
                len(missing_audio) == 0,
                f"{len(missing_audio)} missing" if missing_audio else "",
            )
            else 1
        )
        if orphaned_audio:
            warn("Media files not referenced by any card", f"{len(orphaned_audio)} orphaned")
-        notes_with_audio = sum(
+        notes_with_audio = sum(1 for (flds,) in notes_flds if "[sound:" in flds)
            1 for (flds,) in notes_flds if "[sound:" in flds
        )
        pct = notes_with_audio / note_count * 100 if note_count else 0
-        check(f"Notes with audio", notes_with_audio > 0, f"{notes_with_audio:,}/{note_count:,} ({pct:.0f}%)")
+        check("Notes with audio", notes_with_audio > 0, f"{notes_with_audio:,}/{note_count:,} ({pct:.0f}%)")
        # --- Empty fields check ---
        print("\n[Field content]")
@ -236,22 +247,12 @@ def validate_apkg(apkg_path: Path) -> int:
            field_names = [f["name"] for f in model["flds"]]
            # Check required fields (first 3) are not empty
            required_idx = list(range(min(3, len(field_names))))
            all_notes_for_model = conn.execute("SELECT flds FROM notes WHERE mid=?", [int(mid_str)]).fetchall()
            for idx in required_idx:
                fname = field_names[idx]
                empty_count = conn.execute(
                    """SELECT COUNT(*) FROM notes
                       WHERE mid=? AND (
                           flds LIKE ? OR
                           instr(flds, char(31)) = 0
                       )""",
                    [int(mid_str), "\x1f" * idx + "\x1f%"],
                ).fetchone()[0]
                # Simpler: count notes where field idx is empty
                all_notes_for_model = conn.execute(
                    "SELECT flds FROM notes WHERE mid=?", [int(mid_str)]
                ).fetchall()
                empty = sum(
-                    1 for (flds,) in all_notes_for_model
+                    1
                    for (flds,) in all_notes_for_model
                    if len(flds.split("\x1f")) <= idx or not flds.split("\x1f")[idx].strip()
                )
                if empty > 0:
@ -271,6 +272,9 @@ def main() -> None:
    group = parser.add_mutually_exclusive_group()
    group.add_argument("--vocab", action="store_true", help="Validate vocabulary deck only")
    group.add_argument("--conjugations", action="store_true", help="Validate conjugation deck only")
    group.add_argument("--confusables", action="store_true", help="Validate confusables deck only")
    group.add_argument("--plurals", action="store_true", help="Validate plurals deck only")
    group.add_argument("--complete", action="store_true", help="Validate complete combined deck only")
    args = parser.parse_args()
    targets: list[Path] = []
@ -280,19 +284,25 @@ def main() -> None:
        targets = [VOCAB_APKG]
    elif args.conjugations:
        targets = [CONJ_APKG]
    elif args.confusables:
        targets = [CONF_APKG]
    elif args.plurals:
        targets = [PLURAL_APKG]
    elif args.complete:
        targets = [COMPLETE_APKG]
    else:
-        targets = [VOCAB_APKG, CONJ_APKG]
+        targets = [VOCAB_APKG, CONJ_APKG, CONF_APKG, PLURAL_APKG, COMPLETE_APKG]
    total_failures = 0
    for path in targets:
        total_failures += validate_apkg(path)
-    print(f"\n{'='*60}")
+    print(f"\n{'=' * 60}")
    if total_failures == 0:
        print(f"  {PASS}  All checks passed")
    else:
        print(f"  {FAIL}  {total_failures} check(s) failed")
-    print(f"{'='*60}\n")
+    print(f"{'=' * 60}\n")
    sys.exit(0 if total_failures == 0 else 1)
--- a/validate_verb_list.py
+++ b/validate_verb_list.py
@ -28,42 +28,42 @@ from pathlib import Path
 import requests
 from bs4 import BeautifulSoup
-PEALIM_BASE    = "https://www.pealim.com"
+PEALIM_BASE = "https://www.pealim.com"
-REQUEST_DELAY  = 1.5
+REQUEST_DELAY = 1.5
 REQUEST_TIMEOUT = 15
-SOURCE_FILE    = Path(__file__).parent / "nevo_typed_verbs_from_modern_hebrew"
+SOURCE_FILE = Path(__file__).parent / "nevo_typed_verbs_from_modern_hebrew"
-OUTPUT_FILE    = Path(__file__).parent / "verbs_input.txt"
+OUTPUT_FILE = Path(__file__).parent / "verbs_input.txt"
 # Known problem entries: word → (action, note)
 # action: "REVIEW" = comment out and flag, "3ms" = treat as 3ms past form
 KNOWN_ISSUES: dict[str, tuple[str, str]] = {
-    "לגבוה":   ("REVIEW", "not a standard infinitive form; likely defective spelling or wrong word"),
+    "לגבוה": ("REVIEW", "not a standard infinitive form; likely defective spelling or wrong word"),
-    "לההרג":   ("REVIEW", "extra ה; should probably be להיהרג (Nif'al of הרג)"),
+    "לההרג": ("REVIEW", "extra ה; should probably be להיהרג (Nif'al of הרג)"),
    "להתלקלח": ("REVIEW", "not a real word; likely typo for להתקלקל"),
-    "להקלל":   ("REVIEW", "ambiguous: could be Hif'il לְהָקֵל (to ease) or Nif'al of קלל"),
+    "להקלל": ("REVIEW", "ambiguous: could be Hif'il לְהָקֵל (to ease) or Nif'al of קלל"),
-    "המציא":   ("3ms",    "Hif'il 3ms past form, not an infinitive"),
+    "המציא": ("3ms", "Hif'il 3ms past form, not an infinitive"),
-    "קומם":    ("3ms",    "ambiguous: Pu'al 3ms past; Pi'el infinitive is לְקוֹמֵם"),
+    "קומם": ("3ms", "ambiguous: Pu'al 3ms past; Pi'el infinitive is לְקוֹמֵם"),
 }
 # Expected binyan by line range (1-indexed) per plan analysis
 LINE_RANGES: list[tuple[range, str]] = [
-    (range(1,  18),  "Pa'al"),
+    (range(1, 18), "Pa'al"),
-    (range(18, 29),  "Nif'al"),
+    (range(18, 29), "Nif'al"),
-    (range(29, 37),  "Pi'el"),
+    (range(29, 37), "Pi'el"),
-    (range(37, 43),  "Pu'al"),
+    (range(37, 43), "Pu'al"),
-    (range(43, 53),  "Hitpa'el"),
+    (range(43, 53), "Hitpa'el"),
-    (range(53, 63),  "Hif'il"),
+    (range(53, 63), "Hif'il"),
-    (range(63, 71),  "Huf'al"),
+    (range(63, 71), "Huf'al"),
 ]
 SECTION_HEADERS: dict[str, str] = {
-    "Pa'al":    "# Pa'al (פָּעַל)",
+    "Pa'al": "# Pa'al (פָּעַל)",
-    "Nif'al":   "# Nif'al (נִפְעַל)",
+    "Nif'al": "# Nif'al (נִפְעַל)",
-    "Pi'el":    "# Pi'el (פִּעֵל)",
+    "Pi'el": "# Pi'el (פִּעֵל)",
-    "Pu'al":    "# Pu'al (פֻּעַל) — 3ms past, no infinitive",
+    "Pu'al": "# Pu'al (פֻּעַל) — 3ms past, no infinitive",
    "Hitpa'el": "# Hitpa'el (הִתְפַּעֵל)",
-    "Hif'il":   "# Hif'il (הִפְעִיל)",
+    "Hif'il": "# Hif'il (הִפְעִיל)",
-    "Huf'al":   "# Huf'al (הֻפְעַל) — 3ms past, no infinitive",
+    "Huf'al": "# Huf'al (הֻפְעַל) — 3ms past, no infinitive",
 }
 session = requests.Session()
@ -120,7 +120,7 @@ def main() -> None:
        print(f"ERROR: {SOURCE_FILE} not found", file=sys.stderr)
        sys.exit(1)
-    lines = [l.strip() for l in SOURCE_FILE.read_text(encoding="utf-8").splitlines() if l.strip()]
+    lines = [line.strip() for line in SOURCE_FILE.read_text(encoding="utf-8").splitlines() if line.strip()]
    print(f"Loaded {len(lines)} entries from {SOURCE_FILE.name}")
    print(f"Querying pealim.com (delay {REQUEST_DELAY}s per request)…\n")
@ -137,14 +137,19 @@ def main() -> None:
        if issue_type == "REVIEW":
            # Don't query pealim for known-bad entries
-            print(f"REVIEW  (skipping query)")
+            print("REVIEW  (skipping query)")
-            results.append({
+            results.append(
-                "line": line_num, "word": word,
+                {
-                "expected_binyan": expected_binyan,
+                    "line": line_num,
-                "slug": "", "page_binyan": "",
+                    "word": word,
-                "status": "REVIEW", "notes": issue_note,
+                    "expected_binyan": expected_binyan,
-                "is_3ms": is_3ms_by_position,
+                    "slug": "",
-            })
+                    "page_binyan": "",
                    "status": "REVIEW",
                    "notes": issue_note,
                    "is_3ms": is_3ms_by_position,
                }
            )
            continue
        time.sleep(REQUEST_DELAY)
@ -171,13 +176,18 @@ def main() -> None:
            notes = ""
        print(f"{status:<12}  slug={slug or '-':<35}  binyan={page_binyan or '-'}")
-        results.append({
+        results.append(
-            "line": line_num, "word": word,
+            {
-            "expected_binyan": expected_binyan,
+                "line": line_num,
-            "slug": slug or "", "page_binyan": page_binyan,
+                "word": word,
-            "status": status, "notes": notes,
+                "expected_binyan": expected_binyan,
-            "is_3ms": is_3ms_by_position or issue_type == "3ms",
+                "slug": slug or "",
-        })
+                "page_binyan": page_binyan,
                "status": status,
                "notes": notes,
                "is_3ms": is_3ms_by_position or issue_type == "3ms",
            }
        )
    # ── Write cleaned verbs_input.txt ────────────────────────────────────────────
    sections: dict[str, list[str]] = {b: [] for b in SECTION_HEADERS}
@ -219,7 +229,6 @@ def main() -> None:
    print(f"\nWrote → {OUTPUT_FILE}")
    # ── Print summary table ──────────────────────────────────────────────────────
    col_w = [4, 22, 14, 38, 12]
    print("\n" + "=" * 95)
    print("VALIDATION REPORT")
    print("=" * 95)
@ -232,8 +241,7 @@ def main() -> None:
        )
    print("=" * 95)
-    counts = {s: sum(1 for r in results if r["status"] == s)
+    counts = {s: sum(1 for r in results if r["status"] == s) for s in ("OK", "3ms", "MISMATCH", "REVIEW", "NOT_FOUND")}
              for s in ("OK", "3ms", "MISMATCH", "REVIEW", "NOT_FOUND")}
    print(
        f"\nSummary: {counts['OK']} OK | {counts['3ms']} 3ms-past | "
        f"{counts['MISMATCH']} MISMATCH | {counts['REVIEW']} REVIEW | {counts['NOT_FOUND']} NOT_FOUND"
@ -241,10 +249,7 @@ def main() -> None:
    print(f"Total entries: {len(results)}")
    if counts["REVIEW"] > 0 or counts["NOT_FOUND"] > 0 or counts["MISMATCH"] > 0:
-        print(
+        print("\n⚠  Review flagged entries in verbs_input.txt before running:\n   python3 conjugation_extract.py")
            "\n⚠  Review flagged entries in verbs_input.txt before running:\n"
            "   python3 conjugation_extract.py"
        )
 if __name__ == "__main__":
--- a/verbs_input.txt
+++ b/verbs_input.txt
@ -2,6 +2,8 @@
 # Lines prefixed '# 3ms:' are searched by 3ms past form (Pu'al/Huf'al).
 # Pa'al (פָּעַל)
 # slug: להיות 454-lihyot
 להיות
 לשמור
 ללמוד
 לאסוף
--- a/vulture_whitelist.py
+++ b/vulture_whitelist.py
@ -0,0 +1,3 @@
 # Vulture whitelist: suppress false positives for interface methods
 # HTMLParser.handle_starttag requires (self, tag, attrs) signature
 attrs  # noqa