feat: Sprint 3 — passive/active separation, random card order, card UX fixes

Conjugation extraction: - Active entries now extract active forms only (no auto passive partner) - Passive (# 3ms:) entries extract passive section only via new _extract_passive_from_active_slug(); search-based fallback also uses this path so no active forms leak into passive entries - # slug: VERB SLUG override syntax for search-ambiguous active verbs - # 3ms: FORM ACTIVE-SLUG syntax for passive entries with known active page - Fixed verb spellings: בוטל (was בותל), slug overrides for תואם → 2344-letaem, זוכה → 503-lezakot, לָשִׂים → 45-lasim, העבר → 1442-lehaavir Card UX: - Passive card front: shows active partner infinitive (e.g. לְבַטֵּל) with (סָבִיל) inline in smaller font instead of bare 3ms past form - Removed פָּעִיל label from active cards; only passive cards carry voice label - New cards introduced in random order (new.order=0 via _RandomOrderPackage) - Frequency badge: words outside top 50k show "50k+" instead of blank README: updated CLI options, output files table, pipeline list, card descriptions to reflect Sprint 3 state Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 10:16:50 +00:00 · 2026-03-03 10:16:50 +00:00 · d26e4c8ce5
commit d26e4c8ce5
parent ca7ca74a39
6 changed files with 8038 additions and 9615 deletions
--- a/README.md
+++ b/README.md
@ -50,7 +50,7 @@ Fields on each card:
 | Audio | pronunciation from pealim.com |
 | Frequency rank | #412 |

-Cards are sorted by frequency (most common words first), so you learn the most useful vocabulary earliest.
+Cards are presented in **random order** within Anki's spaced-repetition system, but frequency rank is displayed on every card so you can see how common each word is. Words not in the top 50,000 show a "50k+" badge.

 ---

@ -66,7 +66,9 @@ Each verb is drilled in: present, past, future, imperative, infinitive — all p

 **Modern Hebrew 2fp/3fp:** Classical feminine plural future forms (e.g., תִּשְׁמֹרְנָה) are shown in parentheses; the card's primary answer is the modern masculine plural form used in everyday speech.

-**Voice labels:** Pi'el and Hif'il cards are labeled פָּעִיל (active); Pu'al and Huf'al cards are labeled סָבִיל (passive).
+**Passive label:** Pu'al and Huf'al cards show the active partner's infinitive on the front (e.g., לְבַטֵּל) followed by **(סָבִיל)** in smaller text, so you know you're drilling the passive conjugation. Active verbs show no label.
+
+**Card order:** New cards are introduced in random order.

 **Citation:** Coffin, Edna Amir and Shmuel Bolozky. *A Reference Grammar of Modern Hebrew*. Cambridge University Press, 2005.

@ -137,7 +139,9 @@ python run.py [options]
  --skip-scrape        Use cached data/pealim_dict.csv (no pealim.com scraping)
  --skip-audio         Skip audio .mp3 downloads
  --skip-examples      Skip Ben Yehuda example fetching
+  --only {vocab,conjugations}  Run only one deck (skips all unrelated steps)
  --skip-conjugations  Skip verb conjugation extraction
+  --skip-images        Skip image fetching for concrete nouns
  --refresh-examples   Force rebuild of Ben Yehuda index (nikkud corpus)
  --test N             Process only first N words
 ```
@ -151,6 +155,9 @@ python run.py [options]
 | `data/conjugations.json` | Verb conjugation data |
 | `data/audio/` | Vocabulary audio (.mp3) |
 | `data/audio_conj/` | Conjugation audio (.mp3) |
+| `data/fonts/` | Heebo font files (bundled in .apkg) |
+| `data/images/` | Noun images from Wikipedia/Commons |
+| `data/image_cache.json` | Image fetch cache |
 | `output/pealim_vocabulary.apkg` | Vocabulary Anki deck |
 | `output/pealim_conjugations.apkg` | Conjugation Anki deck |

@ -161,8 +168,10 @@ python run.py [options]
 3. `benyehuda.py` — builds sentence index from Ben-Yehuda corpus
 4. `extract_verb_list.py` — extracts verb list from Coffin & Bolozky PDF
 5. `conjugation_extract.py` — fetches conjugation tables from pealim.com
-6. `apkg_builder.py` — assembles both `.apkg` files
-7. `run.py` — orchestrates all steps
+6. `image_fetch.py` — fetches Wikipedia/Commons images for concrete nouns
+7. `validate_verb_list.py` — validates verb list against pealim.com
+8. `apkg_builder.py` — assembles both `.apkg` files
+9. `run.py` — orchestrates all steps

 ---

--- a/apkg_builder.py
+++ b/apkg_builder.py
@ -141,6 +141,11 @@ CARD_CSS = """
  padding: 2px 8px;
  margin-top: 4px;
 }
+.voice-label {
+  font-size: 0.6em;
+  font-weight: normal;
+  color: #555;
+}
 .sec-label {
  font-size: 16px;
  color: #555;
@ -236,8 +241,7 @@ VOCAB_MODEL = genanki.Model(
 # ──────────────────────────────────────────────────────────────────────────────

 CONJ_FRONT = """
-<div class="hebrew">{{ReferenceForm}}</div>
-{{#Voice}}<div class="hebrew">{{Voice}}</div>{{/Voice}}
+<div class="hebrew">{{ReferenceForm}}{{#Voice}} <span class="voice-label">({{Voice}})</span>{{/Voice}}</div>
 <div class="hebrew">{{Pronoun}}</div>
 <div class="hebrew">{{Tense}}</div>
 """
@ -307,10 +311,8 @@ FP_MODERN_FALLBACK = {
    "imperative_fp": "imperative_mp",
 }

-# Voice field: active/passive label per binyan
+# Voice field: passive label only (shown inline on card front for Pu'al/Huf'al)
 VOICE_MAP = {
-    "Pi'el":  "פָּעִיל",
-    "Hif'il": "פָּעִיל",
    "Pu'al":  "סָבִיל",
    "Huf'al": "סָבִיל",
 }
@ -605,50 +607,6 @@ def build_conj_deck(
            tense   = form_data.get("tense", "")
            add_note(pronoun, tense, conj_form, audio_tag)

-        # Also process passive partner forms if present
-        passive = data.get("passive_partner")
-        if passive and passive.get("forms"):
-            passive_root       = passive.get("root", root)
-            passive_binyan     = passive.get("binyan", "")
-            passive_binyan_heb = BINYAN_TO_HEBREW.get(passive_binyan, passive_binyan)
-            passive_ref        = passive.get("reference_form", ref_form)
-            passive_voice      = VOICE_MAP.get(passive_binyan, "")
-            passive_slug       = passive.get("slug", slug)
-
-            for form_key, form_data in passive["forms"].items():
-                conj_form = form_data.get("form", "")
-                if not conj_form or not re.search(r"[\u05d0-\u05ea]", conj_form):
-                    continue
-
-                audio_tag = ""
-                if passive_slug:
-                    audio_tag = _conj_audio_tag(passive_slug, f"passive_{form_key}")
-                    if audio_tag:
-                        mp3_path = audio_dir / f"{passive_slug}_passive_{form_key}.mp3"
-                        if mp3_path not in media_files:
-                            media_files.append(mp3_path)
-
-                pronoun = form_data.get("pronoun", "")
-                tense   = form_data.get("tense", "")
-
-                if not conj_form:
-                    continue
-                note = genanki.Note(
-                    model=CONJ_MODEL,
-                    fields=[
-                        infinitive,
-                        passive_ref,
-                        pronoun,
-                        tense,
-                        conj_form,
-                        passive_root,
-                        passive_binyan_heb,
-                        passive_voice,
-                        audio_tag,
-                    ],
-                )
-                deck.add_note(note)
-                note_count += 1

    logger.info(
        f"Conjugation deck: {note_count} notes across "
@ -663,13 +621,27 @@ def _font_media_files() -> list[str]:
    return [str(p) for p in font_paths if p.exists()]


+class _RandomOrderPackage(genanki.Package):
+    """genanki.Package subclass that sets new card order to random (0) instead of insertion order (1)."""
+
+    def write_to_db(self, cursor, timestamp, id_gen):
+        super().write_to_db(cursor, timestamp, id_gen)
+        row = cursor.execute("SELECT dconf FROM col").fetchone()
+        if row:
+            dconf = json.loads(row[0])
+            for conf in dconf.values():
+                if isinstance(conf, dict) and "new" in conf:
+                    conf["new"]["order"] = 0
+            cursor.execute("UPDATE col SET dconf = ?", [json.dumps(dconf)])
+
+
 def write_vocab_apkg(
    deck: genanki.Deck,
    media_files: list[Path],
    out_path: Path = VOCAB_APKG,
 ) -> None:
    out_path.parent.mkdir(parents=True, exist_ok=True)
-    pkg = genanki.Package(deck)
+    pkg = _RandomOrderPackage(deck)
    pkg.media_files = [str(p) for p in media_files if p.exists()] + _font_media_files()
    pkg.write_to_file(str(out_path))
    logger.info(f"Vocabulary deck written → {out_path}")
@ -681,7 +653,7 @@ def write_conj_apkg(
    out_path: Path = CONJ_APKG,
 ) -> None:
    out_path.parent.mkdir(parents=True, exist_ok=True)
-    pkg = genanki.Package(deck)
+    pkg = _RandomOrderPackage(deck)
    base = [str(p) for p in (media_files or []) if p.exists()]
    pkg.media_files = base + _font_media_files()
    pkg.write_to_file(str(out_path))
--- a/conjugation_extract.py
+++ b/conjugation_extract.py
@ -478,36 +478,6 @@ def _extract_conjugations(slug: str, search_term: str, is_3ms_search: bool = Fal
                "tense": TENSE_DESCRIPTION.get(key, ""),
            }

-    # Parse passive forms if present on this page (Pi'el/Hif'il pages have passive partner)
-    passive_forms_raw = _parse_table(soup, passive=True)
-    if passive_forms_raw:
-        passive_binyan = _extract_passive_binyan_from_page(soup)
-        if not passive_binyan:
-            # Infer: Pi'el → Pu'al, Hif'il → Huf'al
-            passive_binyan = "Pu'al" if binyan == "Pi'el" else "Huf'al" if binyan == "Hif'il" else ""
-
-        passive_past_3ms = passive_forms_raw.get("past_3ms", {}).get("form", "")
-        passive_result = {
-            "infinitive": search_term,
-            "slug": slug,
-            "root": root,
-            "binyan": passive_binyan,
-            "is_passive": True,
-            "reference_form": passive_past_3ms or search_term,
-            "reference_active_infinitive": reference_form,
-            "forms": {},
-        }
-        for key, form_data in passive_forms_raw.items():
-            if key in PRONOUN_LABELS:
-                passive_result["forms"][key] = {
-                    "form": form_data["form"],
-                    "audio_url": form_data.get("audio_url", ""),
-                    "pronoun": PRONOUN_LABELS[key],
-                    "tense": TENSE_DESCRIPTION.get(key, ""),
-                }
-        result["passive_partner"] = passive_result
-        logger.info(f"  Passive partner ({passive_binyan}): {len(passive_result['forms'])} forms")
-
    logger.info(f"  Extracted {len(result['forms'])} forms for {search_term}")
    return result

@ -525,6 +495,61 @@ def _save_conjugations(data: dict) -> None:
        json.dump(data, f, ensure_ascii=False, indent=2)


+def _extract_passive_from_active_slug(active_slug: str, search_term: str) -> dict | None:
+    """Fetch active verb page and extract only the passive section forms.
+    Used for Pu'al/Huf'al 3ms entries where we know the active verb's slug."""
+    url = f"{PEALIM_BASE}/dict/{active_slug}/"
+    try:
+        resp = session.get(url, cookies={"hebstyle": "mo"}, timeout=REQUEST_TIMEOUT)
+        resp.raise_for_status()
+    except Exception as e:
+        logger.error(f"  Error fetching {url}: {e}")
+        return None
+
+    soup = BeautifulSoup(resp.text, "lxml")
+
+    root = ""
+    for span in soup.find_all("span", class_="menukad"):
+        txt = span.get_text(strip=True)
+        if txt and re.search(r"[\u05d0-\u05ea]", txt) and "-" in txt:
+            root = txt
+            break
+
+    active_binyan = _extract_binyan_from_page(soup)
+    active_forms_raw = _parse_table(soup, passive=False)
+    active_infinitive = active_forms_raw.get("infinitive", {}).get("form", "")
+
+    passive_forms_raw = _parse_table(soup, passive=True)
+    if not passive_forms_raw:
+        logger.warning(f"  No passive forms found on {active_slug} for {search_term}")
+        return None
+
+    passive_binyan = _extract_passive_binyan_from_page(soup)
+    if not passive_binyan:
+        passive_binyan = "Pu'al" if active_binyan == "Pi'el" else "Huf'al" if active_binyan == "Hif'il" else ""
+
+    result = {
+        "infinitive": search_term,
+        "slug": active_slug,
+        "root": root,
+        "binyan": passive_binyan,
+        "is_passive": True,
+        "reference_form": active_infinitive or search_term,
+        "forms": {},
+    }
+    for key, form_data in passive_forms_raw.items():
+        if key in PRONOUN_LABELS:
+            result["forms"][key] = {
+                "form": form_data["form"],
+                "audio_url": form_data.get("audio_url", ""),
+                "pronoun": PRONOUN_LABELS[key],
+                "tense": TENSE_DESCRIPTION.get(key, ""),
+            }
+
+    logger.info(f"  Extracted {len(result['forms'])} passive forms for {search_term} from {active_slug}")
+    return result
+
+
 def main(verbs_file: Path = VERBS_INPUT) -> dict:
    """Read verbs from file and extract conjugations. Returns full conjugations dict."""
    if not verbs_file.exists():
@ -533,43 +558,78 @@ def main(verbs_file: Path = VERBS_INPUT) -> dict:

    raw_lines = verbs_file.read_text(encoding="utf-8").splitlines()

-    # Parse: regular verbs and # 3ms: lines
-    verbs: list[tuple[str, bool]] = []  # (search_term, is_3ms_search)
+    # Parse slug overrides: "# slug: VERB SLUG" anywhere in the file
+    slug_overrides: dict[str, str] = {}
    for line in raw_lines:
-        line = line.strip()
-        if not line:
+        stripped = line.strip()
+        if stripped.startswith("# slug:"):
+            parts = stripped[len("# slug:"):].strip().split()
+            if len(parts) >= 2:
+                slug_overrides[parts[0]] = parts[1]
+
+    # Parse: regular verbs and # 3ms: lines (optional active slug on 3ms lines)
+    verbs: list[tuple[str, bool, str | None]] = []  # (search_term, is_3ms_search, active_slug)
+    for line in raw_lines:
+        stripped = line.strip()
+        if not stripped or stripped.startswith("# slug:"):
            continue
-        if line.startswith("# 3ms:"):
-            form = line[len("# 3ms:"):].strip()
-            if form:
-                verbs.append((form, True))
-        elif line.startswith("#"):
+        if stripped.startswith("# 3ms:"):
+            parts = stripped[len("# 3ms:"):].strip().split()
+            if parts:
+                form = parts[0]
+                active_slug = parts[1] if len(parts) >= 2 else None
+                verbs.append((form, True, active_slug))
+        elif stripped.startswith("#"):
            continue
        else:
-            verbs.append((line, False))
+            verbs.append((stripped, False, None))

    logger.info(f"Loaded {len(verbs)} verbs from {verbs_file} "
-                f"({sum(1 for _, p in verbs if p)} passive 3ms)")
+                f"({sum(1 for _, p, _ in verbs if p)} passive 3ms)")
+    if slug_overrides:
+        logger.info(f"  Slug overrides: {slug_overrides}")

    conjugations = _load_conjugations()
    new_count = 0

-    for verb, is_3ms in verbs:
+    for verb, is_3ms, active_slug in verbs:
        if verb in conjugations:
            logger.info(f"Skipping {verb} (cached)")
            continue

        logger.info(f"Processing: {verb} {'(3ms search)' if is_3ms else ''}")
        time.sleep(REQUEST_DELAY)
-        slug = _find_slug(verb)
-        if not slug:
-            logger.warning(f"  No slug found for {verb}")
-            conjugations[verb] = None
-            _save_conjugations(conjugations)
-            continue

-        time.sleep(REQUEST_DELAY)
-        data = _extract_conjugations(slug, verb, is_3ms_search=is_3ms)
+        if is_3ms:
+            # Passive-only extraction: use provided active slug or search to find it
+            if active_slug:
+                slug = active_slug
+                logger.info(f"  Using active slug {slug} for passive extraction")
+            else:
+                slug = _find_slug(verb)
+                if not slug:
+                    logger.warning(f"  No slug found for {verb}")
+                    conjugations[verb] = None
+                    _save_conjugations(conjugations)
+                    continue
+                logger.info(f"  Found active slug {slug} for passive extraction")
+            time.sleep(REQUEST_DELAY)
+            data = _extract_passive_from_active_slug(slug, verb)
+        else:
+            override = slug_overrides.get(verb)
+            if override:
+                logger.info(f"  Slug override: {override}")
+                slug = override
+            else:
+                slug = _find_slug(verb)
+            if not slug:
+                logger.warning(f"  No slug found for {verb}")
+                conjugations[verb] = None
+                _save_conjugations(conjugations)
+                continue
+            time.sleep(REQUEST_DELAY)
+            data = _extract_conjugations(slug, verb, is_3ms_search=False)
+
        conjugations[verb] = data
        _save_conjugations(conjugations)
        new_count += 1
--- a/data/conjugations.json
+++ b/data/conjugations.json
--- a/run.py
+++ b/run.py
@ -6,6 +6,7 @@ Usage:
  python run.py [options]

 Options:
+  --only {vocab,conjugations}  Run only one deck (skips all unrelated steps)
  --skip-scrape        Use existing data/pealim_dict.csv (no pealim.com dict scraping)
  --skip-audio         Skip audio .mp3 downloads
  --skip-examples      Skip Ben Yehuda example fetching
@ -40,6 +41,7 @@ FONTS_DIR      = DATA_DIR / "fonts"

 def parse_args():
    p = argparse.ArgumentParser(description="Pealim Anki deck builder")
+    p.add_argument("--only",               choices=["vocab", "conjugations"], help="Run only one deck (skips all unrelated steps)")
    p.add_argument("--skip-scrape",        action="store_true", help="Skip dict scraping; use cached CSV")
    p.add_argument("--skip-audio",         action="store_true", help="Skip audio downloads")
    p.add_argument("--skip-examples",      action="store_true", help="Skip Ben Yehuda example lookup")
@ -451,12 +453,23 @@ def main():

    logger.info("=" * 60)
    logger.info("PEALIM ANKI DECK BUILDER")
+    if args.only:
+        logger.info(f"  MODE: --only {args.only}")
    if args.test:
        logger.info(f"  TEST MODE: {args.test} words")
    if args.refresh_examples:
        logger.info("  REFRESH EXAMPLES: Ben Yehuda index will be rebuilt")
    logger.info("=" * 60)

+    if args.only == "conjugations":
+        step_fonts(args)
+        conjugations = step_conjugations(args)
+        print_summary(args, {}, {}, conjugations or {})
+        return
+
+    if args.only == "vocab":
+        args.skip_conjugations = True
+
    step_scrape(args)
    freq_cache     = step_frequency()
    examples_cache = step_examples(args, freq_cache)
--- a/verbs_input.txt
+++ b/verbs_input.txt
@ -1,7 +1,5 @@
 # Verb list — validated against pealim.com from nevo_typed_verbs_from_modern_hebrew
 # Lines prefixed '# 3ms:' are searched by 3ms past form (Pu'al/Huf'al).
-# Lines prefixed '# REVIEW:' need manual correction before conjugation extraction.
-# Lines prefixed '# NOT_FOUND:' had no pealim.com result — check spelling.

 # Pa'al (פָּעַל)
 לשמור
@ -12,11 +10,13 @@
 לאכול
 לשאול
 לשלוח
+לגבוה
 לשבת
 לרשת
-לפול
+לִיפּוֹל
 לקום
-לשים
+# slug: לָשִׂים 45-lasim
+לָשִׂים
 לחון
 לקרוא
 לקנות
@ -24,6 +24,7 @@
 # Nif'al (נִפְעַל)
 להיבדק
 להרדם
+להיהרג
 להחקר
 להישאר
 להיפגע
@ -44,11 +45,11 @@
 לגלגל

 # Pu'al (פֻּעַל) — 3ms past, no infinitive
-# 3ms: בותל
-# 3ms: תואם
+# 3ms: בוטל 214-levatel
+# 3ms: תואם 2344-letaem
 # 3ms: קומם
 # 3ms: דוכא
-# 3ms: זוכה
+# 3ms: זוכה 503-lezakot
 # 3ms: פורסם

 # Hitpa'el (הִתְפַּעֵל)
@ -72,19 +73,14 @@
 להקים
 להמציא
 להרשות
+להקל

 # Huf'al (הֻפְעַל) — 3ms past, no infinitive
 # 3ms: הוגבל
-# 3ms: העבר
+# 3ms: העבר 1442-lehaavir
 # 3ms: הוזהר
 # 3ms: הופל
 # 3ms: הוקם
 # 3ms: הוחל
 # 3ms: הוקפא
 # 3ms: הופנה
-
-# ── Entries flagged for manual review ──────────────────────────────────────────
-# REVIEW: לגבוה  — not a standard infinitive form; likely defective spelling or wrong word. Response: see slug 286-ligboah
-# REVIEW: לההרג  — extra ה; should probably be להיהרג (Nif'al of הרג) Response: correct, nifal of harag
-# REVIEW: להתלקלח  — not a real word; likely typo for להתקלקל Response: correct, it's a typo 
-# REVIEW: להקלל  — ambiguous: could be Hif'il לְהָקֵל (to ease) or Nif'al of קלל Response: it's lehakel, to ease