feat: Sprint 3 — passive/active separation, random card order, card UX fixes

Conjugation extraction:
- Active entries now extract active forms only (no auto passive partner)
- Passive (# 3ms:) entries extract passive section only via new
  _extract_passive_from_active_slug(); search-based fallback also uses
  this path so no active forms leak into passive entries
- # slug: VERB SLUG override syntax for search-ambiguous active verbs
- # 3ms: FORM ACTIVE-SLUG syntax for passive entries with known active page
- Fixed verb spellings: בוטל (was בותל), slug overrides for תואם →
  2344-letaem, זוכה → 503-lezakot, לָשִׂים → 45-lasim, העבר → 1442-lehaavir

Card UX:
- Passive card front: shows active partner infinitive (e.g. לְבַטֵּל) with
  (סָבִיל) inline in smaller font instead of bare 3ms past form
- Removed פָּעִיל label from active cards; only passive cards carry voice label
- New cards introduced in random order (new.order=0 via _RandomOrderPackage)
- Frequency badge: words outside top 50k show "50k+" instead of blank

README: updated CLI options, output files table, pipeline list, card
descriptions to reflect Sprint 3 state

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Sochen 2026-03-03 10:16:50 +00:00
parent ca7ca74a39
commit d26e4c8ce5
6 changed files with 8038 additions and 9615 deletions

View file

@ -50,7 +50,7 @@ Fields on each card:
| Audio | pronunciation from pealim.com |
| Frequency rank | #412 |
Cards are sorted by frequency (most common words first), so you learn the most useful vocabulary earliest.
Cards are presented in **random order** within Anki's spaced-repetition system, but frequency rank is displayed on every card so you can see how common each word is. Words not in the top 50,000 show a "50k+" badge.
---
@ -66,7 +66,9 @@ Each verb is drilled in: present, past, future, imperative, infinitive — all p
**Modern Hebrew 2fp/3fp:** Classical feminine plural future forms (e.g., תִּשְׁמֹרְנָה) are shown in parentheses; the card's primary answer is the modern masculine plural form used in everyday speech.
**Voice labels:** Pi'el and Hif'il cards are labeled פָּעִיל (active); Pu'al and Huf'al cards are labeled סָבִיל (passive).
**Passive label:** Pu'al and Huf'al cards show the active partner's infinitive on the front (e.g., לְבַטֵּל) followed by **(סָבִיל)** in smaller text, so you know you're drilling the passive conjugation. Active verbs show no label.
**Card order:** New cards are introduced in random order.
**Citation:** Coffin, Edna Amir and Shmuel Bolozky. *A Reference Grammar of Modern Hebrew*. Cambridge University Press, 2005.
@ -137,7 +139,9 @@ python run.py [options]
--skip-scrape Use cached data/pealim_dict.csv (no pealim.com scraping)
--skip-audio Skip audio .mp3 downloads
--skip-examples Skip Ben Yehuda example fetching
--only {vocab,conjugations} Run only one deck (skips all unrelated steps)
--skip-conjugations Skip verb conjugation extraction
--skip-images Skip image fetching for concrete nouns
--refresh-examples Force rebuild of Ben Yehuda index (nikkud corpus)
--test N Process only first N words
```
@ -151,6 +155,9 @@ python run.py [options]
| `data/conjugations.json` | Verb conjugation data |
| `data/audio/` | Vocabulary audio (.mp3) |
| `data/audio_conj/` | Conjugation audio (.mp3) |
| `data/fonts/` | Heebo font files (bundled in .apkg) |
| `data/images/` | Noun images from Wikipedia/Commons |
| `data/image_cache.json` | Image fetch cache |
| `output/pealim_vocabulary.apkg` | Vocabulary Anki deck |
| `output/pealim_conjugations.apkg` | Conjugation Anki deck |
@ -161,8 +168,10 @@ python run.py [options]
3. `benyehuda.py` — builds sentence index from Ben-Yehuda corpus
4. `extract_verb_list.py` — extracts verb list from Coffin & Bolozky PDF
5. `conjugation_extract.py` — fetches conjugation tables from pealim.com
6. `apkg_builder.py` — assembles both `.apkg` files
7. `run.py` — orchestrates all steps
6. `image_fetch.py` — fetches Wikipedia/Commons images for concrete nouns
7. `validate_verb_list.py` — validates verb list against pealim.com
8. `apkg_builder.py` — assembles both `.apkg` files
9. `run.py` — orchestrates all steps
---

View file

@ -141,6 +141,11 @@ CARD_CSS = """
padding: 2px 8px;
margin-top: 4px;
}
.voice-label {
font-size: 0.6em;
font-weight: normal;
color: #555;
}
.sec-label {
font-size: 16px;
color: #555;
@ -236,8 +241,7 @@ VOCAB_MODEL = genanki.Model(
# ──────────────────────────────────────────────────────────────────────────────
CONJ_FRONT = """
<div class="hebrew">{{ReferenceForm}}</div>
{{#Voice}}<div class="hebrew">{{Voice}}</div>{{/Voice}}
<div class="hebrew">{{ReferenceForm}}{{#Voice}} <span class="voice-label">({{Voice}})</span>{{/Voice}}</div>
<div class="hebrew">{{Pronoun}}</div>
<div class="hebrew">{{Tense}}</div>
"""
@ -307,10 +311,8 @@ FP_MODERN_FALLBACK = {
"imperative_fp": "imperative_mp",
}
# Voice field: active/passive label per binyan
# Voice field: passive label only (shown inline on card front for Pu'al/Huf'al)
VOICE_MAP = {
"Pi'el": "פָּעִיל",
"Hif'il": "פָּעִיל",
"Pu'al": "סָבִיל",
"Huf'al": "סָבִיל",
}
@ -605,50 +607,6 @@ def build_conj_deck(
tense = form_data.get("tense", "")
add_note(pronoun, tense, conj_form, audio_tag)
# Also process passive partner forms if present
passive = data.get("passive_partner")
if passive and passive.get("forms"):
passive_root = passive.get("root", root)
passive_binyan = passive.get("binyan", "")
passive_binyan_heb = BINYAN_TO_HEBREW.get(passive_binyan, passive_binyan)
passive_ref = passive.get("reference_form", ref_form)
passive_voice = VOICE_MAP.get(passive_binyan, "")
passive_slug = passive.get("slug", slug)
for form_key, form_data in passive["forms"].items():
conj_form = form_data.get("form", "")
if not conj_form or not re.search(r"[\u05d0-\u05ea]", conj_form):
continue
audio_tag = ""
if passive_slug:
audio_tag = _conj_audio_tag(passive_slug, f"passive_{form_key}")
if audio_tag:
mp3_path = audio_dir / f"{passive_slug}_passive_{form_key}.mp3"
if mp3_path not in media_files:
media_files.append(mp3_path)
pronoun = form_data.get("pronoun", "")
tense = form_data.get("tense", "")
if not conj_form:
continue
note = genanki.Note(
model=CONJ_MODEL,
fields=[
infinitive,
passive_ref,
pronoun,
tense,
conj_form,
passive_root,
passive_binyan_heb,
passive_voice,
audio_tag,
],
)
deck.add_note(note)
note_count += 1
logger.info(
f"Conjugation deck: {note_count} notes across "
@ -663,13 +621,27 @@ def _font_media_files() -> list[str]:
return [str(p) for p in font_paths if p.exists()]
class _RandomOrderPackage(genanki.Package):
"""genanki.Package subclass that sets new card order to random (0) instead of insertion order (1)."""
def write_to_db(self, cursor, timestamp, id_gen):
super().write_to_db(cursor, timestamp, id_gen)
row = cursor.execute("SELECT dconf FROM col").fetchone()
if row:
dconf = json.loads(row[0])
for conf in dconf.values():
if isinstance(conf, dict) and "new" in conf:
conf["new"]["order"] = 0
cursor.execute("UPDATE col SET dconf = ?", [json.dumps(dconf)])
def write_vocab_apkg(
deck: genanki.Deck,
media_files: list[Path],
out_path: Path = VOCAB_APKG,
) -> None:
out_path.parent.mkdir(parents=True, exist_ok=True)
pkg = genanki.Package(deck)
pkg = _RandomOrderPackage(deck)
pkg.media_files = [str(p) for p in media_files if p.exists()] + _font_media_files()
pkg.write_to_file(str(out_path))
logger.info(f"Vocabulary deck written → {out_path}")
@ -681,7 +653,7 @@ def write_conj_apkg(
out_path: Path = CONJ_APKG,
) -> None:
out_path.parent.mkdir(parents=True, exist_ok=True)
pkg = genanki.Package(deck)
pkg = _RandomOrderPackage(deck)
base = [str(p) for p in (media_files or []) if p.exists()]
pkg.media_files = base + _font_media_files()
pkg.write_to_file(str(out_path))

View file

@ -478,36 +478,6 @@ def _extract_conjugations(slug: str, search_term: str, is_3ms_search: bool = Fal
"tense": TENSE_DESCRIPTION.get(key, ""),
}
# Parse passive forms if present on this page (Pi'el/Hif'il pages have passive partner)
passive_forms_raw = _parse_table(soup, passive=True)
if passive_forms_raw:
passive_binyan = _extract_passive_binyan_from_page(soup)
if not passive_binyan:
# Infer: Pi'el → Pu'al, Hif'il → Huf'al
passive_binyan = "Pu'al" if binyan == "Pi'el" else "Huf'al" if binyan == "Hif'il" else ""
passive_past_3ms = passive_forms_raw.get("past_3ms", {}).get("form", "")
passive_result = {
"infinitive": search_term,
"slug": slug,
"root": root,
"binyan": passive_binyan,
"is_passive": True,
"reference_form": passive_past_3ms or search_term,
"reference_active_infinitive": reference_form,
"forms": {},
}
for key, form_data in passive_forms_raw.items():
if key in PRONOUN_LABELS:
passive_result["forms"][key] = {
"form": form_data["form"],
"audio_url": form_data.get("audio_url", ""),
"pronoun": PRONOUN_LABELS[key],
"tense": TENSE_DESCRIPTION.get(key, ""),
}
result["passive_partner"] = passive_result
logger.info(f" Passive partner ({passive_binyan}): {len(passive_result['forms'])} forms")
logger.info(f" Extracted {len(result['forms'])} forms for {search_term}")
return result
@ -525,6 +495,61 @@ def _save_conjugations(data: dict) -> None:
json.dump(data, f, ensure_ascii=False, indent=2)
def _extract_passive_from_active_slug(active_slug: str, search_term: str) -> dict | None:
"""Fetch active verb page and extract only the passive section forms.
Used for Pu'al/Huf'al 3ms entries where we know the active verb's slug."""
url = f"{PEALIM_BASE}/dict/{active_slug}/"
try:
resp = session.get(url, cookies={"hebstyle": "mo"}, timeout=REQUEST_TIMEOUT)
resp.raise_for_status()
except Exception as e:
logger.error(f" Error fetching {url}: {e}")
return None
soup = BeautifulSoup(resp.text, "lxml")
root = ""
for span in soup.find_all("span", class_="menukad"):
txt = span.get_text(strip=True)
if txt and re.search(r"[\u05d0-\u05ea]", txt) and "-" in txt:
root = txt
break
active_binyan = _extract_binyan_from_page(soup)
active_forms_raw = _parse_table(soup, passive=False)
active_infinitive = active_forms_raw.get("infinitive", {}).get("form", "")
passive_forms_raw = _parse_table(soup, passive=True)
if not passive_forms_raw:
logger.warning(f" No passive forms found on {active_slug} for {search_term}")
return None
passive_binyan = _extract_passive_binyan_from_page(soup)
if not passive_binyan:
passive_binyan = "Pu'al" if active_binyan == "Pi'el" else "Huf'al" if active_binyan == "Hif'il" else ""
result = {
"infinitive": search_term,
"slug": active_slug,
"root": root,
"binyan": passive_binyan,
"is_passive": True,
"reference_form": active_infinitive or search_term,
"forms": {},
}
for key, form_data in passive_forms_raw.items():
if key in PRONOUN_LABELS:
result["forms"][key] = {
"form": form_data["form"],
"audio_url": form_data.get("audio_url", ""),
"pronoun": PRONOUN_LABELS[key],
"tense": TENSE_DESCRIPTION.get(key, ""),
}
logger.info(f" Extracted {len(result['forms'])} passive forms for {search_term} from {active_slug}")
return result
def main(verbs_file: Path = VERBS_INPUT) -> dict:
"""Read verbs from file and extract conjugations. Returns full conjugations dict."""
if not verbs_file.exists():
@ -533,43 +558,78 @@ def main(verbs_file: Path = VERBS_INPUT) -> dict:
raw_lines = verbs_file.read_text(encoding="utf-8").splitlines()
# Parse: regular verbs and # 3ms: lines
verbs: list[tuple[str, bool]] = [] # (search_term, is_3ms_search)
# Parse slug overrides: "# slug: VERB SLUG" anywhere in the file
slug_overrides: dict[str, str] = {}
for line in raw_lines:
line = line.strip()
if not line:
stripped = line.strip()
if stripped.startswith("# slug:"):
parts = stripped[len("# slug:"):].strip().split()
if len(parts) >= 2:
slug_overrides[parts[0]] = parts[1]
# Parse: regular verbs and # 3ms: lines (optional active slug on 3ms lines)
verbs: list[tuple[str, bool, str | None]] = [] # (search_term, is_3ms_search, active_slug)
for line in raw_lines:
stripped = line.strip()
if not stripped or stripped.startswith("# slug:"):
continue
if line.startswith("# 3ms:"):
form = line[len("# 3ms:"):].strip()
if form:
verbs.append((form, True))
elif line.startswith("#"):
if stripped.startswith("# 3ms:"):
parts = stripped[len("# 3ms:"):].strip().split()
if parts:
form = parts[0]
active_slug = parts[1] if len(parts) >= 2 else None
verbs.append((form, True, active_slug))
elif stripped.startswith("#"):
continue
else:
verbs.append((line, False))
verbs.append((stripped, False, None))
logger.info(f"Loaded {len(verbs)} verbs from {verbs_file} "
f"({sum(1 for _, p in verbs if p)} passive 3ms)")
f"({sum(1 for _, p, _ in verbs if p)} passive 3ms)")
if slug_overrides:
logger.info(f" Slug overrides: {slug_overrides}")
conjugations = _load_conjugations()
new_count = 0
for verb, is_3ms in verbs:
for verb, is_3ms, active_slug in verbs:
if verb in conjugations:
logger.info(f"Skipping {verb} (cached)")
continue
logger.info(f"Processing: {verb} {'(3ms search)' if is_3ms else ''}")
time.sleep(REQUEST_DELAY)
slug = _find_slug(verb)
if not slug:
logger.warning(f" No slug found for {verb}")
conjugations[verb] = None
_save_conjugations(conjugations)
continue
time.sleep(REQUEST_DELAY)
data = _extract_conjugations(slug, verb, is_3ms_search=is_3ms)
if is_3ms:
# Passive-only extraction: use provided active slug or search to find it
if active_slug:
slug = active_slug
logger.info(f" Using active slug {slug} for passive extraction")
else:
slug = _find_slug(verb)
if not slug:
logger.warning(f" No slug found for {verb}")
conjugations[verb] = None
_save_conjugations(conjugations)
continue
logger.info(f" Found active slug {slug} for passive extraction")
time.sleep(REQUEST_DELAY)
data = _extract_passive_from_active_slug(slug, verb)
else:
override = slug_overrides.get(verb)
if override:
logger.info(f" Slug override: {override}")
slug = override
else:
slug = _find_slug(verb)
if not slug:
logger.warning(f" No slug found for {verb}")
conjugations[verb] = None
_save_conjugations(conjugations)
continue
time.sleep(REQUEST_DELAY)
data = _extract_conjugations(slug, verb, is_3ms_search=False)
conjugations[verb] = data
_save_conjugations(conjugations)
new_count += 1

File diff suppressed because it is too large Load diff

13
run.py
View file

@ -6,6 +6,7 @@ Usage:
python run.py [options]
Options:
--only {vocab,conjugations} Run only one deck (skips all unrelated steps)
--skip-scrape Use existing data/pealim_dict.csv (no pealim.com dict scraping)
--skip-audio Skip audio .mp3 downloads
--skip-examples Skip Ben Yehuda example fetching
@ -40,6 +41,7 @@ FONTS_DIR = DATA_DIR / "fonts"
def parse_args():
p = argparse.ArgumentParser(description="Pealim Anki deck builder")
p.add_argument("--only", choices=["vocab", "conjugations"], help="Run only one deck (skips all unrelated steps)")
p.add_argument("--skip-scrape", action="store_true", help="Skip dict scraping; use cached CSV")
p.add_argument("--skip-audio", action="store_true", help="Skip audio downloads")
p.add_argument("--skip-examples", action="store_true", help="Skip Ben Yehuda example lookup")
@ -451,12 +453,23 @@ def main():
logger.info("=" * 60)
logger.info("PEALIM ANKI DECK BUILDER")
if args.only:
logger.info(f" MODE: --only {args.only}")
if args.test:
logger.info(f" TEST MODE: {args.test} words")
if args.refresh_examples:
logger.info(" REFRESH EXAMPLES: Ben Yehuda index will be rebuilt")
logger.info("=" * 60)
if args.only == "conjugations":
step_fonts(args)
conjugations = step_conjugations(args)
print_summary(args, {}, {}, conjugations or {})
return
if args.only == "vocab":
args.skip_conjugations = True
step_scrape(args)
freq_cache = step_frequency()
examples_cache = step_examples(args, freq_cache)

View file

@ -1,7 +1,5 @@
# Verb list — validated against pealim.com from nevo_typed_verbs_from_modern_hebrew
# Lines prefixed '# 3ms:' are searched by 3ms past form (Pu'al/Huf'al).
# Lines prefixed '# REVIEW:' need manual correction before conjugation extraction.
# Lines prefixed '# NOT_FOUND:' had no pealim.com result — check spelling.
# Pa'al (פָּעַל)
לשמור
@ -12,11 +10,13 @@
לאכול
לשאול
לשלוח
לגבוה
לשבת
לרשת
לפול
לִיפּוֹל
לקום
לשים
# slug: לָשִׂים 45-lasim
לָשִׂים
לחון
לקרוא
לקנות
@ -24,6 +24,7 @@
# Nif'al (נִפְעַל)
להיבדק
להרדם
להיהרג
להחקר
להישאר
להיפגע
@ -44,11 +45,11 @@
לגלגל
# Pu'al (פֻּעַל) — 3ms past, no infinitive
# 3ms: בותל
# 3ms: תואם
# 3ms: בוטל 214-levatel
# 3ms: תואם 2344-letaem
# 3ms: קומם
# 3ms: דוכא
# 3ms: זוכה
# 3ms: זוכה 503-lezakot
# 3ms: פורסם
# Hitpa'el (הִתְפַּעֵל)
@ -72,19 +73,14 @@
להקים
להמציא
להרשות
להקל
# Huf'al (הֻפְעַל) — 3ms past, no infinitive
# 3ms: הוגבל
# 3ms: העבר
# 3ms: העבר 1442-lehaavir
# 3ms: הוזהר
# 3ms: הופל
# 3ms: הוקם
# 3ms: הוחל
# 3ms: הוקפא
# 3ms: הופנה
# ── Entries flagged for manual review ──────────────────────────────────────────
# REVIEW: לגבוה — not a standard infinitive form; likely defective spelling or wrong word. Response: see slug 286-ligboah
# REVIEW: לההרג — extra ה; should probably be להיהרג (Nif'al of הרג) Response: correct, nifal of harag
# REVIEW: להתלקלח — not a real word; likely typo for להתקלקל Response: correct, it's a typo
# REVIEW: להקלל — ambiguous: could be Hif'il לְהָקֵל (to ease) or Nif'al of קלל Response: it's lehakel, to ease