Sprint 9: cloze cards, plurals deck, project reorg, lint tooling

- Cloze card pipeline: 924 cards from 2,296 AI-vetted Hebrew book sentences
- Plurals deck: 375 notes (144 irregular + 231 regular from 86 mishkal patterns)
- Ktiv male forms expanded to 20,711 entries for sentence matching
- Project reorg: helpers.py (deduped strip_nikkud from 10 files), scripts/ for
  one-off tools, tests/ with smoke tests, deleted 3 dead files
- Lint tooling: pyproject.toml with ruff/vulture/bandit/pytest config, .editorconfig,
  fixed all 129 ruff errors (B023 closure fix, SIM103, unused vars)
- validate_apkg.py: card count range check for optional cloze template
- Data caches committed: vetted_sentences, ktiv_male_forms, noun_plurals,
  noun_slug_map, vocab_sentence_matches, epub_sentence_index

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Sochen 2026-03-07 08:09:39 +00:00
parent 419e952389
commit 17f7458d19
37 changed files with 330541 additions and 871 deletions

15
.editorconfig Normal file
View file

@ -0,0 +1,15 @@
root = true
[*]
indent_style = space
indent_size = 4
end_of_line = lf
charset = utf-8
trim_trailing_whitespace = true
insert_final_newline = true
[*.{json,yml,yaml,toml}]
indent_size = 2
[*.md]
trim_trailing_whitespace = false

15
.gitignore vendored
View file

@ -11,6 +11,7 @@ pyvenv.cfg
venv/ venv/
__pycache__/ __pycache__/
*.pyc *.pyc
.pytest_cache/
# Large generated cache files (rebuild locally) # Large generated cache files (rebuild locally)
data/benyehuda_index.json data/benyehuda_index.json
@ -31,6 +32,20 @@ ANKIWEB_DESCRIPTION.md
PROJECTS.md PROJECTS.md
SPRINT_LOG.md SPRINT_LOG.md
CLAUDE.md CLAUDE.md
RECOMMENDATIONS.md
# Intermediate scrape progress files
data/ktiv_male_forms.json.partial
data/ktiv_male_forms_partial.json
data/ktiv_scrape_progress.json
data/noun_slug_map_progress.json
data/top_verbs_to_scrape.json
# EPUB source files (large; user-specific)
data/epubs/
# Stray deck files
Everything__*.apkg
# Release artifacts — distributed via Forgejo releases, not committed to tree # Release artifacts — distributed via Forgejo releases, not committed to tree
releases/ releases/

170
README.md
View file

@ -6,16 +6,17 @@
## For Hebrew learners ## For Hebrew learners
This project generates two Anki decks for learning Modern Hebrew: A set of Anki flashcard decks for learning Modern Hebrew — vocabulary, verb conjugations, and more. All words include nikkud (vowel marks), audio, and are sorted by frequency so you learn the most useful words first.
- **Vocabulary deck** — ~9,100 words from [pealim.com](https://www.pealim.com/dict/), with nikkud (vowel marks), roots, parts of speech, related words, and example sentences from classic Hebrew literature. ### What's included
- **Conjugation deck** — 70 paradigm verbs from Coffin & Bolozky's *A Reference Grammar of Modern Hebrew* (2005), fully conjugated in all tenses and persons, across all seven binyanim.
All card data comes from open or academic sources: - **Vocabulary** — ~9,100 Hebrew words with pronunciation audio, roots, example sentences from Hebrew literature, images, and frequency rankings.
- Word data: [pealim.com](https://www.pealim.com) — a free Modern Hebrew dictionary - **Verb conjugations** — 71 core verbs fully conjugated in all tenses and persons, covering all seven binyanim (verb patterns).
- Example sentences: [Project Ben-Yehuda](https://benyehuda.org) — public-domain Hebrew literature corpus - **Confusables** — Words that look the same without vowel marks (e.g., דָּבָר "thing" vs. דִּבֵּר "spoke") shown side by side so you can tell them apart.
- Word frequency: [hermitdave/FrequencyWords](https://github.com/hermitdave/FrequencyWords) — Hebrew frequency list - **Noun plurals** — Practice forming singular↔plural pairs, with a focus on irregular plurals and common patterns.
- Verb paradigm list: Coffin, Edna Amir and Shmuel Bolozky. *A Reference Grammar of Modern Hebrew*. Cambridge University Press, 2005. - **All-in-one** — A combined deck with everything above, organized as subdecks.
You can download and import any deck individually — or use the combined deck to get everything at once.
--- ---
@ -25,17 +26,19 @@ All card data comes from open or academic sources:
2. Double-click to import into [Anki](https://apps.ankiweb.net/) (free, cross-platform) 2. Double-click to import into [Anki](https://apps.ankiweb.net/) (free, cross-platform)
3. Start studying 3. Start studying
Both decks can be imported independently. If you already have one, re-importing the same file updates your deck without losing study progress. All decks can be imported independently — pick just the ones you want. Re-importing the same file later updates your deck without losing study progress.
--- ---
## What's in the vocabulary deck ## What's in the vocabulary deck
Each card has two sides: Each note generates up to three cards:
**Hebrew → English:** See the Hebrew word (with nikkud) + hear audio → recall the meaning. **Hebrew → English:** See the Hebrew word (with nikkud) + hear audio → recall the meaning.
**English → Hebrew:** See the English meaning → recall the Hebrew word, its root, and how to write it. **English → Hebrew:** See the English meaning → recall the Hebrew word. When multiple words share the same English meaning, a disambiguation hint (part of speech + binyan) helps you know which word is expected.
**Sentence Cloze:** A Hebrew sentence with the target word blanked out → fill in the missing word. Only generated for words with a vetted example sentence. Tests recognition in context.
Fields on each card: Fields on each card:
| Field | Example | | Field | Example |
@ -43,56 +46,84 @@ Fields on each card:
| Hebrew word (nikkud) | שָׁמַר | | Hebrew word (nikkud) | שָׁמַר |
| Meaning | kept, watched over | | Meaning | kept, watched over |
| Root | שמ״ר | | Root | שמ״ר |
| Part of speech | פועל (verb) | | Part of speech | פועל — פָּעַל |
| Without nikkud | שמר | | Without nikkud | שמר |
| Related words | שׁוֹמֵר, שְׁמִירָה | | Related words | שׁוֹמֵר, שְׁמִירָה (grouped by Part of Speech) |
| Example sentence | from Ben-Yehuda corpus | | Example sentence | from nikkud'd Hebrew books |
| Audio | pronunciation from pealim.com | | Audio | pronunciation from pealim.com |
| Frequency rank | #412 | | Frequency rank | #412 |
| Image / Emoji | for concrete nouns |
| Plural form | for nouns: רבים: שֻׁלְחָנוֹת |
| Disambiguation hint | for ambiguous Eng→Heb cards |
Cards are presented in **frequency order** — Anki will show you the most common words first. Frequency rank is displayed on every card so you can see how common each word is. Words not in the top 50,000 show a "50k+" badge. Cards are presented in **frequency order** — Anki will show you the most common words first.
### Eng→Heb disambiguation
When two Hebrew words translate to the same English (e.g., both mean "to return"), the Eng→Heb card shows a hint to tell them apart:
- **Layer 1:** Automatic Part of Speech + binyan hints for words with different parts of speech (163 words)
- **Layer 2:** AI-refined distinct glosses for true synonyms sharing the same Part of Speech (440 words)
--- ---
## What's in the conjugation deck ## What's in the conjugation deck
70 paradigm verbs from Coffin & Bolozky's *A Reference Grammar of Modern Hebrew* (Appendix 1), covering all seven binyanim: 71 verbs listed in Appendix 1 of Coffin & Bolozky's *A Reference Grammar of Modern Hebrew* covering all seven binyanim, and **all irregular forms**
- פָּעַל (Pa'al), נִפְעַל (Nif'al), פִּעֵל (Pi'el), פֻּעַל (Pu'al) - פָּעַל (Pa'al), נִפְעַל (Nif'al), פִּעֵל (Pi'el), פֻּעַל (Pu'al)
- הִתְפַּעֵל (Hitpa'el), הִפְעִיל (Hif'il), הֻפְעַל (Huf'al) - הִתְפַּעֵל (Hitpa'el), הִפְעִיל (Hif'il), הֻפְעַל (Huf'al)
Each verb is drilled in: present, past, future, and imperative — all persons and genders. The infinitive is shown on the card front as context but is not quizzed. Each verb is drilled in: present, past, future, and imperative — all persons and genders. Each card shows the English meaning and related vocabulary from the same root.
**Present tense expansion:** Each present form generates 3 cards (one per pronoun that uses it), so you learn אֲנִי, אַתָּה, and הוּא all separately with the same masculine singular form. **Present tense expansion:** Each present tense form randomly generates a pronoun to be shown in the front of the card, so you acclimate to seeing אֲנִי, אַתָּה, and הוּא with the conjugated verb, even though they are all conjugated the same in present tense.
**Modern Hebrew 2fp/3fp:** Classical feminine plural future forms (e.g., תִּשְׁמֹרְנָה) are shown in parentheses; the card's primary answer is the modern masculine plural form used in everyday speech. **Modern Hebrew 2fp/3fp:** Classical feminine plural future forms (e.g., תִּשְׁמֹרְנָה) are shown in parentheses, and played via audio (for the audio-included decks). the card's primary answer is the modern masculine plural form used in everyday speech.
**Passive label:** Pu'al and Huf'al cards show the active partner's infinitive on the front (e.g., לְבַטֵּל) followed by **(סָבִיל)** in smaller text, so you know you're drilling the passive conjugation. Active verbs show no label. **Passive label:** Pu'al and Huf'al cards show the active partner's infinitive on the front (e.g., לְבַטֵּל) followed by **(סָבִיל)** in smaller text, so you know you're drilling the passive conjugation.
**Card order:** New cards are introduced in random order. **Card order:** New conjugation cards are introduced in random order (not grouped by verb).
**Citation:** Coffin, Edna Amir and Shmuel Bolozky. *A Reference Grammar of Modern Hebrew*. Cambridge University Press, 2005. ---
## What's in the confusables deck
Hebrew without vowel marks is full of lookalikes. This deck groups words that are spelled identically without nikkud and asks "מה ההבדל?" (what's the difference?). The answer reveals all the words side by side with their nikkud and definitions.
Examples: דָּבָר (thing) vs. דִּבֵּר (spoke), סֵפֶר (book) vs. סָפַר (counted) vs. סַפָּר (barber).
---
## What's in the plurals deck
Two card directions for each noun:
- **Singular → Plural:** See שֻׁלְחָן → produce שֻׁלְחָנוֹת
- **Plural → Singular:** See שֻׁלְחָנוֹת → produce שֻׁלְחָן
Focuses on irregular plurals (the tricky ones that don't follow the rules) and common examples from each noun pattern. Cards are tagged by pattern for filtered study.
--- ---
## Suggested study strategy ## Suggested study strategy
Start with the vocabulary deck. Anki will present the most frequent words first. Don't try to study to many cards every single day-- Anki suggests 20 per day. Start with the vocabulary deck. Anki will present the most frequent words first. Don't try to study too many cards every single day — Anki suggests 20 per day.
The conjugation cards reinforce verb forms you've already seen in vocabulary. The conjugation cards reinforce verb forms you've already seen in vocabulary.
Use the Hebrew → English direction to build reading comprehension. Use the English → Hebrew direction to build writing and speaking recall. Use the Hebrew → English direction to build reading comprehension. Use the English → Hebrew direction to build writing and speaking recall. The sentence cloze cards test whether you can recognize words in real Hebrew text.
--- ---
## About the data sources ## About the data sources
**pealim.com** — A comprehensive free Modern Hebrew dictionary with nikkud, roots, conjugations, and audio. This project scrapes the public dictionary and conjugation tables. **pealim.com** — A comprehensive free Modern Hebrew dictionary with nikkud, roots, conjugations, and audio. This project scrapes the public dictionary and conjugation tables.
**Project Ben-Yehuda** — A public-domain digital library of Hebrew literature. Example sentences come from the nikkud corpus (classic texts with full vowel marks). **Project Ben-Yehuda** — A public-domain digital library of Hebrew literature. Example sentences come from the nikkud corpus (classic texts with full vowel marks).
**Hebrew books** — Additional example sentences from nikkud'd (menukad) Hebrew books, with Claude Sonnet AI-vetted quality filtering. The AI doesn't generate the sentences, it just determines whether it is a high quality sentence as an example, or not.
**FrequencyWords** — An open Hebrew word frequency list derived from subtitle data. Used to sort vocabulary cards from most to least common. **FrequencyWords** — An open Hebrew word frequency list derived from subtitle data. Used to sort vocabulary cards from most to least common.
**Coffin & Bolozky** — The verb paradigm list for the conjugation deck comes from Appendix 1 of *A Reference Grammar of Modern Hebrew* (Cambridge University Press, 2005), which provides a comprehensive reference for Modern Hebrew verbal morphology. **Coffin & Bolozky** — The verb list, and known good conjugation reference for the conjugation deck comes from Appendix 1 of *A Reference Grammar of Modern Hebrew* (Cambridge University Press, 2005).
--- ---
@ -100,9 +131,9 @@ Use the Hebrew → English direction to build reading comprehension. Use the Eng
If you notice a wrong translation, missing audio, or incorrect conjugation: If you notice a wrong translation, missing audio, or incorrect conjugation:
- For vocabulary errors: the source is pealim.com — you can suggest corrections there. But if you think morfix has a correct translation and pealim.com does not, we may be able to encode an override. - For vocabulary errors: the source is pealim.com — you can suggest corrections there. But if you think morfix has a correct translation and pealim.com does not, we may be able to encode an override.
For any other issue, whether you know to code or not: Email me at pealim [at] nevo [dot] engineer For any other issue, whether you know how to code or not: Email me at hebrew [at] nevo [dot] engineer
--- ---
@ -136,45 +167,78 @@ python run.py --skip-scrape --refresh-examples
``` ```
python run.py [options] python run.py [options]
--skip-scrape Use cached data/hebrew_dict.csv (no pealim.com scraping) --only {vocab,conjugations,confusables,plurals,complete}
--skip-audio Skip audio .mp3 downloads Build only one deck type
--skip-examples Skip Ben Yehuda example fetching --skip-scrape Use cached data/hebrew_dict.csv
--only {vocab,conjugations} Run only one deck (skips all unrelated steps) --skip-audio Skip audio .mp3 downloads
--skip-conjugations Skip verb conjugation extraction (deprecated: use --only vocab) --skip-examples Skip Ben Yehuda example fetching
--skip-images Skip image fetching for concrete nouns --skip-conjugations Skip verb conjugation extraction
--refresh-examples Force rebuild of Ben Yehuda index (nikkud corpus) --skip-images Skip image fetching for concrete nouns
--test N Process only first N words --refresh-examples Force rebuild of Ben Yehuda index
--test N Process only first N words
``` ```
### Output files ### Output files
| File | Description | | File | Description |
|------|-------------| |------|-------------|
| `data/hebrew_dict.csv` | Raw dictionary | | `output/hebrew_vocabulary.apkg` | Vocabulary deck (text only) |
| `data/hebrew_dict_for_anki.csv` | Enriched Anki CSV | | `output/hebrew_vocabulary_audio.apkg` | Vocabulary deck + audio |
| `data/conjugations.json` | Verb conjugation data | | `output/hebrew_vocabulary_images.apkg` | Vocabulary deck + images |
| `data/audio/` | Vocabulary audio (.mp3) | | `output/hebrew_vocabulary_audio_images.apkg` | Vocabulary deck + audio + images |
| `data/audio_conj/` | Conjugation audio (.mp3) | | `output/hebrew_conjugations.apkg` | Conjugation deck |
| `data/fonts/` | Heebo font files (bundled in .apkg) | | `output/hebrew_conjugations_audio.apkg` | Conjugation deck + audio |
| `data/images/` | Noun images from Wikipedia/Commons | | `output/hebrew_confusables.apkg` | Confusables deck |
| `data/image_cache.json` | Image fetch cache | | `output/hebrew_confusables_audio.apkg` | Confusables deck + audio |
| `output/hebrew_vocabulary.apkg` | Vocabulary Anki deck | | `output/hebrew_plurals.apkg` | Plurals deck |
| `output/hebrew_conjugations.apkg` | Conjugation Anki deck | | `output/hebrew_plurals_audio.apkg` | Plurals deck + audio |
| `output/hebrew_complete.apkg` | All decks combined |
| `output/hebrew_complete_audio.apkg` | All decks combined + audio |
### Data files
| File | Description |
|------|-------------|
| `data/hebrew_dict_for_anki.csv` | Enriched vocabulary CSV |
| `data/conjugations.json` | Verb conjugation data (71 verbs) |
| `data/noun_plurals.json` | Noun plural/construct forms |
| `data/refined_meanings.json` | AI-disambiguated meanings (440 words) |
| `data/vetted_sentences.json` | AI-vetted example sentences |
| `data/ktiv_male_forms.json` | Ktiv male (plene) forms for sentence matching |
| `data/legacy_guid_map.json` | Legacy GUIDs for study progress preservation |
### Pipeline overview ### Pipeline overview
1. `hebrew_extract.py` — scrapes pealim.com dictionary 1. `hebrew_extract.py` — scrapes pealim.com dictionary
2. `frequency_lookup.py` — downloads/loads Hebrew frequency data 2. `frequency_lookup.py` — downloads/loads Hebrew frequency data
3. `benyehuda.py` — builds sentence index from Ben-Yehuda corpus 3. `benyehuda.py` — builds sentence index from Ben-Yehuda nikkud corpus
4. `extract_verb_list.py` — extracts verb list from Coffin & Bolozky PDF 4. `extract_verb_list.py` — extracts verb list from Coffin & Bolozky PDF
5. `conjugation_extract.py` — fetches conjugation tables from pealim.com 5. `conjugation_extract.py` — fetches conjugation tables + meanings from pealim.com
6. `image_fetch.py` — fetches Wikipedia/Commons images for concrete nouns 6. `image_fetch.py` — fetches Wikipedia/Commons images for concrete nouns
7. `validate_verb_list.py` — validates verb list against pealim.com 7. `scrape_noun_plurals.py` — scrapes noun plural/construct forms from pealim.com
8. `apkg_builder.py` — assembles both `.apkg` files 8. `scrape_ktiv_male.py` — scrapes ktiv male (plene) forms for sentence matching
9. `run.py` — orchestrates all steps 9. `rebuild_sentence_matches.py` — matches vocab words to book sentences
10. `apkg_builder.py` — assembles all `.apkg` files
11. `run.py` — orchestrates all steps
12. `validate_apkg.py` — validates output decks
---
## Deck variants
| Variant | Contents | Size |
|---------|----------|------|
| `hebrew_vocabulary.apkg` | Text + images | ~15 MB |
| `hebrew_vocabulary_audio.apkg` | Text + images + audio | ~80 MB |
| `hebrew_conjugations.apkg` | Text only | ~1 MB |
| `hebrew_conjugations_audio.apkg` | Text + audio | ~5 MB |
| `hebrew_confusables.apkg` | Text only | ~1 MB |
| `hebrew_plurals.apkg` | Text only | ~1 MB |
| `hebrew_complete.apkg` | Everything combined | ~20 MB |
| `hebrew_complete_audio.apkg` | Everything + audio | ~90 MB |
--- ---
## AnkiWeb ## AnkiWeb
The decks will be published as shared decks on AnkiWeb (TBD). The decks will be published as shared decks on AnkiWeb (TBD).

File diff suppressed because it is too large Load diff

View file

@ -14,20 +14,18 @@ Exposed API:
import json import json
import logging import logging
import re import re
import unicodedata
import zipfile import zipfile
from io import BytesIO from io import BytesIO
from pathlib import Path from pathlib import Path
import requests import requests
from helpers import strip_nikkud as _strip_nikkud
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
# Nikkud-bearing corpus (txt.zip instead of txt_stripped.zip) # Nikkud-bearing corpus (txt.zip instead of txt_stripped.zip)
CORPUS_URL = ( CORPUS_URL = "https://github.com/projectbenyehuda/public_domain_dump/releases/download/2025-10/txt.zip"
"https://github.com/projectbenyehuda/public_domain_dump/releases/"
"download/2025-10/txt.zip"
)
INDEX_PATH = Path(__file__).parent / "data" / "benyehuda_index.json" INDEX_PATH = Path(__file__).parent / "data" / "benyehuda_index.json"
EXAMPLES_CACHE_PATH = Path(__file__).parent / "data" / "examples_cache.json" EXAMPLES_CACHE_PATH = Path(__file__).parent / "data" / "examples_cache.json"
REQUEST_TIMEOUT = 120 REQUEST_TIMEOUT = 120
@ -36,15 +34,8 @@ MAX_SENTENCE_LEN = 200
MAX_INDEX_ENTRIES = 500 # cap examples kept per word in index to limit memory MAX_INDEX_ENTRIES = 500 # cap examples kept per word in index to limit memory
# Module-level state # Module-level state
_index: dict[str, list[str]] = {} # word (with nikkud) -> [sentence, ...] _index: dict[str, list[str]] = {} # word (with nikkud) -> [sentence, ...]
_examples_cache: dict[str, list[str]] = {} # word -> cached result for this run _examples_cache: dict[str, list[str]] = {} # word -> cached result for this run
def _strip_nikkud(text: str) -> str:
return "".join(
ch for ch in unicodedata.normalize("NFD", text)
if unicodedata.category(ch) != "Mn"
)
def _split_sentences(text: str) -> list[str]: def _split_sentences(text: str) -> list[str]:
@ -73,7 +64,7 @@ def _build_index(corpus_zip_bytes: bytes) -> None:
for fname in txt_files: for fname in txt_files:
try: try:
raw = zf.read(fname).decode("utf-8", errors="ignore") raw = zf.read(fname).decode("utf-8", errors="ignore")
except Exception: except Exception: # noqa: S112
continue continue
for sentence in _split_sentences(raw): for sentence in _split_sentences(raw):
# Index by each unique Hebrew token (with nikkud) in the sentence # Index by each unique Hebrew token (with nikkud) in the sentence

View file

@ -19,13 +19,14 @@ import json
import logging import logging
import re import re
import time import time
import unicodedata
import urllib.parse import urllib.parse
from pathlib import Path from pathlib import Path
import requests import requests
from bs4 import BeautifulSoup from bs4 import BeautifulSoup
from helpers import strip_nikkud as _strip_nikkud
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
PEALIM_BASE = "https://www.pealim.com" PEALIM_BASE = "https://www.pealim.com"
@ -34,10 +35,14 @@ REQUEST_TIMEOUT = 15
VERBS_INPUT = Path(__file__).parent / "verbs_input.txt" VERBS_INPUT = Path(__file__).parent / "verbs_input.txt"
CONJUGATIONS_PATH = Path(__file__).parent / "data" / "conjugations.json" CONJUGATIONS_PATH = Path(__file__).parent / "data" / "conjugations.json"
DICT_CSV = next( DICT_CSV = next(
(p for p in [ (
Path(__file__).parent / "data" / "hebrew_dict_for_anki.csv", p
Path(__file__).parent / "data" / "pealim_dict_for_anki.csv", for p in [
] if p.exists()), Path(__file__).parent / "data" / "hebrew_dict_for_anki.csv",
Path(__file__).parent / "data" / "pealim_dict_for_anki.csv",
]
if p.exists()
),
Path(__file__).parent / "data" / "hebrew_dict_for_anki.csv", Path(__file__).parent / "data" / "hebrew_dict_for_anki.csv",
) )
@ -47,17 +52,17 @@ PRONOUN_LABELS = {
"present_fs": "", "present_fs": "",
"present_mp": "", "present_mp": "",
"present_fp": "", "present_fp": "",
"past_1s": "אֲנִי", "past_1s": "אֲנִי",
"past_1p": "אֲנַחְנוּ", "past_1p": "אֲנַחְנוּ",
"past_2ms": "אַתָּה", "past_2ms": "אַתָּה",
"past_2fs": "אַתְּ", "past_2fs": "אַתְּ",
"past_2mp": "אַתֶּם", "past_2mp": "אַתֶּם",
"past_2fp": "אַתֶּן", "past_2fp": "אַתֶּן",
"past_3ms": "הוּא", "past_3ms": "הוּא",
"past_3fs": "הִיא", "past_3fs": "הִיא",
"past_3p": "הֵם / הֵן", "past_3p": "הֵם / הֵן",
"future_1s": "אֲנִי", "future_1s": "אֲנִי",
"future_1p": "אֲנַחְנוּ", "future_1p": "אֲנַחְנוּ",
"future_2ms": "אַתָּה", "future_2ms": "אַתָּה",
"future_2fs": "אַתְּ", "future_2fs": "אַתְּ",
"future_2mp": "אַתֶּם", "future_2mp": "אַתֶּם",
@ -79,17 +84,17 @@ TENSE_DESCRIPTION = {
"present_fs": "הוֹוֶה", "present_fs": "הוֹוֶה",
"present_mp": "הוֹוֶה", "present_mp": "הוֹוֶה",
"present_fp": "הוֹוֶה", "present_fp": "הוֹוֶה",
"past_1s": "עָבָר", "past_1s": "עָבָר",
"past_1p": "עָבָר", "past_1p": "עָבָר",
"past_2ms": "עָבָר", "past_2ms": "עָבָר",
"past_2fs": "עָבָר", "past_2fs": "עָבָר",
"past_2mp": "עָבָר", "past_2mp": "עָבָר",
"past_2fp": "עָבָר", "past_2fp": "עָבָר",
"past_3ms": "עָבָר", "past_3ms": "עָבָר",
"past_3fs": "עָבָר", "past_3fs": "עָבָר",
"past_3p": "עָבָר", "past_3p": "עָבָר",
"future_1s": "עָתִיד", "future_1s": "עָתִיד",
"future_1p": "עָתִיד", "future_1p": "עָתִיד",
"future_2ms": "עָתִיד", "future_2ms": "עָתִיד",
"future_2fs": "עָתִיד", "future_2fs": "עָתִיד",
"future_2mp": "עָתִיד", "future_2mp": "עָתִיד",
@ -105,21 +110,12 @@ TENSE_DESCRIPTION = {
"infinitive": "מְקוֹר", "infinitive": "מְקוֹר",
} }
BINYAN_NAMES: tuple[str, ...] = ( BINYAN_NAMES: tuple[str, ...] = ("Pa'al", "Nif'al", "Pi'el", "Pu'al", "Hitpa'el", "Hif'il", "Huf'al")
"Pa'al", "Nif'al", "Pi'el", "Pu'al", "Hitpa'el", "Hif'il", "Huf'al"
)
session = requests.Session() session = requests.Session()
session.headers.update({"User-Agent": "Mozilla/5.0 (compatible; pealim-anki/2.0)"}) session.headers.update({"User-Agent": "Mozilla/5.0 (compatible; pealim-anki/2.0)"})
def _strip_nikkud(text: str) -> str:
"""Remove Hebrew nikkud (diacritics) from a string."""
return "".join(
ch for ch in unicodedata.normalize("NFD", text)
if unicodedata.category(ch) != "Mn"
)
def _build_pos_lookup() -> dict[str, str]: def _build_pos_lookup() -> dict[str, str]:
"""Build word_stripped → binyan dict from pealim_dict_for_anki.csv.""" """Build word_stripped → binyan dict from pealim_dict_for_anki.csv."""
@ -129,6 +125,7 @@ def _build_pos_lookup() -> dict[str, str]:
try: try:
import pandas as pd import pandas as pd
try: try:
df = pd.read_csv(DICT_CSV, sep=";", index_col=0) df = pd.read_csv(DICT_CSV, sep=";", index_col=0)
if df.shape[1] < 3: if df.shape[1] < 3:
@ -168,13 +165,13 @@ def _binyan_from_pos(word: str) -> str:
pos_lower = pos_str.lower() pos_lower = pos_str.lower()
# Map lowercase pealim.com PoS variants → canonical names # Map lowercase pealim.com PoS variants → canonical names
for bname, variants in [ for bname, variants in [
("Pa'al", ["pa'al", "paal"]), ("Pa'al", ["pa'al", "paal"]),
("Nif'al", ["nif'al", "nifal"]), ("Nif'al", ["nif'al", "nifal"]),
("Pi'el", ["pi'el", "piel"]), ("Pi'el", ["pi'el", "piel"]),
("Pu'al", ["pu'al", "pual"]), ("Pu'al", ["pu'al", "pual"]),
("Hitpa'el", ["hitpa'el", "hitpael"]), ("Hitpa'el", ["hitpa'el", "hitpael"]),
("Hif'il", ["hif'il", "hifil"]), ("Hif'il", ["hif'il", "hifil"]),
("Huf'al", ["huf'al", "hufal"]), ("Huf'al", ["huf'al", "hufal"]),
]: ]:
if any(v in pos_lower for v in variants): if any(v in pos_lower for v in variants):
return bname return bname
@ -305,7 +302,7 @@ def _parse_table(soup: BeautifulSoup, passive: bool = False, table_el=None) -> d
if present_row >= 0: if present_row >= 0:
hf = first_heb_forms(present_row) hf = first_heb_forms(present_row)
keys = ["present_ms", "present_fs", "present_mp", "present_fp"] keys = ["present_ms", "present_fs", "present_mp", "present_fp"]
for k, (v, au) in zip(keys, hf): for k, (v, au) in zip(keys, hf, strict=False):
store(k, v, au) store(k, v, au)
# Past tense # Past tense
@ -319,13 +316,13 @@ def _parse_table(soup: BeautifulSoup, passive: bool = False, table_el=None) -> d
if past_row + 1 < len(rows): if past_row + 1 < len(rows):
hf2 = first_heb_forms(past_row + 1) hf2 = first_heb_forms(past_row + 1)
keys2 = ["past_2ms", "past_2fs", "past_2mp", "past_2fp"] keys2 = ["past_2ms", "past_2fs", "past_2mp", "past_2fp"]
for k, (v, au) in zip(keys2, hf2): for k, (v, au) in zip(keys2, hf2, strict=False):
store(k, v, au) store(k, v, au)
if past_row + 2 < len(rows): if past_row + 2 < len(rows):
unique3 = deduplicate(first_heb_forms(past_row + 2)) unique3 = deduplicate(first_heb_forms(past_row + 2))
keys3 = ["past_3ms", "past_3fs", "past_3p"] keys3 = ["past_3ms", "past_3fs", "past_3p"]
for k, (v, au) in zip(keys3, unique3): for k, (v, au) in zip(keys3, unique3, strict=False):
store(k, v, au) store(k, v, au)
# Future tense # Future tense
@ -339,20 +336,20 @@ def _parse_table(soup: BeautifulSoup, passive: bool = False, table_el=None) -> d
if future_row + 1 < len(rows): if future_row + 1 < len(rows):
hf2 = first_heb_forms(future_row + 1) hf2 = first_heb_forms(future_row + 1)
keys2 = ["future_2ms", "future_2fs", "future_2mp", "future_2fp"] keys2 = ["future_2ms", "future_2fs", "future_2mp", "future_2fp"]
for k, (v, au) in zip(keys2, hf2): for k, (v, au) in zip(keys2, hf2, strict=False):
store(k, v, au) store(k, v, au)
if future_row + 2 < len(rows): if future_row + 2 < len(rows):
hf3 = first_heb_forms(future_row + 2) hf3 = first_heb_forms(future_row + 2)
keys3 = ["future_3ms", "future_3fs", "future_3mp", "future_3fp"] keys3 = ["future_3ms", "future_3fs", "future_3mp", "future_3fp"]
for k, (v, au) in zip(keys3, hf3): for k, (v, au) in zip(keys3, hf3, strict=False):
store(k, v, au) store(k, v, au)
# Imperative # Imperative
if imp_row >= 0: if imp_row >= 0:
hf = first_heb_forms(imp_row) hf = first_heb_forms(imp_row)
keys = ["imperative_ms", "imperative_fs", "imperative_mp", "imperative_fp"] keys = ["imperative_ms", "imperative_fs", "imperative_mp", "imperative_fp"]
for k, (v, au) in zip(keys, hf): for k, (v, au) in zip(keys, hf, strict=False):
store(k, v, au) store(k, v, au)
# Infinitive # Infinitive
@ -399,7 +396,9 @@ def _extract_passive_binyan_from_page(soup: BeautifulSoup) -> str:
return "" return ""
def _extract_conjugations(slug: str, search_term: str, is_3ms_search: bool = False, binyan_hint: str = "") -> dict | None: def _extract_conjugations(
slug: str, search_term: str, is_3ms_search: bool = False, binyan_hint: str = ""
) -> dict | None:
"""Fetch /dict/<slug>/ and parse conjugation table (active + passive).""" """Fetch /dict/<slug>/ and parse conjugation table (active + passive)."""
url = f"{PEALIM_BASE}/dict/{slug}/" url = f"{PEALIM_BASE}/dict/{slug}/"
try: try:
@ -411,6 +410,12 @@ def _extract_conjugations(slug: str, search_term: str, is_3ms_search: bool = Fal
soup = BeautifulSoup(resp.text, "lxml") soup = BeautifulSoup(resp.text, "lxml")
# Extract meaning from <div class="lead"> (English translation)
meaning = ""
lead_div = soup.find("div", class_="lead")
if lead_div:
meaning = lead_div.get_text(strip=True)
# Extract root # Extract root
root = "" root = ""
for span in soup.find_all("span", class_="menukad"): for span in soup.find_all("span", class_="menukad"):
@ -440,10 +445,7 @@ def _extract_conjugations(slug: str, search_term: str, is_3ms_search: bool = Fal
infinitive_form = forms_raw.get("infinitive", {}).get("form", "") if not is_passive else "" infinitive_form = forms_raw.get("infinitive", {}).get("form", "") if not is_passive else ""
past_3ms_form = forms_raw.get("past_3ms", {}).get("form", "") past_3ms_form = forms_raw.get("past_3ms", {}).get("form", "")
if is_passive: reference_form = (past_3ms_form or search_term) if is_passive else (infinitive_form or search_term)
reference_form = past_3ms_form or search_term
else:
reference_form = infinitive_form or search_term
# Build active result # Build active result
result = { result = {
@ -451,6 +453,7 @@ def _extract_conjugations(slug: str, search_term: str, is_3ms_search: bool = Fal
"slug": slug, "slug": slug,
"root": root, "root": root,
"binyan": binyan, "binyan": binyan,
"meaning": meaning,
"is_passive": is_passive, "is_passive": is_passive,
"reference_form": reference_form, "reference_form": reference_form,
"forms": {}, "forms": {},
@ -474,10 +477,7 @@ def _extract_conjugations(slug: str, search_term: str, is_3ms_search: bool = Fal
passive_table_ids = { passive_table_ids = {
id(t) for t in (passive_h3.find_all_next("table", class_="conjugation-table") if passive_h3 else []) id(t) for t in (passive_h3.find_all_next("table", class_="conjugation-table") if passive_h3 else [])
} }
active_tables = [ active_tables = [t for t in soup.find_all("table", class_="conjugation-table") if id(t) not in passive_table_ids]
t for t in soup.find_all("table", class_="conjugation-table")
if id(t) not in passive_table_ids
]
if len(active_tables) >= 2: if len(active_tables) >= 2:
alt_raw = _parse_table(soup, passive=False, table_el=active_tables[1]) alt_raw = _parse_table(soup, passive=False, table_el=active_tables[1])
alternate_forms = {} alternate_forms = {}
@ -521,6 +521,12 @@ def _extract_passive_from_active_slug(active_slug: str, search_term: str, binyan
soup = BeautifulSoup(resp.text, "lxml") soup = BeautifulSoup(resp.text, "lxml")
# Extract meaning (this is the active verb's meaning — useful context for passive)
meaning = ""
lead_div = soup.find("div", class_="lead")
if lead_div:
meaning = lead_div.get_text(strip=True)
root = "" root = ""
for span in soup.find_all("span", class_="menukad"): for span in soup.find_all("span", class_="menukad"):
txt = span.get_text(strip=True) txt = span.get_text(strip=True)
@ -548,6 +554,7 @@ def _extract_passive_from_active_slug(active_slug: str, search_term: str, binyan
"slug": active_slug, "slug": active_slug,
"root": root, "root": root,
"binyan": passive_binyan, "binyan": passive_binyan,
"meaning": meaning,
"is_passive": True, "is_passive": True,
"reference_form": active_infinitive or search_term, "reference_form": active_infinitive or search_term,
"forms": {}, "forms": {},
@ -578,14 +585,19 @@ def main(verbs_file: Path = VERBS_INPUT) -> dict:
for line in raw_lines: for line in raw_lines:
stripped = line.strip() stripped = line.strip()
if stripped.startswith("# slug:"): if stripped.startswith("# slug:"):
parts = stripped[len("# slug:"):].strip().split() parts = stripped[len("# slug:") :].strip().split()
if len(parts) >= 2: if len(parts) >= 2:
slug_overrides[parts[0]] = parts[1] slug_overrides[parts[0]] = parts[1]
# Map section header keywords → binyan name (for binyan_hint fallback) # Map section header keywords → binyan name (for binyan_hint fallback)
SECTION_BINYAN = { SECTION_BINYAN = {
"pa'al": "Pa'al", "nif'al": "Nif'al", "pi'el": "Pi'el", "pa'al": "Pa'al",
"pu'al": "Pu'al", "hitpa'el": "Hitpa'el", "hif'il": "Hif'il", "huf'al": "Huf'al", "nif'al": "Nif'al",
"pi'el": "Pi'el",
"pu'al": "Pu'al",
"hitpa'el": "Hitpa'el",
"hif'il": "Hif'il",
"huf'al": "Huf'al",
} }
# Parse: regular verbs and # 3ms: lines (optional active slug on 3ms lines) # Parse: regular verbs and # 3ms: lines (optional active slug on 3ms lines)
@ -597,7 +609,7 @@ def main(verbs_file: Path = VERBS_INPUT) -> dict:
if not stripped or stripped.startswith("# slug:"): if not stripped or stripped.startswith("# slug:"):
continue continue
if stripped.startswith("# 3ms:"): if stripped.startswith("# 3ms:"):
parts = stripped[len("# 3ms:"):].strip().split() parts = stripped[len("# 3ms:") :].strip().split()
if parts: if parts:
form = parts[0] form = parts[0]
active_slug = parts[1] if len(parts) >= 2 else None active_slug = parts[1] if len(parts) >= 2 else None
@ -612,8 +624,7 @@ def main(verbs_file: Path = VERBS_INPUT) -> dict:
else: else:
verbs.append((stripped, False, None, current_binyan_hint)) verbs.append((stripped, False, None, current_binyan_hint))
logger.info(f"Loaded {len(verbs)} verbs from {verbs_file} " logger.info(f"Loaded {len(verbs)} verbs from {verbs_file} ({sum(1 for _, p, _, _ in verbs if p)} passive 3ms)")
f"({sum(1 for _, p, _, _ in verbs if p)} passive 3ms)")
if slug_overrides: if slug_overrides:
logger.info(f" Slug overrides: {slug_overrides}") logger.info(f" Slug overrides: {slug_overrides}")

View file

@ -175,7 +175,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to guard; to keep, to maintain (על)"
}, },
"ללמוד": { "ללמוד": {
"infinitive": "ללמוד", "infinitive": "ללמוד",
@ -353,7 +354,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to learn, to study"
}, },
"לאסוף": { "לאסוף": {
"infinitive": "לאסוף", "infinitive": "לאסוף",
@ -531,7 +533,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to collect, to pick up, to reap"
}, },
"לעבוד": { "לעבוד": {
"infinitive": "לעבוד", "infinitive": "לעבוד",
@ -709,7 +712,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to work; to operate, to function"
}, },
"לחבוש": { "לחבוש": {
"infinitive": "לחבוש", "infinitive": "לחבוש",
@ -887,7 +891,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to bandage; to put on (a hat)"
}, },
"לאכול": { "לאכול": {
"infinitive": "לאכול", "infinitive": "לאכול",
@ -1065,7 +1070,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to eat"
}, },
"לשאול": { "לשאול": {
"infinitive": "לשאול", "infinitive": "לשאול",
@ -1243,7 +1249,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to ask; to borrow"
}, },
"לשלוח": { "לשלוח": {
"infinitive": "לשלוח", "infinitive": "לשלוח",
@ -1421,7 +1428,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to send, to dispatch"
}, },
"לגבוה": { "לגבוה": {
"infinitive": "לגבוה", "infinitive": "לגבוה",
@ -1599,7 +1607,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to be high, exalted"
}, },
"לשבת": { "לשבת": {
"infinitive": "לשבת", "infinitive": "לשבת",
@ -1777,7 +1786,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to sit, to settle"
}, },
"לרשת": { "לרשת": {
"infinitive": "לרשת", "infinitive": "לרשת",
@ -1955,7 +1965,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to inherit"
}, },
"לִיפּוֹל": { "לִיפּוֹל": {
"infinitive": "לִיפּוֹל", "infinitive": "לִיפּוֹל",
@ -2133,7 +2144,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to fall, to drop"
}, },
"לקום": { "לקום": {
"infinitive": "לקום", "infinitive": "לקום",
@ -2311,7 +2323,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to get up, to stand up, to arise; to be established, to come into being"
}, },
"לחון": { "לחון": {
"infinitive": "לחון", "infinitive": "לחון",
@ -2489,7 +2502,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to pardon, to amnesty; to endow"
}, },
"לקרוא": { "לקרוא": {
"infinitive": "לקרוא", "infinitive": "לקרוא",
@ -2667,7 +2681,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to read (ב-, את); to call (ל-)"
}, },
"לקנות": { "לקנות": {
"infinitive": "לקנות", "infinitive": "לקנות",
@ -2845,7 +2860,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to buy, to purchase"
}, },
"להיבדק": { "להיבדק": {
"infinitive": "להיבדק", "infinitive": "להיבדק",
@ -3023,7 +3039,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to be tested, examined"
}, },
"להרדם": { "להרדם": {
"infinitive": "להרדם", "infinitive": "להרדם",
@ -3201,7 +3218,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to fall asleep, to doze off"
}, },
"להיהרג": { "להיהרג": {
"infinitive": "להיהרג", "infinitive": "להיהרג",
@ -3379,7 +3397,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to be killed"
}, },
"להחקר": { "להחקר": {
"infinitive": "להחקר", "infinitive": "להחקר",
@ -3557,7 +3576,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to be investigated, explored"
}, },
"להישאר": { "להישאר": {
"infinitive": "להישאר", "infinitive": "להישאר",
@ -3735,7 +3755,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to remain"
}, },
"להיפגע": { "להיפגע": {
"infinitive": "להיפגע", "infinitive": "להיפגע",
@ -3913,7 +3934,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to be damaged, to be injured, to be wounded; to be insulted, to be offended"
}, },
"להיוולד": { "להיוולד": {
"infinitive": "להיוולד", "infinitive": "להיוולד",
@ -4091,7 +4113,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to be born"
}, },
"להנצל": { "להנצל": {
"infinitive": "להנצל", "infinitive": "להנצל",
@ -4269,7 +4292,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to be saved, to be rescued, to survive"
}, },
"להיסוג": { "להיסוג": {
"infinitive": "להיסוג", "infinitive": "להיסוג",
@ -4447,7 +4471,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to withdraw, to retreat"
}, },
"להימצא": { "להימצא": {
"infinitive": "להימצא", "infinitive": "להימצא",
@ -4625,7 +4650,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to be found, discovered; to be present, to be located"
}, },
"להיבנות": { "להיבנות": {
"infinitive": "להיבנות", "infinitive": "להיבנות",
@ -4803,7 +4829,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to be built, constructed"
}, },
"לדבר": { "לדבר": {
"infinitive": "לדבר", "infinitive": "לדבר",
@ -5130,7 +5157,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to speak, to talk"
}, },
"לברך": { "לברך": {
"infinitive": "לברך", "infinitive": "לברך",
@ -5457,7 +5485,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to bless, to greet, to felicitate"
}, },
"לנהל": { "לנהל": {
"infinitive": "לנהל", "infinitive": "לנהל",
@ -5784,7 +5813,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to manage, to organize"
}, },
"לנצח": { "לנצח": {
"infinitive": "לנצח", "infinitive": "לנצח",
@ -6111,7 +6141,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to win; to overcome, to beat; to conduct, to orchestrate"
}, },
"לקומם": { "לקומם": {
"infinitive": "לקומם", "infinitive": "לקומם",
@ -6438,7 +6469,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to outrage, to anger"
}, },
"למלא": { "למלא": {
"infinitive": "למלא", "infinitive": "למלא",
@ -6765,7 +6797,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to fill; to fill out; to fulfil"
}, },
"לחכות": { "לחכות": {
"infinitive": "לחכות", "infinitive": "לחכות",
@ -7092,7 +7125,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to await, to wait for (ל-)"
}, },
"לגלגל": { "לגלגל": {
"infinitive": "לגלגל", "infinitive": "לגלגל",
@ -7419,7 +7453,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to roll, to revolve (transitive)"
}, },
"להתלבש": { "להתלבש": {
"infinitive": "להתלבש", "infinitive": "להתלבש",
@ -7597,7 +7632,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to dress oneself"
}, },
"להסתלק": { "להסתלק": {
"infinitive": "להסתלק", "infinitive": "להסתלק",
@ -7775,7 +7811,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to leave, to go away"
}, },
"להצטלם": { "להצטלם": {
"infinitive": "להצטלם", "infinitive": "להצטלם",
@ -7953,7 +7990,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to pose for a photograph, to be photographed"
}, },
"להזדקק": { "להזדקק": {
"infinitive": "להזדקק", "infinitive": "להזדקק",
@ -8131,7 +8169,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to need, to require (ל-)"
}, },
"להתנהג": { "להתנהג": {
"infinitive": "להתנהג", "infinitive": "להתנהג",
@ -8309,7 +8348,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to behave"
}, },
"להתקומם": { "להתקומם": {
"infinitive": "להתקומם", "infinitive": "להתקומם",
@ -8487,7 +8527,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to rebel, to revolt"
}, },
"להתפלא": { "להתפלא": {
"infinitive": "להתפלא", "infinitive": "להתפלא",
@ -8665,7 +8706,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to wonder, to be surprised"
}, },
"להתקלקל": { "להתקלקל": {
"infinitive": "להתקלקל", "infinitive": "להתקלקל",
@ -8843,7 +8885,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to be damaged, to be spoiled (of food products)"
}, },
"להכניס": { "להכניס": {
"infinitive": "להכניס", "infinitive": "להכניס",
@ -9170,7 +9213,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to insert, to bring in"
}, },
"להעסיק": { "להעסיק": {
"infinitive": "להעסיק", "infinitive": "להעסיק",
@ -9497,7 +9541,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to keep busy; to employ"
}, },
"להחליט": { "להחליט": {
"infinitive": "להחליט", "infinitive": "להחליט",
@ -9824,7 +9869,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to decide"
}, },
"להבטיח": { "להבטיח": {
"infinitive": "להבטיח", "infinitive": "להבטיח",
@ -10151,7 +10197,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to ensure, to promise"
}, },
"להוריד": { "להוריד": {
"infinitive": "להוריד", "infinitive": "להוריד",
@ -10478,7 +10525,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to lower, to reduce; to download (computing)"
}, },
"להפיל": { "להפיל": {
"infinitive": "להפיל", "infinitive": "להפיל",
@ -10805,7 +10853,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to drop, to throw down"
}, },
"להקים": { "להקים": {
"infinitive": "להקים", "infinitive": "להקים",
@ -11132,7 +11181,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to build, to found, to establish"
}, },
"להמציא": { "להמציא": {
"infinitive": "להמציא", "infinitive": "להמציא",
@ -11459,7 +11509,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to invent; to make up; to present"
}, },
"להרשות": { "להרשות": {
"infinitive": "להרשות", "infinitive": "להרשות",
@ -11786,7 +11837,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to allow, to permit"
}, },
"להקל": { "להקל": {
"infinitive": "להקל", "infinitive": "להקל",
@ -12113,7 +12165,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to ease, to alleviate"
}, },
"לָשִׂים": { "לָשִׂים": {
"infinitive": "לָשִׂים", "infinitive": "לָשִׂים",
@ -12291,7 +12344,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to put, to put on"
}, },
"בוטל": { "בוטל": {
"infinitive": "בוטל", "infinitive": "בוטל",
@ -12439,7 +12493,8 @@
"pronoun": "הֵן", "pronoun": "הֵן",
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} },
"meaning": "to cancel, to undo"
}, },
"תואם": { "תואם": {
"infinitive": "תואם", "infinitive": "תואם",
@ -12587,7 +12642,8 @@
"pronoun": "הֵן", "pronoun": "הֵן",
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} },
"meaning": "to coordinate"
}, },
"קומם": { "קומם": {
"infinitive": "קומם", "infinitive": "קומם",
@ -12735,7 +12791,8 @@
"pronoun": "הֵן", "pronoun": "הֵן",
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} },
"meaning": "to outrage, to anger"
}, },
"דוכא": { "דוכא": {
"infinitive": "דוכא", "infinitive": "דוכא",
@ -12883,7 +12940,8 @@
"pronoun": "הֵן", "pronoun": "הֵן",
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} },
"meaning": "to oppress, to crush; to cause depression"
}, },
"זוכה": { "זוכה": {
"infinitive": "זוכה", "infinitive": "זוכה",
@ -13031,7 +13089,8 @@
"pronoun": "הֵן", "pronoun": "הֵן",
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} },
"meaning": "to achieve; to credit"
}, },
"פורסם": { "פורסם": {
"infinitive": "פורסם", "infinitive": "פורסם",
@ -13179,7 +13238,8 @@
"pronoun": "הֵן", "pronoun": "הֵן",
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} },
"meaning": "to advertise, to publish, to publicize"
}, },
"הוגבל": { "הוגבל": {
"infinitive": "הוגבל", "infinitive": "הוגבל",
@ -13327,7 +13387,8 @@
"pronoun": "הֵן", "pronoun": "הֵן",
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} },
"meaning": "to limit, to restrict, to confine"
}, },
"העבר": { "העבר": {
"infinitive": "העבר", "infinitive": "העבר",
@ -13475,7 +13536,8 @@
"pronoun": "הֵן", "pronoun": "הֵן",
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} },
"meaning": "to transfer, to pass something"
}, },
"הוזהר": { "הוזהר": {
"infinitive": "הוזהר", "infinitive": "הוזהר",
@ -13623,7 +13685,8 @@
"pronoun": "הֵן", "pronoun": "הֵן",
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} },
"meaning": "to warn"
}, },
"הופל": { "הופל": {
"infinitive": "הופל", "infinitive": "הופל",
@ -13771,7 +13834,8 @@
"pronoun": "הֵן", "pronoun": "הֵן",
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} },
"meaning": "to drop, to throw down"
}, },
"הוקם": { "הוקם": {
"infinitive": "הוקם", "infinitive": "הוקם",
@ -13919,7 +13983,8 @@
"pronoun": "הֵן", "pronoun": "הֵן",
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} },
"meaning": "to build, to found, to establish"
}, },
"הוחל": { "הוחל": {
"infinitive": "הוחל", "infinitive": "הוחל",
@ -14067,7 +14132,8 @@
"pronoun": "הֵן", "pronoun": "הֵן",
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} },
"meaning": "to apply, to enforce, to put in force"
}, },
"הוקפא": { "הוקפא": {
"infinitive": "הוקפא", "infinitive": "הוקפא",
@ -14215,7 +14281,8 @@
"pronoun": "הֵן", "pronoun": "הֵן",
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} },
"meaning": "to freeze (something)"
}, },
"הופנה": { "הופנה": {
"infinitive": "הופנה", "infinitive": "הופנה",
@ -14363,7 +14430,8 @@
"pronoun": "הֵן", "pronoun": "הֵן",
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} },
"meaning": "to direct; to refer someone"
}, },
"להתקלח": { "להתקלח": {
"infinitive": "להתקלח", "infinitive": "להתקלח",
@ -14541,7 +14609,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to take a shower"
}, },
"להתגלות": { "להתגלות": {
"infinitive": "להתגלות", "infinitive": "להתגלות",
@ -14719,6 +14788,162 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to be discovered, to appear"
},
"להיות": {
"infinitive": "להיות",
"slug": "454-lihyot",
"root": "ה - י - ה",
"binyan": "Pa'al",
"is_passive": false,
"reference_form": "לִהְיוֹת",
"forms": {
"past_1s": {
"form": "הָיִיתִי",
"audio_url": "https://audio.pealim.com/v0/bx/bxtedharx4kd.mp3",
"pronoun": "אֲנִי",
"tense": "עָבָר"
},
"past_1p": {
"form": "הָיִינוּ",
"audio_url": "https://audio.pealim.com/v0/bz/bztr7bt7yw8j.mp3",
"pronoun": "אֲנַחְנוּ",
"tense": "עָבָר"
},
"past_2ms": {
"form": "הָיִיתָ",
"audio_url": "https://audio.pealim.com/v0/1i/1imxfddysg8d8.mp3",
"pronoun": "אַתָּה",
"tense": "עָבָר"
},
"past_2fs": {
"form": "הָיִית",
"audio_url": "https://audio.pealim.com/v0/si/sizbwqsi2wej.mp3",
"pronoun": "אַתְּ",
"tense": "עָבָר"
},
"past_2mp": {
"form": "הֱיִיתֶם",
"audio_url": "https://audio.pealim.com/v0/31/31081nk4lvxj.mp3",
"pronoun": "אַתֶּם",
"tense": "עָבָר"
},
"past_2fp": {
"form": "הֱיִיתֶן",
"audio_url": "https://audio.pealim.com/v0/30/30zpav63u9ig.mp3",
"pronoun": "אַתֶּן",
"tense": "עָבָר"
},
"past_3ms": {
"form": "הָיָה",
"audio_url": "https://audio.pealim.com/v0/1h/1hxhgoyxra6fs.mp3",
"pronoun": "הוּא",
"tense": "עָבָר"
},
"past_3fs": {
"form": "הָיְתָה",
"audio_url": "https://audio.pealim.com/v0/17/17fb6fulu2da8.mp3",
"pronoun": "הִיא",
"tense": "עָבָר"
},
"past_3p": {
"form": "הָיוּ",
"audio_url": "https://audio.pealim.com/v0/1h/1hxhgf26s3ou9.mp3",
"pronoun": "הֵם / הֵן",
"tense": "עָבָר"
},
"future_1s": {
"form": "אֶהְיֶה",
"audio_url": "https://audio.pealim.com/v0/at/atd2i0kljhge.mp3",
"pronoun": "אֲנִי",
"tense": "עָתִיד"
},
"future_1p": {
"form": "נִהְיֶה",
"audio_url": "https://audio.pealim.com/v0/2a/2a41xa7h8jei.mp3",
"pronoun": "אֲנַחְנוּ",
"tense": "עָתִיד"
},
"future_2ms": {
"form": "תִּהְיֶה",
"audio_url": "https://audio.pealim.com/v0/g6/g6saa9abkllk.mp3",
"pronoun": "אַתָּה",
"tense": "עָתִיד"
},
"future_2fs": {
"form": "תִּהְיִי",
"audio_url": "https://audio.pealim.com/v0/g6/g6s9q8uugtnx.mp3",
"pronoun": "אַתְּ",
"tense": "עָתִיד"
},
"future_2mp": {
"form": "תִּהְיוּ",
"audio_url": "https://audio.pealim.com/v0/g6/g6sjf854r5a7.mp3",
"pronoun": "אַתֶּם",
"tense": "עָתִיד"
},
"future_2fp": {
"form": "תִּהְיֶינָה",
"audio_url": "https://audio.pealim.com/v0/12/12upso035jy8g.mp3",
"pronoun": "אַתֶּן",
"tense": "עָתִיד"
},
"future_3ms": {
"form": "יִהְיֶה",
"audio_url": "https://audio.pealim.com/v0/yy/yyo97spf6rob.mp3",
"pronoun": "הוּא",
"tense": "עָתִיד"
},
"future_3fs": {
"form": "תִּהְיֶה",
"audio_url": "https://audio.pealim.com/v0/g6/g6saa9abkllk.mp3",
"pronoun": "הִיא",
"tense": "עָתִיד"
},
"future_3mp": {
"form": "יִהְיוּ",
"audio_url": "https://audio.pealim.com/v0/yy/yyo02tum07zo.mp3",
"pronoun": "הֵם",
"tense": "עָתִיד"
},
"future_3fp": {
"form": "תִּהְיֶינָה",
"audio_url": "https://audio.pealim.com/v0/12/12upso035jy8g.mp3",
"pronoun": "הֵן",
"tense": "עָתִיד"
},
"imperative_ms": {
"form": "הֱיֵה!",
"audio_url": "https://audio.pealim.com/v0/1h/1hxjabs7uspli.mp3",
"pronoun": "אַתָּה",
"tense": "צִוּוּי"
},
"imperative_fs": {
"form": "הֱיִי!",
"audio_url": "https://audio.pealim.com/v0/1h/1hxjac2th43as.mp3",
"pronoun": "אַתְּ",
"tense": "צִוּוּי"
},
"imperative_mp": {
"form": "הֱיוּ!",
"audio_url": "https://audio.pealim.com/v0/1h/1hxja0tjuptcu.mp3",
"pronoun": "אַתֶּם",
"tense": "צִוּוּי"
},
"imperative_fp": {
"form": "הֱיֶינָה!",
"audio_url": "https://audio.pealim.com/v0/xe/xef6kg7mexvb.mp3",
"pronoun": "אַתֶּן",
"tense": "צִוּוּי"
},
"infinitive": {
"form": "לִהְיוֹת",
"audio_url": "https://audio.pealim.com/v0/1n/1nej50k4t35xi.mp3",
"pronoun": "",
"tense": "מְקוֹר"
}
},
"meaning": "to be"
} }
} }

15904
data/epub_sentence_index.json Normal file

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

186078
data/ktiv_male_forms.json Normal file

File diff suppressed because it is too large Load diff

9242
data/legacy_guid_map.json Normal file

File diff suppressed because it is too large Load diff

37457
data/noun_plurals.json Normal file

File diff suppressed because it is too large Load diff

29598
data/noun_slug_map.json Normal file

File diff suppressed because it is too large Load diff

442
data/refined_meanings.json Normal file
View file

@ -0,0 +1,442 @@
{
"שְׁלָל": "abundance; loot, plunder, spoils",
"שֶׁפַע": "abundance, plenty, profusion",
"נַר": "acquaintance (person one knows)",
"הֶכֵּרוּת": "acquaintance (the state of knowing someone)",
"כְּתֹבֶת": "address (postal/location)",
"מַעַן": "address (formal, for the sake of; destination)",
"שׁוּב": "again (once more, to repeat an action)",
"שֵׁנִית": "again; a second time, secondly",
"כְּנֶגֶד": "against; compared to, as opposed to",
"מוּל": "opposite, facing; against",
"נֶגֶד": "against; contrary to",
"נֶכֶס": "asset, property (financial/material possession)",
"קִנְיָן": "asset, property; possession, ownership (abstract or acquired)",
"הִתְבּוֹלְלוּת": "assimilation (cultural/ethnic blending in)",
"הִטַּמְּעוּת": "assimilation (absorption, integration into surroundings)",
"כְּפִיפָה": "basket (woven, traditional/biblical)",
"סַל": "basket (general, everyday)",
"מַשְׁמִים": "boring, dreary (causing desolation/boredom)",
"מְשַׁעְמֵם": "boring, tedious (causing boredom, common usage)",
"מַשָּׂא": "burden, load (heavy cargo; figurative weight)",
"נֵטֶל": "burden, load; ballast (dead weight)",
"טָרוּד": "busy, preoccupied (mentally troubled/distracted)",
"עָסוּק": "busy, occupied (engaged in an activity)",
"מַמְתָּק": "candy, sweet (generic confection)",
"סֻכָּרִיָּה": "candy, sweet (individual wrapped candy piece)",
"מַרְבָד": "carpet, rug (literary/poetic); bedspread",
"שָׁטִיחַ": "carpet, rug (standard, everyday word)",
"כַּרְפַּס": "celery (also: the Passover seder vegetable)",
"סֶלֶרִי": "celery (modern loanword, everyday usage)",
"שַׁלְשֶׁלֶת": "chain (figurative: chain of events, lineage)",
"שַׁרְשֶׁרֶת": "chain (physical chain, links)",
"אָפְיָן": "characteristic (trait, attribute of a person/thing)",
"סַמְמָן": "characteristic; indicator, hallmark",
"שׁוֹקוֹלָד": "chocolate (the substance, mass noun, masc.)",
"שׁוֹקוֹלָדָה": "chocolate (a piece of chocolate; hot chocolate, fem.)",
"עִגּוּל": "circle (the shape); rounding",
"מַעֲגָל": "circle (circular path, cycle, circuit)",
"נִקּוּי": "cleaning (the act of cleaning, removing dirt)",
"נִקָּיוֹן": "cleanliness, tidiness (state of being clean)",
"בִּקּוּעַ": "cleaving, splitting (a single crack or fissure)",
"הִתְבַּקְּעוּת": "cleaving, splitting (the process of cracking apart)",
"בְּעִילָה": "coitus, sexual intercourse (legal/halachic term)",
"מִשְׁגָּל": "coitus, sexual intercourse (formal/literary)",
"מִדְרָשָׁה": "college (religious seminary, study institute)",
"מִכְלָלָה": "college (academic institution, secular)",
"תַּחֲרוּת": "competition, contest (an event or rivalry)",
"הִתְחָרוּת": "competition (the act/process of competing)",
"לְגַמְרֵי": "completely, totally (colloquial, very common)",
"כָּלִיל": "completely, entirely (literary/formal); wholly",
"רְכִיב": "component (technical part, element in a system)",
"מַרְכִּיב": "component, ingredient (constituent that makes up a whole)",
"תַּבְעֵרָה": "conflagration, fire (intense blaze, biblical/literary)",
"דְּלֵקָה": "fire (accidental fire, house fire, everyday)",
"צַרְכָנוּת": "consumerism; consumer advocacy",
"צְרִיכָה": "consumption (using up resources, usage)",
"קֵרוּר": "cooling, refrigeration (active process of making cold)",
"הִתְקָרְרוּת": "cooling (becoming cold); catching a cold",
"חָשׁוּךְ": "dark (of a place, lacking light; figuratively bleak)",
"כֵּהֶה": "dark (of a color, shade; dim)",
"אֲפֵלָה": "darkness (deep gloom; figurative despair)",
"אֹפֶל": "darkness (poetic/literary, deep darkness)",
"חֹשֶׁךְ": "darkness (general, common word)",
"יַקִּיר": "darling, dear (masculine form)",
"יַקִּירָה": "darling, dear (feminine form)",
"מִרְמָה": "deceit, fraud (cunning deception, trickery)",
"תַּרְמִית": "deceit, fraud (a specific act of swindling)",
"אֲבַדּוֹן": "destruction (total ruin, perdition; the abyss)",
"הֶרֶס": "destruction, demolition (physical wreckage)",
"הֶבְדֵּל": "difference, distinction (between two things)",
"שֹׁנִי": "difference (variance, otherness)",
"הֵעָלְמוּת": "disappearance (the act of vanishing, going missing)",
"הֶעֱלֵם": "disappearance (concealment, suppression of information)",
"נְדָבָה": "donation (voluntary, charitable gift; tip)",
"תְּרוּמָה": "donation, contribution (formal; also: religious offering)",
"הִשְׁתַּעְבְּדוּת": "enslavement (the process of becoming enslaved)",
"שִׁעְבּוּד": "enslavement, subjugation; mortgaging (finance)",
"טָעוּת": "mistake, error (common, everyday blunder)",
"שְׁגִיאָה": "error, mistake (formal, technical error)",
"הִתְאַדּוּת": "evaporation (natural process of turning to vapor)",
"הִתְאַיְּדוּת": "evaporation (process of dissipating, vaporizing)",
"דֻּגְמָה": "example, sample (concrete instance or specimen)",
"מָשָׁל": "example; parable, allegory, proverb",
"גּוֹלָה": "exile, diaspora (the community in exile)",
"גָּלוּת": "exile, diaspora (the state/condition of being exiled)",
"חֲוָיָה": "experience (a lived event, an adventure)",
"הִתְנַסּוּת": "experience (the process of trying/experimenting)",
"נִסָּיוֹן": "experience (accumulated knowledge); attempt, trial",
"בֵּאוּר": "explanation, elucidation (detailed clarification)",
"הֶסְבֵּר": "explanation (the act of explaining, making understood)",
"פָּנִים": "face (standard word); surface",
"פַּרְצוּף": "face (appearance, facial expression; colloquial)",
"מֶחְדָּל": "failure, omission (negligent failure to act)",
"כִּשָּׁלוֹן": "failure (general: failed attempt or endeavor)",
"כֶּשֶׁל": "failure, malfunction (technical breakdown)",
"תַּעְנִית": "fast (religious fast day, formal term)",
"צוֹם": "fast, fasting (the act of fasting, general)",
"תְּחוּשָׁה": "feeling, sensation (physical or gut feeling)",
"הַרְגָּשָׁה": "feeling (emotional sense; well-being)",
"רֶגֶשׁ": "feeling, emotion (inner emotional state)",
"לֶהָבָה": "flame (common word for a flame)",
"שַׁלְהֶבֶת": "flame (poetic/literary, blazing flame)",
"כָּפִיף": "flexible, pliable (can be bent physically)",
"מָתִיחַ": "flexible, elastic (stretchy, resilient)",
"זֶרֶם": "flow, current (of water, electricity, or ideas)",
"זְרִימָה": "flow, flowing (the act/process of flowing)",
"אֹכֶל": "food (general, everyday word for food/meal)",
"מַאֲכָל": "food (a specific dish, a prepared food item)",
"מָזוֹן": "food, nourishment (sustenance, nutrition)",
"חֹפֶשׁ": "freedom; vacation, time off (colloquial)",
"חֵרוּת": "freedom, liberty (formal, political/ideological)",
"הַקְפָּאָה": "freezing (active act of freezing something; a freeze/suspension)",
"קִפָּאוֹן": "freezing; standstill, stagnation (frozen state)",
"תְּדִירוּת": "frequency (how often something occurs)",
"תֶּדֶר": "frequency (radio/physics frequency)",
"תָּדִיר": "frequent, regular (happening at steady intervals)",
"תָּכוּף": "frequent, rapid (happening in quick succession)",
"גָּאוֹן": "genius (title of greatness; rabbinical title Gaon)",
"עִלּוּי": "genius, prodigy (exceptionally gifted person)",
"תְּשׁוּרָה": "gift, present (formal/literary offering)",
"שַׁי": "gift, present (a token gift, small present)",
"אַכְלָן": "glutton (big eater, food-lover, common)",
"רְעַבְתָּן": "glutton (insatiably hungry person)",
"מֶמְשֶׁלֶת": "government (construct state form, used in compounds)",
"מֶמְשָׁלָה": "government (standard form)",
"מֶמְשַׁלְתִּי": "governmental (relating to the government/cabinet)",
"שִׁלְטוֹנִי": "governmental (relating to ruling authority/regime)",
"חֹפֶן": "handful (cupped palm, a scooped amount)",
"קֹמֶץ": "handful (a pinch, a small quantity)",
"יָד": "handle (of a tool, door); hand",
"יָדִית": "handle (a knob or grip, specifically a handle)",
"כָּאן": "here (standard, common usage)",
"פֹּה": "here (colloquial/informal variant)",
"טָמוּן": "hidden (buried, latent, lying within)",
"נִסְתָּר": "hidden, concealed (secret, mysterious; grammar: 3rd person)",
"מֻצְנָע": "hidden, concealed (modestly tucked away, discreet)",
"תְּמוּנָה": "image, picture (photo, illustration, scene)",
"צֶלֶם": "image (likeness, form); idol",
"הִתְרַשְּׁמוּת": "impression (the experience of being impressed)",
"רֹשֶׁם": "impression (a mark left; an effect on someone)",
"בִּפְנִים": "inside (location: on the inside, indoors)",
"פְּנִימָה": "inside (direction: inward, toward the inside)",
"עֶלְבּוֹן": "insult, offence (the slight or affront itself)",
"הַעֲלָבָה": "insult (the act of insulting someone)",
"פְּנִים": "interior, inside (inner part, inner side)",
"קֶרֶב": "interior; innards, midst (among, in the thick of)",
"תָּוֶךְ": "interior, inside; center, middle; essence",
"תַּחְקִיר": "investigation (journalistic/official inquiry)",
"חֲקִירָה": "investigation, inquiry (police/legal; research)",
"רִנָּה": "joy; joyful song, singing (literary)",
"מָשׂוֹשׂ": "joy, delight (source of joy, literary)",
"גִּיל": "joy, elation (exuberant happiness; age)",
"שִׂמְחָה": "joy, happiness (celebration, festive occasion)",
"עֶלְצוֹן": "jubilance, exultation (archaic, the feeling)",
"עֶלְצָה": "jubilance, exultation (archaic, feminine noun form)",
"עָצֵל": "lazy, idle (basic adjective form)",
"עַצְלָן": "lazy, lazybones (characteristically lazy person)",
"תְּחִקָּה": "legislation (a specific statute or enacted law)",
"חֲקִיקָה": "legislation (the process/act of legislating)",
"הִתְהוֹלְלוּת": "licentiousness, revelry (wild raucous behavior)",
"הוֹלֵלוּת": "licentiousness, debauchery (moral depravity)",
"שׁוֹשָׁן": "lily (the flower, masculine; also: the name Shoshan)",
"שׁוֹשַׁנָּה": "lily; rose (archaic); the name Shoshana",
"הִמָּצְאוּת": "location; presence (being found/situated somewhere)",
"מִקּוּם": "location, positioning (placing in a specific spot)",
"נַעֲלֶה": "lofty, exalted (elevated, superior in quality)",
"נִשְׂגָּב": "lofty, exalted (sublime, beyond reach, grand)",
"תַּאֲוָה": "lust, craving (appetite, physical desire)",
"תְּשׁוּקָה": "passion, desire (deep longing, yearning)",
"אַחְזָקָה": "maintenance; holding (corporate; upkeep of property)",
"תַּחְזוּקָה": "maintenance (technical upkeep of systems/equipment)",
"תִּחְזוּק": "maintenance (the process/act of maintaining)",
"מִנְהָל": "administration, management (the office/system)",
"נִהוּל": "management (the act/process of managing)",
"הַנְהָלָה": "management (the managing body, executive board)",
"פֵּרוּשׁ": "meaning; interpretation, commentary",
"מַשְׁמָעוּת": "meaning, significance (broader importance)",
"מַשְׁמָע": "meaning, implication (what is implied)",
"לַחַן": "melody, tune (a musical composition)",
"נִגּוּן": "melody, tune (a chant; Hasidic wordless melody)",
"נְעִימָה": "melody, tune; tone, intonation (of voice)",
"נֵס": "miracle (divine intervention; common word)",
"פֶּלֶא": "wonder, marvel (something astonishing)",
"תְּזוּזָה": "movement (a budge, slight motion, shift)",
"תְּנוּעָה": "movement (broad: traffic; organization; vowel mark)",
"מִסְתּוֹרִין": "mystery (enigma, something hidden/secret)",
"תַּעֲלוּמָה": "mystery (unsolved puzzle, unknown secret)",
"עֵירֹם": "naked (completely nude, formal)",
"עָרֹם": "naked (nude; also: shrewd, cunning in biblical Hebrew)",
"אֻמָּה": "nation (a unified political/cultural entity)",
"לְאֹם": "nation, people (ethnic group; literary/formal)",
"זִלְזוּל": "negligence; contempt, disrespect (dismissive attitude)",
"הִתְרַשְּׁלוּת": "negligence (carelessness, failure to take proper care)",
"נֵיטְרָלִי": "neutral (politically/scientifically neutral, loanword)",
"סְתָמִי": "neutral; vague, nondescript, generic",
"אֲצֻלָּה": "nobility, aristocracy (the aristocratic class)",
"אֲצִילוּת": "nobility (the quality of being noble, refinement)",
"הִסְתַּכְּלוּת": "observation (looking, watching, contemplation)",
"תַּצְפִּית": "observation (military/scientific lookout; observation post)",
"מִכְשׁוֹל": "obstacle, stumbling block (impediment to progress)",
"נֶגֶף": "obstacle; plague, affliction (biblical)",
"עַל": "on, upon; about, regarding",
"עַל גַּב": "on, upon (on the back/surface of)",
"עַל גַּבֵּי": "on, upon (on top of, on the surface of)",
"פְּקֻדָּה": "order, command (military/authoritative directive)",
"צַו": "order, decree (legal injunction, official order)",
"בָּחוּץ": "outside (location: on the outside, outdoors)",
"הַחוּצָה": "outside (direction: outward, to the outside)",
"מַאֲרָז": "package (a packed container, packaging)",
"חֲבִילָה": "package, parcel (a bundle, a wrapped item)",
"מְחִילָה": "pardon, forgiveness (personal, between individuals)",
"סְלִיחָה": "pardon, forgiveness (also: excuse me; liturgical pardon)",
"סַיֶּרֶת": "patrol (elite military unit, commando squad)",
"סִיּוּר": "patrol; tour (a round of inspection or sightseeing)",
"שָׂכָר": "payment; salary, wage (earned compensation)",
"תַּשְׁלוּם": "payment (a single payment/installment; compensation)",
"עֲצוּמָה": "petition (public petition with signatures)",
"עֲתִירָה": "petition (legal petition, court appeal)",
"דַּלּוּת": "poverty; meagerness, paucity (scarcity of quality/quantity)",
"עֹנִי": "poverty (destitution, financial hardship)",
"עָצְמָתִי": "powerful (having great inherent power)",
"רַב עָצְמָה": "powerful (of great might, formidable)",
"הַאֲמָרָה": "price increase (deliberate raising of prices)",
"הִתְיַקְּרוּת": "price increase (becoming more expensive, rising costs)",
"קִדְמָה": "progress (general/societal advancement, modernity)",
"הִתְקַדְּמוּת": "progress (the process of advancing, making headway)",
"הַסְבָּרָה": "propaganda; public diplomacy (Israeli hasbara)",
"תַּעֲמוּלָה": "propaganda (political propaganda, agitation)",
"סְמִיכוּת": "proximity; construct state (grammar term)",
"קִרְבָה": "proximity; kinship, closeness (relational nearness)",
"תְּהִלּוֹת": "Psalms (variant plural form)",
"תְּהִלִּים": "Psalms (standard name for the Book of Psalms)",
"קְנִיָּה": "purchase (a buy, an act of buying, everyday)",
"רְכִישָׁה": "acquisition (formal purchase, procurement)",
"בִּזְרִיזוּת": "quickly, nimbly (with agile efficiency)",
"בִּמְהִירוּת": "quickly, at high speed (with velocity)",
"רִיצָה": "running (the activity of running)",
"מְרוּצָה": "race (a competitive running event)",
"גְּאֻלָּה": "redemption (national/messianic deliverance)",
"פְּדוּת": "redemption (ransoming, being redeemed; literary)",
"הוֹצָאָה": "removal; expense, expenditure; publishing house",
"הַסָּחָה": "removal; deflection, diversion, distraction",
"יִצּוּג": "representation (acting on behalf of; depiction)",
"נְצִיגוּת": "representation (the body of representatives, delegation)",
"מְכִירָה": "sale (the act of selling, a transaction)",
"מֶכֶר": "sale; merchandise, value (literary/biblical)",
"יֶשַׁע": "salvation, deliverance (divine rescue, literary)",
"תְּשׁוּעָה": "salvation, victory (triumphant rescue, literary)",
"הַפְרָדָה": "separation (active act of separating things/people)",
"הִפָּרְדוּת": "separation (the process of parting ways)",
"חַד": "sharp (of edges, blades; clear-cut)",
"חָרִיף": "sharp, acute; spicy, pungent; keen, witty",
"חָסוּת": "shelter, patronage (protection under authority)",
"מִקְלָט": "shelter, refuge (bomb shelter, safe haven, physical place)",
"חֻלְצָה": "shirt, blouse (modern everyday word)",
"כֻּתֹּנֶת": "shirt; tunic, gown (biblical/traditional garment)",
"שֶׁקֶט": "silence, quiet (peaceful calm, serenity)",
"שְׁתִיקָה": "silence (the act of keeping silent, not speaking)",
"חֶטְא": "sin (a specific transgression, missing the mark)",
"עָווֹן": "sin, iniquity (moral guilt; legal: misdemeanor)",
"זִמְרָה": "singing (musical performance, song/hymn)",
"רְנָנָה": "singing; joyful song, jubilant cry (literary)",
"נָטוּי": "slanted, inclined (tilted, leaning; grammar: inflected)",
"מְשֻׁפָּע": "slanted, inclined; having an abundance of something",
"כִּשּׁוּף": "sorcery, witchcraft (dark magic, spellcasting)",
"קֶסֶם": "magic, charm (enchantment, allure)",
"נֶפֶשׁ": "soul (life force, self, being; appetite)",
"נְשָׁמָה": "soul (divine breath of life, spiritual essence)",
"מַצָּת": "spark plug (automotive ignition component)",
"פְּלָג": "spark plug (variant/slang term)",
"דּוֹבֵר": "speaker, spokesman (masculine form)",
"דּוֹבֶרֶת": "speaker, spokeswoman (feminine form)",
"סוּפָה": "storm, tempest (violent windstorm)",
"סְעָרָה": "storm, tempest (raging storm; figurative turmoil)",
"קַשׁ": "straw (dry stalks; figuratively: trivial thing)",
"תֶּבֶן": "straw, hay (animal feed, dried grass)",
"עִקֵּשׁ": "stubborn, obstinate (perversely rigid)",
"עַקְשָׁן": "stubborn, obstinate (characteristically persistent/stubborn person)",
"חָנִיךְ": "student, pupil (trainee, apprentice, cadet)",
"תַּלְמִיד": "student, pupil (school student, common word)",
"פִּקּוּחַ": "supervision (regulatory oversight, monitoring)",
"הַשְׁגָּחָה": "supervision (watchful care, divine providence; kosher certification)",
"הַסְפָּקָה": "supply, provision (the act of supplying goods)",
"אַסְפָּקָה": "supply, provision (military/logistical provisioning)",
"אֲרָעִי": "temporary, provisional (makeshift, not permanent)",
"זְמַנִּי": "temporary, time-limited (for a limited period)",
"אֵלֶה": "these (standard demonstrative pronoun)",
"אֵלוּ": "these (literary/Mishnaic variant)",
"בֹּהֶן": "thumb; big toe (anatomical term)",
"אֲגוּדָל": "thumb (common/colloquial word for thumb)",
"זְמַן": "time (general, measurable time; tense in grammar)",
"עֵת": "time (a specific moment, epoch, literary/biblical)",
"עִתּוּי": "timing (choosing the right moment)",
"תִּזְמוּן": "timing (synchronization, technical scheduling)",
"לְכַתֵּב": "to address (write an address on); to engrave",
"לְמַעֵן": "to address (direct/target communication toward)",
"לְזַיֵּן": "to arm (equip with weapons; vulgar slang)",
"לְחַמֵּשׁ": "to arm (equip/furnish with armaments)",
"לְהִתְאַסֵּף": "to assemble, to gather together (of people collecting)",
"לְהִתְכַּנֵּס": "to assemble, to convene (a formal meeting/conference)",
"לְהִכָּבֵל": "to be bound (chained, shackled with chains)",
"לְהִכָּפֵת": "to be bound (handcuffed, tied up physically)",
"לְהִבָּרֵא": "to be created (divine/fundamental creation, ex nihilo)",
"לְהִוָּצֵר": "to be created (formed, shaped, manufactured)",
"לְהִגָּזֵז": "to be cut off (sheared, trimmed, as hair/wool)",
"לְהִגָּזֵר": "to be cut off (decreed, sentenced; derived from)",
"לְהִקָּטֵעַ": "to be cut off (interrupted, severed abruptly)",
"לְהִנָּגֵף": "to be defeated (struck down, plagued; biblical)",
"לְהֵרָעֵץ": "to be defeated (crushed, shattered; literary)",
"לְהֵהָרֵס": "to be destroyed (demolished, wrecked; slang: exhausted)",
"לְהֵחָרֵב": "to be destroyed (laid waste, devastated; of cities/temples)",
"לְהִסָּתֵר": "to be hidden; to hide oneself (take cover)",
"לְהִצָּפֵן": "to be hidden (encoded, concealed from view)",
"לְהִנָּטֵעַ": "to be planted (of trees/plants, set in soil)",
"לְהִשָּׁתֵל": "to be planted (implanted, transplanted; of an organ or undercover agent)",
"לָדֹם": "to be silent (to become utterly still; literary)",
"לִשְׁתֹּק": "to be silent (to stop talking, keep quiet; common)",
"לְהִתְקַמֵּץ": "to be stingy (to pinch pennies, scrimp)",
"לְהִתְקַמְצֵן": "to be stingy (to act like a miser, be miserly)",
"לְהִבָּדֵק": "to be tested, checked (verified, inspected)",
"לְהִבָּחֵן": "to be tested, examined (undergo a formal exam/evaluation)",
"נִהְיָה": "to become (turn into, come to be; common)",
"לְהֵעָשׂוֹת": "to become; to be made, to be done, to be carried out",
"לְהִתְבַּהֵר": "to become clear (clarified, understood)",
"לְהִצְטַלֵּל": "to become clear (of liquid becoming transparent/limpid)",
"לְכוֹפֵף": "to bend (flex, bow down, curve something)",
"לְקַמֵּר": "to bend, to vault (arch over, create a dome shape)",
"לְקַשֵּׁת": "to bend, to curve (form into a bow/arc shape)",
"לְפַחֵם": "to blacken (carbonize, char with coal/charcoal)",
"לְפַיֵּחַ": "to blacken (cover with soot, smoke residue)",
"לְמַצְמֵץ": "to blink (rapidly open and close one's eyes)",
"לְעַפְעֵף": "to blink (flutter one's eyelids)",
"לִנְפֹּחַ": "to blow (puff up, inflate; blow air)",
"לִנְשֹׁף": "to blow, to exhale; to play a wind instrument",
"לְצַיֵּץ": "to chirp, to tweet (of birds; to post on social media)",
"לְצַפְצֵף": "to chirp, to whistle (shrill piping sound; to not care — slang)",
"לְחַבֵּר": "to connect, to join (attach together; to compose/write)",
"לְקַשֵּׁר": "to connect, to link (establish a relationship/connection)",
"לְהָסִיחַ": "to converse (engage in casual talk; to divert attention)",
"לְהָשִׂיחַ": "to converse, to talk (literary; to speak with)",
"לְסַלְסֵל": "to curl (hair); to trill (music)",
"לְתַלְתֵּל": "to curl (hair into ringlets/curls)",
"לְיַפּוֹת": "to beautify, to embellish (make more attractive)",
"לְפַרְכֵּס": "to embellish; to squirm, to flounder",
"לִדְרֹשׁ": "to demand; to inquire, to preach (seek/expound)",
"לִתְבֹּעַ": "to demand; to sue, to claim (legal demand)",
"לְהֵישִׁיר": "to direct; to straighten, to look straight at",
"לְהַפְנוֹת": "to direct; to refer someone (redirect attention/person)",
"לְהַגְזִים": "to exaggerate (overstate, blow out of proportion; common)",
"לְהַפְרִיז": "to exaggerate (go to extremes, overdo; formal)",
"לְהִמּוֹג": "to fade, to dissolve (melt away, lose form; literary)",
"לְהִנָּדֵף": "to fade, to dissipate (blown away, scattered by wind)",
"לִפֹּל": "to fall (general: fall down, collapse; common word)",
"לִנְשֹׁר": "to fall, to drop (shed: leaves, hair; drop out of school)",
"לְכַלּוֹת": "to finish (consume entirely, exhaust; to annihilate)",
"לְסַיֵּם": "to finish, to complete (conclude, bring to an end; common)",
"לִנְהֹר": "to flow (stream toward); to shine, to glow",
"לִשְׁתֹּת": "to flow (pour forth, stream out; literary)",
"לִמְחֹל": "to forgive (pardon on a personal level, waive a claim)",
"לִסְלֹחַ": "to forgive, to pardon (general, standard word for forgiving)",
"לְהַחְבִּיא": "to hide, to conceal (physically stash away; common)",
"לְהַעֲלִים": "to hide, to conceal (suppress information; to evade)",
"לִדְלֹף": "to leak (of a pipe, roof; seep through)",
"לִנְזֹל": "to drip, to trickle (flow in drops, ooze)",
"לִזְנֹחַ": "to abandon, to neglect (forsake, discard)",
"לַעֲזֹב": "to leave, to abandon (depart from; give up; common word)",
"לְהַנִּיחַ": "to place, to put (set down carefully); to assume",
"לְהָשִׂים": "to place, to put (set/assign); to turn into something",
"לְפָאֵר": "to glorify, to adorn (extol with grandeur)",
"לְשַׁבֵּחַ": "to praise, to commend (express approval; common)",
"לִדְחֹף": "to push, to shove (physically push forward; common)",
"לִדְחֹק": "to push, to press (squeeze, crowd; urge insistently)",
"לְהַבְרִיא": "to recover (regain health, get well; common)",
"לְהַחְלִים": "to recover, to convalesce (heal fully from illness; formal)",
"לַעֲלֹץ": "to rejoice, to exult (leap with joy; literary)",
"לָשׂוּשׂ": "to rejoice (be glad, delight in; biblical/literary)",
"לְהוֹשִׁיעַ": "to rescue, to save (deliver from danger; biblical/literary)",
"לְהַצִּיל": "to rescue, to save (common, everyday word)",
"לְחַכֵּךְ": "to rub (scratch an itch, abrade gently)",
"לְשַׁפְשֵׁף": "to rub (scrub, polish by rubbing repeatedly)",
"לִסְרֹט": "to scratch (scrape with a sharp object; to make a video/film)",
"לִשְׂרֹט": "to scratch (draw a line, score a surface)",
"לִנְגֹּהַּ": "to shine (glow with bright light; literary)",
"לִקְרֹן": "to shine, to beam (radiate light, as from horns of light)",
"לְהַחֲרִישׁ": "to silence; to be silent (choose not to respond; literary)",
"לְהַשְׁתִּיק": "to silence (make someone/something stop making noise; common)",
"לִטְבֹּחַ": "to slaughter (massacre, butcher violently)",
"לִשְׁחֹט": "to slaughter (ritually slaughter an animal; shecht)",
"לְהִתְמַחוֹת": "to specialize (become an expert in a field)",
"לְהִתְמַקְצֵעַ": "to specialize (become a professional, gain proficiency)",
"לְבַקֵּעַ": "to split, to cleave (crack open forcefully)",
"לְבַתֵּק": "to split, to cleave; to pierce (cut through)",
"לִמְרֹחַ": "to spread (smear, apply a spread on surface)",
"לִשְׁטֹחַ": "to spread (lay out flat, unfurl); to present, explicate",
"לְאַשֵּׁשׁ": "to strengthen, to establish (shore up, substantiate)",
"לְחַזֵּק": "to strengthen (make stronger, reinforce; common word)",
"לְהִתְיַסֵּר": "to suffer (be tormented, endure agony)",
"לְהִתְעַנּוֹת": "to suffer; to fast (endure hardship/deprivation; literary)",
"לִידוֹת": "to throw, to hurl (cast, fling; biblical)",
"לִרְמוֹת": "to throw, to hurl (toss; biblical)",
"לִגְזֹז": "to trim (shear wool/hair, clip close)",
"לִגְזֹם": "to trim (prune branches/bushes, cut back vegetation)",
"לְאַדּוֹת": "to vaporize (steam, evaporate); to simmer, to poach (cooking)",
"לְאַיֵּד": "to vaporize, to evaporate (cause to turn into vapor)",
"לֶאֱרֹג": "to weave (on a loom, produce fabric; common word)",
"לִשְׁזֹר": "to weave (intertwine, braid, thread together)",
"בְּיַחַד": "together (as a group, common usage with 'be-')",
"יַחַד": "together (jointly, in unison; literary)",
"יַחְדָּו": "together (jointly; biblical/poetic variant)",
"מִסְחָר": "trade, commerce (the business/sector of trading)",
"סַחַר": "trade, commerce (goods traded, merchandise; literary)",
"אֱמֶת": "truth (common word for truth, verity)",
"אֲמִתָּה": "truth; axiom (fundamental truth, literary)",
"מִצְנֶפֶת": "turban (formal headdress, priestly turban)",
"צָנִיף": "turban, head wrap (wrapped head covering)",
"אַחְדוּת": "unity (state of being united, solidarity)",
"אִחוּד": "unification (the act of uniting, merging)",
"בִּקְעָה": "valley (broad, flat valley plain)",
"עֵמֶק": "valley (deep valley between mountains/hills)",
"אִשְׁרָה": "visa; approval (entry permit; formal approval)",
"וִיזָה": "visa (travel visa, loanword)",
"כֹּתֶל": "wall (the Western Wall; a freestanding stone wall)",
"קִיר": "wall (common word for wall of a room/building)",
"אַזְהָרָה": "warning (a caution, alert; legal/safety warning)",
"הַזְהָרָה": "warning (the act of warning someone; admonition)",
"רַהַט": "water trough (channel, gutter for water flow)",
"שֹׁקֶת": "water trough (feeding/drinking trough for animals)",
"אִלּוּלֵי": "were it not for (standard conditional; common)",
"לוּלֵא": "were it not for (literary/Talmudic variant)",
"אוֹפַןּ": "wheel (a single wheel; biblical/poetic)",
"גַּלְגַּל": "wheel (rolling wheel; cycle, pulley)",
"אַיֵּה": "where? (literary/biblical: where is?)",
"הֵיכָן": "where? (standard literary form of 'where')",
"לֹבֶן": "whiteness (white of the eye; white color)",
"צְחוֹר": "whiteness; purity (brilliant white, radiance)",
"עוֹלָם": "world (the world, universe; eternity; common word)",
"תֵּבֵל": "world, universe (the inhabited world; poetic/literary)",
"פֶּצַע": "wound (a specific cut, gash, open wound)",
"פְּצִיעָה": "wound, injury (the event/act of being wounded)",
"כִּסּוּפִים": "yearning, longing (wistful craving, literary; plural)",
"עֶרְגָּה": "yearning, longing (deep nostalgic longing, literary)"
}

19525
data/vetted_sentences.json Normal file

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

446
epub_examples.py Normal file
View file

@ -0,0 +1,446 @@
#!/usr/bin/env python3
"""
Extract example sentences from nikud'd Hebrew EPUBs (and PDFs where possible),
match them against the vocab list, and produce examples_cache.json.
Usage:
python3 epub_examples.py
Outputs:
data/epub_sentence_index.json full sentence corpus
data/examples_cache.json best sentence(s) per vocab word
"""
import csv
import json
import os
import re
import zipfile
from html.parser import HTMLParser
from pathlib import Path
from helpers import strip_nikkud
DATA_DIR = Path(__file__).parent / "data"
EPUB_DIR = DATA_DIR / "epubs"
DICT_CSV = DATA_DIR / "hebrew_dict_for_anki.csv"
# Book metadata: filename -> display name
EPUB_BOOKS = {
"little_prince.epub": "הנסיך הקטן",
"time_tunnel_82.epub": "מנהרת הזמן 82",
}
# PDF books are excluded — pypdf produces garbled RTL text (reversed chars within
# words). If/when a proper EPUB version becomes available on Calibre, add it to
# EPUB_BOOKS above instead.
PDF_BOOKS: dict[str, str] = {}
# Sentence length bounds (word count)
MIN_WORDS = 4
MAX_WORDS = 15
# ── HTML text extraction ─────────────────────────────────────────
class _TextExtractor(HTMLParser):
"""Extract text content from HTML, skipping script/style tags."""
SKIP_TAGS = {"script", "style", "head"}
def __init__(self):
super().__init__()
self.parts: list[str] = []
self._skip_depth = 0
def handle_starttag(self, tag, attrs):
if tag in self.SKIP_TAGS:
self._skip_depth += 1
# Insert space for block-level elements to avoid word concatenation
if tag in (
"p",
"div",
"br",
"li",
"h1",
"h2",
"h3",
"h4",
"h5",
"h6",
"td",
"th",
"tr",
"blockquote",
"section",
):
self.parts.append("\n")
def handle_endtag(self, tag):
if tag in self.SKIP_TAGS:
self._skip_depth = max(0, self._skip_depth - 1)
def handle_data(self, data):
if self._skip_depth == 0:
self.parts.append(data)
def get_text(self) -> str:
return "".join(self.parts)
def extract_text_from_html(html: str) -> str:
"""Parse HTML and return plain text."""
parser = _TextExtractor()
parser.feed(html)
return parser.get_text()
# ── EPUB processing ──────────────────────────────────────────────
def _content_files_from_epub(zf: zipfile.ZipFile) -> list[str]:
"""Get ordered list of content XHTML files from the OPF manifest."""
# Find the OPF file
opf_path = None
for name in zf.namelist():
if name.endswith(".opf"):
opf_path = name
break
if not opf_path:
# Fallback: just use all xhtml files
return sorted(
n
for n in zf.namelist()
if n.endswith((".xhtml", ".html"))
and "toc" not in n.lower()
and "cover" not in n.lower()
and "nav" not in n.lower()
)
# Parse OPF to get spine order
opf_content = zf.read(opf_path).decode("utf-8")
opf_dir = os.path.dirname(opf_path)
# Extract manifest items: id -> href
manifest = {}
for m in re.finditer(r'<item\s+[^>]*id="([^"]+)"[^>]*href="([^"]+)"', opf_content):
manifest[m.group(1)] = m.group(2)
# Also try reversed attribute order
for m in re.finditer(r'<item\s+[^>]*href="([^"]+)"[^>]*id="([^"]+)"', opf_content):
manifest[m.group(2)] = m.group(1)
# Extract spine order
spine_ids = re.findall(r'<itemref\s+[^>]*idref="([^"]+)"', opf_content)
result = []
for sid in spine_ids:
href = manifest.get(sid, "")
if href and href.endswith((".xhtml", ".html")):
full_path = os.path.join(opf_dir, href) if opf_dir else href
# Normalize path separators
full_path = full_path.replace("\\", "/")
if full_path in zf.namelist():
result.append(full_path)
if not result:
# Fallback
return sorted(
n
for n in zf.namelist()
if n.endswith((".xhtml", ".html")) and "toc" not in n.lower() and "cover" not in n.lower()
)
return result
def extract_sentences_from_epub(epub_path: Path, book_name: str) -> list[dict]:
"""Extract sentences from an EPUB file.
Returns list of {"text": str, "book": str, "stripped": str}
"""
zf = zipfile.ZipFile(epub_path)
content_files = _content_files_from_epub(zf)
all_text = []
for cf in content_files:
try:
html = zf.read(cf).decode("utf-8")
except (KeyError, UnicodeDecodeError):
continue
text = extract_text_from_html(html)
all_text.append(text)
full_text = "\n".join(all_text)
return _split_into_sentences(full_text, book_name)
# ── PDF processing ───────────────────────────────────────────────
def extract_sentences_from_pdf(pdf_path: Path, book_name: str) -> list[dict]:
"""Extract sentences from a PDF file (best-effort, handles RTL reversal)."""
try:
import pypdf
except ImportError:
print(f" [SKIP] pypdf not installed, cannot process {pdf_path.name}")
return []
reader = pypdf.PdfReader(pdf_path)
all_text_parts = []
for page in reader.pages:
raw = page.extract_text()
if not raw:
continue
# pypdf often reverses word order for RTL text; fix it
fixed_lines = []
for line in raw.split("\n"):
words = line.split()
# Check if this line is predominantly Hebrew
hebrew_chars = sum(1 for c in line if "\u0590" <= c <= "\u05ff")
if hebrew_chars > len(line) * 0.3 and len(words) > 1:
# Reverse word order
fixed_lines.append(" ".join(reversed(words)))
else:
fixed_lines.append(line)
all_text_parts.append("\n".join(fixed_lines))
full_text = "\n".join(all_text_parts)
return _split_into_sentences(full_text, book_name)
# ── Sentence splitting ───────────────────────────────────────────
# Hebrew sentence terminators: period, exclamation, question mark, sof pasuk
_SENT_SPLIT = re.compile(r"[.!?\u05C3]+")
# Punctuation to strip from word boundaries when matching
_PUNCT = re.compile(
r'^[\u0022\u0027\u05F4\u05F3,;:\-–—…\u201C\u201D\u201E\u201F\u2018\u2019()\[\]{}«»"\']+|[\u0022\u0027\u05F4\u05F3,;:\-–—…\u201C\u201D\u201E\u201F\u2018\u2019()\[\]{}«»"\']+$'
)
def _split_into_sentences(text: str, book_name: str) -> list[dict]:
"""Split text into sentences and filter by length."""
# Normalize whitespace
text = re.sub(r"\s+", " ", text).strip()
raw_sentences = _SENT_SPLIT.split(text)
results = []
seen = set()
for sent in raw_sentences:
sent = sent.strip()
if not sent:
continue
# Count Hebrew words (skip non-Hebrew tokens like numbers)
words = sent.split()
hebrew_words = [w for w in words if any("\u0590" <= c <= "\u05ff" for c in w)]
if len(hebrew_words) < MIN_WORDS or len(hebrew_words) > MAX_WORDS:
continue
# Skip duplicates
stripped = strip_nikkud(sent)
if stripped in seen:
continue
seen.add(stripped)
results.append(
{
"text": sent,
"book": book_name,
"stripped": stripped,
}
)
return results
# ── Vocab loading ────────────────────────────────────────────────
def load_vocab(csv_path: Path) -> dict:
"""Load vocab CSV and return {stripped_form: nikkud_word} mapping.
Also returns reverse mapping for lookup.
Returns (word_to_nikkud, nikkud_words_set)
"""
words_by_stripped: dict[str, list[str]] = {} # stripped -> [nikkud words]
with open(csv_path, encoding="utf-8") as f:
reader = csv.DictReader(f, delimiter=";")
for row in reader:
nikkud_word = row.get("Word", "").strip()
word_no_nik = row.get("Word Without Nikkud", "").strip()
if not nikkud_word:
continue
# Method 1: strip nikkud from the Word column
stripped_from_nikkud = strip_nikkud(nikkud_word)
# Add both forms for matching
for form in {stripped_from_nikkud, word_no_nik}:
if form:
words_by_stripped.setdefault(form, []).append(nikkud_word)
return words_by_stripped
# ── Matching ─────────────────────────────────────────────────────
def match_sentences(sentences: list[dict], words_by_stripped: dict) -> dict:
"""Match sentences against vocab words.
Returns {nikkud_word: [sentences]} with best (shortest) first.
"""
# Build a set of all stripped forms for fast lookup
all_forms = set(words_by_stripped.keys())
# Hebrew single-letter prefixes: ב, ה, ו, כ, ל, מ, ש, ד (של)
_HEB_PREFIXES = set("בהוכלמשד")
# For each sentence, extract stripped words
matches: dict[str, list[tuple[int, str]]] = {} # nikkud_word -> [(word_count, sentence)]
for sent_info in sentences:
sent_text = sent_info["text"]
sent_stripped = sent_info["stripped"]
word_count = len(sent_text.split())
# Get stripped words from the sentence
raw_words = sent_stripped.split()
# Map: candidate_form -> set of original cleaned words that produced it
# This lets us verify that prefix stripping is plausible
candidates: dict[str, str] = {} # form -> original_word
for w in raw_words:
cleaned = _PUNCT.sub("", w)
if not cleaned:
continue
# Direct match (always try)
candidates[cleaned] = cleaned
# Prefix stripping: only if remaining stem is >= 2 chars
# and the prefix char is a known Hebrew prefix letter
for prefix_len in (1, 2):
if len(cleaned) > prefix_len + 1:
prefix = cleaned[:prefix_len]
stem = cleaned[prefix_len:]
if all(c in _HEB_PREFIXES for c in prefix) and len(stem) >= 2:
candidates[stem] = cleaned
# Check which vocab words appear in this sentence
matched_forms = set(candidates.keys()) & all_forms
for form in matched_forms:
# Skip spurious matches: very short vocab forms (1-2 chars)
# should only match via direct word match, not prefix stripping
if len(form) <= 2 and form not in {_PUNCT.sub("", w) for w in raw_words}:
continue
for nikkud_word in words_by_stripped[form]:
matches.setdefault(nikkud_word, []).append((word_count, sent_text))
# Sort by word count (prefer shorter sentences) and deduplicate
result = {}
for nikkud_word, sent_list in matches.items():
sent_list.sort(key=lambda x: x[0])
seen = set()
unique = []
for _, sent in sent_list:
if sent not in seen:
seen.add(sent)
unique.append(sent)
if len(unique) >= 5: # Keep top 5 per word
break
result[nikkud_word] = unique
return result
# ── Main ─────────────────────────────────────────────────────────
def main():
print("=" * 60)
print("EPUB Example Sentence Extraction Pipeline")
print("=" * 60)
# Step 1: Extract sentences from all books
all_sentences = []
book_counts = {}
for filename, book_name in EPUB_BOOKS.items():
path = EPUB_DIR / filename
if not path.exists():
print(f"\n[SKIP] {filename} not found")
continue
print(f"\n[EPUB] Extracting: {book_name} ({filename})")
sentences = extract_sentences_from_epub(path, book_name)
book_counts[book_name] = len(sentences)
all_sentences.extend(sentences)
print(f" -> {len(sentences)} sentences")
for filename, book_name in PDF_BOOKS.items():
path = EPUB_DIR / filename
if not path.exists():
print(f"\n[SKIP] {filename} not found")
continue
print(f"\n[PDF] Extracting: {book_name} ({filename})")
sentences = extract_sentences_from_pdf(path, book_name)
book_counts[book_name] = len(sentences)
all_sentences.extend(sentences)
print(f" -> {len(sentences)} sentences")
print(f"\nTotal sentences: {len(all_sentences)}")
# Step 2: Save sentence index
index_path = DATA_DIR / "epub_sentence_index.json"
with open(index_path, "w", encoding="utf-8") as f:
json.dump({"sentences": all_sentences}, f, ensure_ascii=False, indent=2)
print(f"\nSaved sentence index: {index_path}")
# Step 3: Load vocab and match
print(f"\nLoading vocab from {DICT_CSV} ...")
words_by_stripped = load_vocab(DICT_CSV)
total_vocab = len({w for wlist in words_by_stripped.values() for w in wlist})
print(f" {total_vocab} unique vocab words ({len(words_by_stripped)} lookup forms)")
print("\nMatching sentences against vocab ...")
examples_cache = match_sentences(all_sentences, words_by_stripped)
# Step 4: Save examples_cache
cache_path = DATA_DIR / "examples_cache.json"
with open(cache_path, "w", encoding="utf-8") as f:
json.dump(examples_cache, f, ensure_ascii=False, indent=2)
print(f"Saved examples cache: {cache_path}")
# Step 5: Summary stats
print("\n" + "=" * 60)
print("SUMMARY")
print("=" * 60)
print("\nSentences per book:")
for book_name, count in book_counts.items():
print(f" {book_name}: {count}")
print(f" Total: {len(all_sentences)}")
print("\nVocab matching:")
print(f" Total vocab words: {total_vocab}")
print(f" Words with examples: {len(examples_cache)}")
coverage = 100 * len(examples_cache) / total_vocab if total_vocab else 0
print(f" Coverage: {coverage:.1f}%")
# Show some sample matches
print("\nSample matches:")
count = 0
for word, sents in examples_cache.items():
if count >= 5:
break
print(f" {word} -> {sents[0][:60]}...")
count += 1
return examples_cache
if __name__ == "__main__":
main()

View file

@ -7,18 +7,15 @@ Exposed API: get_frequency_rank(word_no_nikkud) -> int | None
import json import json
import logging import logging
import re
import unicodedata
from pathlib import Path from pathlib import Path
import requests import requests
from helpers import strip_nikkud as _strip_nikkud
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
FREQ_URL = ( FREQ_URL = "https://raw.githubusercontent.com/hermitdave/FrequencyWords/master/content/2016/he/he_50k.txt"
"https://raw.githubusercontent.com/hermitdave/FrequencyWords/"
"master/content/2016/he/he_50k.txt"
)
CACHE_PATH = Path(__file__).parent / "data" / "frequency_cache.json" CACHE_PATH = Path(__file__).parent / "data" / "frequency_cache.json"
REQUEST_TIMEOUT = 30 REQUEST_TIMEOUT = 30
@ -26,14 +23,6 @@ REQUEST_TIMEOUT = 30
_freq: dict[str, int] = {} _freq: dict[str, int] = {}
def _strip_nikkud(text: str) -> str:
"""Remove Hebrew nikkud (diacritics) from a string."""
return "".join(
ch for ch in unicodedata.normalize("NFD", text)
if unicodedata.category(ch) != "Mn"
)
def load(cache_path: Path = CACHE_PATH) -> None: def load(cache_path: Path = CACHE_PATH) -> None:
"""Load frequency data from cache, downloading if not present.""" """Load frequency data from cache, downloading if not present."""
global _freq global _freq

View file

@ -4,25 +4,20 @@ Extract Hebrew vocabulary from pealim.com dictionary.
Scrapes word entries, roots, parts of speech, and audio URLs for Anki flashcards. Scrapes word entries, roots, parts of speech, and audio URLs for Anki flashcards.
""" """
import requests
import pandas as pd
from bs4 import BeautifulSoup
import logging import logging
import time import time
from typing import Optional
import pandas as pd
import requests
from bs4 import BeautifulSoup
# Configure logging # Configure logging
logging.basicConfig( logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
# Session for connection pooling # Session for connection pooling
session = requests.Session() session = requests.Session()
session.headers.update({ session.headers.update({"User-Agent": "Mozilla/5.0 (compatible; pealim-scraper/1.0)"})
'User-Agent': 'Mozilla/5.0 (compatible; pealim-scraper/1.0)'
})
PEALIM_DICT_URL = "https://www.pealim.com/dict/" PEALIM_DICT_URL = "https://www.pealim.com/dict/"
REQUEST_DELAY = 1.5 # seconds between requests (respectful scraping) REQUEST_DELAY = 1.5 # seconds between requests (respectful scraping)
@ -33,7 +28,7 @@ def get_total_pages() -> int:
"""Dynamically determine total pages from first request.""" """Dynamically determine total pages from first request."""
try: try:
logger.info("Fetching total page count...") logger.info("Fetching total page count...")
cookies = {'translit': 'none', 'hebstyle': 'mo'} cookies = {"translit": "none", "hebstyle": "mo"}
response = session.get(PEALIM_DICT_URL, cookies=cookies, timeout=REQUEST_TIMEOUT) response = session.get(PEALIM_DICT_URL, cookies=cookies, timeout=REQUEST_TIMEOUT)
response.raise_for_status() response.raise_for_status()
# Hardcoded — pealim.com has ~608 pages at ~15 words/page # Hardcoded — pealim.com has ~608 pages at ~15 words/page
@ -48,17 +43,17 @@ def _parse_page_with_audio(html_bytes: bytes) -> list[dict]:
Parse a dict page with BeautifulSoup to extract word data + audio URL. Parse a dict page with BeautifulSoup to extract word data + audio URL.
Returns list of dicts with keys: Word, Root, Part of Speech, Meaning, audio_url. Returns list of dicts with keys: Word, Root, Part of Speech, Meaning, audio_url.
""" """
soup = BeautifulSoup(html_bytes, 'html.parser') soup = BeautifulSoup(html_bytes, "html.parser")
rows = [] rows = []
for tr in soup.select('table tr'): for tr in soup.select("table tr"):
tds = tr.find_all('td') tds = tr.find_all("td")
if len(tds) < 4: if len(tds) < 4:
continue continue
# Audio URL from span[data-audio] in first td # Audio URL from span[data-audio] in first td
audio_span = tds[0].find(attrs={'data-audio': True}) audio_span = tds[0].find(attrs={"data-audio": True})
audio_url = audio_span['data-audio'] if audio_span else '' audio_url = audio_span["data-audio"] if audio_span else ""
# Word with nikkud # Word with nikkud
menukad = tds[0].find('span', class_='menukad') menukad = tds[0].find("span", class_="menukad")
word = menukad.get_text(strip=True) if menukad else tds[0].get_text(strip=True) word = menukad.get_text(strip=True) if menukad else tds[0].get_text(strip=True)
# Root (may be link or plain text) # Root (may be link or plain text)
root = tds[1].get_text(strip=True) root = tds[1].get_text(strip=True)
@ -67,17 +62,19 @@ def _parse_page_with_audio(html_bytes: bytes) -> list[dict]:
# Meaning # Meaning
meaning = tds[3].get_text(strip=True) meaning = tds[3].get_text(strip=True)
if word: if word:
rows.append({ rows.append(
'Word': word, {
'Root': root if root else '-', "Word": word,
'Part of Speech': pos, "Root": root if root else "-",
'Meaning': meaning, "Part of Speech": pos,
'audio_url': audio_url, "Meaning": meaning,
}) "audio_url": audio_url,
}
)
return rows return rows
def extract_from_website(max_pages: Optional[int] = None) -> pd.DataFrame: def extract_from_website(max_pages: int | None = None) -> pd.DataFrame:
""" """
Extract dictionary entries from pealim.com. Extract dictionary entries from pealim.com.
Captures audio URLs from each word entry's data-audio attribute. Captures audio URLs from each word entry's data-audio attribute.
@ -93,33 +90,33 @@ def extract_from_website(max_pages: Optional[int] = None) -> pd.DataFrame:
all_rows: list[dict] = [] all_rows: list[dict] = []
for page_num in range(1, total_pages): for page_num in range(1, total_pages + 1):
try: try:
url = f"{PEALIM_DICT_URL}?page={page_num}" url = f"{PEALIM_DICT_URL}?page={page_num}"
# First request: with nikkud — parse with BeautifulSoup for audio URL # First request: with nikkud — parse with BeautifulSoup for audio URL
cookies = {'translit': 'none', 'hebstyle': 'mo'} cookies = {"translit": "none", "hebstyle": "mo"}
response = session.get(url, cookies=cookies, timeout=REQUEST_TIMEOUT) response = session.get(url, cookies=cookies, timeout=REQUEST_TIMEOUT)
response.raise_for_status() response.raise_for_status()
page_rows = _parse_page_with_audio(response.content) page_rows = _parse_page_with_audio(response.content)
# Second request: without nikkud — just get the word column # Second request: without nikkud — just get the word column
cookies_vl = {'translit': 'none', 'hebstyle': 'vl', 'showmeaning': 'off'} cookies_vl = {"translit": "none", "hebstyle": "vl", "showmeaning": "off"}
resp_vl = session.get(url, cookies=cookies_vl, timeout=REQUEST_TIMEOUT) resp_vl = session.get(url, cookies=cookies_vl, timeout=REQUEST_TIMEOUT)
resp_vl.raise_for_status() resp_vl.raise_for_status()
soup_vl = BeautifulSoup(resp_vl.content, 'html.parser') soup_vl = BeautifulSoup(resp_vl.content, "html.parser")
no_nik_words = [] no_nik_words = []
for tr in soup_vl.select('table tr'): for tr in soup_vl.select("table tr"):
tds = tr.find_all('td') tds = tr.find_all("td")
if len(tds) < 4: if len(tds) < 4:
continue continue
menukad = tds[0].find('span', class_='menukad') menukad = tds[0].find("span", class_="menukad")
w = menukad.get_text(strip=True) if menukad else tds[0].get_text(strip=True) w = menukad.get_text(strip=True) if menukad else tds[0].get_text(strip=True)
no_nik_words.append(w) no_nik_words.append(w)
# Merge no-nikkud words into rows # Merge no-nikkud words into rows
for i, row in enumerate(page_rows): for i, row in enumerate(page_rows):
row['Word Without Nikkud'] = no_nik_words[i] if i < len(no_nik_words) else '' row["Word Without Nikkud"] = no_nik_words[i] if i < len(no_nik_words) else ""
all_rows.extend(page_rows) all_rows.extend(page_rows)
@ -136,7 +133,7 @@ def extract_from_website(max_pages: Optional[int] = None) -> pd.DataFrame:
continue continue
df = pd.DataFrame(all_rows) df = pd.DataFrame(all_rows)
audio_count = (df['audio_url'] != '').sum() if 'audio_url' in df.columns else 0 audio_count = (df["audio_url"] != "").sum() if "audio_url" in df.columns else 0
logger.info(f"Extraction complete. Total words: {len(df)}, with audio URL: {audio_count}") logger.info(f"Extraction complete. Total words: {len(df)}, with audio URL: {audio_count}")
return df return df
@ -150,39 +147,39 @@ def modify_for_anki(df: pd.DataFrame) -> pd.DataFrame:
# Find shared root words # Find shared root words
shared_root_words = [] shared_root_words = []
for idx, row in df.iterrows(): for _idx, row in df.iterrows():
root = row['Root'] root = row["Root"]
word = row['Word'] word = row["Word"]
if root != '-' and pd.notna(root): if root != "-" and pd.notna(root):
same_root = df[(df['Root'] == root) & (df['Word'] != word)]['Word'].values same_root = df[(df["Root"] == root) & (df["Word"] != word)]["Word"].values
shared = ' '.join(str(w) for w in same_root) shared = " ".join(str(w) for w in same_root)
shared_root_words.append(shared) shared_root_words.append(shared)
else: else:
shared_root_words.append('') shared_root_words.append("")
df['shared roots'] = shared_root_words df["shared roots"] = shared_root_words
# Generate Hebrew tags # Generate Hebrew tags
tags = [] tags = []
for idx, row in df.iterrows(): for _idx, row in df.iterrows():
tag_parts = [] tag_parts = []
root = str(row['Root']).replace(' ', '').replace('-', '') root = str(row["Root"]).replace(" ", "").replace("-", "")
if 'nan' not in root and root: if "nan" not in root and root:
root_clean = root.replace('.', '') root_clean = root.replace(".", "")
tag_parts.append(f"שורש::{root_clean}") tag_parts.append(f"שורש::{root_clean}")
pos = str(row['Part of Speech']) pos = str(row["Part of Speech"])
pos_tags = { pos_tags = {
'Adverb': 'תוארי_הפועל', "Adverb": "תוארי_הפועל",
'Pronoun': 'כינוייוף', "Pronoun": "כינוייוף",
'Noun': 'שם_עצם', "Noun": "שם_עצם",
'Verb': 'פעלים', "Verb": "פעלים",
'Adjective': 'שם_תואר', "Adjective": "שם_תואר",
'Preposition': 'מילות_יחס', "Preposition": "מילות_יחס",
'Conjunction': 'מילות_חיבור', "Conjunction": "מילות_חיבור",
'Particle': 'מילית' "Particle": "מילית",
} }
for key, value in pos_tags.items(): for key, value in pos_tags.items():
@ -190,9 +187,9 @@ def modify_for_anki(df: pd.DataFrame) -> pd.DataFrame:
tag_parts.append(value) tag_parts.append(value)
break break
tags.append(' '.join(tag_parts)) tags.append(" ".join(tag_parts))
df['tags'] = tags df["tags"] = tags
logger.info("Anki preparation complete.") logger.info("Anki preparation complete.")
return df return df
@ -201,11 +198,11 @@ def main():
"""Main entry point.""" """Main entry point."""
try: try:
df = extract_from_website() df = extract_from_website()
df.to_csv('hebrew_dict.csv', index=True) df.to_csv("hebrew_dict.csv", index=True)
logger.info("Saved: hebrew_dict.csv") logger.info("Saved: hebrew_dict.csv")
df = modify_for_anki(df) df = modify_for_anki(df)
df.to_csv('hebrew_dict_for_anki.csv', sep=';', index=True) df.to_csv("hebrew_dict_for_anki.csv", sep=";", index=True)
logger.info("Saved: hebrew_dict_for_anki.csv") logger.info("Saved: hebrew_dict_for_anki.csv")
logger.info("Complete!") logger.info("Complete!")
@ -215,5 +212,5 @@ def main():
raise raise
if __name__ == '__main__': if __name__ == "__main__":
main() main()

8
helpers.py Normal file
View file

@ -0,0 +1,8 @@
"""Shared helper functions for the Hebrew Flash Cards project."""
import unicodedata
def strip_nikkud(text: str) -> str:
"""Remove Hebrew nikkud (diacritics) from a string."""
return "".join(ch for ch in unicodedata.normalize("NFD", text) if unicodedata.category(ch) != "Mn")

View file

@ -22,40 +22,43 @@ import argparse
import json import json
import logging import logging
import re import re
import sys
import time import time
import unicodedata
from pathlib import Path from pathlib import Path
import requests import requests
from helpers import strip_nikkud as _strip_nikkud
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
DATA_DIR = Path(__file__).parent / "data" DATA_DIR = Path(__file__).parent / "data"
IMAGES_DIR = DATA_DIR / "images" IMAGES_DIR = DATA_DIR / "images"
CACHE_PATH = DATA_DIR / "image_cache.json" CACHE_PATH = DATA_DIR / "image_cache.json"
REQUEST_DELAY = 0.5 REQUEST_DELAY = 0.5
REQUEST_TIMEOUT = 10 REQUEST_TIMEOUT = 10
# Abstract noun suffixes — words whose English meaning ends in these are skipped # Abstract noun suffixes — words whose English meaning ends in these are skipped
ABSTRACT_SUFFIXES = ( ABSTRACT_SUFFIXES = (
"tion", "ity", "ness", "ment", "ance", "ence", "ism", "tion",
"hood", "ship", "ure", "age", "ity",
"ness",
"ment",
"ance",
"ence",
"ism",
"hood",
"ship",
"ure",
"age",
) )
session = requests.Session() session = requests.Session()
session.headers.update({ session.headers.update(
"User-Agent": "pealim-anki/3.0 (educational Hebrew Anki deck builder; contact: anki@pealim.invalid)" {"User-Agent": "pealim-anki/3.0 (educational Hebrew Anki deck builder; contact: anki@pealim.invalid)"}
}) )
def _strip_nikkud(text: str) -> str:
return "".join(
ch for ch in unicodedata.normalize("NFD", text)
if unicodedata.category(ch) != "Mn"
)
def is_concrete(english_meaning: str) -> bool: def is_concrete(english_meaning: str) -> bool:
"""Return True if the English meaning looks like a concrete noun.""" """Return True if the English meaning looks like a concrete noun."""
@ -196,7 +199,7 @@ def load_cache() -> dict:
try: try:
with open(CACHE_PATH, encoding="utf-8") as f: with open(CACHE_PATH, encoding="utf-8") as f:
return json.load(f) return json.load(f)
except Exception: except Exception: # noqa: S110
pass pass
return {} return {}
@ -242,10 +245,10 @@ def run(limit: int | None = None, dry_run: bool = False, single_word: str | None
if limit is not None and processed >= limit: if limit is not None and processed >= limit:
break break
word = str(row.get("Word", "")).strip() word = str(row.get("Word", "")).strip()
meaning = str(row.get("Meaning", "")).strip() meaning = str(row.get("Meaning", "")).strip()
word_plain = str(row.get("Word Without Nikkud", "")).strip() word_plain = str(row.get("Word Without Nikkud", "")).strip()
pos_raw = str(row.get("Part of speech", row.get("Part of Speech", ""))).strip() pos_raw = str(row.get("Part of speech", row.get("Part of Speech", ""))).strip()
if not word or not meaning or meaning in ("nan", "None"): if not word or not meaning or meaning in ("nan", "None"):
continue continue

View file

@ -1,187 +0,0 @@
#!/usr/bin/env python3
"""
Extract Hebrew vocabulary from pealim.com dictionary.
Scrapes word entries, roots, and parts of speech for Anki flashcards.
"""
import requests
import pandas as pd
import logging
import time
from typing import Optional
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
# Session for connection pooling
session = requests.Session()
session.headers.update({
'User-Agent': 'Mozilla/5.0 (compatible; pealim-scraper/1.0)'
})
PEALIM_DICT_URL = "https://www.pealim.com/dict/"
REQUEST_DELAY = 1.5 # seconds between requests (respectful scraping)
REQUEST_TIMEOUT = 10 # seconds
def get_total_pages() -> int:
"""Dynamically determine total pages from first request."""
try:
logger.info("Fetching total page count...")
cookies = {'translit': 'none', 'hebstyle': 'mo'}
response = session.get(PEALIM_DICT_URL, cookies=cookies, timeout=REQUEST_TIMEOUT)
response.raise_for_status()
dfs = pd.read_html(response.content)
if dfs:
# Estimate pages from first page (typically 15 words per page)
# For now, use hardcoded value but this could be improved
return 608
except Exception as e:
logger.error(f"Error fetching page count: {e}. Using default (608).")
return 608
def extract_from_website(max_pages: Optional[int] = None) -> pd.DataFrame:
"""
Extract dictionary entries from pealim.com.
Args:
max_pages: Maximum pages to scrape (None = all)
Returns:
DataFrame with Word, Root, Part of Speech, and Word Without Nikkud columns
"""
total_pages = max_pages or get_total_pages()
logger.info(f"Starting extraction from {total_pages} pages...")
df = pd.DataFrame()
for page_num in range(1, total_pages):
try:
url = f"{PEALIM_DICT_URL}?page={page_num}"
# First request: with nikkud
cookies = {'translit': 'none', 'hebstyle': 'mo'}
response = session.get(url, cookies=cookies, timeout=REQUEST_TIMEOUT)
response.raise_for_status()
df_list = pd.read_html(response.content)
# Second request: without nikkud
cookies = {'translit': 'none', 'hebstyle': 'vl', 'showmeaning': 'off'}
response = session.get(url, cookies=cookies, timeout=REQUEST_TIMEOUT)
response.raise_for_status()
without_nikkud_words = pd.read_html(response.content)[-1]['Word']
without_nikkud_words = without_nikkud_words.rename('Word Without Nikkud')
# Combine and append
df_to_add = pd.concat([df_list[-1], without_nikkud_words], axis=1)
df = pd.concat([df, df_to_add], ignore_index=True)
if page_num % 50 == 0:
logger.info(f"Processed {page_num}/{total_pages} pages...")
time.sleep(REQUEST_DELAY)
except requests.RequestException as e:
logger.error(f"Error fetching page {page_num}: {e}. Retrying...")
time.sleep(REQUEST_DELAY * 2)
except Exception as e:
logger.error(f"Unexpected error on page {page_num}: {e}")
continue
logger.info(f"Extraction complete. Total words: {len(df)}")
return df
def modify_for_anki(df: pd.DataFrame) -> pd.DataFrame:
"""
Transform dictionary DataFrame for Anki import.
Adds shared root words and Hebrew tags.
Args:
df: Dictionary DataFrame
Returns:
Modified DataFrame ready for Anki
"""
logger.info("Preparing data for Anki...")
# Find shared root words
shared_root_words = []
for idx, row in df.iterrows():
root = row['Root']
word = row['Word']
if root != '-' and pd.notna(root):
# Find other words with same root
same_root = df[(df['Root'] == root) & (df['Word'] != word)]['Word'].values
shared = ' '.join(str(w) for w in same_root)
shared_root_words.append(shared)
else:
shared_root_words.append('')
df['shared roots'] = shared_root_words
# Generate Hebrew tags
tags = []
for idx, row in df.iterrows():
tag_parts = []
# Root tag
root = str(row['Root']).replace(' ', '').replace('-', '')
if 'nan' not in root and root:
root_clean = root.replace('.', '')
tag_parts.append(f"שורש::{root_clean}")
# Part of speech tag
pos = str(row['Part of Speech'])
pos_tags = {
'Adverb': 'תוארי_הפועל',
'Pronoun': 'כינוייוף',
'Noun': 'שם_עצם',
'Verb': 'פעלים',
'Adjective': 'שם_תואר',
'Preposition': 'מילות_יחס',
'Conjunction': 'מילות_חיבור',
'Particle': 'מילית'
}
for key, value in pos_tags.items():
if key in pos:
tag_parts.append(value)
break
tags.append(' '.join(tag_parts))
df['tags'] = tags
logger.info("Anki preparation complete.")
return df
def main():
"""Main entry point."""
try:
# Extract from website
df = extract_from_website()
df.to_csv('pealim_dict.csv', index=True)
logger.info("Saved: pealim_dict.csv")
# Transform for Anki
df = modify_for_anki(df)
df.to_csv('pealim_dict_for_anki.csv', sep=';', index=True)
logger.info("Saved: pealim_dict_for_anki.csv")
logger.info("✅ Complete!")
except Exception as e:
logger.error(f"Fatal error: {e}")
raise
if __name__ == '__main__':
main()

80
pyproject.toml Normal file
View file

@ -0,0 +1,80 @@
[project]
name = "hebrew-flash-cards"
version = "0.13"
description = "Hebrew vocabulary & verb conjugation flashcards for Anki"
requires-python = ">=3.11"
dependencies = [
"beautifulsoup4>=4.11.0",
"genanki>=0.8.0",
"lxml>=4.9.0",
"numpy>=1.21.0",
"pandas>=1.3.0",
"pymupdf>=1.23.0",
"pypdf>=3.0.0",
"python-bidi>=0.4.2",
"requests>=2.26.0",
]
[project.optional-dependencies]
dev = [
"bandit",
"pytest",
"ruff",
"vulture",
]
[tool.pytest.ini_options]
testpaths = ["tests"]
[tool.ruff]
target-version = "py311"
line-length = 120
exclude = [
"lib/",
"bin/",
"include/",
"lib64/",
"archive/",
"venv/",
]
[tool.ruff.lint]
select = [
"E", # pycodestyle errors
"W", # pycodestyle warnings
"F", # pyflakes
"I", # isort
"UP", # pyupgrade
"B", # flake8-bugbear
"SIM", # flake8-simplify
"PIE", # flake8-pie
"T20", # flake8-print (flag print statements)
"RET", # flake8-return
"C4", # flake8-comprehensions
"S", # flake8-bandit (security)
]
ignore = [
"T201", # allow print() — this is a CLI tool, not a library
"S603", # subprocess call with shell=False is fine
"S607", # partial executable path is fine for CLI tools
"S105", # PASS = "✓" is not a password
"S108", # /tmp paths are intentional for temp downloads
"S311", # random.Random() is for card ordering, not crypto
"E501", # line too long — handled by formatter
]
[tool.ruff.lint.per-file-ignores]
"test_*.py" = ["S101"] # allow assert in tests
[tool.ruff.format]
quote-style = "double"
indent-style = "space"
[tool.vulture]
paths = ["."]
exclude = ["lib/", "bin/", "include/", "lib64/", "venv/", "archive/"]
min_confidence = 80
[tool.bandit]
exclude_dirs = ["lib", "bin", "include", "lib64", "venv", "archive"]
skips = ["B101"] # allow assert

183
rebuild_sentence_matches.py Normal file
View file

@ -0,0 +1,183 @@
#!/usr/bin/env python3
"""
Rebuild vocab_sentence_matches.json using both direct word matching
and ktiv male conjugated/declined form matching.
This dramatically improves sentence coverage by matching not just
dictionary forms but all conjugated verbs and declined nouns.
"""
import json
import logging
import re
from pathlib import Path
import pandas as pd
from helpers import strip_nikkud as _strip_nikkud
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger(__name__)
DATA_DIR = Path(__file__).parent / "data"
def main():
# Load sentences
with open(DATA_DIR / "epub_sentence_index.json") as f:
sentences = json.load(f).get("sentences", [])
logger.info(f"Loaded {len(sentences)} sentences")
# Load vocab CSV
csv_path = DATA_DIR / "hebrew_dict_for_anki.csv"
try:
df = pd.read_csv(csv_path, sep=";", index_col=0)
if df.shape[1] < 3:
raise ValueError
except (ValueError, pd.errors.ParserError):
df = pd.read_csv(csv_path, index_col=0)
logger.info(f"Loaded {len(df)} vocab entries")
# Build word lookup: stripped_form → (word_nikkud, word_no_nikkud)
word_lookup: dict[str, list[tuple[str, str]]] = {}
for _, row in df.iterrows():
word = str(row.get("Word", "")).strip()
wni = str(row.get("Word Without Nikkud", "")).strip()
if not word or word in ("nan", "None"):
continue
stripped = _strip_nikkud(word)
if stripped:
word_lookup.setdefault(stripped, []).append((word, wni))
# Load ktiv male forms: ktiv_male_form → [{word_nikkud, form_type, ...}]
ktiv_path = DATA_DIR / "ktiv_male_forms.json"
ktiv_forms: dict[str, list[dict]] = {}
if ktiv_path.exists():
with open(ktiv_path) as f:
ktiv_forms = json.load(f)
logger.info(f"Loaded {len(ktiv_forms)} ktiv male forms")
else:
logger.warning("No ktiv_male_forms.json — only using direct matching")
# Build reverse lookup: ktiv_male → set of dictionary words (nikkud)
ktiv_to_word: dict[str, set[str]] = {}
for ktiv, entries in ktiv_forms.items():
for entry in entries:
word_nikkud = entry.get("word_nikkud", "")
if word_nikkud:
ktiv_to_word.setdefault(ktiv, set()).add(word_nikkud)
# Also add all vocab words' own stripped forms to ktiv_to_word
for stripped, entries in word_lookup.items():
for word_nikkud, _ in entries:
ktiv_to_word.setdefault(stripped, set()).add(word_nikkud)
logger.info(f"Total matchable forms: {len(ktiv_to_word)}")
# Tokenize all sentences once
sentence_tokens: list[tuple[dict, list[str]]] = []
for s in sentences:
stripped = s.get("stripped", _strip_nikkud(s.get("text", "")))
tokens = [re.sub(r'[.,!?;:"\'\u05be]', "", t) for t in stripped.split()]
tokens = [t for t in tokens if t] # remove empty
sentence_tokens.append((s, tokens))
# Match: for each sentence token, check ktiv_to_word lookup
# Build word_nikkud → [sentence_info]
matches: dict[str, list[dict]] = {} # word_nikkud → [sentences]
for sent, tokens in sentence_tokens:
text = sent.get("text", "")
book = sent.get("book", "")
word_len = len(tokens)
# Skip sentences that are too short or too long
if word_len < 4 or word_len > 15:
continue
for tok in tokens:
if tok in ktiv_to_word:
for word_nikkud in ktiv_to_word[tok]:
matches.setdefault(word_nikkud, []).append(
{
"text": text,
"book": book,
"matched_form": tok,
"word_count": word_len,
}
)
logger.info(f"Words with at least 1 match: {len(matches)}")
# Deduplicate and limit to 3 best sentences per word
# Prefer shorter sentences (6-12 words ideal)
output: dict[str, dict] = {}
for word_nikkud, sents in matches.items():
# Deduplicate by text
seen_texts = set()
unique = []
for s in sents:
if s["text"] not in seen_texts:
seen_texts.add(s["text"])
unique.append(s)
# Score: prefer 6-12 word sentences
def score(s):
wc = s["word_count"]
if 6 <= wc <= 12:
return 0 # ideal
return abs(wc - 9) # distance from ideal
unique.sort(key=score)
best = unique[:3]
# Find the Word Without Nikkud for this word
stripped = _strip_nikkud(word_nikkud)
wni = stripped # default
if stripped in word_lookup:
for wn, w_wni in word_lookup[stripped]:
if wn == word_nikkud:
wni = w_wni
break
output[wni] = {
"word_nikkud": word_nikkud,
"sentences": [{"text": s["text"], "book": s["book"]} for s in best],
}
# Save
out_path = DATA_DIR / "vocab_sentence_matches.json"
with open(out_path, "w") as f:
json.dump(output, f, ensure_ascii=False, indent=1)
total_sents = sum(len(v["sentences"]) for v in output.values())
logger.info(f"Saved {len(output)} words with {total_sents} sentences → {out_path}")
# Stats
total_vocab = len(df)
pct = len(output) * 100 / total_vocab
logger.info(f"Coverage: {len(output)}/{total_vocab} ({pct:.1f}%)")
# Breakdown by match type
direct_only = 0
ktiv_only = 0
both = 0
for _wni, info in output.items():
word = info["word_nikkud"]
stripped = _strip_nikkud(word)
has_direct = stripped in word_lookup
has_ktiv = any(s.get("matched_form", "") != stripped for s in info["sentences"])
if has_direct and has_ktiv:
both += 1
elif has_ktiv:
ktiv_only += 1
else:
direct_only += 1
logger.info(f" Direct matches only: {direct_only}")
logger.info(f" Ktiv male matches only: {ktiv_only}")
logger.info(f" Both: {both}")
if __name__ == "__main__":
main()

166
run.py
View file

@ -6,7 +6,7 @@ Usage:
python run.py [options] python run.py [options]
Options: Options:
--only {vocab,conjugations} Run only one deck (skips all unrelated steps) --only {vocab,conjugations,confusables,plurals,complete} Run only one deck
--skip-scrape Use existing data/pealim_dict.csv (no pealim.com dict scraping) --skip-scrape Use existing data/pealim_dict.csv (no pealim.com dict scraping)
--skip-audio Skip audio .mp3 downloads --skip-audio Skip audio .mp3 downloads
--skip-examples Skip Ben Yehuda example fetching --skip-examples Skip Ben Yehuda example fetching
@ -22,9 +22,10 @@ import logging
import re import re
import sys import sys
import time import time
import unicodedata
from pathlib import Path from pathlib import Path
from helpers import strip_nikkud
sys.path.insert(0, str(Path(__file__).parent)) sys.path.insert(0, str(Path(__file__).parent))
logging.basicConfig( logging.basicConfig(
@ -33,23 +34,31 @@ logging.basicConfig(
) )
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
DATA_DIR = Path(__file__).parent / "data" DATA_DIR = Path(__file__).parent / "data"
OUTPUT_DIR = Path(__file__).parent / "output" OUTPUT_DIR = Path(__file__).parent / "output"
AUDIO_DIR = DATA_DIR / "audio" AUDIO_DIR = DATA_DIR / "audio"
AUDIO_CONJ_DIR = DATA_DIR / "audio_conj" AUDIO_CONJ_DIR = DATA_DIR / "audio_conj"
FONTS_DIR = DATA_DIR / "fonts" FONTS_DIR = DATA_DIR / "fonts"
def parse_args(): def parse_args():
p = argparse.ArgumentParser(description="Pealim Anki deck builder") p = argparse.ArgumentParser(description="Pealim Anki deck builder")
p.add_argument("--only", choices=["vocab", "conjugations"], help="Run only one deck (skips all unrelated steps)") p.add_argument(
p.add_argument("--skip-scrape", action="store_true", help="Skip dict scraping; use cached CSV") "--only",
p.add_argument("--skip-audio", action="store_true", help="Skip audio downloads") choices=["vocab", "conjugations", "confusables", "plurals", "complete"],
p.add_argument("--skip-examples", action="store_true", help="Skip Ben Yehuda example lookup") help="Run only one deck (skips all unrelated steps)",
p.add_argument("--skip-conjugations", action="store_true", help="Skip verb conjugation extraction (deprecated: use --only vocab)") )
p.add_argument("--skip-images", action="store_true", help="Skip image fetching") p.add_argument("--skip-scrape", action="store_true", help="Skip dict scraping; use cached CSV")
p.add_argument("--refresh-examples", action="store_true", help="Force rebuild of Ben Yehuda index") p.add_argument("--skip-audio", action="store_true", help="Skip audio downloads")
p.add_argument("--test", type=int, metavar="N", help="Limit to first N words") p.add_argument("--skip-examples", action="store_true", help="Skip Ben Yehuda example lookup")
p.add_argument(
"--skip-conjugations",
action="store_true",
help="Skip verb conjugation extraction (deprecated: use --only vocab)",
)
p.add_argument("--skip-images", action="store_true", help="Skip image fetching")
p.add_argument("--refresh-examples", action="store_true", help="Force rebuild of Ben Yehuda index")
p.add_argument("--test", type=int, metavar="N", help="Limit to first N words")
return p.parse_args() return p.parse_args()
@ -59,8 +68,6 @@ def step_scrape(args):
anki_csv = DATA_DIR / "hebrew_dict_for_anki.csv" anki_csv = DATA_DIR / "hebrew_dict_for_anki.csv"
# Legacy fallback names # Legacy fallback names
legacy_dict = DATA_DIR / "pealim_dict.csv" legacy_dict = DATA_DIR / "pealim_dict.csv"
legacy_anki = DATA_DIR / "pealim_dict_for_anki.csv"
if args.skip_scrape: if args.skip_scrape:
if dict_csv.exists(): if dict_csv.exists():
logger.info(f"[1] Using existing {dict_csv}") logger.info(f"[1] Using existing {dict_csv}")
@ -72,8 +79,8 @@ def step_scrape(args):
return return
logger.info("[1] Scraping dictionary from pealim.com …") logger.info("[1] Scraping dictionary from pealim.com …")
import hebrew_extract import hebrew_extract
import pandas as pd
df = hebrew_extract.extract_from_website() df = hebrew_extract.extract_from_website()
df.to_csv(dict_csv, index=True) df.to_csv(dict_csv, index=True)
@ -88,6 +95,7 @@ def step_frequency() -> dict[str, int]:
"""Step 2 — load/download word frequency data.""" """Step 2 — load/download word frequency data."""
logger.info("[2] Loading word frequency data …") logger.info("[2] Loading word frequency data …")
import frequency_lookup import frequency_lookup
frequency_lookup.load() frequency_lookup.load()
return frequency_lookup._freq return frequency_lookup._freq
@ -104,6 +112,7 @@ def step_examples(args, freq_cache: dict):
logger.info("[3] Loading Ben Yehuda example index …") logger.info("[3] Loading Ben Yehuda example index …")
import benyehuda import benyehuda
benyehuda.load(force_rebuild=args.refresh_examples) benyehuda.load(force_rebuild=args.refresh_examples)
dict_csv = DATA_DIR / "hebrew_dict_for_anki.csv" dict_csv = DATA_DIR / "hebrew_dict_for_anki.csv"
@ -116,6 +125,7 @@ def step_examples(args, freq_cache: dict):
try: try:
import pandas as pd import pandas as pd
try: try:
df = pd.read_csv(dict_csv, sep=";", index_col=0) df = pd.read_csv(dict_csv, sep=";", index_col=0)
if df.shape[1] < 3: if df.shape[1] < 3:
@ -158,6 +168,7 @@ def step_audio(args):
import pandas as pd import pandas as pd
import requests import requests
try: try:
try: try:
df = pd.read_csv(dict_csv, sep=";", index_col=0) df = pd.read_csv(dict_csv, sep=";", index_col=0)
@ -166,7 +177,7 @@ def step_audio(args):
except (ValueError, pd.errors.ParserError): except (ValueError, pd.errors.ParserError):
df = pd.read_csv(dict_csv, index_col=0) df = pd.read_csv(dict_csv, index_col=0)
if 'audio_url' not in df.columns: if "audio_url" not in df.columns:
logger.warning(" No audio_url column in CSV — re-scrape with hebrew_extract.py to capture audio URLs") logger.warning(" No audio_url column in CSV — re-scrape with hebrew_extract.py to capture audio URLs")
return return
@ -178,10 +189,6 @@ def step_audio(args):
skipped = 0 skipped = 0
no_url = 0 no_url = 0
def strip_nik(t: str) -> str:
return "".join(c for c in unicodedata.normalize("NFD", t)
if unicodedata.category(c) != "Mn")
for _, row in df.iterrows(): for _, row in df.iterrows():
word = str(row.get("Word", "")).strip() word = str(row.get("Word", "")).strip()
word_plain = str(row.get("Word Without Nikkud", "")).strip() word_plain = str(row.get("Word Without Nikkud", "")).strip()
@ -190,7 +197,7 @@ def step_audio(args):
if not word: if not word:
continue continue
safe_name = re.sub(r"[^\u05d0-\u05ea]", "", strip_nik(word_plain or word)) safe_name = re.sub(r"[^\u05d0-\u05ea]", "", strip_nikkud(word_plain or word))
if not safe_name: if not safe_name:
continue continue
mp3_path = AUDIO_DIR / f"{safe_name}.mp3" mp3_path = AUDIO_DIR / f"{safe_name}.mp3"
@ -228,11 +235,12 @@ def step_conj_audio(args, conjugations: dict):
AUDIO_CONJ_DIR.mkdir(parents=True, exist_ok=True) AUDIO_CONJ_DIR.mkdir(parents=True, exist_ok=True)
import requests import requests
downloaded = 0 downloaded = 0
skipped = 0 skipped = 0
failed = 0 failed = 0
for infinitive, data in conjugations.items(): for _infinitive, data in conjugations.items():
if not data or not data.get("forms"): if not data or not data.get("forms"):
continue continue
@ -282,17 +290,14 @@ def step_conj_audio(args, conjugations: dict):
logger.debug(f" Conj audio failed {filename}: {e}") logger.debug(f" Conj audio failed {filename}: {e}")
failed += 1 failed += 1
logger.info( logger.info(f" Conjugation audio: {downloaded} downloaded, {skipped} cached, {failed} failed")
f" Conjugation audio: {downloaded} downloaded, "
f"{skipped} cached, {failed} failed"
)
def step_fonts(args): def step_fonts(args):
"""Step 4c — download Heebo font files (one-time, cached).""" """Step 4c — download Heebo font files (one-time, cached)."""
FONTS_DIR.mkdir(parents=True, exist_ok=True) FONTS_DIR.mkdir(parents=True, exist_ok=True)
regular = FONTS_DIR / "_Heebo-Regular.ttf" regular = FONTS_DIR / "_Heebo-Regular.ttf"
bold = FONTS_DIR / "_Heebo-Bold.ttf" bold = FONTS_DIR / "_Heebo-Bold.ttf"
if regular.exists() and bold.exists(): if regular.exists() and bold.exists():
logger.info("[4c] Heebo fonts already cached") logger.info("[4c] Heebo fonts already cached")
@ -302,6 +307,7 @@ def step_fonts(args):
# Fetch CSS to get actual TTF source URLs (static subset for Hebrew + Latin) # Fetch CSS to get actual TTF source URLs (static subset for Hebrew + Latin)
import requests as _req import requests as _req
headers = { headers = {
# Request TTF (not woff2) so Anki can embed them # Request TTF (not woff2) so Anki can embed them
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) Gecko/20100101 Firefox/120.0" "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) Gecko/20100101 Firefox/120.0"
@ -355,10 +361,13 @@ def step_images(args) -> dict:
limit = args.test # When in test mode, limit images too limit = args.test # When in test mode, limit images too
logger.info("[4d] Fetching images for concrete nouns …") logger.info("[4d] Fetching images for concrete nouns …")
import image_fetch import image_fetch
return image_fetch.run(limit=limit) return image_fetch.run(limit=limit)
def step_build_all(args, examples_cache: dict, freq_cache: dict, conjugations: dict | None, image_cache: dict | None = None): def step_build_all(
args, examples_cache: dict, freq_cache: dict, conjugations: dict | None, image_cache: dict | None = None
):
"""Step 5 — build all 6 release variants (4 vocab + 2 conj).""" """Step 5 — build all 6 release variants (4 vocab + 2 conj)."""
logger.info("[5] Building all deck variants …") logger.info("[5] Building all deck variants …")
import apkg_builder import apkg_builder
@ -394,6 +403,7 @@ def step_conjugations(args):
logger.info("[6] --skip-conjugations: loading from cache …") logger.info("[6] --skip-conjugations: loading from cache …")
with open(conj_cache) as f: with open(conj_cache) as f:
import json as _json import json as _json
return _json.load(f) return _json.load(f)
logger.info("[6] --skip-conjugations: no cache found, skipping conj decks") logger.info("[6] --skip-conjugations: no cache found, skipping conj decks")
return None return None
@ -407,10 +417,12 @@ def step_conjugations(args):
logger.info("[6] Using cached conjugations.json …") logger.info("[6] Using cached conjugations.json …")
with open(conj_cache) as f: with open(conj_cache) as f:
import json as _json import json as _json
conjugations = _json.load(f) conjugations = _json.load(f)
else: else:
logger.info("[6] Extracting verb conjugations …") logger.info("[6] Extracting verb conjugations …")
import conjugation_extract import conjugation_extract
conjugations = conjugation_extract.main(verbs_file) conjugations = conjugation_extract.main(verbs_file)
# Download conjugation audio # Download conjugation audio
@ -434,6 +446,7 @@ def print_summary(args, examples_cache, freq_cache, conjugations):
dict_csv = DATA_DIR / "pealim_dict.csv" dict_csv = DATA_DIR / "pealim_dict.csv"
if dict_csv.exists(): if dict_csv.exists():
import pandas as pd import pandas as pd
try: try:
df = pd.read_csv(dict_csv, sep=";", index_col=0) df = pd.read_csv(dict_csv, sep=";", index_col=0)
if df.shape[1] < 3: if df.shape[1] < 3:
@ -446,7 +459,7 @@ def print_summary(args, examples_cache, freq_cache, conjugations):
logger.info(f" Example cache entries: {len(examples_cache)}") logger.info(f" Example cache entries: {len(examples_cache)}")
covered = sum(1 for v in examples_cache.values() if v) covered = sum(1 for v in examples_cache.values() if v)
if examples_cache: if examples_cache:
logger.info(f" Example coverage: {covered}/{len(examples_cache)} ({100*covered//len(examples_cache)}%)") logger.info(f" Example coverage: {covered}/{len(examples_cache)} ({100 * covered // len(examples_cache)}%)")
if AUDIO_DIR.exists(): if AUDIO_DIR.exists():
mp3s = list(AUDIO_DIR.glob("*.mp3")) mp3s = list(AUDIO_DIR.glob("*.mp3"))
@ -455,9 +468,9 @@ def print_summary(args, examples_cache, freq_cache, conjugations):
if AUDIO_CONJ_DIR.exists(): if AUDIO_CONJ_DIR.exists():
# Count only files that will be bundled: active non-infinitive forms # Count only files that will be bundled: active non-infinitive forms
# (excludes {slug}_passive_* and {slug}_infinitive.mp3 on-disk extras) # (excludes {slug}_passive_* and {slug}_infinitive.mp3 on-disk extras)
mp3s = [p for p in AUDIO_CONJ_DIR.glob("*.mp3") mp3s = [
if not p.stem.endswith("_infinitive") p for p in AUDIO_CONJ_DIR.glob("*.mp3") if not p.stem.endswith("_infinitive") and "_passive_" not in p.stem
and "_passive_" not in p.stem] ]
logger.info(f" Conjugation audio files (bundled): {len(mp3s)}") logger.info(f" Conjugation audio files (bundled): {len(mp3s)}")
image_cache_path = DATA_DIR / "image_cache.json" image_cache_path = DATA_DIR / "image_cache.json"
@ -468,9 +481,18 @@ def print_summary(args, examples_cache, freq_cache, conjugations):
logger.info(f" Images: {found_imgs}/{len(ic)} nouns with images") logger.info(f" Images: {found_imgs}/{len(ic)} nouns with images")
import apkg_builder as _ab import apkg_builder as _ab
all_apkgs = [ all_apkgs = [
_ab.VOCAB_APKG, _ab.VOCAB_APKG_AUDIO, _ab.VOCAB_APKG_IMAGES, _ab.VOCAB_APKG_AUDIO_IMAGES, _ab.VOCAB_APKG,
_ab.CONJ_APKG, _ab.CONJ_APKG_AUDIO, _ab.VOCAB_APKG_AUDIO,
_ab.VOCAB_APKG_IMAGES,
_ab.VOCAB_APKG_AUDIO_IMAGES,
_ab.CONJ_APKG,
_ab.CONJ_APKG_AUDIO,
_ab.CONF_APKG,
_ab.CONF_APKG_AUDIO,
_ab.COMPLETE_APKG,
_ab.COMPLETE_APKG_AUDIO,
] ]
for apkg in all_apkgs: for apkg in all_apkgs:
if apkg.exists(): if apkg.exists():
@ -502,24 +524,80 @@ def main():
conjugations = step_conjugations(args) conjugations = step_conjugations(args)
if conjugations: if conjugations:
import apkg_builder import apkg_builder
apkg_builder.build_all_variants(
DATA_DIR / "hebrew_dict_for_anki.csv", dict_csv = DATA_DIR / "hebrew_dict_for_anki.csv"
conjugations=conjugations, if not dict_csv.exists():
limit=args.test, dict_csv = DATA_DIR / "hebrew_dict.csv"
) for audio, path in [(False, apkg_builder.CONJ_APKG), (True, apkg_builder.CONJ_APKG_AUDIO)]:
deck, media = apkg_builder.build_conj_deck(
conjugations,
include_audio=audio,
dict_csv=dict_csv,
)
apkg_builder.write_conj_apkg(deck, media, out_path=path)
print_summary(args, {}, {}, conjugations or {}) print_summary(args, {}, {}, conjugations or {})
return return
if args.only == "confusables":
step_fonts(args)
import apkg_builder
dict_csv = DATA_DIR / "hebrew_dict_for_anki.csv"
for audio, path in [(False, apkg_builder.CONF_APKG), (True, apkg_builder.CONF_APKG_AUDIO)]:
deck, media = apkg_builder.build_confusables_deck(dict_csv, include_audio=audio)
apkg_builder.write_conf_apkg(deck, media, out_path=path)
print_summary(args, {}, {}, {})
return
if args.only == "plurals":
step_fonts(args)
import apkg_builder
dict_csv = DATA_DIR / "hebrew_dict_for_anki.csv"
if not dict_csv.exists():
dict_csv = DATA_DIR / "hebrew_dict.csv"
for audio, path in [(False, apkg_builder.PLURAL_APKG), (True, apkg_builder.PLURAL_APKG_AUDIO)]:
deck, media = apkg_builder.build_plural_deck(dict_csv=dict_csv, include_audio=audio)
apkg_builder.write_plural_apkg(deck, media, out_path=path)
print_summary(args, {}, {}, {})
return
if args.only == "complete":
step_fonts(args)
freq_cache = step_frequency() if not args.skip_scrape else {}
examples_cache = step_examples(args, freq_cache) if not args.skip_examples else {}
image_cache = step_images(args) if not args.skip_images else {}
conjugations = step_conjugations(args)
import apkg_builder
dict_csv = DATA_DIR / "hebrew_dict_for_anki.csv"
if not dict_csv.exists():
dict_csv = DATA_DIR / "hebrew_dict.csv"
emoji_lookup = apkg_builder._load_emoji_lookup()
for audio, path in [(False, apkg_builder.COMPLETE_APKG), (True, apkg_builder.COMPLETE_APKG_AUDIO)]:
decks, media = apkg_builder.build_complete_deck(
dict_csv,
conjugations=conjugations or {},
examples_cache=examples_cache,
freq_cache=freq_cache,
image_cache=image_cache,
emoji_lookup=emoji_lookup,
include_audio=audio,
)
apkg_builder.write_complete_apkg(decks, media, out_path=path)
print_summary(args, examples_cache, freq_cache, conjugations or {})
return
if args.only == "vocab": if args.only == "vocab":
args.skip_conjugations = True args.skip_conjugations = True
step_scrape(args) step_scrape(args)
freq_cache = step_frequency() freq_cache = step_frequency()
examples_cache = step_examples(args, freq_cache) examples_cache = step_examples(args, freq_cache)
step_audio(args) step_audio(args)
step_fonts(args) step_fonts(args)
image_cache = step_images(args) image_cache = step_images(args)
conjugations = step_conjugations(args) conjugations = step_conjugations(args)
step_build_all(args, examples_cache, freq_cache, conjugations, image_cache) step_build_all(args, examples_cache, freq_cache, conjugations, image_cache)
print_summary(args, examples_cache, freq_cache, conjugations or {}) print_summary(args, examples_cache, freq_cache, conjugations or {})

View file

@ -0,0 +1,405 @@
#!/usr/bin/env python3
"""
Extract sentences from PDF books and match vocab words to sentences.
1. Extract sentences from alice.pdf and lion_strawberry.pdf
2. Merge into existing epub_sentence_index.json
3. Match vocab words to sentences, produce vocab_sentence_matches.json
"""
import json
import os
import re
import sys
# Use the venv with pymupdf
sys.path.insert(0, "/home/node/projects/pealim/venv_pdf/lib/python3.11/site-packages")
# Also need the main venv for pandas
sys.path.insert(0, "/home/node/projects/pealim/lib/python3.11/site-packages")
import fitz
import pandas as pd
BASE_DIR = "/home/node/projects/pealim"
DATA_DIR = os.path.join(BASE_DIR, "data")
EPUBS_DIR = os.path.join(DATA_DIR, "epubs")
SENTENCE_INDEX = os.path.join(DATA_DIR, "epub_sentence_index.json")
VOCAB_CSV = os.path.join(DATA_DIR, "hebrew_dict_for_anki.csv")
MATCHES_FILE = os.path.join(DATA_DIR, "vocab_sentence_matches.json")
NIKKUD_RE = re.compile(r"[\u0591-\u05C7]")
HEBREW_RE = re.compile(r"[\u05d0-\u05ea]")
HEBREW_CHAR_RE = re.compile(r"[\u05d0-\u05ea\ufb20-\ufb4f]")
def strip_nikkud(text):
"""Remove all Hebrew nikkud/cantillation marks."""
return NIKKUD_RE.sub("", text)
def collapse_hebrew_spaces(text):
"""Collapse spaces between Hebrew letter fragments (for badly-encoded PDFs).
Strategy: strip nikkud first, then iteratively remove spaces between
Hebrew characters. Real word boundaries are detected by:
- Final-form letters (ם ן ף ך ץ) followed by space
- Punctuation (.,;:!?"')
- Non-Hebrew characters
"""
stripped = strip_nikkud(text)
# Normalize presentation forms to standard Hebrew
# FB20-FB4F contains presentation forms
for code in range(0xFB2A, 0xFB50):
ch = chr(code)
if ch in stripped:
# Map shin/sin dots, dagesh forms back to base
# FB2A = שׁ (shin+dot), FB2B = שׂ (sin+dot)
base_map = {
"\ufb2a": "ש",
"\ufb2b": "ש",
"\ufb35": "ו",
"\ufb4b": "ו",
"\ufb30": "א",
"\ufb31": "ב",
"\ufb32": "ג",
"\ufb33": "ד",
"\ufb34": "ה",
"\ufb36": "ז",
"\ufb38": "ט",
"\ufb39": "י",
"\ufb3a": "כ",
"\ufb3b": "כ",
"\ufb3c": "ל",
"\ufb3e": "מ",
"\ufb40": "נ",
"\ufb41": "ס",
"\ufb43": "פ",
"\ufb44": "פ",
"\ufb46": "צ",
"\ufb47": "ק",
"\ufb48": "ר",
"\ufb49": "ש",
"\ufb4a": "ת",
}
if ch in base_map:
stripped = stripped.replace(ch, base_map[ch])
# Replace multiple spaces with single
stripped = re.sub(r" {2,}", " ", stripped)
# Now rebuild text, keeping spaces only at word boundaries
# Word boundary markers: final-form letters, punctuation, non-Hebrew
final_forms = set("םןףךץ")
result = []
i = 0
chars = list(stripped)
while i < len(chars):
if chars[i] != " ":
result.append(chars[i])
i += 1
continue
# It's a space. Decide if it's a word boundary.
# Look back for the last non-space character
prev_ch = None
for j in range(len(result) - 1, -1, -1):
if result[j] != " ":
prev_ch = result[j]
break
# Look forward for next non-space character
next_ch = None
for j in range(i + 1, len(chars)):
if chars[j] != " ":
next_ch = chars[j]
break
is_boundary = False
# After final-form letter = word boundary
if prev_ch and prev_ch in final_forms:
is_boundary = True
# Before/after punctuation or non-Hebrew = word boundary
if prev_ch and not HEBREW_RE.match(prev_ch):
is_boundary = True
if next_ch and not HEBREW_RE.match(next_ch):
is_boundary = True
# If either side is not Hebrew at all, boundary
if prev_ch is None or next_ch is None:
is_boundary = True
if is_boundary:
result.append(" ")
# else: skip the space (collapse intra-word gap)
i += 1
return "".join(result).strip()
def extract_pdf_sentences(pdf_path, book_name):
"""Extract sentences from a PDF file."""
doc = fitz.open(pdf_path)
sentences = []
for page_num in range(len(doc)):
page = doc[page_num]
text = page.get_text()
if not text.strip():
continue
# Split into lines first, then split on sentence-ending punctuation
lines = text.split("\n")
raw_sentences = []
for line in lines:
line = line.strip()
if not line:
continue
# Split on sentence-ending punctuation followed by space or at end
parts = re.split(r"(?<=[.?!])\s+", line)
raw_sentences.extend(parts)
for sent in raw_sentences:
sent = sent.strip()
if not sent:
continue
# Must contain Hebrew characters
if not HEBREW_RE.search(sent):
continue
# Create stripped version (no nikkud, collapsed spaces for PDF)
stripped = collapse_hebrew_spaces(sent)
# Count Hebrew words in stripped version
words = [w for w in stripped.split() if HEBREW_RE.search(w)]
word_count = len(words)
# Filter: 4-15 Hebrew words
if word_count < 4 or word_count > 15:
continue
# Drop metadata-like lines
# Page numbers (just digits)
if re.match(r"^\d+$", sent.strip()):
continue
# Copyright text
if any(kw in sent.lower() for kw in ["copyright", "©", "isbn", "printed in"]):
continue
sentences.append(
{
"text": sent,
"book": book_name,
"stripped": stripped,
}
)
doc.close()
return sentences
def has_extractable_text(pdf_path):
"""Check if a PDF has extractable text."""
doc = fitz.open(pdf_path)
text_found = False
for i in range(min(len(doc), 10)):
if doc[i].get_text().strip():
text_found = True
break
doc.close()
return text_found
def load_sentence_index():
"""Load existing sentence index."""
if os.path.exists(SENTENCE_INDEX):
with open(SENTENCE_INDEX, encoding="utf-8") as f:
return json.load(f)
return {"sentences": []}
def save_sentence_index(data):
"""Save sentence index."""
with open(SENTENCE_INDEX, "w", encoding="utf-8") as f:
json.dump(data, f, ensure_ascii=False, indent=2)
def match_vocab_to_sentences(sentences, vocab_df):
"""Match vocab words to sentences."""
matches = {}
# Build lookup: word_no_nikkud -> word_nikkud
vocab_words = []
for _, row in vocab_df.iterrows():
word_no_nik = str(row.get("Word Without Nikkud", "")).strip()
word_nik = str(row.get("Word", "")).strip()
if word_no_nik and word_nik:
vocab_words.append((word_no_nik, word_nik))
print(f"Matching {len(vocab_words)} vocab words against {len(sentences)} sentences...")
# Precompute: for each sentence, get the stripped text
sent_data = []
for s in sentences:
stripped = s.get("stripped", "")
# For PDF sentences, stripped already has collapsed spaces but words may be joined
# For EPUB sentences, stripped has proper word spacing
sent_data.append(
{
"text": s["text"],
"book": s["book"],
"stripped": stripped,
"word_count": len(stripped.split()),
}
)
matched_count = 0
for word_no_nik, word_nik in vocab_words:
if len(word_no_nik) < 2:
continue
# Build regex for word boundary matching
# Use both approaches: proper word boundary and substring for PDF text
pattern = re.compile(r"(?:^|\s)" + re.escape(word_no_nik) + r"(?:\s|$)")
# For PDF texts with collapsed spaces, also try substring match
# but only for words >= 3 chars to avoid false positives
use_substring = len(word_no_nik) >= 3
word_matches = []
for sd in sent_data:
stripped = sd["stripped"]
# Try word-boundary match first
if pattern.search(stripped):
word_matches.append(sd)
elif use_substring and word_no_nik in stripped:
# Substring match for PDF texts with collapsed spaces
# Verify it's not part of a longer word by checking the character
# before and after in the collapsed text
idx = stripped.find(word_no_nik)
before_ok = idx == 0 or not HEBREW_RE.match(stripped[idx - 1])
after_idx = idx + len(word_no_nik)
after_ok = after_idx >= len(stripped) or not HEBREW_RE.match(stripped[after_idx])
# Only count if at least one boundary is clear
# (for PDF collapsed text, boundaries are often missing)
# For PDF books, we accept substring matches
if sd["book"] in ("אליס בארץ הפלאות", "האריה שאהב תות") or before_ok or after_ok:
word_matches.append(sd)
if word_matches:
matched_count += 1
# Sort by preference: 6-12 words ideal, then shorter is better
def score(sd):
wc = sd["word_count"]
if 6 <= wc <= 12:
return (0, wc) # ideal range, prefer shorter
if wc < 6:
return (1, -wc) # too short
return (2, wc) # too long
word_matches.sort(key=score)
best = word_matches[:3]
matches[word_no_nik] = {
"word_nikkud": word_nik,
"sentences": [{"text": m["text"], "book": m["book"]} for m in best],
}
print(
f"Words with at least 1 match: {matched_count}/{len(vocab_words)} ({100 * matched_count / len(vocab_words):.1f}%)"
)
return matches
def main():
# ── Step 1: Extract from PDFs ──
pdfs = [
("alice.pdf", "אליס בארץ הפלאות"),
("lion_strawberry.pdf", "האריה שאהב תות"),
]
all_new_sentences = []
for filename, book_name in pdfs:
pdf_path = os.path.join(EPUBS_DIR, filename)
if not os.path.exists(pdf_path):
print(f"SKIP: {filename} not found")
continue
if not has_extractable_text(pdf_path):
print(f"SKIP: {filename} has no extractable text (likely scanned images)")
continue
print(f"Extracting from {filename} ({book_name})...")
sentences = extract_pdf_sentences(pdf_path, book_name)
print(f" Extracted {len(sentences)} sentences")
all_new_sentences.extend(sentences)
# ── Step 2: Merge with existing index ──
index = load_sentence_index()
existing_count = len(index["sentences"])
# Deduplicate by (stripped, book)
existing_keys = set()
for s in index["sentences"]:
key = (s.get("stripped", ""), s.get("book", ""))
existing_keys.add(key)
added = 0
for s in all_new_sentences:
key = (s["stripped"], s["book"])
if key not in existing_keys:
index["sentences"].append(s)
existing_keys.add(key)
added += 1
save_sentence_index(index)
total = len(index["sentences"])
print(f"\nSentence index: {existing_count} existing + {added} new = {total} total")
# ── Per-book stats ──
book_counts = {}
for s in index["sentences"]:
book = s.get("book", "unknown")
book_counts[book] = book_counts.get(book, 0) + 1
print("\nSentences per book:")
for book, count in sorted(book_counts.items(), key=lambda x: -x[1]):
print(f" {book}: {count}")
# ── Step 3: Match vocab words to sentences ──
print(f"\nLoading vocab from {VOCAB_CSV}...")
vocab_df = pd.read_csv(VOCAB_CSV, sep=";", index_col=0)
print(f" {len(vocab_df)} vocab words loaded")
matches = match_vocab_to_sentences(index["sentences"], vocab_df)
with open(MATCHES_FILE, "w", encoding="utf-8") as f:
json.dump(matches, f, ensure_ascii=False, indent=2)
print(f"\nWrote {len(matches)} word matches to {MATCHES_FILE}")
# ── Step 4: Summary stats ──
total_words = len(vocab_df)
matched_words = len(matches)
print(f"\n{'=' * 50}")
print("SUMMARY")
print(f"{'=' * 50}")
print(f"Total sentences: {total}")
for book, count in sorted(book_counts.items(), key=lambda x: -x[1]):
print(f" {book}: {count}")
print(f"Total vocab words: {total_words}")
print(f"Words with sentences: {matched_words} ({100 * matched_words / total_words:.1f}%)")
print(f"Words without sentences: {total_words - matched_words}")
if __name__ == "__main__":
main()

View file

@ -21,9 +21,10 @@ from pathlib import Path
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
PDF_URL = "https://books.nevo.engineer/opds/download/117/pdf/" PROJECT_ROOT = Path(__file__).resolve().parent.parent
PDF_URL = "" # Set to URL or local path of Coffin & Bolozky PDF
PDF_PATH = Path("/tmp/coffin_bolozky.pdf") PDF_PATH = Path("/tmp/coffin_bolozky.pdf")
OUTPUT_PATH = Path(__file__).parent / "verbs_input.txt" OUTPUT_PATH = PROJECT_ROOT / "verbs_input.txt"
# Pages to scan (Appendix 1) # Pages to scan (Appendix 1)
PAGE_START = 390 PAGE_START = 390
@ -31,24 +32,38 @@ PAGE_END = 411
# Binyan headings in Hebrew (vowelled and unvowelled variants) # Binyan headings in Hebrew (vowelled and unvowelled variants)
BINYAN_HEADINGS_HEB = [ BINYAN_HEADINGS_HEB = [
"פָּעַל", "פעל", "פָּעַל",
"נִפְעַל", "נפעל", "פעל",
"פִּעֵל", "פיעל", "נִפְעַל",
"פֻּעַל", "פועל", "נפעל",
"הִתְפַּעֵל", "התפעל", "פִּעֵל",
"הִפְעִיל", "הפעיל", "פיעל",
"הֻפְעַל", "הופעל", "פֻּעַל",
"פועל",
"הִתְפַּעֵל",
"התפעל",
"הִפְעִיל",
"הפעיל",
"הֻפְעַל",
"הופעל",
] ]
# Binyan heading → canonical name # Binyan heading → canonical name
BINYAN_CANONICAL = { BINYAN_CANONICAL = {
"פָּעַל": "Pa'al", "פעל": "Pa'al", "פָּעַל": "Pa'al",
"נִפְעַל": "Nif'al", "נפעל": "Nif'al", "פעל": "Pa'al",
"פִּעֵל": "Pi'el", "פיעל": "Pi'el", "נִפְעַל": "Nif'al",
"פֻּעַל": "Pu'al", "פועל": "Pu'al", "נפעל": "Nif'al",
"הִתְפַּעֵל": "Hitpa'el", "התפעל": "Hitpa'el", "פִּעֵל": "Pi'el",
"הִפְעִיל": "Hif'il", "הפעיל": "Hif'il", "פיעל": "Pi'el",
"הֻפְעַל": "Huf'al", "הופעל": "Huf'al", "פֻּעַל": "Pu'al",
"פועל": "Pu'al",
"הִתְפַּעֵל": "Hitpa'el",
"התפעל": "Hitpa'el",
"הִפְעִיל": "Hif'il",
"הפעיל": "Hif'il",
"הֻפְעַל": "Huf'al",
"הופעל": "Huf'al",
} }
# Passive binyan names — no infinitive, use 3ms past # Passive binyan names — no infinitive, use 3ms past
@ -156,15 +171,16 @@ FALLBACK_VERBS = """# Verb list from Coffin & Bolozky, A Reference Grammar of Mo
def _install_deps(): def _install_deps():
"""Install pymupdf and python-bidi if not available.""" """Install pymupdf and python-bidi if not available."""
try: try:
import fitz # noqa: F401
import bidi # noqa: F401 import bidi # noqa: F401
import fitz # noqa: F401
return True return True
except ImportError: except ImportError:
logger.info("Installing pymupdf and python-bidi …") logger.info("Installing pymupdf and python-bidi …")
import subprocess import subprocess
result = subprocess.run( result = subprocess.run(
[sys.executable, "-m", "pip", "install", [sys.executable, "-m", "pip", "install", "pymupdf", "python-bidi", "--break-system-packages", "-q"],
"pymupdf", "python-bidi", "--break-system-packages", "-q"],
capture_output=True, capture_output=True,
) )
if result.returncode != 0: if result.returncode != 0:
@ -182,6 +198,7 @@ def _download_pdf() -> bool:
logger.info(f"Downloading PDF from {PDF_URL}") logger.info(f"Downloading PDF from {PDF_URL}")
try: try:
import requests import requests
resp = requests.get(PDF_URL, timeout=120, stream=True) resp = requests.get(PDF_URL, timeout=120, stream=True)
resp.raise_for_status() resp.raise_for_status()
PDF_PATH.write_bytes(resp.content) PDF_PATH.write_bytes(resp.content)
@ -211,10 +228,7 @@ def _needs_bidi_fix(text: str) -> bool:
def _strip_nikkud(text: str) -> str: def _strip_nikkud(text: str) -> str:
return "".join( return "".join(ch for ch in unicodedata.normalize("NFD", text) if unicodedata.category(ch) != "Mn")
ch for ch in unicodedata.normalize("NFD", text)
if unicodedata.category(ch) != "Mn"
)
def _extract_from_pdf() -> list[tuple[str, str, str]]: def _extract_from_pdf() -> list[tuple[str, str, str]]:
@ -244,10 +258,9 @@ def _extract_from_pdf() -> list[tuple[str, str, str]]:
# Check if we need bidi correction # Check if we need bidi correction
test_text = "" test_text = ""
try: try:
for page_num in range(min(PAGE_START, doc.page_count - 1), for page_num in range(min(PAGE_START, doc.page_count - 1), min(PAGE_START + 3, doc.page_count)):
min(PAGE_START + 3, doc.page_count)):
test_text += doc[page_num].get_text("text") test_text += doc[page_num].get_text("text")
except Exception: except Exception: # noqa: S110
pass pass
use_bidi = _needs_bidi_fix(test_text) use_bidi = _needs_bidi_fix(test_text)
@ -259,6 +272,7 @@ def _extract_from_pdf() -> list[tuple[str, str, str]]:
return t return t
try: try:
from bidi.algorithm import get_display from bidi.algorithm import get_display
lines = t.split("\n") lines = t.split("\n")
fixed = [] fixed = []
for line in lines: for line in lines:
@ -274,7 +288,7 @@ def _extract_from_pdf() -> list[tuple[str, str, str]]:
for page_num in range(PAGE_START - 1, page_end): # fitz is 0-indexed for page_num in range(PAGE_START - 1, page_end): # fitz is 0-indexed
try: try:
raw = doc[page_num].get_text("text") raw = doc[page_num].get_text("text")
except Exception: except Exception: # noqa: S112
continue continue
text = fix_text(raw) text = fix_text(raw)
@ -316,9 +330,12 @@ def _extract_from_pdf() -> list[tuple[str, str, str]]:
heb_words = re.findall(r"[\u05d0-\u05ea\u05b0-\u05c7]{3,}", line) heb_words = re.findall(r"[\u05d0-\u05ea\u05b0-\u05c7]{3,}", line)
for w in heb_words: for w in heb_words:
stripped_w = _strip_nikkud(w) stripped_w = _strip_nikkud(w)
if current_binyan == "Pu'al" and stripped_w.startswith("פ"): if (
entries.append((current_binyan, "3ms", w)) current_binyan == "Pu'al"
elif current_binyan == "Huf'al" and stripped_w.startswith("ה"): and stripped_w.startswith("פ")
or current_binyan == "Huf'al"
and stripped_w.startswith("ה")
):
entries.append((current_binyan, "3ms", w)) entries.append((current_binyan, "3ms", w))
doc.close() doc.close()
@ -357,16 +374,20 @@ def _write_output(entries: list[tuple[str, str, str]]) -> None:
lines.append(form) lines.append(form)
OUTPUT_PATH.write_text("\n".join(lines) + "\n", encoding="utf-8") OUTPUT_PATH.write_text("\n".join(lines) + "\n", encoding="utf-8")
verb_count = sum(1 for l in lines if l and not l.startswith("#")) verb_count = sum(1 for ln in lines if ln and not ln.startswith("#"))
passive_count = sum(1 for l in lines if l.startswith("# 3ms:")) passive_count = sum(1 for ln in lines if ln.startswith("# 3ms:"))
logger.info(f"Written {verb_count} active verbs + {passive_count} passive (3ms) → {OUTPUT_PATH}") logger.info(f"Written {verb_count} active verbs + {passive_count} passive (3ms) → {OUTPUT_PATH}")
def _binyan_heb(name: str) -> str: def _binyan_heb(name: str) -> str:
mapping = { mapping = {
"Pa'al": "פָּעַל", "Nif'al": "נִפְעַל", "Pi'el": "פִּעֵל", "Pa'al": "פָּעַל",
"Pu'al": "פֻּעַל", "Hitpa'el": "הִתְפַּעֵל", "Nif'al": "נִפְעַל",
"Hif'il": "הִפְעִיל", "Huf'al": "הֻפְעַל", "Pi'el": "פִּעֵל",
"Pu'al": "פֻּעַל",
"Hitpa'el": "הִתְפַּעֵל",
"Hif'il": "הִפְעִיל",
"Huf'al": "הֻפְעַל",
} }
return mapping.get(name, name) return mapping.get(name, name)

237
scripts/scrape_ktiv_male.py Normal file
View file

@ -0,0 +1,237 @@
#!/usr/bin/env python3
"""
Scrape ktiv male (plene/vowelless) forms from pealim.com.
Uses hebstyle=vl cookie to get vowelless writing with matres lectionis.
Builds a lookup: ktiv_male_form [{word_nikkud, form_type, pos, slug}]
This enables matching Hebrew text (which is normally in ktiv male)
against our vocabulary, including conjugated verbs and noun plurals.
"""
import json
import logging
import sys
import time
from pathlib import Path
import requests
from bs4 import BeautifulSoup
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger(__name__)
DATA_DIR = Path(__file__).resolve().parent.parent / "data"
OUTPUT_PATH = DATA_DIR / "ktiv_male_forms.json"
COOKIES = {"translit": "none", "hebstyle": "vl"}
REQUEST_TIMEOUT = 15
DELAY = 1.5 # seconds between requests
def fetch_verb_ktiv_male(slug: str, infinitive_nikkud: str) -> list[dict]:
"""Fetch all conjugated forms in ktiv male for a verb."""
url = f"https://www.pealim.com/dict/{slug}/"
resp = requests.get(url, cookies=COOKIES, timeout=REQUEST_TIMEOUT)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "html.parser")
forms = []
table = soup.find("table", class_="conjugation-table")
if not table:
return forms
# Also get the infinitive from the page
lead = soup.find("div", class_="lead")
if lead:
inf_spans = lead.find_all("span", class_="menukad")
for s in inf_spans:
ktiv = s.text.strip()
if ktiv:
forms.append(
{
"ktiv_male": ktiv,
"word_nikkud": infinitive_nikkud,
"form_type": "infinitive",
"pos": "Verb",
"slug": slug,
}
)
rows = table.find_all("tr")
for row in rows:
menukad_spans = row.find_all("span", class_="menukad")
for span in menukad_spans:
ktiv = span.text.strip()
if ktiv and ktiv not in {f["ktiv_male"] for f in forms}:
forms.append(
{
"ktiv_male": ktiv,
"word_nikkud": infinitive_nikkud,
"form_type": "conjugation",
"pos": "Verb",
"slug": slug,
}
)
return forms
def fetch_noun_ktiv_male(slug: str, singular_nikkud: str, gender: str) -> list[dict]:
"""Fetch noun declension forms in ktiv male."""
url = f"https://www.pealim.com/dict/{slug}/"
resp = requests.get(url, cookies=COOKIES, timeout=REQUEST_TIMEOUT)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "html.parser")
forms = []
table = soup.find("table", class_="conjugation-table")
if not table:
return forms
rows = table.find_all("tr")
form_labels = ["absolute_singular", "absolute_plural", "construct_singular", "construct_plural"]
label_idx = 0
for row in rows:
menukad_spans = row.find_all("span", class_="menukad")
for span in menukad_spans:
ktiv = span.text.strip()
if ktiv:
ft = form_labels[label_idx] if label_idx < len(form_labels) else "other"
forms.append(
{
"ktiv_male": ktiv,
"word_nikkud": singular_nikkud,
"form_type": ft,
"pos": "Noun",
"slug": slug,
"gender": gender,
}
)
label_idx += 1
return forms
def scrape_verbs() -> list[dict]:
"""Scrape ktiv male forms for all verbs in conjugations.json."""
conj_path = DATA_DIR / "conjugations.json"
if not conj_path.exists():
logger.warning("No conjugations.json found")
return []
with open(conj_path) as f:
conjugations = json.load(f)
all_forms = []
slugs_done = set()
for verb, data in conjugations.items():
if not data or not data.get("slug"):
continue
slug = data["slug"]
if slug in slugs_done:
continue
slugs_done.add(slug)
try:
forms = fetch_verb_ktiv_male(slug, verb)
all_forms.extend(forms)
logger.info(f" Verb {verb} ({slug}): {len(forms)} forms")
except Exception as e:
logger.warning(f" Verb {verb} ({slug}) failed: {e}")
time.sleep(DELAY)
return all_forms
def scrape_nouns() -> list[dict]:
"""Scrape ktiv male forms for all nouns in noun_slug_map.json."""
slug_path = DATA_DIR / "noun_slug_map.json"
if not slug_path.exists():
logger.warning("No noun_slug_map.json found")
return []
with open(slug_path) as f:
slug_map = json.load(f)
# Also load existing plurals to get nikkud singular form
plurals_path = DATA_DIR / "noun_plurals.json"
plurals = {}
if plurals_path.exists():
with open(plurals_path) as f:
plurals = json.load(f)
all_forms = []
done = 0
total = len(slug_map)
for word, info in slug_map.items():
slug = info.get("slug", "")
if not slug:
continue
# Get nikkud form from plurals data or slug map
nikkud = info.get("word_nikkud", word)
if word in plurals:
nikkud = plurals[word].get("singular", nikkud)
gender = info.get("gender", "")
try:
forms = fetch_noun_ktiv_male(slug, nikkud, gender)
all_forms.extend(forms)
done += 1
if done % 50 == 0:
logger.info(f" Nouns: {done}/{total} ({len(all_forms)} forms)")
# Save incrementally
_save_forms(all_forms, partial=True)
except Exception as e:
logger.warning(f" Noun {word} ({slug}) failed: {e}")
done += 1
time.sleep(DELAY)
return all_forms
def _save_forms(all_forms: list[dict], partial: bool = False):
"""Build and save the ktiv male lookup dict."""
lookup: dict[str, list[dict]] = {}
for entry in all_forms:
ktiv = entry["ktiv_male"]
# Don't include ktiv_male in the stored entry (it's the key)
stored = {k: v for k, v in entry.items() if k != "ktiv_male"}
lookup.setdefault(ktiv, []).append(stored)
suffix = ".partial" if partial else ""
out = OUTPUT_PATH.parent / (OUTPUT_PATH.name + suffix)
with open(out, "w") as f:
json.dump(lookup, f, ensure_ascii=False, indent=1)
logger.info(f" Saved {len(lookup)} unique ktiv male forms → {out}")
def main():
mode = sys.argv[1] if len(sys.argv) > 1 else "all"
all_forms = []
if mode in ("all", "verbs"):
logger.info("=== Scraping verb ktiv male forms ===")
verb_forms = scrape_verbs()
all_forms.extend(verb_forms)
logger.info(f"Verbs done: {len(verb_forms)} forms from {len({f['slug'] for f in verb_forms})} verbs")
if mode in ("all", "nouns"):
logger.info("=== Scraping noun ktiv male forms ===")
noun_forms = scrape_nouns()
all_forms.extend(noun_forms)
logger.info(f"Nouns done: {len(noun_forms)} forms")
_save_forms(all_forms)
logger.info(f"Total: {len(all_forms)} forms → {OUTPUT_PATH}")
if __name__ == "__main__":
main()

View file

@ -0,0 +1,365 @@
#!/usr/bin/env python3
"""
Scrape pealim.com for noun plural and construct forms.
Step 1: Collect noun slugs from list pages (/dict/?pos=noun&page=N)
Step 2: Fetch detail pages for plural + construct forms
Step 3: Print summary statistics
"""
import json
import re
import time
from pathlib import Path
import requests
from bs4 import BeautifulSoup
BASE_URL = "https://www.pealim.com"
COOKIES = {"translit": "none", "hebstyle": "mo"}
HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; PealimScraper/1.0)"}
DATA_DIR = Path(__file__).resolve().parent.parent / "data"
SLUG_MAP_FILE = DATA_DIR / "noun_slug_map.json"
PROGRESS_FILE = DATA_DIR / "noun_slug_map_progress.json"
PLURALS_FILE = DATA_DIR / "noun_plurals.json"
DELAY = 1.5 # seconds between requests
def load_json(path, default=None):
if path.exists():
with open(path) as f:
return json.load(f)
return default if default is not None else {}
def save_json(path, data):
with open(path, "w") as f:
json.dump(data, f, ensure_ascii=False, indent=2)
def fetch_with_retry(url, max_retries=5):
"""Fetch URL with exponential backoff."""
for attempt in range(max_retries):
try:
r = requests.get(url, cookies=COOKIES, headers=HEADERS, timeout=30)
r.raise_for_status()
return r
except (requests.RequestException, ConnectionError) as e:
wait = min(2**attempt * 2, 60)
print(f" Retry {attempt + 1}/{max_retries} for {url}: {e} (waiting {wait}s)")
time.sleep(wait)
print(f" FAILED after {max_retries} retries: {url}")
return None
def get_total_pages():
"""Get total number of noun list pages."""
r = fetch_with_retry(f"{BASE_URL}/dict/?pos=noun&page=1")
if not r:
return 0
soup = BeautifulSoup(r.text, "lxml")
pages = set()
for a in soup.select("ul.pagination li a"):
href = a.get("href", "")
m = re.search(r"page=(\d+)", href)
if m:
pages.add(int(m.group(1)))
return max(pages) if pages else 1
def parse_list_page(html):
"""Parse a noun list page and return list of noun entries."""
soup = BeautifulSoup(html, "lxml")
table = soup.select_one("table.dict-table")
if not table:
return []
entries = []
for row in table.select("tr")[1:]: # skip header
tds = row.select("td")
if len(tds) < 3:
continue
# First td: word + link
first_td = tds[0]
a = first_td.select_one("a")
if not a:
continue
href = a.get("href", "")
slug_match = re.search(r"/dict/([^/]+)/", href)
if not slug_match:
continue
slug = slug_match.group(1)
menukad = first_td.select_one("span.menukad")
word_nikkud = menukad.get_text(strip=True) if menukad else ""
# Word without nikkud (strip combining marks)
word_plain = re.sub(r"[\u0591-\u05C7]", "", word_nikkud)
# Third td: part of speech
pos_text = tds[2].get_text(strip=True)
# Gender
gender = ""
if "masculine" in pos_text.lower():
gender = "masculine"
elif "feminine" in pos_text.lower():
gender = "feminine"
# Mishkal pattern
mishkal = ""
m = re.search(r"(\w+)\s*pattern", pos_text.lower())
if m:
mishkal = m.group(1)
entries.append(
{
"word_plain": word_plain,
"slug": slug,
"word_nikkud": word_nikkud,
"pos": pos_text,
"gender": gender,
"mishkal": mishkal,
}
)
return entries
def step1_collect_slugs():
"""Step 1: Collect noun slugs from list pages."""
print("=" * 60)
print("STEP 1: Collecting noun slugs from list pages")
print("=" * 60)
slug_map = load_json(SLUG_MAP_FILE, {})
progress = load_json(PROGRESS_FILE, [])
completed_pages = set(progress) if isinstance(progress, list) else set()
# Get total pages
total_pages = get_total_pages()
print(f"Total pages: {total_pages}")
print(f"Already completed: {len(completed_pages)} pages, {len(slug_map)} nouns")
remaining = [p for p in range(1, total_pages + 1) if p not in completed_pages]
print(f"Remaining pages: {len(remaining)}")
if not remaining:
print("All pages already scraped!")
return slug_map
for i, page_num in enumerate(remaining):
url = f"{BASE_URL}/dict/?pos=noun&page={page_num}"
r = fetch_with_retry(url)
if not r:
print(f" Skipping page {page_num}")
continue
entries = parse_list_page(r.text)
for entry in entries:
word = entry["word_plain"]
slug_map[word] = {
"slug": entry["slug"],
"word_nikkud": entry["word_nikkud"],
"pos": entry["pos"],
"gender": entry["gender"],
"mishkal": entry["mishkal"],
}
completed_pages.add(page_num)
done = len(completed_pages)
print(f" Page {page_num} ({done}/{total_pages}): {len(entries)} nouns (total: {len(slug_map)})")
# Save progress every 10 pages
if (i + 1) % 10 == 0 or page_num == remaining[-1]:
save_json(SLUG_MAP_FILE, slug_map)
save_json(PROGRESS_FILE, sorted(completed_pages))
print(f" [Saved progress: {len(slug_map)} nouns, {done} pages]")
time.sleep(DELAY)
# Final save
save_json(SLUG_MAP_FILE, slug_map)
save_json(PROGRESS_FILE, sorted(completed_pages))
print(f"\nStep 1 complete: {len(slug_map)} total nouns from {len(completed_pages)} pages")
return slug_map
def parse_detail_page(html, slug, gender, mishkal):
"""Parse a noun detail page for plural/construct forms."""
soup = BeautifulSoup(html, "lxml")
tables = soup.select("table.conjugation-table")
if not tables:
return None
table = tables[0]
rows = table.select("tr")
result = {
"slug": slug,
"singular": "",
"singular_audio": "",
"plural": "",
"plural_audio": "",
"construct_singular": "",
"construct_plural": "",
"gender": gender,
"mishkal": mishkal,
}
for row in rows:
th = row.select_one("th")
if not th:
continue
label = th.get_text(strip=True).lower()
tds = row.select("td")
if "absolute" in label:
if len(tds) >= 1:
td = tds[0]
m = td.select_one("span.menukad")
result["singular"] = m.get_text(strip=True) if m else ""
audio_el = td.select_one("[data-audio]")
result["singular_audio"] = audio_el.get("data-audio", "") if audio_el else td.get("data-audio", "")
if len(tds) >= 2:
td = tds[1]
m = td.select_one("span.menukad")
result["plural"] = m.get_text(strip=True) if m else ""
audio_el = td.select_one("[data-audio]")
result["plural_audio"] = audio_el.get("data-audio", "") if audio_el else td.get("data-audio", "")
elif "construct" in label:
if len(tds) >= 1:
td = tds[0]
m = td.select_one("span.menukad")
result["construct_singular"] = m.get_text(strip=True) if m else ""
if len(tds) >= 2:
td = tds[1]
m = td.select_one("span.menukad")
result["construct_plural"] = m.get_text(strip=True) if m else ""
return result
def step2_fetch_plurals(slug_map):
"""Step 2: Fetch detail pages for plural + construct forms."""
print("\n" + "=" * 60)
print("STEP 2: Fetching plural + construct forms from detail pages")
print("=" * 60)
plurals = load_json(PLURALS_FILE, {})
already_done = set(plurals.keys())
# Build work list: nouns not yet in plurals
work = []
for word, info in slug_map.items():
if word not in already_done:
work.append((word, info))
print(f"Already have plural data: {len(already_done)}")
print(f"Remaining to fetch: {len(work)}")
if not work:
print("All nouns already have plural data!")
return plurals
skipped = 0
for i, (word, info) in enumerate(work):
slug = info["slug"]
url = f"{BASE_URL}/dict/{slug}/"
r = fetch_with_retry(url)
if not r:
print(f" Skipping {word} ({slug})")
skipped += 1
continue
entry = parse_detail_page(r.text, slug, info.get("gender", ""), info.get("mishkal", ""))
if entry:
plurals[word] = entry
else:
# No declension table - store minimal entry
plurals[word] = {
"slug": slug,
"singular": info.get("word_nikkud", ""),
"singular_audio": "",
"plural": "",
"plural_audio": "",
"construct_singular": "",
"construct_plural": "",
"gender": info.get("gender", ""),
"mishkal": info.get("mishkal", ""),
"no_declension_table": True,
}
done = len(already_done) + i + 1 - skipped
total = len(already_done) + len(work)
if (i + 1) % 50 == 0 or i == 0:
print(
f" [{i + 1}/{len(work)}] {word} ({slug}): "
f"plural={entry['plural'] if entry else 'N/A'} "
f"(total: {done}/{total})"
)
# Save every 50 entries
if (i + 1) % 50 == 0 or i == len(work) - 1:
save_json(PLURALS_FILE, plurals)
print(f" [Saved: {len(plurals)} entries]")
time.sleep(DELAY)
save_json(PLURALS_FILE, plurals)
print(f"\nStep 2 complete: {len(plurals)} total noun entries with plural data")
return plurals
def step3_summary(slug_map, plurals):
"""Step 3: Print summary statistics."""
print("\n" + "=" * 60)
print("SUMMARY")
print("=" * 60)
total_slugs = len(slug_map)
total_plurals = len(plurals)
has_plural = sum(1 for v in plurals.values() if v.get("plural"))
has_construct = sum(1 for v in plurals.values() if v.get("construct_singular") or v.get("construct_plural"))
has_audio = sum(1 for v in plurals.values() if v.get("singular_audio") or v.get("plural_audio"))
no_table = sum(1 for v in plurals.values() if v.get("no_declension_table"))
# Irregular plurals: masculine with ות- ending, feminine with ים- ending
irregular = 0
for _word, v in plurals.items():
plural = v.get("plural", "")
gender = v.get("gender", "")
if not plural or not gender:
continue
plain_plural = re.sub(r"[\u0591-\u05C7]", "", plural)
if (
gender == "masculine"
and plain_plural.endswith("ות")
or gender == "feminine"
and plain_plural.endswith("ים")
):
irregular += 1
print(f"Total nouns in slug map: {total_slugs}")
print(f"Total nouns with plural data: {total_plurals}")
print(f" - With plural form: {has_plural}")
print(f" - With construct forms: {has_construct}")
print(f" - With audio URLs: {has_audio}")
print(f" - No declension table: {no_table}")
print(f" - Irregular plurals: {irregular}")
def main():
print("Pealim Noun Plural Scraper")
print(f"Data directory: {DATA_DIR}")
print()
slug_map = step1_collect_slugs()
plurals = step2_fetch_plurals(slug_map)
step3_summary(slug_map, plurals)
if __name__ == "__main__":
main()

250
scripts/scrape_verb_ktiv.py Normal file
View file

@ -0,0 +1,250 @@
#!/usr/bin/env python3
"""Scrape ktiv male (vowelless plene) conjugation forms for top 500 verbs from pealim.com."""
import json
import os
import re
import sys
import time
sys.stdout.reconfigure(line_buffering=True)
import requests # noqa: E402
from bs4 import BeautifulSoup # noqa: E402
DATA_DIR = os.path.join(os.path.dirname(os.path.dirname(__file__)), "data")
INPUT_FILE = os.path.join(DATA_DIR, "top_verbs_to_scrape.json")
OUTPUT_FILE = os.path.join(DATA_DIR, "ktiv_male_forms.json")
PARTIAL_FILE = os.path.join(DATA_DIR, "ktiv_male_forms_partial.json")
PROGRESS_FILE = os.path.join(DATA_DIR, "ktiv_scrape_progress.json")
COOKIES = {"translit": "none", "hebstyle": "vl"}
HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; PealimScraper/1.0)"}
DELAY = 1.5
session = requests.Session()
session.cookies.update(COOKIES)
session.headers.update(HEADERS)
def load_json(path):
if os.path.exists(path):
with open(path, encoding="utf-8") as f:
return json.load(f)
return {}
def save_json(data, path):
with open(path, "w", encoding="utf-8") as f:
json.dump(data, f, ensure_ascii=False, indent=1)
def search_slug(wni):
"""Search pealim for a verb and return the first result's slug."""
url = "https://www.pealim.com/search/"
resp = session.get(url, params={"q": wni}, timeout=15)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "html.parser")
# Look for result links like /dict/SLUG/
for a in soup.select("a[href]"):
href = a["href"]
m = re.match(r"/dict/(\d+-[^/]+)/", href)
if m:
return m.group(1)
return None
def scrape_verb_forms(slug):
"""Fetch a verb's detail page and extract all ktiv male conjugation forms."""
url = f"https://www.pealim.com/dict/{slug}/"
resp = session.get(url, timeout=15)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "html.parser")
forms = set()
# Get infinitive from div.lead or page title
lead = soup.select_one("div.lead")
if lead:
menukad_spans = lead.select("span.menukad")
for span in menukad_spans:
text = span.get_text(strip=True)
if text:
forms.add(text)
# Get word_nikkud (the nikkud form of the infinitive) from the page
# We need to fetch with mo cookie for that, but we already have it from input data
# Instead, get the page title which usually has the nikkud form
word_nikkud = None
title = soup.select_one("h1")
if title:
menukad_in_title = title.select_one("span.menukad")
if menukad_in_title:
word_nikkud = menukad_in_title.get_text(strip=True)
# Get ALL span.menukad elements from conjugation tables
for span in soup.select("span.menukad"):
text = span.get_text(strip=True)
if text:
forms.add(text)
return forms, word_nikkud
def main():
verbs = load_json(INPUT_FILE)
if not verbs:
print("ERROR: No verbs found in input file")
sys.exit(1)
# Load existing forms
existing_forms = load_json(OUTPUT_FILE)
new_forms = {} # Will be merged into existing at the end
# Load progress to resume
progress = load_json(PROGRESS_FILE)
done_wnis = set(progress.get("done_wnis", []))
slug_cache = progress.get("slug_cache", {})
# Pre-populate slug cache from conjugations.json
conj_file = os.path.join(DATA_DIR, "conjugations.json")
if os.path.exists(conj_file):
conj_data = load_json(conj_file)
for wni_key, cdata in conj_data.items():
if isinstance(cdata, dict) and "slug" in cdata and wni_key not in slug_cache:
slug_cache[wni_key] = cdata["slug"]
print(f"Pre-populated {len(slug_cache)} slugs from conjugations.json")
# Deduplicate verbs by wni
seen_wni = set()
unique_verbs = []
for v in verbs:
if v["wni"] not in seen_wni:
seen_wni.add(v["wni"])
unique_verbs.append(v)
total = len(unique_verbs)
to_scrape = [v for v in unique_verbs if v["wni"] not in done_wnis]
print(f"Total unique verbs: {total}, already done: {total - len(to_scrape)}, to scrape: {len(to_scrape)}")
scraped_count = 0
skipped_count = 0
total_new_forms = 0
sample_verbs = {} # For summary: wni -> list of forms
for i, verb in enumerate(to_scrape):
wni = verb["wni"]
word_nikkud_input = verb["word"]
try:
# Step 1: Find slug
if wni in slug_cache:
slug = slug_cache[wni]
else:
slug = search_slug(wni)
time.sleep(DELAY)
if not slug:
print(f" [{i + 1}/{len(to_scrape)}] SKIP {wni} - not found on pealim")
skipped_count += 1
done_wnis.add(wni)
continue
slug_cache[wni] = slug
# Step 2: Scrape forms
forms, page_nikkud = scrape_verb_forms(slug)
time.sleep(DELAY)
# Use the nikkud form from our input data (more reliable)
nikkud_to_use = word_nikkud_input
# Build entries for each form
for form in forms:
entry = {
"word_nikkud": nikkud_to_use,
"form_type": "conjugation",
"pos": "Verb",
"slug": slug,
}
if form not in new_forms:
new_forms[form] = []
# Check for duplicate entry
if not any(e["slug"] == slug for e in new_forms[form]):
new_forms[form].append(entry)
total_new_forms += 1
scraped_count += 1
# Collect samples (first 3 completed)
if len(sample_verbs) < 3:
sample_verbs[wni] = sorted(forms)
print(f" [{i + 1}/{len(to_scrape)}] {wni} -> {slug} ({len(forms)} forms)")
done_wnis.add(wni)
except Exception as e:
print(f" [{i + 1}/{len(to_scrape)}] ERROR {wni}: {e}")
skipped_count += 1
done_wnis.add(wni)
# Save progress every 50 verbs
if (i + 1) % 50 == 0:
progress = {"done_wnis": list(done_wnis), "slug_cache": slug_cache}
save_json(progress, PROGRESS_FILE)
# Save partial merged result
merged = dict(existing_forms)
for form, entries in new_forms.items():
if form in merged:
existing_slugs = {e["slug"] for e in merged[form]}
for entry in entries:
if entry["slug"] not in existing_slugs:
merged[form].append(entry)
else:
merged[form] = entries
save_json(merged, PARTIAL_FILE)
print(f" -- Progress saved at {i + 1}/{len(to_scrape)} --")
# Final merge
merged = dict(existing_forms)
for form, entries in new_forms.items():
if form in merged:
existing_slugs = {e["slug"] for e in merged[form]}
for entry in entries:
if entry["slug"] not in existing_slugs:
merged[form].append(entry)
else:
merged[form] = entries
save_json(merged, OUTPUT_FILE)
# Save final progress
progress = {"done_wnis": list(done_wnis), "slug_cache": slug_cache}
save_json(progress, PROGRESS_FILE)
# Clean up partial file
if os.path.exists(PARTIAL_FILE):
os.remove(PARTIAL_FILE)
# Summary
print(f"\n{'=' * 50}")
print("SUMMARY")
print(f"{'=' * 50}")
print(f"Verbs scraped: {scraped_count}")
print(f"Verbs skipped: {skipped_count}")
print(f"New forms added: {total_new_forms}")
print(f"Total unique ktiv male forms: {len(merged)}")
print(f"Previous forms count: {len(existing_forms)}")
print(f"Net new form keys: {len(merged) - len(existing_forms)}")
if sample_verbs:
print("\nSample verbs:")
for wni, forms in list(sample_verbs.items())[:3]:
print(f"\n {wni} ({len(forms)} forms):")
for f in forms[:8]:
print(f" {f}")
if len(forms) > 8:
print(f" ... and {len(forms) - 8} more")
if __name__ == "__main__":
main()

View file

@ -1,31 +0,0 @@
#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup
word = 'אבל'
url = f'https://www.pealim.com/search/?q={word}'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
try:
response = requests.get(url, headers=headers, timeout=10)
print(f'Status: {response.status_code}')
soup = BeautifulSoup(response.content, 'html.parser')
# Debug: check what we find
word_elem = soup.find('h1', class_='word-title')
pos_elem = soup.find('span', class_='pos')
definition_elem = soup.find('div', class_='definition')
print(f'word_elem found: {word_elem is not None}')
print(f'pos_elem found: {pos_elem is not None}')
print(f'definition_elem found: {definition_elem is not None}')
print('\n--- HTML snippet (first 3000 chars) ---')
print(soup.prettify()[:3000])
except Exception as e:
print(f'Error: {e}')
import traceback
traceback.print_exc()

0
tests/__init__.py Normal file
View file

45
tests/test_smoke.py Normal file
View file

@ -0,0 +1,45 @@
"""Smoke tests for the Hebrew Flash Cards project."""
import sys
from pathlib import Path
# Ensure project root is on path
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
def test_helpers_strip_nikkud():
from helpers import strip_nikkud
assert strip_nikkud("שָׁלוֹם") == "שלום"
assert strip_nikkud("hello") == "hello"
assert strip_nikkud("") == ""
def test_apkg_builder_imports():
import apkg_builder
assert hasattr(apkg_builder, "build_vocab_deck")
assert hasattr(apkg_builder, "build_conj_deck")
assert apkg_builder.VOCAB_MODEL_ID == 1_701_222_017_968
def test_data_files_exist():
data_dir = Path(__file__).resolve().parent.parent / "data"
assert (data_dir / "hebrew_dict_for_anki.csv").exists(), "vocab CSV missing"
assert (data_dir / "conjugations.json").exists(), "conjugations cache missing"
def test_strip_nikkud_idempotent():
from helpers import strip_nikkud
plain = "שלום"
assert strip_nikkud(plain) == plain
def test_strip_nikkud_all_marks():
from helpers import strip_nikkud
# Comprehensive: patach, kamatz, segol, tsere, hiriq, holam, kubutz, shva, dagesh
nikkud = "הַמַּלְכָּה"
plain = strip_nikkud(nikkud)
assert all(ch < "\u0591" or ch > "\u05C7" for ch in plain), f"Residual nikkud in: {plain}"

View file

@ -14,7 +14,6 @@ import json
import os import os
import re import re
import sqlite3 import sqlite3
import struct
import sys import sys
import tempfile import tempfile
import zipfile import zipfile
@ -22,6 +21,9 @@ from pathlib import Path
VOCAB_APKG = Path("output/hebrew_vocabulary.apkg") VOCAB_APKG = Path("output/hebrew_vocabulary.apkg")
CONJ_APKG = Path("output/hebrew_conjugations.apkg") CONJ_APKG = Path("output/hebrew_conjugations.apkg")
CONF_APKG = Path("output/hebrew_confusables.apkg")
PLURAL_APKG = Path("output/hebrew_plurals.apkg")
COMPLETE_APKG = Path("output/hebrew_complete.apkg")
PASS = "\033[32m✓\033[0m" PASS = "\033[32m✓\033[0m"
FAIL = "\033[31m✗\033[0m" FAIL = "\033[31m✗\033[0m"
@ -60,10 +62,9 @@ def _detect_format(data: bytes) -> str:
def validate_apkg(apkg_path: Path) -> int: def validate_apkg(apkg_path: Path) -> int:
"""Run all checks. Returns number of failures.""" """Run all checks. Returns number of failures."""
name = apkg_path.name print(f"\n{'=' * 60}")
print(f"\n{'='*60}")
print(f" Validating: {apkg_path}") print(f" Validating: {apkg_path}")
print(f"{'='*60}") print(f"{'=' * 60}")
failures = 0 failures = 0
@ -78,16 +79,17 @@ def validate_apkg(apkg_path: Path) -> int:
print("\n[ZIP structure]") print("\n[ZIP structure]")
try: try:
zf = zipfile.ZipFile(apkg_path) zf = zipfile.ZipFile(apkg_path)
except zipfile.BadZipFile as e:
print(f" {FAIL} Invalid ZIP: {e}")
return 1
with zf, tempfile.TemporaryDirectory() as tmpdir:
namelist = zf.namelist() namelist = zf.namelist()
has_db = "collection.anki2" in namelist has_db = "collection.anki2" in namelist
has_media = "media" in namelist has_media = "media" in namelist
failures += 0 if check("collection.anki2 present", has_db) else 1 failures += 0 if check("collection.anki2 present", has_db) else 1
failures += 0 if check("media manifest present", has_media) else 1 failures += 0 if check("media manifest present", has_media) else 1
except zipfile.BadZipFile as e:
print(f" {FAIL} Invalid ZIP: {e}")
return 1
with tempfile.TemporaryDirectory() as tmpdir:
zf.extractall(tmpdir) zf.extractall(tmpdir)
# --- Media manifest --- # --- Media manifest ---
@ -116,8 +118,11 @@ def validate_apkg(apkg_path: Path) -> int:
size = zf.getinfo(num).file_size if num in zf.NameToInfo else -1 size = zf.getinfo(num).file_size if num in zf.NameToInfo else -1
if size == 0: if size == 0:
zero_byte.append(orig) zero_byte.append(orig)
failures += 0 if check("No zero-byte media files", len(zero_byte) == 0, failures += (
f"{len(zero_byte)} empty" if zero_byte else "") else 1 0
if check("No zero-byte media files", len(zero_byte) == 0, f"{len(zero_byte)} empty" if zero_byte else "")
else 1
)
# Check audio format sample (first 20 mp3s) # Check audio format sample (first 20 mp3s)
mp3_names = [num for num, orig in media_map.items() if orig.endswith(".mp3")] mp3_names = [num for num, orig in media_map.items() if orig.endswith(".mp3")]
@ -127,16 +132,19 @@ def validate_apkg(apkg_path: Path) -> int:
fmt = _detect_format(data) fmt = _detect_format(data)
if "MP3" not in fmt: if "MP3" not in fmt:
bad_format.append(f"{media_map[num]}: {fmt}") bad_format.append(f"{media_map[num]}: {fmt}")
failures += 0 if check( failures += (
f"Audio format (sampled {min(20, len(mp3_names))} files)", 0
len(bad_format) == 0, if check(
"; ".join(bad_format) if bad_format else f"all MP3", f"Audio format (sampled {min(20, len(mp3_names))} files)",
) else 1 len(bad_format) == 0,
"; ".join(bad_format) if bad_format else "all MP3",
)
else 1
)
# Fonts present # Fonts present
font_files = [v for v in original_names if v.endswith(".ttf")] font_files = [v for v in original_names if v.endswith(".ttf")]
check("Heebo font files bundled", len(font_files) >= 1, check("Heebo font files bundled", len(font_files) >= 1, ", ".join(font_files) if font_files else "none found")
", ".join(font_files) if font_files else "none found")
# --- Database --- # --- Database ---
print("\n[Database]") print("\n[Database]")
@ -144,8 +152,7 @@ def validate_apkg(apkg_path: Path) -> int:
conn = sqlite3.connect(db_path) conn = sqlite3.connect(db_path)
schema_ver = conn.execute("SELECT ver FROM col").fetchone()[0] schema_ver = conn.execute("SELECT ver FROM col").fetchone()[0]
failures += 0 if check("Schema version 11 (Anki 2.1)", schema_ver == 11, failures += 0 if check("Schema version 11 (Anki 2.1)", schema_ver == 11, f"got {schema_ver}") else 1
f"got {schema_ver}") else 1
note_count = conn.execute("SELECT COUNT(*) FROM notes").fetchone()[0] note_count = conn.execute("SELECT COUNT(*) FROM notes").fetchone()[0]
card_count = conn.execute("SELECT COUNT(*) FROM cards").fetchone()[0] card_count = conn.execute("SELECT COUNT(*) FROM cards").fetchone()[0]
@ -153,33 +160,37 @@ def validate_apkg(apkg_path: Path) -> int:
failures += 0 if check("Cards present", card_count > 0, f"{card_count:,} cards") else 1 failures += 0 if check("Cards present", card_count > 0, f"{card_count:,} cards") else 1
# Determine expected cards per note from model templates # Determine expected cards per note from model templates
# Some templates are optional (e.g. cloze only generates when field is non-empty),
# so we check that cards fall between min and max expected range.
models_json_raw = conn.execute("SELECT models FROM col").fetchone()[0] models_json_raw = conn.execute("SELECT models FROM col").fetchone()[0]
models_raw = json.loads(models_json_raw) models_raw = json.loads(models_json_raw)
tmpl_counts = [len(m["tmpls"]) for m in models_raw.values()] tmpl_counts = [len(m["tmpls"]) for m in models_raw.values()]
expected_ratio = tmpl_counts[0] if len(set(tmpl_counts)) == 1 else None if len(set(tmpl_counts)) == 1 and len(tmpl_counts) == 1:
if expected_ratio: expected_ratio = tmpl_counts[0]
failures += 0 if check( # Allow fewer cards when optional templates exist (e.g. cloze)
f"{expected_ratio} card(s) per note", min_cards = note_count # at least 1 card per note
card_count == note_count * expected_ratio, max_cards = note_count * expected_ratio
f"{note_count} notes × {expected_ratio} = {note_count * expected_ratio}, got {card_count}", failures += (
) else 1 0
if check(
f"Cards per note (1{expected_ratio} templates)",
min_cards <= card_count <= max_cards,
f"{card_count:,} cards from {note_count:,} notes",
)
else 1
)
# Duplicate GUIDs # Duplicate GUIDs
dup_guids = conn.execute( dup_guids = conn.execute("SELECT guid, COUNT(*) c FROM notes GROUP BY guid HAVING c > 1").fetchall()
"SELECT guid, COUNT(*) c FROM notes GROUP BY guid HAVING c > 1" failures += 0 if check("No duplicate GUIDs", len(dup_guids) == 0, f"{len(dup_guids)} duplicates") else 1
).fetchall()
failures += 0 if check("No duplicate GUIDs", len(dup_guids) == 0,
f"{len(dup_guids)} duplicates") else 1
# Card queue states # Card queue states
queues = conn.execute( queues = conn.execute("SELECT type, queue, COUNT(*) FROM cards GROUP BY type, queue").fetchall()
"SELECT type, queue, COUNT(*) FROM cards GROUP BY type, queue"
).fetchall()
queue_map = {(t, q): cnt for t, q, cnt in queues} queue_map = {(t, q): cnt for t, q, cnt in queues}
new_cards = queue_map.get((0, 0), 0) new_cards = queue_map.get((0, 0), 0)
suspended = queue_map.get((0, -1), 0) + queue_map.get((1, -1), 0) + queue_map.get((2, -1), 0) suspended = queue_map.get((0, -1), 0) + queue_map.get((1, -1), 0) + queue_map.get((2, -1), 0)
if new_cards > 0: if new_cards > 0:
check(f"Cards in new queue (type=0, queue=0)", True, f"{new_cards:,}") check("Cards in new queue (type=0, queue=0)", True, f"{new_cards:,}")
if suspended > 0: if suspended > 0:
warn("Suspended cards", f"{suspended:,}") warn("Suspended cards", f"{suspended:,}")
@ -190,23 +201,18 @@ def validate_apkg(apkg_path: Path) -> int:
per_days = {dc.get("new", {}).get("perDay") for dc in dconf.values() if isinstance(dc, dict)} per_days = {dc.get("new", {}).get("perDay") for dc in dconf.values() if isinstance(dc, dict)}
check("new.order configured", bool(orders), f"{orders}") check("new.order configured", bool(orders), f"{orders}")
if per_days: if per_days:
check("new.perDay > 0", all(p and p > 0 for p in per_days if p is not None), check("new.perDay > 0", all(p and p > 0 for p in per_days if p is not None), f"perDay={per_days}")
f"perDay={per_days}")
# Deck assignment # Deck assignment
decks_json = conn.execute("SELECT decks FROM col").fetchone()[0] decks_json = conn.execute("SELECT decks FROM col").fetchone()[0]
decks = json.loads(decks_json) decks = json.loads(decks_json)
real_decks = {did: d for did, d in decks.items() if did != "1"} real_decks = {did: d for did, d in decks.items() if did != "1"}
if real_decks: if real_decks:
check("Custom deck exists (not Default only)", True, check("Custom deck exists (not Default only)", True, ", ".join(d["name"] for d in real_decks.values()))
", ".join(d["name"] for d in real_decks.values()))
# All cards in the custom deck? # All cards in the custom deck?
for did_str in real_decks: for did_str in real_decks:
assigned = conn.execute( assigned = conn.execute("SELECT COUNT(*) FROM cards WHERE did=?", [int(did_str)]).fetchone()[0]
"SELECT COUNT(*) FROM cards WHERE did=?", [int(did_str)] check(f"Cards in deck '{real_decks[did_str]['name']}'", assigned > 0, f"{assigned:,}/{card_count:,}")
).fetchone()[0]
check(f"Cards in deck '{real_decks[did_str]['name']}'", assigned > 0,
f"{assigned:,}/{card_count:,}")
# --- Sound references vs media manifest --- # --- Sound references vs media manifest ---
print("\n[Sound references]") print("\n[Sound references]")
@ -218,16 +224,21 @@ def validate_apkg(apkg_path: Path) -> int:
missing_audio = sound_refs - original_names missing_audio = sound_refs - original_names
orphaned_audio = original_names - sound_refs - set(font_files) orphaned_audio = original_names - sound_refs - set(font_files)
failures += 0 if check("All sound refs in media manifest", len(missing_audio) == 0, failures += (
f"{len(missing_audio)} missing" if missing_audio else "") else 1 0
if check(
"All sound refs in media manifest",
len(missing_audio) == 0,
f"{len(missing_audio)} missing" if missing_audio else "",
)
else 1
)
if orphaned_audio: if orphaned_audio:
warn("Media files not referenced by any card", f"{len(orphaned_audio)} orphaned") warn("Media files not referenced by any card", f"{len(orphaned_audio)} orphaned")
notes_with_audio = sum( notes_with_audio = sum(1 for (flds,) in notes_flds if "[sound:" in flds)
1 for (flds,) in notes_flds if "[sound:" in flds
)
pct = notes_with_audio / note_count * 100 if note_count else 0 pct = notes_with_audio / note_count * 100 if note_count else 0
check(f"Notes with audio", notes_with_audio > 0, f"{notes_with_audio:,}/{note_count:,} ({pct:.0f}%)") check("Notes with audio", notes_with_audio > 0, f"{notes_with_audio:,}/{note_count:,} ({pct:.0f}%)")
# --- Empty fields check --- # --- Empty fields check ---
print("\n[Field content]") print("\n[Field content]")
@ -236,22 +247,12 @@ def validate_apkg(apkg_path: Path) -> int:
field_names = [f["name"] for f in model["flds"]] field_names = [f["name"] for f in model["flds"]]
# Check required fields (first 3) are not empty # Check required fields (first 3) are not empty
required_idx = list(range(min(3, len(field_names)))) required_idx = list(range(min(3, len(field_names))))
all_notes_for_model = conn.execute("SELECT flds FROM notes WHERE mid=?", [int(mid_str)]).fetchall()
for idx in required_idx: for idx in required_idx:
fname = field_names[idx] fname = field_names[idx]
empty_count = conn.execute(
"""SELECT COUNT(*) FROM notes
WHERE mid=? AND (
flds LIKE ? OR
instr(flds, char(31)) = 0
)""",
[int(mid_str), "\x1f" * idx + "\x1f%"],
).fetchone()[0]
# Simpler: count notes where field idx is empty
all_notes_for_model = conn.execute(
"SELECT flds FROM notes WHERE mid=?", [int(mid_str)]
).fetchall()
empty = sum( empty = sum(
1 for (flds,) in all_notes_for_model 1
for (flds,) in all_notes_for_model
if len(flds.split("\x1f")) <= idx or not flds.split("\x1f")[idx].strip() if len(flds.split("\x1f")) <= idx or not flds.split("\x1f")[idx].strip()
) )
if empty > 0: if empty > 0:
@ -271,6 +272,9 @@ def main() -> None:
group = parser.add_mutually_exclusive_group() group = parser.add_mutually_exclusive_group()
group.add_argument("--vocab", action="store_true", help="Validate vocabulary deck only") group.add_argument("--vocab", action="store_true", help="Validate vocabulary deck only")
group.add_argument("--conjugations", action="store_true", help="Validate conjugation deck only") group.add_argument("--conjugations", action="store_true", help="Validate conjugation deck only")
group.add_argument("--confusables", action="store_true", help="Validate confusables deck only")
group.add_argument("--plurals", action="store_true", help="Validate plurals deck only")
group.add_argument("--complete", action="store_true", help="Validate complete combined deck only")
args = parser.parse_args() args = parser.parse_args()
targets: list[Path] = [] targets: list[Path] = []
@ -280,19 +284,25 @@ def main() -> None:
targets = [VOCAB_APKG] targets = [VOCAB_APKG]
elif args.conjugations: elif args.conjugations:
targets = [CONJ_APKG] targets = [CONJ_APKG]
elif args.confusables:
targets = [CONF_APKG]
elif args.plurals:
targets = [PLURAL_APKG]
elif args.complete:
targets = [COMPLETE_APKG]
else: else:
targets = [VOCAB_APKG, CONJ_APKG] targets = [VOCAB_APKG, CONJ_APKG, CONF_APKG, PLURAL_APKG, COMPLETE_APKG]
total_failures = 0 total_failures = 0
for path in targets: for path in targets:
total_failures += validate_apkg(path) total_failures += validate_apkg(path)
print(f"\n{'='*60}") print(f"\n{'=' * 60}")
if total_failures == 0: if total_failures == 0:
print(f" {PASS} All checks passed") print(f" {PASS} All checks passed")
else: else:
print(f" {FAIL} {total_failures} check(s) failed") print(f" {FAIL} {total_failures} check(s) failed")
print(f"{'='*60}\n") print(f"{'=' * 60}\n")
sys.exit(0 if total_failures == 0 else 1) sys.exit(0 if total_failures == 0 else 1)

View file

@ -28,42 +28,42 @@ from pathlib import Path
import requests import requests
from bs4 import BeautifulSoup from bs4 import BeautifulSoup
PEALIM_BASE = "https://www.pealim.com" PEALIM_BASE = "https://www.pealim.com"
REQUEST_DELAY = 1.5 REQUEST_DELAY = 1.5
REQUEST_TIMEOUT = 15 REQUEST_TIMEOUT = 15
SOURCE_FILE = Path(__file__).parent / "nevo_typed_verbs_from_modern_hebrew" SOURCE_FILE = Path(__file__).parent / "nevo_typed_verbs_from_modern_hebrew"
OUTPUT_FILE = Path(__file__).parent / "verbs_input.txt" OUTPUT_FILE = Path(__file__).parent / "verbs_input.txt"
# Known problem entries: word → (action, note) # Known problem entries: word → (action, note)
# action: "REVIEW" = comment out and flag, "3ms" = treat as 3ms past form # action: "REVIEW" = comment out and flag, "3ms" = treat as 3ms past form
KNOWN_ISSUES: dict[str, tuple[str, str]] = { KNOWN_ISSUES: dict[str, tuple[str, str]] = {
"לגבוה": ("REVIEW", "not a standard infinitive form; likely defective spelling or wrong word"), "לגבוה": ("REVIEW", "not a standard infinitive form; likely defective spelling or wrong word"),
"לההרג": ("REVIEW", "extra ה; should probably be להיהרג (Nif'al of הרג)"), "לההרג": ("REVIEW", "extra ה; should probably be להיהרג (Nif'al of הרג)"),
"להתלקלח": ("REVIEW", "not a real word; likely typo for להתקלקל"), "להתלקלח": ("REVIEW", "not a real word; likely typo for להתקלקל"),
"להקלל": ("REVIEW", "ambiguous: could be Hif'il לְהָקֵל (to ease) or Nif'al of קלל"), "להקלל": ("REVIEW", "ambiguous: could be Hif'il לְהָקֵל (to ease) or Nif'al of קלל"),
"המציא": ("3ms", "Hif'il 3ms past form, not an infinitive"), "המציא": ("3ms", "Hif'il 3ms past form, not an infinitive"),
"קומם": ("3ms", "ambiguous: Pu'al 3ms past; Pi'el infinitive is לְקוֹמֵם"), "קומם": ("3ms", "ambiguous: Pu'al 3ms past; Pi'el infinitive is לְקוֹמֵם"),
} }
# Expected binyan by line range (1-indexed) per plan analysis # Expected binyan by line range (1-indexed) per plan analysis
LINE_RANGES: list[tuple[range, str]] = [ LINE_RANGES: list[tuple[range, str]] = [
(range(1, 18), "Pa'al"), (range(1, 18), "Pa'al"),
(range(18, 29), "Nif'al"), (range(18, 29), "Nif'al"),
(range(29, 37), "Pi'el"), (range(29, 37), "Pi'el"),
(range(37, 43), "Pu'al"), (range(37, 43), "Pu'al"),
(range(43, 53), "Hitpa'el"), (range(43, 53), "Hitpa'el"),
(range(53, 63), "Hif'il"), (range(53, 63), "Hif'il"),
(range(63, 71), "Huf'al"), (range(63, 71), "Huf'al"),
] ]
SECTION_HEADERS: dict[str, str] = { SECTION_HEADERS: dict[str, str] = {
"Pa'al": "# Pa'al (פָּעַל)", "Pa'al": "# Pa'al (פָּעַל)",
"Nif'al": "# Nif'al (נִפְעַל)", "Nif'al": "# Nif'al (נִפְעַל)",
"Pi'el": "# Pi'el (פִּעֵל)", "Pi'el": "# Pi'el (פִּעֵל)",
"Pu'al": "# Pu'al (פֻּעַל) — 3ms past, no infinitive", "Pu'al": "# Pu'al (פֻּעַל) — 3ms past, no infinitive",
"Hitpa'el": "# Hitpa'el (הִתְפַּעֵל)", "Hitpa'el": "# Hitpa'el (הִתְפַּעֵל)",
"Hif'il": "# Hif'il (הִפְעִיל)", "Hif'il": "# Hif'il (הִפְעִיל)",
"Huf'al": "# Huf'al (הֻפְעַל) — 3ms past, no infinitive", "Huf'al": "# Huf'al (הֻפְעַל) — 3ms past, no infinitive",
} }
session = requests.Session() session = requests.Session()
@ -120,7 +120,7 @@ def main() -> None:
print(f"ERROR: {SOURCE_FILE} not found", file=sys.stderr) print(f"ERROR: {SOURCE_FILE} not found", file=sys.stderr)
sys.exit(1) sys.exit(1)
lines = [l.strip() for l in SOURCE_FILE.read_text(encoding="utf-8").splitlines() if l.strip()] lines = [line.strip() for line in SOURCE_FILE.read_text(encoding="utf-8").splitlines() if line.strip()]
print(f"Loaded {len(lines)} entries from {SOURCE_FILE.name}") print(f"Loaded {len(lines)} entries from {SOURCE_FILE.name}")
print(f"Querying pealim.com (delay {REQUEST_DELAY}s per request)…\n") print(f"Querying pealim.com (delay {REQUEST_DELAY}s per request)…\n")
@ -137,14 +137,19 @@ def main() -> None:
if issue_type == "REVIEW": if issue_type == "REVIEW":
# Don't query pealim for known-bad entries # Don't query pealim for known-bad entries
print(f"REVIEW (skipping query)") print("REVIEW (skipping query)")
results.append({ results.append(
"line": line_num, "word": word, {
"expected_binyan": expected_binyan, "line": line_num,
"slug": "", "page_binyan": "", "word": word,
"status": "REVIEW", "notes": issue_note, "expected_binyan": expected_binyan,
"is_3ms": is_3ms_by_position, "slug": "",
}) "page_binyan": "",
"status": "REVIEW",
"notes": issue_note,
"is_3ms": is_3ms_by_position,
}
)
continue continue
time.sleep(REQUEST_DELAY) time.sleep(REQUEST_DELAY)
@ -171,13 +176,18 @@ def main() -> None:
notes = "" notes = ""
print(f"{status:<12} slug={slug or '-':<35} binyan={page_binyan or '-'}") print(f"{status:<12} slug={slug or '-':<35} binyan={page_binyan or '-'}")
results.append({ results.append(
"line": line_num, "word": word, {
"expected_binyan": expected_binyan, "line": line_num,
"slug": slug or "", "page_binyan": page_binyan, "word": word,
"status": status, "notes": notes, "expected_binyan": expected_binyan,
"is_3ms": is_3ms_by_position or issue_type == "3ms", "slug": slug or "",
}) "page_binyan": page_binyan,
"status": status,
"notes": notes,
"is_3ms": is_3ms_by_position or issue_type == "3ms",
}
)
# ── Write cleaned verbs_input.txt ──────────────────────────────────────────── # ── Write cleaned verbs_input.txt ────────────────────────────────────────────
sections: dict[str, list[str]] = {b: [] for b in SECTION_HEADERS} sections: dict[str, list[str]] = {b: [] for b in SECTION_HEADERS}
@ -219,7 +229,6 @@ def main() -> None:
print(f"\nWrote → {OUTPUT_FILE}") print(f"\nWrote → {OUTPUT_FILE}")
# ── Print summary table ────────────────────────────────────────────────────── # ── Print summary table ──────────────────────────────────────────────────────
col_w = [4, 22, 14, 38, 12]
print("\n" + "=" * 95) print("\n" + "=" * 95)
print("VALIDATION REPORT") print("VALIDATION REPORT")
print("=" * 95) print("=" * 95)
@ -232,8 +241,7 @@ def main() -> None:
) )
print("=" * 95) print("=" * 95)
counts = {s: sum(1 for r in results if r["status"] == s) counts = {s: sum(1 for r in results if r["status"] == s) for s in ("OK", "3ms", "MISMATCH", "REVIEW", "NOT_FOUND")}
for s in ("OK", "3ms", "MISMATCH", "REVIEW", "NOT_FOUND")}
print( print(
f"\nSummary: {counts['OK']} OK | {counts['3ms']} 3ms-past | " f"\nSummary: {counts['OK']} OK | {counts['3ms']} 3ms-past | "
f"{counts['MISMATCH']} MISMATCH | {counts['REVIEW']} REVIEW | {counts['NOT_FOUND']} NOT_FOUND" f"{counts['MISMATCH']} MISMATCH | {counts['REVIEW']} REVIEW | {counts['NOT_FOUND']} NOT_FOUND"
@ -241,10 +249,7 @@ def main() -> None:
print(f"Total entries: {len(results)}") print(f"Total entries: {len(results)}")
if counts["REVIEW"] > 0 or counts["NOT_FOUND"] > 0 or counts["MISMATCH"] > 0: if counts["REVIEW"] > 0 or counts["NOT_FOUND"] > 0 or counts["MISMATCH"] > 0:
print( print("\n⚠ Review flagged entries in verbs_input.txt before running:\n python3 conjugation_extract.py")
"\n⚠ Review flagged entries in verbs_input.txt before running:\n"
" python3 conjugation_extract.py"
)
if __name__ == "__main__": if __name__ == "__main__":

View file

@ -2,6 +2,8 @@
# Lines prefixed '# 3ms:' are searched by 3ms past form (Pu'al/Huf'al). # Lines prefixed '# 3ms:' are searched by 3ms past form (Pu'al/Huf'al).
# Pa'al (פָּעַל) # Pa'al (פָּעַל)
# slug: להיות 454-lihyot
להיות
לשמור לשמור
ללמוד ללמוד
לאסוף לאסוף

3
vulture_whitelist.py Normal file
View file

@ -0,0 +1,3 @@
# Vulture whitelist: suppress false positives for interface methods
# HTMLParser.handle_starttag requires (self, tag, attrs) signature
attrs # noqa