Sprint 9: cloze cards, plurals deck, project reorg, lint tooling
- Cloze card pipeline: 924 cards from 2,296 AI-vetted Hebrew book sentences - Plurals deck: 375 notes (144 irregular + 231 regular from 86 mishkal patterns) - Ktiv male forms expanded to 20,711 entries for sentence matching - Project reorg: helpers.py (deduped strip_nikkud from 10 files), scripts/ for one-off tools, tests/ with smoke tests, deleted 3 dead files - Lint tooling: pyproject.toml with ruff/vulture/bandit/pytest config, .editorconfig, fixed all 129 ruff errors (B023 closure fix, SIM103, unused vars) - validate_apkg.py: card count range check for optional cloze template - Data caches committed: vetted_sentences, ktiv_male_forms, noun_plurals, noun_slug_map, vocab_sentence_matches, epub_sentence_index Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
419e952389
commit
17f7458d19
37 changed files with 330541 additions and 871 deletions
15
.editorconfig
Normal file
15
.editorconfig
Normal file
|
|
@ -0,0 +1,15 @@
|
|||
root = true
|
||||
|
||||
[*]
|
||||
indent_style = space
|
||||
indent_size = 4
|
||||
end_of_line = lf
|
||||
charset = utf-8
|
||||
trim_trailing_whitespace = true
|
||||
insert_final_newline = true
|
||||
|
||||
[*.{json,yml,yaml,toml}]
|
||||
indent_size = 2
|
||||
|
||||
[*.md]
|
||||
trim_trailing_whitespace = false
|
||||
15
.gitignore
vendored
15
.gitignore
vendored
|
|
@ -11,6 +11,7 @@ pyvenv.cfg
|
|||
venv/
|
||||
__pycache__/
|
||||
*.pyc
|
||||
.pytest_cache/
|
||||
|
||||
# Large generated cache files (rebuild locally)
|
||||
data/benyehuda_index.json
|
||||
|
|
@ -31,6 +32,20 @@ ANKIWEB_DESCRIPTION.md
|
|||
PROJECTS.md
|
||||
SPRINT_LOG.md
|
||||
CLAUDE.md
|
||||
RECOMMENDATIONS.md
|
||||
|
||||
# Intermediate scrape progress files
|
||||
data/ktiv_male_forms.json.partial
|
||||
data/ktiv_male_forms_partial.json
|
||||
data/ktiv_scrape_progress.json
|
||||
data/noun_slug_map_progress.json
|
||||
data/top_verbs_to_scrape.json
|
||||
|
||||
# EPUB source files (large; user-specific)
|
||||
data/epubs/
|
||||
|
||||
# Stray deck files
|
||||
Everything__*.apkg
|
||||
|
||||
# Release artifacts — distributed via Forgejo releases, not committed to tree
|
||||
releases/
|
||||
|
|
|
|||
154
README.md
154
README.md
|
|
@ -6,16 +6,17 @@
|
|||
|
||||
## For Hebrew learners
|
||||
|
||||
This project generates two Anki decks for learning Modern Hebrew:
|
||||
A set of Anki flashcard decks for learning Modern Hebrew — vocabulary, verb conjugations, and more. All words include nikkud (vowel marks), audio, and are sorted by frequency so you learn the most useful words first.
|
||||
|
||||
- **Vocabulary deck** — ~9,100 words from [pealim.com](https://www.pealim.com/dict/), with nikkud (vowel marks), roots, parts of speech, related words, and example sentences from classic Hebrew literature.
|
||||
- **Conjugation deck** — 70 paradigm verbs from Coffin & Bolozky's *A Reference Grammar of Modern Hebrew* (2005), fully conjugated in all tenses and persons, across all seven binyanim.
|
||||
### What's included
|
||||
|
||||
All card data comes from open or academic sources:
|
||||
- Word data: [pealim.com](https://www.pealim.com) — a free Modern Hebrew dictionary
|
||||
- Example sentences: [Project Ben-Yehuda](https://benyehuda.org) — public-domain Hebrew literature corpus
|
||||
- Word frequency: [hermitdave/FrequencyWords](https://github.com/hermitdave/FrequencyWords) — Hebrew frequency list
|
||||
- Verb paradigm list: Coffin, Edna Amir and Shmuel Bolozky. *A Reference Grammar of Modern Hebrew*. Cambridge University Press, 2005.
|
||||
- **Vocabulary** — ~9,100 Hebrew words with pronunciation audio, roots, example sentences from Hebrew literature, images, and frequency rankings.
|
||||
- **Verb conjugations** — 71 core verbs fully conjugated in all tenses and persons, covering all seven binyanim (verb patterns).
|
||||
- **Confusables** — Words that look the same without vowel marks (e.g., דָּבָר "thing" vs. דִּבֵּר "spoke") shown side by side so you can tell them apart.
|
||||
- **Noun plurals** — Practice forming singular↔plural pairs, with a focus on irregular plurals and common patterns.
|
||||
- **All-in-one** — A combined deck with everything above, organized as subdecks.
|
||||
|
||||
You can download and import any deck individually — or use the combined deck to get everything at once.
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -25,17 +26,19 @@ All card data comes from open or academic sources:
|
|||
2. Double-click to import into [Anki](https://apps.ankiweb.net/) (free, cross-platform)
|
||||
3. Start studying
|
||||
|
||||
Both decks can be imported independently. If you already have one, re-importing the same file updates your deck without losing study progress.
|
||||
All decks can be imported independently — pick just the ones you want. Re-importing the same file later updates your deck without losing study progress.
|
||||
|
||||
---
|
||||
|
||||
## What's in the vocabulary deck
|
||||
|
||||
Each card has two sides:
|
||||
Each note generates up to three cards:
|
||||
|
||||
**Hebrew → English:** See the Hebrew word (with nikkud) + hear audio → recall the meaning.
|
||||
|
||||
**English → Hebrew:** See the English meaning → recall the Hebrew word, its root, and how to write it.
|
||||
**English → Hebrew:** See the English meaning → recall the Hebrew word. When multiple words share the same English meaning, a disambiguation hint (part of speech + binyan) helps you know which word is expected.
|
||||
|
||||
**Sentence Cloze:** A Hebrew sentence with the target word blanked out → fill in the missing word. Only generated for words with a vetted example sentence. Tests recognition in context.
|
||||
|
||||
Fields on each card:
|
||||
| Field | Example |
|
||||
|
|
@ -43,44 +46,70 @@ Fields on each card:
|
|||
| Hebrew word (nikkud) | שָׁמַר |
|
||||
| Meaning | kept, watched over |
|
||||
| Root | שמ״ר |
|
||||
| Part of speech | פועל (verb) |
|
||||
| Part of speech | פועל — פָּעַל |
|
||||
| Without nikkud | שמר |
|
||||
| Related words | שׁוֹמֵר, שְׁמִירָה |
|
||||
| Example sentence | from Ben-Yehuda corpus |
|
||||
| Related words | שׁוֹמֵר, שְׁמִירָה (grouped by Part of Speech) |
|
||||
| Example sentence | from nikkud'd Hebrew books |
|
||||
| Audio | pronunciation from pealim.com |
|
||||
| Frequency rank | #412 |
|
||||
| Image / Emoji | for concrete nouns |
|
||||
| Plural form | for nouns: רבים: שֻׁלְחָנוֹת |
|
||||
| Disambiguation hint | for ambiguous Eng→Heb cards |
|
||||
|
||||
Cards are presented in **frequency order** — Anki will show you the most common words first. Frequency rank is displayed on every card so you can see how common each word is. Words not in the top 50,000 show a "50k+" badge.
|
||||
Cards are presented in **frequency order** — Anki will show you the most common words first.
|
||||
|
||||
### Eng→Heb disambiguation
|
||||
|
||||
When two Hebrew words translate to the same English (e.g., both mean "to return"), the Eng→Heb card shows a hint to tell them apart:
|
||||
|
||||
- **Layer 1:** Automatic Part of Speech + binyan hints for words with different parts of speech (163 words)
|
||||
- **Layer 2:** AI-refined distinct glosses for true synonyms sharing the same Part of Speech (440 words)
|
||||
|
||||
---
|
||||
|
||||
## What's in the conjugation deck
|
||||
|
||||
70 paradigm verbs from Coffin & Bolozky's *A Reference Grammar of Modern Hebrew* (Appendix 1), covering all seven binyanim:
|
||||
71 verbs listed in Appendix 1 of Coffin & Bolozky's *A Reference Grammar of Modern Hebrew* covering all seven binyanim, and **all irregular forms**
|
||||
- פָּעַל (Pa'al), נִפְעַל (Nif'al), פִּעֵל (Pi'el), פֻּעַל (Pu'al)
|
||||
- הִתְפַּעֵל (Hitpa'el), הִפְעִיל (Hif'il), הֻפְעַל (Huf'al)
|
||||
|
||||
Each verb is drilled in: present, past, future, and imperative — all persons and genders. The infinitive is shown on the card front as context but is not quizzed.
|
||||
Each verb is drilled in: present, past, future, and imperative — all persons and genders. Each card shows the English meaning and related vocabulary from the same root.
|
||||
|
||||
**Present tense expansion:** Each present form generates 3 cards (one per pronoun that uses it), so you learn אֲנִי, אַתָּה, and הוּא all separately with the same masculine singular form.
|
||||
**Present tense expansion:** Each present tense form randomly generates a pronoun to be shown in the front of the card, so you acclimate to seeing אֲנִי, אַתָּה, and הוּא with the conjugated verb, even though they are all conjugated the same in present tense.
|
||||
|
||||
**Modern Hebrew 2fp/3fp:** Classical feminine plural future forms (e.g., תִּשְׁמֹרְנָה) are shown in parentheses; the card's primary answer is the modern masculine plural form used in everyday speech.
|
||||
**Modern Hebrew 2fp/3fp:** Classical feminine plural future forms (e.g., תִּשְׁמֹרְנָה) are shown in parentheses, and played via audio (for the audio-included decks). the card's primary answer is the modern masculine plural form used in everyday speech.
|
||||
|
||||
**Passive label:** Pu'al and Huf'al cards show the active partner's infinitive on the front (e.g., לְבַטֵּל) followed by **(סָבִיל)** in smaller text, so you know you're drilling the passive conjugation. Active verbs show no label.
|
||||
**Passive label:** Pu'al and Huf'al cards show the active partner's infinitive on the front (e.g., לְבַטֵּל) followed by **(סָבִיל)** in smaller text, so you know you're drilling the passive conjugation.
|
||||
|
||||
**Card order:** New cards are introduced in random order.
|
||||
**Card order:** New conjugation cards are introduced in random order (not grouped by verb).
|
||||
|
||||
**Citation:** Coffin, Edna Amir and Shmuel Bolozky. *A Reference Grammar of Modern Hebrew*. Cambridge University Press, 2005.
|
||||
---
|
||||
|
||||
## What's in the confusables deck
|
||||
|
||||
Hebrew without vowel marks is full of lookalikes. This deck groups words that are spelled identically without nikkud and asks "מה ההבדל?" (what's the difference?). The answer reveals all the words side by side with their nikkud and definitions.
|
||||
|
||||
Examples: דָּבָר (thing) vs. דִּבֵּר (spoke), סֵפֶר (book) vs. סָפַר (counted) vs. סַפָּר (barber).
|
||||
|
||||
---
|
||||
|
||||
## What's in the plurals deck
|
||||
|
||||
Two card directions for each noun:
|
||||
- **Singular → Plural:** See שֻׁלְחָן → produce שֻׁלְחָנוֹת
|
||||
- **Plural → Singular:** See שֻׁלְחָנוֹת → produce שֻׁלְחָן
|
||||
|
||||
Focuses on irregular plurals (the tricky ones that don't follow the rules) and common examples from each noun pattern. Cards are tagged by pattern for filtered study.
|
||||
|
||||
---
|
||||
|
||||
## Suggested study strategy
|
||||
|
||||
Start with the vocabulary deck. Anki will present the most frequent words first. Don't try to study to many cards every single day-- Anki suggests 20 per day.
|
||||
Start with the vocabulary deck. Anki will present the most frequent words first. Don't try to study too many cards every single day — Anki suggests 20 per day.
|
||||
|
||||
The conjugation cards reinforce verb forms you've already seen in vocabulary.
|
||||
|
||||
Use the Hebrew → English direction to build reading comprehension. Use the English → Hebrew direction to build writing and speaking recall.
|
||||
Use the Hebrew → English direction to build reading comprehension. Use the English → Hebrew direction to build writing and speaking recall. The sentence cloze cards test whether you can recognize words in real Hebrew text.
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -90,9 +119,11 @@ Use the Hebrew → English direction to build reading comprehension. Use the Eng
|
|||
|
||||
**Project Ben-Yehuda** — A public-domain digital library of Hebrew literature. Example sentences come from the nikkud corpus (classic texts with full vowel marks).
|
||||
|
||||
**Hebrew books** — Additional example sentences from nikkud'd (menukad) Hebrew books, with Claude Sonnet AI-vetted quality filtering. The AI doesn't generate the sentences, it just determines whether it is a high quality sentence as an example, or not.
|
||||
|
||||
**FrequencyWords** — An open Hebrew word frequency list derived from subtitle data. Used to sort vocabulary cards from most to least common.
|
||||
|
||||
**Coffin & Bolozky** — The verb paradigm list for the conjugation deck comes from Appendix 1 of *A Reference Grammar of Modern Hebrew* (Cambridge University Press, 2005), which provides a comprehensive reference for Modern Hebrew verbal morphology.
|
||||
**Coffin & Bolozky** — The verb list, and known good conjugation reference for the conjugation deck comes from Appendix 1 of *A Reference Grammar of Modern Hebrew* (Cambridge University Press, 2005).
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -102,7 +133,7 @@ If you notice a wrong translation, missing audio, or incorrect conjugation:
|
|||
|
||||
- For vocabulary errors: the source is pealim.com — you can suggest corrections there. But if you think morfix has a correct translation and pealim.com does not, we may be able to encode an override.
|
||||
|
||||
For any other issue, whether you know to code or not: Email me at pealim [at] nevo [dot] engineer
|
||||
For any other issue, whether you know how to code or not: Email me at hebrew [at] nevo [dot] engineer
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -136,13 +167,14 @@ python run.py --skip-scrape --refresh-examples
|
|||
```
|
||||
python run.py [options]
|
||||
|
||||
--skip-scrape Use cached data/hebrew_dict.csv (no pealim.com scraping)
|
||||
--only {vocab,conjugations,confusables,plurals,complete}
|
||||
Build only one deck type
|
||||
--skip-scrape Use cached data/hebrew_dict.csv
|
||||
--skip-audio Skip audio .mp3 downloads
|
||||
--skip-examples Skip Ben Yehuda example fetching
|
||||
--only {vocab,conjugations} Run only one deck (skips all unrelated steps)
|
||||
--skip-conjugations Skip verb conjugation extraction (deprecated: use --only vocab)
|
||||
--skip-conjugations Skip verb conjugation extraction
|
||||
--skip-images Skip image fetching for concrete nouns
|
||||
--refresh-examples Force rebuild of Ben Yehuda index (nikkud corpus)
|
||||
--refresh-examples Force rebuild of Ben Yehuda index
|
||||
--test N Process only first N words
|
||||
```
|
||||
|
||||
|
|
@ -150,28 +182,60 @@ python run.py [options]
|
|||
|
||||
| File | Description |
|
||||
|------|-------------|
|
||||
| `data/hebrew_dict.csv` | Raw dictionary |
|
||||
| `data/hebrew_dict_for_anki.csv` | Enriched Anki CSV |
|
||||
| `data/conjugations.json` | Verb conjugation data |
|
||||
| `data/audio/` | Vocabulary audio (.mp3) |
|
||||
| `data/audio_conj/` | Conjugation audio (.mp3) |
|
||||
| `data/fonts/` | Heebo font files (bundled in .apkg) |
|
||||
| `data/images/` | Noun images from Wikipedia/Commons |
|
||||
| `data/image_cache.json` | Image fetch cache |
|
||||
| `output/hebrew_vocabulary.apkg` | Vocabulary Anki deck |
|
||||
| `output/hebrew_conjugations.apkg` | Conjugation Anki deck |
|
||||
| `output/hebrew_vocabulary.apkg` | Vocabulary deck (text only) |
|
||||
| `output/hebrew_vocabulary_audio.apkg` | Vocabulary deck + audio |
|
||||
| `output/hebrew_vocabulary_images.apkg` | Vocabulary deck + images |
|
||||
| `output/hebrew_vocabulary_audio_images.apkg` | Vocabulary deck + audio + images |
|
||||
| `output/hebrew_conjugations.apkg` | Conjugation deck |
|
||||
| `output/hebrew_conjugations_audio.apkg` | Conjugation deck + audio |
|
||||
| `output/hebrew_confusables.apkg` | Confusables deck |
|
||||
| `output/hebrew_confusables_audio.apkg` | Confusables deck + audio |
|
||||
| `output/hebrew_plurals.apkg` | Plurals deck |
|
||||
| `output/hebrew_plurals_audio.apkg` | Plurals deck + audio |
|
||||
| `output/hebrew_complete.apkg` | All decks combined |
|
||||
| `output/hebrew_complete_audio.apkg` | All decks combined + audio |
|
||||
|
||||
### Data files
|
||||
|
||||
| File | Description |
|
||||
|------|-------------|
|
||||
| `data/hebrew_dict_for_anki.csv` | Enriched vocabulary CSV |
|
||||
| `data/conjugations.json` | Verb conjugation data (71 verbs) |
|
||||
| `data/noun_plurals.json` | Noun plural/construct forms |
|
||||
| `data/refined_meanings.json` | AI-disambiguated meanings (440 words) |
|
||||
| `data/vetted_sentences.json` | AI-vetted example sentences |
|
||||
| `data/ktiv_male_forms.json` | Ktiv male (plene) forms for sentence matching |
|
||||
| `data/legacy_guid_map.json` | Legacy GUIDs for study progress preservation |
|
||||
|
||||
### Pipeline overview
|
||||
|
||||
1. `hebrew_extract.py` — scrapes pealim.com dictionary
|
||||
2. `frequency_lookup.py` — downloads/loads Hebrew frequency data
|
||||
3. `benyehuda.py` — builds sentence index from Ben-Yehuda corpus
|
||||
3. `benyehuda.py` — builds sentence index from Ben-Yehuda nikkud corpus
|
||||
4. `extract_verb_list.py` — extracts verb list from Coffin & Bolozky PDF
|
||||
5. `conjugation_extract.py` — fetches conjugation tables from pealim.com
|
||||
5. `conjugation_extract.py` — fetches conjugation tables + meanings from pealim.com
|
||||
6. `image_fetch.py` — fetches Wikipedia/Commons images for concrete nouns
|
||||
7. `validate_verb_list.py` — validates verb list against pealim.com
|
||||
8. `apkg_builder.py` — assembles both `.apkg` files
|
||||
9. `run.py` — orchestrates all steps
|
||||
7. `scrape_noun_plurals.py` — scrapes noun plural/construct forms from pealim.com
|
||||
8. `scrape_ktiv_male.py` — scrapes ktiv male (plene) forms for sentence matching
|
||||
9. `rebuild_sentence_matches.py` — matches vocab words to book sentences
|
||||
10. `apkg_builder.py` — assembles all `.apkg` files
|
||||
11. `run.py` — orchestrates all steps
|
||||
12. `validate_apkg.py` — validates output decks
|
||||
|
||||
---
|
||||
|
||||
## Deck variants
|
||||
|
||||
| Variant | Contents | Size |
|
||||
|---------|----------|------|
|
||||
| `hebrew_vocabulary.apkg` | Text + images | ~15 MB |
|
||||
| `hebrew_vocabulary_audio.apkg` | Text + images + audio | ~80 MB |
|
||||
| `hebrew_conjugations.apkg` | Text only | ~1 MB |
|
||||
| `hebrew_conjugations_audio.apkg` | Text + audio | ~5 MB |
|
||||
| `hebrew_confusables.apkg` | Text only | ~1 MB |
|
||||
| `hebrew_plurals.apkg` | Text only | ~1 MB |
|
||||
| `hebrew_complete.apkg` | Everything combined | ~20 MB |
|
||||
| `hebrew_complete_audio.apkg` | Everything + audio | ~90 MB |
|
||||
|
||||
---
|
||||
|
||||
|
|
|
|||
1111
apkg_builder.py
1111
apkg_builder.py
File diff suppressed because it is too large
Load diff
17
benyehuda.py
17
benyehuda.py
|
|
@ -14,20 +14,18 @@ Exposed API:
|
|||
import json
|
||||
import logging
|
||||
import re
|
||||
import unicodedata
|
||||
import zipfile
|
||||
from io import BytesIO
|
||||
from pathlib import Path
|
||||
|
||||
import requests
|
||||
|
||||
from helpers import strip_nikkud as _strip_nikkud
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Nikkud-bearing corpus (txt.zip instead of txt_stripped.zip)
|
||||
CORPUS_URL = (
|
||||
"https://github.com/projectbenyehuda/public_domain_dump/releases/"
|
||||
"download/2025-10/txt.zip"
|
||||
)
|
||||
CORPUS_URL = "https://github.com/projectbenyehuda/public_domain_dump/releases/download/2025-10/txt.zip"
|
||||
INDEX_PATH = Path(__file__).parent / "data" / "benyehuda_index.json"
|
||||
EXAMPLES_CACHE_PATH = Path(__file__).parent / "data" / "examples_cache.json"
|
||||
REQUEST_TIMEOUT = 120
|
||||
|
|
@ -40,13 +38,6 @@ _index: dict[str, list[str]] = {} # word (with nikkud) -> [sentence, ..
|
|||
_examples_cache: dict[str, list[str]] = {} # word -> cached result for this run
|
||||
|
||||
|
||||
def _strip_nikkud(text: str) -> str:
|
||||
return "".join(
|
||||
ch for ch in unicodedata.normalize("NFD", text)
|
||||
if unicodedata.category(ch) != "Mn"
|
||||
)
|
||||
|
||||
|
||||
def _split_sentences(text: str) -> list[str]:
|
||||
"""
|
||||
Split text into sentences on newlines only (Hebrew sentences don't have
|
||||
|
|
@ -73,7 +64,7 @@ def _build_index(corpus_zip_bytes: bytes) -> None:
|
|||
for fname in txt_files:
|
||||
try:
|
||||
raw = zf.read(fname).decode("utf-8", errors="ignore")
|
||||
except Exception:
|
||||
except Exception: # noqa: S112
|
||||
continue
|
||||
for sentence in _split_sentences(raw):
|
||||
# Index by each unique Hebrew token (with nikkud) in the sentence
|
||||
|
|
|
|||
|
|
@ -19,13 +19,14 @@ import json
|
|||
import logging
|
||||
import re
|
||||
import time
|
||||
import unicodedata
|
||||
import urllib.parse
|
||||
from pathlib import Path
|
||||
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
from helpers import strip_nikkud as _strip_nikkud
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
PEALIM_BASE = "https://www.pealim.com"
|
||||
|
|
@ -34,10 +35,14 @@ REQUEST_TIMEOUT = 15
|
|||
VERBS_INPUT = Path(__file__).parent / "verbs_input.txt"
|
||||
CONJUGATIONS_PATH = Path(__file__).parent / "data" / "conjugations.json"
|
||||
DICT_CSV = next(
|
||||
(p for p in [
|
||||
(
|
||||
p
|
||||
for p in [
|
||||
Path(__file__).parent / "data" / "hebrew_dict_for_anki.csv",
|
||||
Path(__file__).parent / "data" / "pealim_dict_for_anki.csv",
|
||||
] if p.exists()),
|
||||
]
|
||||
if p.exists()
|
||||
),
|
||||
Path(__file__).parent / "data" / "hebrew_dict_for_anki.csv",
|
||||
)
|
||||
|
||||
|
|
@ -105,21 +110,12 @@ TENSE_DESCRIPTION = {
|
|||
"infinitive": "מְקוֹר",
|
||||
}
|
||||
|
||||
BINYAN_NAMES: tuple[str, ...] = (
|
||||
"Pa'al", "Nif'al", "Pi'el", "Pu'al", "Hitpa'el", "Hif'il", "Huf'al"
|
||||
)
|
||||
BINYAN_NAMES: tuple[str, ...] = ("Pa'al", "Nif'al", "Pi'el", "Pu'al", "Hitpa'el", "Hif'il", "Huf'al")
|
||||
|
||||
session = requests.Session()
|
||||
session.headers.update({"User-Agent": "Mozilla/5.0 (compatible; pealim-anki/2.0)"})
|
||||
|
||||
|
||||
def _strip_nikkud(text: str) -> str:
|
||||
"""Remove Hebrew nikkud (diacritics) from a string."""
|
||||
return "".join(
|
||||
ch for ch in unicodedata.normalize("NFD", text)
|
||||
if unicodedata.category(ch) != "Mn"
|
||||
)
|
||||
|
||||
|
||||
def _build_pos_lookup() -> dict[str, str]:
|
||||
"""Build word_stripped → binyan dict from pealim_dict_for_anki.csv."""
|
||||
|
|
@ -129,6 +125,7 @@ def _build_pos_lookup() -> dict[str, str]:
|
|||
|
||||
try:
|
||||
import pandas as pd
|
||||
|
||||
try:
|
||||
df = pd.read_csv(DICT_CSV, sep=";", index_col=0)
|
||||
if df.shape[1] < 3:
|
||||
|
|
@ -305,7 +302,7 @@ def _parse_table(soup: BeautifulSoup, passive: bool = False, table_el=None) -> d
|
|||
if present_row >= 0:
|
||||
hf = first_heb_forms(present_row)
|
||||
keys = ["present_ms", "present_fs", "present_mp", "present_fp"]
|
||||
for k, (v, au) in zip(keys, hf):
|
||||
for k, (v, au) in zip(keys, hf, strict=False):
|
||||
store(k, v, au)
|
||||
|
||||
# Past tense
|
||||
|
|
@ -319,13 +316,13 @@ def _parse_table(soup: BeautifulSoup, passive: bool = False, table_el=None) -> d
|
|||
if past_row + 1 < len(rows):
|
||||
hf2 = first_heb_forms(past_row + 1)
|
||||
keys2 = ["past_2ms", "past_2fs", "past_2mp", "past_2fp"]
|
||||
for k, (v, au) in zip(keys2, hf2):
|
||||
for k, (v, au) in zip(keys2, hf2, strict=False):
|
||||
store(k, v, au)
|
||||
|
||||
if past_row + 2 < len(rows):
|
||||
unique3 = deduplicate(first_heb_forms(past_row + 2))
|
||||
keys3 = ["past_3ms", "past_3fs", "past_3p"]
|
||||
for k, (v, au) in zip(keys3, unique3):
|
||||
for k, (v, au) in zip(keys3, unique3, strict=False):
|
||||
store(k, v, au)
|
||||
|
||||
# Future tense
|
||||
|
|
@ -339,20 +336,20 @@ def _parse_table(soup: BeautifulSoup, passive: bool = False, table_el=None) -> d
|
|||
if future_row + 1 < len(rows):
|
||||
hf2 = first_heb_forms(future_row + 1)
|
||||
keys2 = ["future_2ms", "future_2fs", "future_2mp", "future_2fp"]
|
||||
for k, (v, au) in zip(keys2, hf2):
|
||||
for k, (v, au) in zip(keys2, hf2, strict=False):
|
||||
store(k, v, au)
|
||||
|
||||
if future_row + 2 < len(rows):
|
||||
hf3 = first_heb_forms(future_row + 2)
|
||||
keys3 = ["future_3ms", "future_3fs", "future_3mp", "future_3fp"]
|
||||
for k, (v, au) in zip(keys3, hf3):
|
||||
for k, (v, au) in zip(keys3, hf3, strict=False):
|
||||
store(k, v, au)
|
||||
|
||||
# Imperative
|
||||
if imp_row >= 0:
|
||||
hf = first_heb_forms(imp_row)
|
||||
keys = ["imperative_ms", "imperative_fs", "imperative_mp", "imperative_fp"]
|
||||
for k, (v, au) in zip(keys, hf):
|
||||
for k, (v, au) in zip(keys, hf, strict=False):
|
||||
store(k, v, au)
|
||||
|
||||
# Infinitive
|
||||
|
|
@ -399,7 +396,9 @@ def _extract_passive_binyan_from_page(soup: BeautifulSoup) -> str:
|
|||
return ""
|
||||
|
||||
|
||||
def _extract_conjugations(slug: str, search_term: str, is_3ms_search: bool = False, binyan_hint: str = "") -> dict | None:
|
||||
def _extract_conjugations(
|
||||
slug: str, search_term: str, is_3ms_search: bool = False, binyan_hint: str = ""
|
||||
) -> dict | None:
|
||||
"""Fetch /dict/<slug>/ and parse conjugation table (active + passive)."""
|
||||
url = f"{PEALIM_BASE}/dict/{slug}/"
|
||||
try:
|
||||
|
|
@ -411,6 +410,12 @@ def _extract_conjugations(slug: str, search_term: str, is_3ms_search: bool = Fal
|
|||
|
||||
soup = BeautifulSoup(resp.text, "lxml")
|
||||
|
||||
# Extract meaning from <div class="lead"> (English translation)
|
||||
meaning = ""
|
||||
lead_div = soup.find("div", class_="lead")
|
||||
if lead_div:
|
||||
meaning = lead_div.get_text(strip=True)
|
||||
|
||||
# Extract root
|
||||
root = ""
|
||||
for span in soup.find_all("span", class_="menukad"):
|
||||
|
|
@ -440,10 +445,7 @@ def _extract_conjugations(slug: str, search_term: str, is_3ms_search: bool = Fal
|
|||
infinitive_form = forms_raw.get("infinitive", {}).get("form", "") if not is_passive else ""
|
||||
past_3ms_form = forms_raw.get("past_3ms", {}).get("form", "")
|
||||
|
||||
if is_passive:
|
||||
reference_form = past_3ms_form or search_term
|
||||
else:
|
||||
reference_form = infinitive_form or search_term
|
||||
reference_form = (past_3ms_form or search_term) if is_passive else (infinitive_form or search_term)
|
||||
|
||||
# Build active result
|
||||
result = {
|
||||
|
|
@ -451,6 +453,7 @@ def _extract_conjugations(slug: str, search_term: str, is_3ms_search: bool = Fal
|
|||
"slug": slug,
|
||||
"root": root,
|
||||
"binyan": binyan,
|
||||
"meaning": meaning,
|
||||
"is_passive": is_passive,
|
||||
"reference_form": reference_form,
|
||||
"forms": {},
|
||||
|
|
@ -474,10 +477,7 @@ def _extract_conjugations(slug: str, search_term: str, is_3ms_search: bool = Fal
|
|||
passive_table_ids = {
|
||||
id(t) for t in (passive_h3.find_all_next("table", class_="conjugation-table") if passive_h3 else [])
|
||||
}
|
||||
active_tables = [
|
||||
t for t in soup.find_all("table", class_="conjugation-table")
|
||||
if id(t) not in passive_table_ids
|
||||
]
|
||||
active_tables = [t for t in soup.find_all("table", class_="conjugation-table") if id(t) not in passive_table_ids]
|
||||
if len(active_tables) >= 2:
|
||||
alt_raw = _parse_table(soup, passive=False, table_el=active_tables[1])
|
||||
alternate_forms = {}
|
||||
|
|
@ -521,6 +521,12 @@ def _extract_passive_from_active_slug(active_slug: str, search_term: str, binyan
|
|||
|
||||
soup = BeautifulSoup(resp.text, "lxml")
|
||||
|
||||
# Extract meaning (this is the active verb's meaning — useful context for passive)
|
||||
meaning = ""
|
||||
lead_div = soup.find("div", class_="lead")
|
||||
if lead_div:
|
||||
meaning = lead_div.get_text(strip=True)
|
||||
|
||||
root = ""
|
||||
for span in soup.find_all("span", class_="menukad"):
|
||||
txt = span.get_text(strip=True)
|
||||
|
|
@ -548,6 +554,7 @@ def _extract_passive_from_active_slug(active_slug: str, search_term: str, binyan
|
|||
"slug": active_slug,
|
||||
"root": root,
|
||||
"binyan": passive_binyan,
|
||||
"meaning": meaning,
|
||||
"is_passive": True,
|
||||
"reference_form": active_infinitive or search_term,
|
||||
"forms": {},
|
||||
|
|
@ -584,8 +591,13 @@ def main(verbs_file: Path = VERBS_INPUT) -> dict:
|
|||
|
||||
# Map section header keywords → binyan name (for binyan_hint fallback)
|
||||
SECTION_BINYAN = {
|
||||
"pa'al": "Pa'al", "nif'al": "Nif'al", "pi'el": "Pi'el",
|
||||
"pu'al": "Pu'al", "hitpa'el": "Hitpa'el", "hif'il": "Hif'il", "huf'al": "Huf'al",
|
||||
"pa'al": "Pa'al",
|
||||
"nif'al": "Nif'al",
|
||||
"pi'el": "Pi'el",
|
||||
"pu'al": "Pu'al",
|
||||
"hitpa'el": "Hitpa'el",
|
||||
"hif'il": "Hif'il",
|
||||
"huf'al": "Huf'al",
|
||||
}
|
||||
|
||||
# Parse: regular verbs and # 3ms: lines (optional active slug on 3ms lines)
|
||||
|
|
@ -612,8 +624,7 @@ def main(verbs_file: Path = VERBS_INPUT) -> dict:
|
|||
else:
|
||||
verbs.append((stripped, False, None, current_binyan_hint))
|
||||
|
||||
logger.info(f"Loaded {len(verbs)} verbs from {verbs_file} "
|
||||
f"({sum(1 for _, p, _, _ in verbs if p)} passive 3ms)")
|
||||
logger.info(f"Loaded {len(verbs)} verbs from {verbs_file} ({sum(1 for _, p, _, _ in verbs if p)} passive 3ms)")
|
||||
if slug_overrides:
|
||||
logger.info(f" Slug overrides: {slug_overrides}")
|
||||
|
||||
|
|
|
|||
|
|
@ -175,7 +175,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to guard; to keep, to maintain (על)"
|
||||
},
|
||||
"ללמוד": {
|
||||
"infinitive": "ללמוד",
|
||||
|
|
@ -353,7 +354,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to learn, to study"
|
||||
},
|
||||
"לאסוף": {
|
||||
"infinitive": "לאסוף",
|
||||
|
|
@ -531,7 +533,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to collect, to pick up, to reap"
|
||||
},
|
||||
"לעבוד": {
|
||||
"infinitive": "לעבוד",
|
||||
|
|
@ -709,7 +712,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to work; to operate, to function"
|
||||
},
|
||||
"לחבוש": {
|
||||
"infinitive": "לחבוש",
|
||||
|
|
@ -887,7 +891,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to bandage; to put on (a hat)"
|
||||
},
|
||||
"לאכול": {
|
||||
"infinitive": "לאכול",
|
||||
|
|
@ -1065,7 +1070,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to eat"
|
||||
},
|
||||
"לשאול": {
|
||||
"infinitive": "לשאול",
|
||||
|
|
@ -1243,7 +1249,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to ask; to borrow"
|
||||
},
|
||||
"לשלוח": {
|
||||
"infinitive": "לשלוח",
|
||||
|
|
@ -1421,7 +1428,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to send, to dispatch"
|
||||
},
|
||||
"לגבוה": {
|
||||
"infinitive": "לגבוה",
|
||||
|
|
@ -1599,7 +1607,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to be high, exalted"
|
||||
},
|
||||
"לשבת": {
|
||||
"infinitive": "לשבת",
|
||||
|
|
@ -1777,7 +1786,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to sit, to settle"
|
||||
},
|
||||
"לרשת": {
|
||||
"infinitive": "לרשת",
|
||||
|
|
@ -1955,7 +1965,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to inherit"
|
||||
},
|
||||
"לִיפּוֹל": {
|
||||
"infinitive": "לִיפּוֹל",
|
||||
|
|
@ -2133,7 +2144,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to fall, to drop"
|
||||
},
|
||||
"לקום": {
|
||||
"infinitive": "לקום",
|
||||
|
|
@ -2311,7 +2323,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to get up, to stand up, to arise; to be established, to come into being"
|
||||
},
|
||||
"לחון": {
|
||||
"infinitive": "לחון",
|
||||
|
|
@ -2489,7 +2502,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to pardon, to amnesty; to endow"
|
||||
},
|
||||
"לקרוא": {
|
||||
"infinitive": "לקרוא",
|
||||
|
|
@ -2667,7 +2681,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to read (ב-, את); to call (ל-)"
|
||||
},
|
||||
"לקנות": {
|
||||
"infinitive": "לקנות",
|
||||
|
|
@ -2845,7 +2860,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to buy, to purchase"
|
||||
},
|
||||
"להיבדק": {
|
||||
"infinitive": "להיבדק",
|
||||
|
|
@ -3023,7 +3039,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to be tested, examined"
|
||||
},
|
||||
"להרדם": {
|
||||
"infinitive": "להרדם",
|
||||
|
|
@ -3201,7 +3218,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to fall asleep, to doze off"
|
||||
},
|
||||
"להיהרג": {
|
||||
"infinitive": "להיהרג",
|
||||
|
|
@ -3379,7 +3397,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to be killed"
|
||||
},
|
||||
"להחקר": {
|
||||
"infinitive": "להחקר",
|
||||
|
|
@ -3557,7 +3576,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to be investigated, explored"
|
||||
},
|
||||
"להישאר": {
|
||||
"infinitive": "להישאר",
|
||||
|
|
@ -3735,7 +3755,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to remain"
|
||||
},
|
||||
"להיפגע": {
|
||||
"infinitive": "להיפגע",
|
||||
|
|
@ -3913,7 +3934,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to be damaged, to be injured, to be wounded; to be insulted, to be offended"
|
||||
},
|
||||
"להיוולד": {
|
||||
"infinitive": "להיוולד",
|
||||
|
|
@ -4091,7 +4113,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to be born"
|
||||
},
|
||||
"להנצל": {
|
||||
"infinitive": "להנצל",
|
||||
|
|
@ -4269,7 +4292,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to be saved, to be rescued, to survive"
|
||||
},
|
||||
"להיסוג": {
|
||||
"infinitive": "להיסוג",
|
||||
|
|
@ -4447,7 +4471,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to withdraw, to retreat"
|
||||
},
|
||||
"להימצא": {
|
||||
"infinitive": "להימצא",
|
||||
|
|
@ -4625,7 +4650,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to be found, discovered; to be present, to be located"
|
||||
},
|
||||
"להיבנות": {
|
||||
"infinitive": "להיבנות",
|
||||
|
|
@ -4803,7 +4829,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to be built, constructed"
|
||||
},
|
||||
"לדבר": {
|
||||
"infinitive": "לדבר",
|
||||
|
|
@ -5130,7 +5157,8 @@
|
|||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to speak, to talk"
|
||||
},
|
||||
"לברך": {
|
||||
"infinitive": "לברך",
|
||||
|
|
@ -5457,7 +5485,8 @@
|
|||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to bless, to greet, to felicitate"
|
||||
},
|
||||
"לנהל": {
|
||||
"infinitive": "לנהל",
|
||||
|
|
@ -5784,7 +5813,8 @@
|
|||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to manage, to organize"
|
||||
},
|
||||
"לנצח": {
|
||||
"infinitive": "לנצח",
|
||||
|
|
@ -6111,7 +6141,8 @@
|
|||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to win; to overcome, to beat; to conduct, to orchestrate"
|
||||
},
|
||||
"לקומם": {
|
||||
"infinitive": "לקומם",
|
||||
|
|
@ -6438,7 +6469,8 @@
|
|||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to outrage, to anger"
|
||||
},
|
||||
"למלא": {
|
||||
"infinitive": "למלא",
|
||||
|
|
@ -6765,7 +6797,8 @@
|
|||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to fill; to fill out; to fulfil"
|
||||
},
|
||||
"לחכות": {
|
||||
"infinitive": "לחכות",
|
||||
|
|
@ -7092,7 +7125,8 @@
|
|||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to await, to wait for (ל-)"
|
||||
},
|
||||
"לגלגל": {
|
||||
"infinitive": "לגלגל",
|
||||
|
|
@ -7419,7 +7453,8 @@
|
|||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to roll, to revolve (transitive)"
|
||||
},
|
||||
"להתלבש": {
|
||||
"infinitive": "להתלבש",
|
||||
|
|
@ -7597,7 +7632,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to dress oneself"
|
||||
},
|
||||
"להסתלק": {
|
||||
"infinitive": "להסתלק",
|
||||
|
|
@ -7775,7 +7811,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to leave, to go away"
|
||||
},
|
||||
"להצטלם": {
|
||||
"infinitive": "להצטלם",
|
||||
|
|
@ -7953,7 +7990,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to pose for a photograph, to be photographed"
|
||||
},
|
||||
"להזדקק": {
|
||||
"infinitive": "להזדקק",
|
||||
|
|
@ -8131,7 +8169,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to need, to require (ל-)"
|
||||
},
|
||||
"להתנהג": {
|
||||
"infinitive": "להתנהג",
|
||||
|
|
@ -8309,7 +8348,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to behave"
|
||||
},
|
||||
"להתקומם": {
|
||||
"infinitive": "להתקומם",
|
||||
|
|
@ -8487,7 +8527,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to rebel, to revolt"
|
||||
},
|
||||
"להתפלא": {
|
||||
"infinitive": "להתפלא",
|
||||
|
|
@ -8665,7 +8706,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to wonder, to be surprised"
|
||||
},
|
||||
"להתקלקל": {
|
||||
"infinitive": "להתקלקל",
|
||||
|
|
@ -8843,7 +8885,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to be damaged, to be spoiled (of food products)"
|
||||
},
|
||||
"להכניס": {
|
||||
"infinitive": "להכניס",
|
||||
|
|
@ -9170,7 +9213,8 @@
|
|||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to insert, to bring in"
|
||||
},
|
||||
"להעסיק": {
|
||||
"infinitive": "להעסיק",
|
||||
|
|
@ -9497,7 +9541,8 @@
|
|||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to keep busy; to employ"
|
||||
},
|
||||
"להחליט": {
|
||||
"infinitive": "להחליט",
|
||||
|
|
@ -9824,7 +9869,8 @@
|
|||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to decide"
|
||||
},
|
||||
"להבטיח": {
|
||||
"infinitive": "להבטיח",
|
||||
|
|
@ -10151,7 +10197,8 @@
|
|||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to ensure, to promise"
|
||||
},
|
||||
"להוריד": {
|
||||
"infinitive": "להוריד",
|
||||
|
|
@ -10478,7 +10525,8 @@
|
|||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to lower, to reduce; to download (computing)"
|
||||
},
|
||||
"להפיל": {
|
||||
"infinitive": "להפיל",
|
||||
|
|
@ -10805,7 +10853,8 @@
|
|||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to drop, to throw down"
|
||||
},
|
||||
"להקים": {
|
||||
"infinitive": "להקים",
|
||||
|
|
@ -11132,7 +11181,8 @@
|
|||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to build, to found, to establish"
|
||||
},
|
||||
"להמציא": {
|
||||
"infinitive": "להמציא",
|
||||
|
|
@ -11459,7 +11509,8 @@
|
|||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to invent; to make up; to present"
|
||||
},
|
||||
"להרשות": {
|
||||
"infinitive": "להרשות",
|
||||
|
|
@ -11786,7 +11837,8 @@
|
|||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to allow, to permit"
|
||||
},
|
||||
"להקל": {
|
||||
"infinitive": "להקל",
|
||||
|
|
@ -12113,7 +12165,8 @@
|
|||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to ease, to alleviate"
|
||||
},
|
||||
"לָשִׂים": {
|
||||
"infinitive": "לָשִׂים",
|
||||
|
|
@ -12291,7 +12344,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to put, to put on"
|
||||
},
|
||||
"בוטל": {
|
||||
"infinitive": "בוטל",
|
||||
|
|
@ -12439,7 +12493,8 @@
|
|||
"pronoun": "הֵן",
|
||||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to cancel, to undo"
|
||||
},
|
||||
"תואם": {
|
||||
"infinitive": "תואם",
|
||||
|
|
@ -12587,7 +12642,8 @@
|
|||
"pronoun": "הֵן",
|
||||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to coordinate"
|
||||
},
|
||||
"קומם": {
|
||||
"infinitive": "קומם",
|
||||
|
|
@ -12735,7 +12791,8 @@
|
|||
"pronoun": "הֵן",
|
||||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to outrage, to anger"
|
||||
},
|
||||
"דוכא": {
|
||||
"infinitive": "דוכא",
|
||||
|
|
@ -12883,7 +12940,8 @@
|
|||
"pronoun": "הֵן",
|
||||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to oppress, to crush; to cause depression"
|
||||
},
|
||||
"זוכה": {
|
||||
"infinitive": "זוכה",
|
||||
|
|
@ -13031,7 +13089,8 @@
|
|||
"pronoun": "הֵן",
|
||||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to achieve; to credit"
|
||||
},
|
||||
"פורסם": {
|
||||
"infinitive": "פורסם",
|
||||
|
|
@ -13179,7 +13238,8 @@
|
|||
"pronoun": "הֵן",
|
||||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to advertise, to publish, to publicize"
|
||||
},
|
||||
"הוגבל": {
|
||||
"infinitive": "הוגבל",
|
||||
|
|
@ -13327,7 +13387,8 @@
|
|||
"pronoun": "הֵן",
|
||||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to limit, to restrict, to confine"
|
||||
},
|
||||
"העבר": {
|
||||
"infinitive": "העבר",
|
||||
|
|
@ -13475,7 +13536,8 @@
|
|||
"pronoun": "הֵן",
|
||||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to transfer, to pass something"
|
||||
},
|
||||
"הוזהר": {
|
||||
"infinitive": "הוזהר",
|
||||
|
|
@ -13623,7 +13685,8 @@
|
|||
"pronoun": "הֵן",
|
||||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to warn"
|
||||
},
|
||||
"הופל": {
|
||||
"infinitive": "הופל",
|
||||
|
|
@ -13771,7 +13834,8 @@
|
|||
"pronoun": "הֵן",
|
||||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to drop, to throw down"
|
||||
},
|
||||
"הוקם": {
|
||||
"infinitive": "הוקם",
|
||||
|
|
@ -13919,7 +13983,8 @@
|
|||
"pronoun": "הֵן",
|
||||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to build, to found, to establish"
|
||||
},
|
||||
"הוחל": {
|
||||
"infinitive": "הוחל",
|
||||
|
|
@ -14067,7 +14132,8 @@
|
|||
"pronoun": "הֵן",
|
||||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to apply, to enforce, to put in force"
|
||||
},
|
||||
"הוקפא": {
|
||||
"infinitive": "הוקפא",
|
||||
|
|
@ -14215,7 +14281,8 @@
|
|||
"pronoun": "הֵן",
|
||||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to freeze (something)"
|
||||
},
|
||||
"הופנה": {
|
||||
"infinitive": "הופנה",
|
||||
|
|
@ -14363,7 +14430,8 @@
|
|||
"pronoun": "הֵן",
|
||||
"tense": "עָתִיד"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to direct; to refer someone"
|
||||
},
|
||||
"להתקלח": {
|
||||
"infinitive": "להתקלח",
|
||||
|
|
@ -14541,7 +14609,8 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to take a shower"
|
||||
},
|
||||
"להתגלות": {
|
||||
"infinitive": "להתגלות",
|
||||
|
|
@ -14719,6 +14788,162 @@
|
|||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
}
|
||||
},
|
||||
"meaning": "to be discovered, to appear"
|
||||
},
|
||||
"להיות": {
|
||||
"infinitive": "להיות",
|
||||
"slug": "454-lihyot",
|
||||
"root": "ה - י - ה",
|
||||
"binyan": "Pa'al",
|
||||
"is_passive": false,
|
||||
"reference_form": "לִהְיוֹת",
|
||||
"forms": {
|
||||
"past_1s": {
|
||||
"form": "הָיִיתִי",
|
||||
"audio_url": "https://audio.pealim.com/v0/bx/bxtedharx4kd.mp3",
|
||||
"pronoun": "אֲנִי",
|
||||
"tense": "עָבָר"
|
||||
},
|
||||
"past_1p": {
|
||||
"form": "הָיִינוּ",
|
||||
"audio_url": "https://audio.pealim.com/v0/bz/bztr7bt7yw8j.mp3",
|
||||
"pronoun": "אֲנַחְנוּ",
|
||||
"tense": "עָבָר"
|
||||
},
|
||||
"past_2ms": {
|
||||
"form": "הָיִיתָ",
|
||||
"audio_url": "https://audio.pealim.com/v0/1i/1imxfddysg8d8.mp3",
|
||||
"pronoun": "אַתָּה",
|
||||
"tense": "עָבָר"
|
||||
},
|
||||
"past_2fs": {
|
||||
"form": "הָיִית",
|
||||
"audio_url": "https://audio.pealim.com/v0/si/sizbwqsi2wej.mp3",
|
||||
"pronoun": "אַתְּ",
|
||||
"tense": "עָבָר"
|
||||
},
|
||||
"past_2mp": {
|
||||
"form": "הֱיִיתֶם",
|
||||
"audio_url": "https://audio.pealim.com/v0/31/31081nk4lvxj.mp3",
|
||||
"pronoun": "אַתֶּם",
|
||||
"tense": "עָבָר"
|
||||
},
|
||||
"past_2fp": {
|
||||
"form": "הֱיִיתֶן",
|
||||
"audio_url": "https://audio.pealim.com/v0/30/30zpav63u9ig.mp3",
|
||||
"pronoun": "אַתֶּן",
|
||||
"tense": "עָבָר"
|
||||
},
|
||||
"past_3ms": {
|
||||
"form": "הָיָה",
|
||||
"audio_url": "https://audio.pealim.com/v0/1h/1hxhgoyxra6fs.mp3",
|
||||
"pronoun": "הוּא",
|
||||
"tense": "עָבָר"
|
||||
},
|
||||
"past_3fs": {
|
||||
"form": "הָיְתָה",
|
||||
"audio_url": "https://audio.pealim.com/v0/17/17fb6fulu2da8.mp3",
|
||||
"pronoun": "הִיא",
|
||||
"tense": "עָבָר"
|
||||
},
|
||||
"past_3p": {
|
||||
"form": "הָיוּ",
|
||||
"audio_url": "https://audio.pealim.com/v0/1h/1hxhgf26s3ou9.mp3",
|
||||
"pronoun": "הֵם / הֵן",
|
||||
"tense": "עָבָר"
|
||||
},
|
||||
"future_1s": {
|
||||
"form": "אֶהְיֶה",
|
||||
"audio_url": "https://audio.pealim.com/v0/at/atd2i0kljhge.mp3",
|
||||
"pronoun": "אֲנִי",
|
||||
"tense": "עָתִיד"
|
||||
},
|
||||
"future_1p": {
|
||||
"form": "נִהְיֶה",
|
||||
"audio_url": "https://audio.pealim.com/v0/2a/2a41xa7h8jei.mp3",
|
||||
"pronoun": "אֲנַחְנוּ",
|
||||
"tense": "עָתִיד"
|
||||
},
|
||||
"future_2ms": {
|
||||
"form": "תִּהְיֶה",
|
||||
"audio_url": "https://audio.pealim.com/v0/g6/g6saa9abkllk.mp3",
|
||||
"pronoun": "אַתָּה",
|
||||
"tense": "עָתִיד"
|
||||
},
|
||||
"future_2fs": {
|
||||
"form": "תִּהְיִי",
|
||||
"audio_url": "https://audio.pealim.com/v0/g6/g6s9q8uugtnx.mp3",
|
||||
"pronoun": "אַתְּ",
|
||||
"tense": "עָתִיד"
|
||||
},
|
||||
"future_2mp": {
|
||||
"form": "תִּהְיוּ",
|
||||
"audio_url": "https://audio.pealim.com/v0/g6/g6sjf854r5a7.mp3",
|
||||
"pronoun": "אַתֶּם",
|
||||
"tense": "עָתִיד"
|
||||
},
|
||||
"future_2fp": {
|
||||
"form": "תִּהְיֶינָה",
|
||||
"audio_url": "https://audio.pealim.com/v0/12/12upso035jy8g.mp3",
|
||||
"pronoun": "אַתֶּן",
|
||||
"tense": "עָתִיד"
|
||||
},
|
||||
"future_3ms": {
|
||||
"form": "יִהְיֶה",
|
||||
"audio_url": "https://audio.pealim.com/v0/yy/yyo97spf6rob.mp3",
|
||||
"pronoun": "הוּא",
|
||||
"tense": "עָתִיד"
|
||||
},
|
||||
"future_3fs": {
|
||||
"form": "תִּהְיֶה",
|
||||
"audio_url": "https://audio.pealim.com/v0/g6/g6saa9abkllk.mp3",
|
||||
"pronoun": "הִיא",
|
||||
"tense": "עָתִיד"
|
||||
},
|
||||
"future_3mp": {
|
||||
"form": "יִהְיוּ",
|
||||
"audio_url": "https://audio.pealim.com/v0/yy/yyo02tum07zo.mp3",
|
||||
"pronoun": "הֵם",
|
||||
"tense": "עָתִיד"
|
||||
},
|
||||
"future_3fp": {
|
||||
"form": "תִּהְיֶינָה",
|
||||
"audio_url": "https://audio.pealim.com/v0/12/12upso035jy8g.mp3",
|
||||
"pronoun": "הֵן",
|
||||
"tense": "עָתִיד"
|
||||
},
|
||||
"imperative_ms": {
|
||||
"form": "הֱיֵה!",
|
||||
"audio_url": "https://audio.pealim.com/v0/1h/1hxjabs7uspli.mp3",
|
||||
"pronoun": "אַתָּה",
|
||||
"tense": "צִוּוּי"
|
||||
},
|
||||
"imperative_fs": {
|
||||
"form": "הֱיִי!",
|
||||
"audio_url": "https://audio.pealim.com/v0/1h/1hxjac2th43as.mp3",
|
||||
"pronoun": "אַתְּ",
|
||||
"tense": "צִוּוּי"
|
||||
},
|
||||
"imperative_mp": {
|
||||
"form": "הֱיוּ!",
|
||||
"audio_url": "https://audio.pealim.com/v0/1h/1hxja0tjuptcu.mp3",
|
||||
"pronoun": "אַתֶּם",
|
||||
"tense": "צִוּוּי"
|
||||
},
|
||||
"imperative_fp": {
|
||||
"form": "הֱיֶינָה!",
|
||||
"audio_url": "https://audio.pealim.com/v0/xe/xef6kg7mexvb.mp3",
|
||||
"pronoun": "אַתֶּן",
|
||||
"tense": "צִוּוּי"
|
||||
},
|
||||
"infinitive": {
|
||||
"form": "לִהְיוֹת",
|
||||
"audio_url": "https://audio.pealim.com/v0/1n/1nej50k4t35xi.mp3",
|
||||
"pronoun": "",
|
||||
"tense": "מְקוֹר"
|
||||
}
|
||||
},
|
||||
"meaning": "to be"
|
||||
}
|
||||
}
|
||||
15904
data/epub_sentence_index.json
Normal file
15904
data/epub_sentence_index.json
Normal file
File diff suppressed because it is too large
Load diff
File diff suppressed because it is too large
Load diff
186078
data/ktiv_male_forms.json
Normal file
186078
data/ktiv_male_forms.json
Normal file
File diff suppressed because it is too large
Load diff
9242
data/legacy_guid_map.json
Normal file
9242
data/legacy_guid_map.json
Normal file
File diff suppressed because it is too large
Load diff
37457
data/noun_plurals.json
Normal file
37457
data/noun_plurals.json
Normal file
File diff suppressed because it is too large
Load diff
29598
data/noun_slug_map.json
Normal file
29598
data/noun_slug_map.json
Normal file
File diff suppressed because it is too large
Load diff
442
data/refined_meanings.json
Normal file
442
data/refined_meanings.json
Normal file
|
|
@ -0,0 +1,442 @@
|
|||
{
|
||||
"שְׁלָל": "abundance; loot, plunder, spoils",
|
||||
"שֶׁפַע": "abundance, plenty, profusion",
|
||||
"נַר": "acquaintance (person one knows)",
|
||||
"הֶכֵּרוּת": "acquaintance (the state of knowing someone)",
|
||||
"כְּתֹבֶת": "address (postal/location)",
|
||||
"מַעַן": "address (formal, for the sake of; destination)",
|
||||
"שׁוּב": "again (once more, to repeat an action)",
|
||||
"שֵׁנִית": "again; a second time, secondly",
|
||||
"כְּנֶגֶד": "against; compared to, as opposed to",
|
||||
"מוּל": "opposite, facing; against",
|
||||
"נֶגֶד": "against; contrary to",
|
||||
"נֶכֶס": "asset, property (financial/material possession)",
|
||||
"קִנְיָן": "asset, property; possession, ownership (abstract or acquired)",
|
||||
"הִתְבּוֹלְלוּת": "assimilation (cultural/ethnic blending in)",
|
||||
"הִטַּמְּעוּת": "assimilation (absorption, integration into surroundings)",
|
||||
"כְּפִיפָה": "basket (woven, traditional/biblical)",
|
||||
"סַל": "basket (general, everyday)",
|
||||
"מַשְׁמִים": "boring, dreary (causing desolation/boredom)",
|
||||
"מְשַׁעְמֵם": "boring, tedious (causing boredom, common usage)",
|
||||
"מַשָּׂא": "burden, load (heavy cargo; figurative weight)",
|
||||
"נֵטֶל": "burden, load; ballast (dead weight)",
|
||||
"טָרוּד": "busy, preoccupied (mentally troubled/distracted)",
|
||||
"עָסוּק": "busy, occupied (engaged in an activity)",
|
||||
"מַמְתָּק": "candy, sweet (generic confection)",
|
||||
"סֻכָּרִיָּה": "candy, sweet (individual wrapped candy piece)",
|
||||
"מַרְבָד": "carpet, rug (literary/poetic); bedspread",
|
||||
"שָׁטִיחַ": "carpet, rug (standard, everyday word)",
|
||||
"כַּרְפַּס": "celery (also: the Passover seder vegetable)",
|
||||
"סֶלֶרִי": "celery (modern loanword, everyday usage)",
|
||||
"שַׁלְשֶׁלֶת": "chain (figurative: chain of events, lineage)",
|
||||
"שַׁרְשֶׁרֶת": "chain (physical chain, links)",
|
||||
"אָפְיָן": "characteristic (trait, attribute of a person/thing)",
|
||||
"סַמְמָן": "characteristic; indicator, hallmark",
|
||||
"שׁוֹקוֹלָד": "chocolate (the substance, mass noun, masc.)",
|
||||
"שׁוֹקוֹלָדָה": "chocolate (a piece of chocolate; hot chocolate, fem.)",
|
||||
"עִגּוּל": "circle (the shape); rounding",
|
||||
"מַעֲגָל": "circle (circular path, cycle, circuit)",
|
||||
"נִקּוּי": "cleaning (the act of cleaning, removing dirt)",
|
||||
"נִקָּיוֹן": "cleanliness, tidiness (state of being clean)",
|
||||
"בִּקּוּעַ": "cleaving, splitting (a single crack or fissure)",
|
||||
"הִתְבַּקְּעוּת": "cleaving, splitting (the process of cracking apart)",
|
||||
"בְּעִילָה": "coitus, sexual intercourse (legal/halachic term)",
|
||||
"מִשְׁגָּל": "coitus, sexual intercourse (formal/literary)",
|
||||
"מִדְרָשָׁה": "college (religious seminary, study institute)",
|
||||
"מִכְלָלָה": "college (academic institution, secular)",
|
||||
"תַּחֲרוּת": "competition, contest (an event or rivalry)",
|
||||
"הִתְחָרוּת": "competition (the act/process of competing)",
|
||||
"לְגַמְרֵי": "completely, totally (colloquial, very common)",
|
||||
"כָּלִיל": "completely, entirely (literary/formal); wholly",
|
||||
"רְכִיב": "component (technical part, element in a system)",
|
||||
"מַרְכִּיב": "component, ingredient (constituent that makes up a whole)",
|
||||
"תַּבְעֵרָה": "conflagration, fire (intense blaze, biblical/literary)",
|
||||
"דְּלֵקָה": "fire (accidental fire, house fire, everyday)",
|
||||
"צַרְכָנוּת": "consumerism; consumer advocacy",
|
||||
"צְרִיכָה": "consumption (using up resources, usage)",
|
||||
"קֵרוּר": "cooling, refrigeration (active process of making cold)",
|
||||
"הִתְקָרְרוּת": "cooling (becoming cold); catching a cold",
|
||||
"חָשׁוּךְ": "dark (of a place, lacking light; figuratively bleak)",
|
||||
"כֵּהֶה": "dark (of a color, shade; dim)",
|
||||
"אֲפֵלָה": "darkness (deep gloom; figurative despair)",
|
||||
"אֹפֶל": "darkness (poetic/literary, deep darkness)",
|
||||
"חֹשֶׁךְ": "darkness (general, common word)",
|
||||
"יַקִּיר": "darling, dear (masculine form)",
|
||||
"יַקִּירָה": "darling, dear (feminine form)",
|
||||
"מִרְמָה": "deceit, fraud (cunning deception, trickery)",
|
||||
"תַּרְמִית": "deceit, fraud (a specific act of swindling)",
|
||||
"אֲבַדּוֹן": "destruction (total ruin, perdition; the abyss)",
|
||||
"הֶרֶס": "destruction, demolition (physical wreckage)",
|
||||
"הֶבְדֵּל": "difference, distinction (between two things)",
|
||||
"שֹׁנִי": "difference (variance, otherness)",
|
||||
"הֵעָלְמוּת": "disappearance (the act of vanishing, going missing)",
|
||||
"הֶעֱלֵם": "disappearance (concealment, suppression of information)",
|
||||
"נְדָבָה": "donation (voluntary, charitable gift; tip)",
|
||||
"תְּרוּמָה": "donation, contribution (formal; also: religious offering)",
|
||||
"הִשְׁתַּעְבְּדוּת": "enslavement (the process of becoming enslaved)",
|
||||
"שִׁעְבּוּד": "enslavement, subjugation; mortgaging (finance)",
|
||||
"טָעוּת": "mistake, error (common, everyday blunder)",
|
||||
"שְׁגִיאָה": "error, mistake (formal, technical error)",
|
||||
"הִתְאַדּוּת": "evaporation (natural process of turning to vapor)",
|
||||
"הִתְאַיְּדוּת": "evaporation (process of dissipating, vaporizing)",
|
||||
"דֻּגְמָה": "example, sample (concrete instance or specimen)",
|
||||
"מָשָׁל": "example; parable, allegory, proverb",
|
||||
"גּוֹלָה": "exile, diaspora (the community in exile)",
|
||||
"גָּלוּת": "exile, diaspora (the state/condition of being exiled)",
|
||||
"חֲוָיָה": "experience (a lived event, an adventure)",
|
||||
"הִתְנַסּוּת": "experience (the process of trying/experimenting)",
|
||||
"נִסָּיוֹן": "experience (accumulated knowledge); attempt, trial",
|
||||
"בֵּאוּר": "explanation, elucidation (detailed clarification)",
|
||||
"הֶסְבֵּר": "explanation (the act of explaining, making understood)",
|
||||
"פָּנִים": "face (standard word); surface",
|
||||
"פַּרְצוּף": "face (appearance, facial expression; colloquial)",
|
||||
"מֶחְדָּל": "failure, omission (negligent failure to act)",
|
||||
"כִּשָּׁלוֹן": "failure (general: failed attempt or endeavor)",
|
||||
"כֶּשֶׁל": "failure, malfunction (technical breakdown)",
|
||||
"תַּעְנִית": "fast (religious fast day, formal term)",
|
||||
"צוֹם": "fast, fasting (the act of fasting, general)",
|
||||
"תְּחוּשָׁה": "feeling, sensation (physical or gut feeling)",
|
||||
"הַרְגָּשָׁה": "feeling (emotional sense; well-being)",
|
||||
"רֶגֶשׁ": "feeling, emotion (inner emotional state)",
|
||||
"לֶהָבָה": "flame (common word for a flame)",
|
||||
"שַׁלְהֶבֶת": "flame (poetic/literary, blazing flame)",
|
||||
"כָּפִיף": "flexible, pliable (can be bent physically)",
|
||||
"מָתִיחַ": "flexible, elastic (stretchy, resilient)",
|
||||
"זֶרֶם": "flow, current (of water, electricity, or ideas)",
|
||||
"זְרִימָה": "flow, flowing (the act/process of flowing)",
|
||||
"אֹכֶל": "food (general, everyday word for food/meal)",
|
||||
"מַאֲכָל": "food (a specific dish, a prepared food item)",
|
||||
"מָזוֹן": "food, nourishment (sustenance, nutrition)",
|
||||
"חֹפֶשׁ": "freedom; vacation, time off (colloquial)",
|
||||
"חֵרוּת": "freedom, liberty (formal, political/ideological)",
|
||||
"הַקְפָּאָה": "freezing (active act of freezing something; a freeze/suspension)",
|
||||
"קִפָּאוֹן": "freezing; standstill, stagnation (frozen state)",
|
||||
"תְּדִירוּת": "frequency (how often something occurs)",
|
||||
"תֶּדֶר": "frequency (radio/physics frequency)",
|
||||
"תָּדִיר": "frequent, regular (happening at steady intervals)",
|
||||
"תָּכוּף": "frequent, rapid (happening in quick succession)",
|
||||
"גָּאוֹן": "genius (title of greatness; rabbinical title Gaon)",
|
||||
"עִלּוּי": "genius, prodigy (exceptionally gifted person)",
|
||||
"תְּשׁוּרָה": "gift, present (formal/literary offering)",
|
||||
"שַׁי": "gift, present (a token gift, small present)",
|
||||
"אַכְלָן": "glutton (big eater, food-lover, common)",
|
||||
"רְעַבְתָּן": "glutton (insatiably hungry person)",
|
||||
"מֶמְשֶׁלֶת": "government (construct state form, used in compounds)",
|
||||
"מֶמְשָׁלָה": "government (standard form)",
|
||||
"מֶמְשַׁלְתִּי": "governmental (relating to the government/cabinet)",
|
||||
"שִׁלְטוֹנִי": "governmental (relating to ruling authority/regime)",
|
||||
"חֹפֶן": "handful (cupped palm, a scooped amount)",
|
||||
"קֹמֶץ": "handful (a pinch, a small quantity)",
|
||||
"יָד": "handle (of a tool, door); hand",
|
||||
"יָדִית": "handle (a knob or grip, specifically a handle)",
|
||||
"כָּאן": "here (standard, common usage)",
|
||||
"פֹּה": "here (colloquial/informal variant)",
|
||||
"טָמוּן": "hidden (buried, latent, lying within)",
|
||||
"נִסְתָּר": "hidden, concealed (secret, mysterious; grammar: 3rd person)",
|
||||
"מֻצְנָע": "hidden, concealed (modestly tucked away, discreet)",
|
||||
"תְּמוּנָה": "image, picture (photo, illustration, scene)",
|
||||
"צֶלֶם": "image (likeness, form); idol",
|
||||
"הִתְרַשְּׁמוּת": "impression (the experience of being impressed)",
|
||||
"רֹשֶׁם": "impression (a mark left; an effect on someone)",
|
||||
"בִּפְנִים": "inside (location: on the inside, indoors)",
|
||||
"פְּנִימָה": "inside (direction: inward, toward the inside)",
|
||||
"עֶלְבּוֹן": "insult, offence (the slight or affront itself)",
|
||||
"הַעֲלָבָה": "insult (the act of insulting someone)",
|
||||
"פְּנִים": "interior, inside (inner part, inner side)",
|
||||
"קֶרֶב": "interior; innards, midst (among, in the thick of)",
|
||||
"תָּוֶךְ": "interior, inside; center, middle; essence",
|
||||
"תַּחְקִיר": "investigation (journalistic/official inquiry)",
|
||||
"חֲקִירָה": "investigation, inquiry (police/legal; research)",
|
||||
"רִנָּה": "joy; joyful song, singing (literary)",
|
||||
"מָשׂוֹשׂ": "joy, delight (source of joy, literary)",
|
||||
"גִּיל": "joy, elation (exuberant happiness; age)",
|
||||
"שִׂמְחָה": "joy, happiness (celebration, festive occasion)",
|
||||
"עֶלְצוֹן": "jubilance, exultation (archaic, the feeling)",
|
||||
"עֶלְצָה": "jubilance, exultation (archaic, feminine noun form)",
|
||||
"עָצֵל": "lazy, idle (basic adjective form)",
|
||||
"עַצְלָן": "lazy, lazybones (characteristically lazy person)",
|
||||
"תְּחִקָּה": "legislation (a specific statute or enacted law)",
|
||||
"חֲקִיקָה": "legislation (the process/act of legislating)",
|
||||
"הִתְהוֹלְלוּת": "licentiousness, revelry (wild raucous behavior)",
|
||||
"הוֹלֵלוּת": "licentiousness, debauchery (moral depravity)",
|
||||
"שׁוֹשָׁן": "lily (the flower, masculine; also: the name Shoshan)",
|
||||
"שׁוֹשַׁנָּה": "lily; rose (archaic); the name Shoshana",
|
||||
"הִמָּצְאוּת": "location; presence (being found/situated somewhere)",
|
||||
"מִקּוּם": "location, positioning (placing in a specific spot)",
|
||||
"נַעֲלֶה": "lofty, exalted (elevated, superior in quality)",
|
||||
"נִשְׂגָּב": "lofty, exalted (sublime, beyond reach, grand)",
|
||||
"תַּאֲוָה": "lust, craving (appetite, physical desire)",
|
||||
"תְּשׁוּקָה": "passion, desire (deep longing, yearning)",
|
||||
"אַחְזָקָה": "maintenance; holding (corporate; upkeep of property)",
|
||||
"תַּחְזוּקָה": "maintenance (technical upkeep of systems/equipment)",
|
||||
"תִּחְזוּק": "maintenance (the process/act of maintaining)",
|
||||
"מִנְהָל": "administration, management (the office/system)",
|
||||
"נִהוּל": "management (the act/process of managing)",
|
||||
"הַנְהָלָה": "management (the managing body, executive board)",
|
||||
"פֵּרוּשׁ": "meaning; interpretation, commentary",
|
||||
"מַשְׁמָעוּת": "meaning, significance (broader importance)",
|
||||
"מַשְׁמָע": "meaning, implication (what is implied)",
|
||||
"לַחַן": "melody, tune (a musical composition)",
|
||||
"נִגּוּן": "melody, tune (a chant; Hasidic wordless melody)",
|
||||
"נְעִימָה": "melody, tune; tone, intonation (of voice)",
|
||||
"נֵס": "miracle (divine intervention; common word)",
|
||||
"פֶּלֶא": "wonder, marvel (something astonishing)",
|
||||
"תְּזוּזָה": "movement (a budge, slight motion, shift)",
|
||||
"תְּנוּעָה": "movement (broad: traffic; organization; vowel mark)",
|
||||
"מִסְתּוֹרִין": "mystery (enigma, something hidden/secret)",
|
||||
"תַּעֲלוּמָה": "mystery (unsolved puzzle, unknown secret)",
|
||||
"עֵירֹם": "naked (completely nude, formal)",
|
||||
"עָרֹם": "naked (nude; also: shrewd, cunning in biblical Hebrew)",
|
||||
"אֻמָּה": "nation (a unified political/cultural entity)",
|
||||
"לְאֹם": "nation, people (ethnic group; literary/formal)",
|
||||
"זִלְזוּל": "negligence; contempt, disrespect (dismissive attitude)",
|
||||
"הִתְרַשְּׁלוּת": "negligence (carelessness, failure to take proper care)",
|
||||
"נֵיטְרָלִי": "neutral (politically/scientifically neutral, loanword)",
|
||||
"סְתָמִי": "neutral; vague, nondescript, generic",
|
||||
"אֲצֻלָּה": "nobility, aristocracy (the aristocratic class)",
|
||||
"אֲצִילוּת": "nobility (the quality of being noble, refinement)",
|
||||
"הִסְתַּכְּלוּת": "observation (looking, watching, contemplation)",
|
||||
"תַּצְפִּית": "observation (military/scientific lookout; observation post)",
|
||||
"מִכְשׁוֹל": "obstacle, stumbling block (impediment to progress)",
|
||||
"נֶגֶף": "obstacle; plague, affliction (biblical)",
|
||||
"עַל": "on, upon; about, regarding",
|
||||
"עַל גַּב": "on, upon (on the back/surface of)",
|
||||
"עַל גַּבֵּי": "on, upon (on top of, on the surface of)",
|
||||
"פְּקֻדָּה": "order, command (military/authoritative directive)",
|
||||
"צַו": "order, decree (legal injunction, official order)",
|
||||
"בָּחוּץ": "outside (location: on the outside, outdoors)",
|
||||
"הַחוּצָה": "outside (direction: outward, to the outside)",
|
||||
"מַאֲרָז": "package (a packed container, packaging)",
|
||||
"חֲבִילָה": "package, parcel (a bundle, a wrapped item)",
|
||||
"מְחִילָה": "pardon, forgiveness (personal, between individuals)",
|
||||
"סְלִיחָה": "pardon, forgiveness (also: excuse me; liturgical pardon)",
|
||||
"סַיֶּרֶת": "patrol (elite military unit, commando squad)",
|
||||
"סִיּוּר": "patrol; tour (a round of inspection or sightseeing)",
|
||||
"שָׂכָר": "payment; salary, wage (earned compensation)",
|
||||
"תַּשְׁלוּם": "payment (a single payment/installment; compensation)",
|
||||
"עֲצוּמָה": "petition (public petition with signatures)",
|
||||
"עֲתִירָה": "petition (legal petition, court appeal)",
|
||||
"דַּלּוּת": "poverty; meagerness, paucity (scarcity of quality/quantity)",
|
||||
"עֹנִי": "poverty (destitution, financial hardship)",
|
||||
"עָצְמָתִי": "powerful (having great inherent power)",
|
||||
"רַב עָצְמָה": "powerful (of great might, formidable)",
|
||||
"הַאֲמָרָה": "price increase (deliberate raising of prices)",
|
||||
"הִתְיַקְּרוּת": "price increase (becoming more expensive, rising costs)",
|
||||
"קִדְמָה": "progress (general/societal advancement, modernity)",
|
||||
"הִתְקַדְּמוּת": "progress (the process of advancing, making headway)",
|
||||
"הַסְבָּרָה": "propaganda; public diplomacy (Israeli hasbara)",
|
||||
"תַּעֲמוּלָה": "propaganda (political propaganda, agitation)",
|
||||
"סְמִיכוּת": "proximity; construct state (grammar term)",
|
||||
"קִרְבָה": "proximity; kinship, closeness (relational nearness)",
|
||||
"תְּהִלּוֹת": "Psalms (variant plural form)",
|
||||
"תְּהִלִּים": "Psalms (standard name for the Book of Psalms)",
|
||||
"קְנִיָּה": "purchase (a buy, an act of buying, everyday)",
|
||||
"רְכִישָׁה": "acquisition (formal purchase, procurement)",
|
||||
"בִּזְרִיזוּת": "quickly, nimbly (with agile efficiency)",
|
||||
"בִּמְהִירוּת": "quickly, at high speed (with velocity)",
|
||||
"רִיצָה": "running (the activity of running)",
|
||||
"מְרוּצָה": "race (a competitive running event)",
|
||||
"גְּאֻלָּה": "redemption (national/messianic deliverance)",
|
||||
"פְּדוּת": "redemption (ransoming, being redeemed; literary)",
|
||||
"הוֹצָאָה": "removal; expense, expenditure; publishing house",
|
||||
"הַסָּחָה": "removal; deflection, diversion, distraction",
|
||||
"יִצּוּג": "representation (acting on behalf of; depiction)",
|
||||
"נְצִיגוּת": "representation (the body of representatives, delegation)",
|
||||
"מְכִירָה": "sale (the act of selling, a transaction)",
|
||||
"מֶכֶר": "sale; merchandise, value (literary/biblical)",
|
||||
"יֶשַׁע": "salvation, deliverance (divine rescue, literary)",
|
||||
"תְּשׁוּעָה": "salvation, victory (triumphant rescue, literary)",
|
||||
"הַפְרָדָה": "separation (active act of separating things/people)",
|
||||
"הִפָּרְדוּת": "separation (the process of parting ways)",
|
||||
"חַד": "sharp (of edges, blades; clear-cut)",
|
||||
"חָרִיף": "sharp, acute; spicy, pungent; keen, witty",
|
||||
"חָסוּת": "shelter, patronage (protection under authority)",
|
||||
"מִקְלָט": "shelter, refuge (bomb shelter, safe haven, physical place)",
|
||||
"חֻלְצָה": "shirt, blouse (modern everyday word)",
|
||||
"כֻּתֹּנֶת": "shirt; tunic, gown (biblical/traditional garment)",
|
||||
"שֶׁקֶט": "silence, quiet (peaceful calm, serenity)",
|
||||
"שְׁתִיקָה": "silence (the act of keeping silent, not speaking)",
|
||||
"חֶטְא": "sin (a specific transgression, missing the mark)",
|
||||
"עָווֹן": "sin, iniquity (moral guilt; legal: misdemeanor)",
|
||||
"זִמְרָה": "singing (musical performance, song/hymn)",
|
||||
"רְנָנָה": "singing; joyful song, jubilant cry (literary)",
|
||||
"נָטוּי": "slanted, inclined (tilted, leaning; grammar: inflected)",
|
||||
"מְשֻׁפָּע": "slanted, inclined; having an abundance of something",
|
||||
"כִּשּׁוּף": "sorcery, witchcraft (dark magic, spellcasting)",
|
||||
"קֶסֶם": "magic, charm (enchantment, allure)",
|
||||
"נֶפֶשׁ": "soul (life force, self, being; appetite)",
|
||||
"נְשָׁמָה": "soul (divine breath of life, spiritual essence)",
|
||||
"מַצָּת": "spark plug (automotive ignition component)",
|
||||
"פְּלָג": "spark plug (variant/slang term)",
|
||||
"דּוֹבֵר": "speaker, spokesman (masculine form)",
|
||||
"דּוֹבֶרֶת": "speaker, spokeswoman (feminine form)",
|
||||
"סוּפָה": "storm, tempest (violent windstorm)",
|
||||
"סְעָרָה": "storm, tempest (raging storm; figurative turmoil)",
|
||||
"קַשׁ": "straw (dry stalks; figuratively: trivial thing)",
|
||||
"תֶּבֶן": "straw, hay (animal feed, dried grass)",
|
||||
"עִקֵּשׁ": "stubborn, obstinate (perversely rigid)",
|
||||
"עַקְשָׁן": "stubborn, obstinate (characteristically persistent/stubborn person)",
|
||||
"חָנִיךְ": "student, pupil (trainee, apprentice, cadet)",
|
||||
"תַּלְמִיד": "student, pupil (school student, common word)",
|
||||
"פִּקּוּחַ": "supervision (regulatory oversight, monitoring)",
|
||||
"הַשְׁגָּחָה": "supervision (watchful care, divine providence; kosher certification)",
|
||||
"הַסְפָּקָה": "supply, provision (the act of supplying goods)",
|
||||
"אַסְפָּקָה": "supply, provision (military/logistical provisioning)",
|
||||
"אֲרָעִי": "temporary, provisional (makeshift, not permanent)",
|
||||
"זְמַנִּי": "temporary, time-limited (for a limited period)",
|
||||
"אֵלֶה": "these (standard demonstrative pronoun)",
|
||||
"אֵלוּ": "these (literary/Mishnaic variant)",
|
||||
"בֹּהֶן": "thumb; big toe (anatomical term)",
|
||||
"אֲגוּדָל": "thumb (common/colloquial word for thumb)",
|
||||
"זְמַן": "time (general, measurable time; tense in grammar)",
|
||||
"עֵת": "time (a specific moment, epoch, literary/biblical)",
|
||||
"עִתּוּי": "timing (choosing the right moment)",
|
||||
"תִּזְמוּן": "timing (synchronization, technical scheduling)",
|
||||
"לְכַתֵּב": "to address (write an address on); to engrave",
|
||||
"לְמַעֵן": "to address (direct/target communication toward)",
|
||||
"לְזַיֵּן": "to arm (equip with weapons; vulgar slang)",
|
||||
"לְחַמֵּשׁ": "to arm (equip/furnish with armaments)",
|
||||
"לְהִתְאַסֵּף": "to assemble, to gather together (of people collecting)",
|
||||
"לְהִתְכַּנֵּס": "to assemble, to convene (a formal meeting/conference)",
|
||||
"לְהִכָּבֵל": "to be bound (chained, shackled with chains)",
|
||||
"לְהִכָּפֵת": "to be bound (handcuffed, tied up physically)",
|
||||
"לְהִבָּרֵא": "to be created (divine/fundamental creation, ex nihilo)",
|
||||
"לְהִוָּצֵר": "to be created (formed, shaped, manufactured)",
|
||||
"לְהִגָּזֵז": "to be cut off (sheared, trimmed, as hair/wool)",
|
||||
"לְהִגָּזֵר": "to be cut off (decreed, sentenced; derived from)",
|
||||
"לְהִקָּטֵעַ": "to be cut off (interrupted, severed abruptly)",
|
||||
"לְהִנָּגֵף": "to be defeated (struck down, plagued; biblical)",
|
||||
"לְהֵרָעֵץ": "to be defeated (crushed, shattered; literary)",
|
||||
"לְהֵהָרֵס": "to be destroyed (demolished, wrecked; slang: exhausted)",
|
||||
"לְהֵחָרֵב": "to be destroyed (laid waste, devastated; of cities/temples)",
|
||||
"לְהִסָּתֵר": "to be hidden; to hide oneself (take cover)",
|
||||
"לְהִצָּפֵן": "to be hidden (encoded, concealed from view)",
|
||||
"לְהִנָּטֵעַ": "to be planted (of trees/plants, set in soil)",
|
||||
"לְהִשָּׁתֵל": "to be planted (implanted, transplanted; of an organ or undercover agent)",
|
||||
"לָדֹם": "to be silent (to become utterly still; literary)",
|
||||
"לִשְׁתֹּק": "to be silent (to stop talking, keep quiet; common)",
|
||||
"לְהִתְקַמֵּץ": "to be stingy (to pinch pennies, scrimp)",
|
||||
"לְהִתְקַמְצֵן": "to be stingy (to act like a miser, be miserly)",
|
||||
"לְהִבָּדֵק": "to be tested, checked (verified, inspected)",
|
||||
"לְהִבָּחֵן": "to be tested, examined (undergo a formal exam/evaluation)",
|
||||
"נִהְיָה": "to become (turn into, come to be; common)",
|
||||
"לְהֵעָשׂוֹת": "to become; to be made, to be done, to be carried out",
|
||||
"לְהִתְבַּהֵר": "to become clear (clarified, understood)",
|
||||
"לְהִצְטַלֵּל": "to become clear (of liquid becoming transparent/limpid)",
|
||||
"לְכוֹפֵף": "to bend (flex, bow down, curve something)",
|
||||
"לְקַמֵּר": "to bend, to vault (arch over, create a dome shape)",
|
||||
"לְקַשֵּׁת": "to bend, to curve (form into a bow/arc shape)",
|
||||
"לְפַחֵם": "to blacken (carbonize, char with coal/charcoal)",
|
||||
"לְפַיֵּחַ": "to blacken (cover with soot, smoke residue)",
|
||||
"לְמַצְמֵץ": "to blink (rapidly open and close one's eyes)",
|
||||
"לְעַפְעֵף": "to blink (flutter one's eyelids)",
|
||||
"לִנְפֹּחַ": "to blow (puff up, inflate; blow air)",
|
||||
"לִנְשֹׁף": "to blow, to exhale; to play a wind instrument",
|
||||
"לְצַיֵּץ": "to chirp, to tweet (of birds; to post on social media)",
|
||||
"לְצַפְצֵף": "to chirp, to whistle (shrill piping sound; to not care — slang)",
|
||||
"לְחַבֵּר": "to connect, to join (attach together; to compose/write)",
|
||||
"לְקַשֵּׁר": "to connect, to link (establish a relationship/connection)",
|
||||
"לְהָסִיחַ": "to converse (engage in casual talk; to divert attention)",
|
||||
"לְהָשִׂיחַ": "to converse, to talk (literary; to speak with)",
|
||||
"לְסַלְסֵל": "to curl (hair); to trill (music)",
|
||||
"לְתַלְתֵּל": "to curl (hair into ringlets/curls)",
|
||||
"לְיַפּוֹת": "to beautify, to embellish (make more attractive)",
|
||||
"לְפַרְכֵּס": "to embellish; to squirm, to flounder",
|
||||
"לִדְרֹשׁ": "to demand; to inquire, to preach (seek/expound)",
|
||||
"לִתְבֹּעַ": "to demand; to sue, to claim (legal demand)",
|
||||
"לְהֵישִׁיר": "to direct; to straighten, to look straight at",
|
||||
"לְהַפְנוֹת": "to direct; to refer someone (redirect attention/person)",
|
||||
"לְהַגְזִים": "to exaggerate (overstate, blow out of proportion; common)",
|
||||
"לְהַפְרִיז": "to exaggerate (go to extremes, overdo; formal)",
|
||||
"לְהִמּוֹג": "to fade, to dissolve (melt away, lose form; literary)",
|
||||
"לְהִנָּדֵף": "to fade, to dissipate (blown away, scattered by wind)",
|
||||
"לִפֹּל": "to fall (general: fall down, collapse; common word)",
|
||||
"לִנְשֹׁר": "to fall, to drop (shed: leaves, hair; drop out of school)",
|
||||
"לְכַלּוֹת": "to finish (consume entirely, exhaust; to annihilate)",
|
||||
"לְסַיֵּם": "to finish, to complete (conclude, bring to an end; common)",
|
||||
"לִנְהֹר": "to flow (stream toward); to shine, to glow",
|
||||
"לִשְׁתֹּת": "to flow (pour forth, stream out; literary)",
|
||||
"לִמְחֹל": "to forgive (pardon on a personal level, waive a claim)",
|
||||
"לִסְלֹחַ": "to forgive, to pardon (general, standard word for forgiving)",
|
||||
"לְהַחְבִּיא": "to hide, to conceal (physically stash away; common)",
|
||||
"לְהַעֲלִים": "to hide, to conceal (suppress information; to evade)",
|
||||
"לִדְלֹף": "to leak (of a pipe, roof; seep through)",
|
||||
"לִנְזֹל": "to drip, to trickle (flow in drops, ooze)",
|
||||
"לִזְנֹחַ": "to abandon, to neglect (forsake, discard)",
|
||||
"לַעֲזֹב": "to leave, to abandon (depart from; give up; common word)",
|
||||
"לְהַנִּיחַ": "to place, to put (set down carefully); to assume",
|
||||
"לְהָשִׂים": "to place, to put (set/assign); to turn into something",
|
||||
"לְפָאֵר": "to glorify, to adorn (extol with grandeur)",
|
||||
"לְשַׁבֵּחַ": "to praise, to commend (express approval; common)",
|
||||
"לִדְחֹף": "to push, to shove (physically push forward; common)",
|
||||
"לִדְחֹק": "to push, to press (squeeze, crowd; urge insistently)",
|
||||
"לְהַבְרִיא": "to recover (regain health, get well; common)",
|
||||
"לְהַחְלִים": "to recover, to convalesce (heal fully from illness; formal)",
|
||||
"לַעֲלֹץ": "to rejoice, to exult (leap with joy; literary)",
|
||||
"לָשׂוּשׂ": "to rejoice (be glad, delight in; biblical/literary)",
|
||||
"לְהוֹשִׁיעַ": "to rescue, to save (deliver from danger; biblical/literary)",
|
||||
"לְהַצִּיל": "to rescue, to save (common, everyday word)",
|
||||
"לְחַכֵּךְ": "to rub (scratch an itch, abrade gently)",
|
||||
"לְשַׁפְשֵׁף": "to rub (scrub, polish by rubbing repeatedly)",
|
||||
"לִסְרֹט": "to scratch (scrape with a sharp object; to make a video/film)",
|
||||
"לִשְׂרֹט": "to scratch (draw a line, score a surface)",
|
||||
"לִנְגֹּהַּ": "to shine (glow with bright light; literary)",
|
||||
"לִקְרֹן": "to shine, to beam (radiate light, as from horns of light)",
|
||||
"לְהַחֲרִישׁ": "to silence; to be silent (choose not to respond; literary)",
|
||||
"לְהַשְׁתִּיק": "to silence (make someone/something stop making noise; common)",
|
||||
"לִטְבֹּחַ": "to slaughter (massacre, butcher violently)",
|
||||
"לִשְׁחֹט": "to slaughter (ritually slaughter an animal; shecht)",
|
||||
"לְהִתְמַחוֹת": "to specialize (become an expert in a field)",
|
||||
"לְהִתְמַקְצֵעַ": "to specialize (become a professional, gain proficiency)",
|
||||
"לְבַקֵּעַ": "to split, to cleave (crack open forcefully)",
|
||||
"לְבַתֵּק": "to split, to cleave; to pierce (cut through)",
|
||||
"לִמְרֹחַ": "to spread (smear, apply a spread on surface)",
|
||||
"לִשְׁטֹחַ": "to spread (lay out flat, unfurl); to present, explicate",
|
||||
"לְאַשֵּׁשׁ": "to strengthen, to establish (shore up, substantiate)",
|
||||
"לְחַזֵּק": "to strengthen (make stronger, reinforce; common word)",
|
||||
"לְהִתְיַסֵּר": "to suffer (be tormented, endure agony)",
|
||||
"לְהִתְעַנּוֹת": "to suffer; to fast (endure hardship/deprivation; literary)",
|
||||
"לִידוֹת": "to throw, to hurl (cast, fling; biblical)",
|
||||
"לִרְמוֹת": "to throw, to hurl (toss; biblical)",
|
||||
"לִגְזֹז": "to trim (shear wool/hair, clip close)",
|
||||
"לִגְזֹם": "to trim (prune branches/bushes, cut back vegetation)",
|
||||
"לְאַדּוֹת": "to vaporize (steam, evaporate); to simmer, to poach (cooking)",
|
||||
"לְאַיֵּד": "to vaporize, to evaporate (cause to turn into vapor)",
|
||||
"לֶאֱרֹג": "to weave (on a loom, produce fabric; common word)",
|
||||
"לִשְׁזֹר": "to weave (intertwine, braid, thread together)",
|
||||
"בְּיַחַד": "together (as a group, common usage with 'be-')",
|
||||
"יַחַד": "together (jointly, in unison; literary)",
|
||||
"יַחְדָּו": "together (jointly; biblical/poetic variant)",
|
||||
"מִסְחָר": "trade, commerce (the business/sector of trading)",
|
||||
"סַחַר": "trade, commerce (goods traded, merchandise; literary)",
|
||||
"אֱמֶת": "truth (common word for truth, verity)",
|
||||
"אֲמִתָּה": "truth; axiom (fundamental truth, literary)",
|
||||
"מִצְנֶפֶת": "turban (formal headdress, priestly turban)",
|
||||
"צָנִיף": "turban, head wrap (wrapped head covering)",
|
||||
"אַחְדוּת": "unity (state of being united, solidarity)",
|
||||
"אִחוּד": "unification (the act of uniting, merging)",
|
||||
"בִּקְעָה": "valley (broad, flat valley plain)",
|
||||
"עֵמֶק": "valley (deep valley between mountains/hills)",
|
||||
"אִשְׁרָה": "visa; approval (entry permit; formal approval)",
|
||||
"וִיזָה": "visa (travel visa, loanword)",
|
||||
"כֹּתֶל": "wall (the Western Wall; a freestanding stone wall)",
|
||||
"קִיר": "wall (common word for wall of a room/building)",
|
||||
"אַזְהָרָה": "warning (a caution, alert; legal/safety warning)",
|
||||
"הַזְהָרָה": "warning (the act of warning someone; admonition)",
|
||||
"רַהַט": "water trough (channel, gutter for water flow)",
|
||||
"שֹׁקֶת": "water trough (feeding/drinking trough for animals)",
|
||||
"אִלּוּלֵי": "were it not for (standard conditional; common)",
|
||||
"לוּלֵא": "were it not for (literary/Talmudic variant)",
|
||||
"אוֹפַןּ": "wheel (a single wheel; biblical/poetic)",
|
||||
"גַּלְגַּל": "wheel (rolling wheel; cycle, pulley)",
|
||||
"אַיֵּה": "where? (literary/biblical: where is?)",
|
||||
"הֵיכָן": "where? (standard literary form of 'where')",
|
||||
"לֹבֶן": "whiteness (white of the eye; white color)",
|
||||
"צְחוֹר": "whiteness; purity (brilliant white, radiance)",
|
||||
"עוֹלָם": "world (the world, universe; eternity; common word)",
|
||||
"תֵּבֵל": "world, universe (the inhabited world; poetic/literary)",
|
||||
"פֶּצַע": "wound (a specific cut, gash, open wound)",
|
||||
"פְּצִיעָה": "wound, injury (the event/act of being wounded)",
|
||||
"כִּסּוּפִים": "yearning, longing (wistful craving, literary; plural)",
|
||||
"עֶרְגָּה": "yearning, longing (deep nostalgic longing, literary)"
|
||||
}
|
||||
19525
data/vetted_sentences.json
Normal file
19525
data/vetted_sentences.json
Normal file
File diff suppressed because it is too large
Load diff
20159
data/vocab_sentence_matches.json
Normal file
20159
data/vocab_sentence_matches.json
Normal file
File diff suppressed because it is too large
Load diff
446
epub_examples.py
Normal file
446
epub_examples.py
Normal file
|
|
@ -0,0 +1,446 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Extract example sentences from nikud'd Hebrew EPUBs (and PDFs where possible),
|
||||
match them against the vocab list, and produce examples_cache.json.
|
||||
|
||||
Usage:
|
||||
python3 epub_examples.py
|
||||
|
||||
Outputs:
|
||||
data/epub_sentence_index.json — full sentence corpus
|
||||
data/examples_cache.json — best sentence(s) per vocab word
|
||||
"""
|
||||
|
||||
import csv
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import zipfile
|
||||
from html.parser import HTMLParser
|
||||
from pathlib import Path
|
||||
|
||||
from helpers import strip_nikkud
|
||||
|
||||
DATA_DIR = Path(__file__).parent / "data"
|
||||
EPUB_DIR = DATA_DIR / "epubs"
|
||||
DICT_CSV = DATA_DIR / "hebrew_dict_for_anki.csv"
|
||||
|
||||
# Book metadata: filename -> display name
|
||||
EPUB_BOOKS = {
|
||||
"little_prince.epub": "הנסיך הקטן",
|
||||
"time_tunnel_82.epub": "מנהרת הזמן 82",
|
||||
}
|
||||
|
||||
# PDF books are excluded — pypdf produces garbled RTL text (reversed chars within
|
||||
# words). If/when a proper EPUB version becomes available on Calibre, add it to
|
||||
# EPUB_BOOKS above instead.
|
||||
PDF_BOOKS: dict[str, str] = {}
|
||||
|
||||
# Sentence length bounds (word count)
|
||||
MIN_WORDS = 4
|
||||
MAX_WORDS = 15
|
||||
|
||||
|
||||
|
||||
# ── HTML text extraction ─────────────────────────────────────────
|
||||
|
||||
|
||||
class _TextExtractor(HTMLParser):
|
||||
"""Extract text content from HTML, skipping script/style tags."""
|
||||
|
||||
SKIP_TAGS = {"script", "style", "head"}
|
||||
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
self.parts: list[str] = []
|
||||
self._skip_depth = 0
|
||||
|
||||
def handle_starttag(self, tag, attrs):
|
||||
if tag in self.SKIP_TAGS:
|
||||
self._skip_depth += 1
|
||||
# Insert space for block-level elements to avoid word concatenation
|
||||
if tag in (
|
||||
"p",
|
||||
"div",
|
||||
"br",
|
||||
"li",
|
||||
"h1",
|
||||
"h2",
|
||||
"h3",
|
||||
"h4",
|
||||
"h5",
|
||||
"h6",
|
||||
"td",
|
||||
"th",
|
||||
"tr",
|
||||
"blockquote",
|
||||
"section",
|
||||
):
|
||||
self.parts.append("\n")
|
||||
|
||||
def handle_endtag(self, tag):
|
||||
if tag in self.SKIP_TAGS:
|
||||
self._skip_depth = max(0, self._skip_depth - 1)
|
||||
|
||||
def handle_data(self, data):
|
||||
if self._skip_depth == 0:
|
||||
self.parts.append(data)
|
||||
|
||||
def get_text(self) -> str:
|
||||
return "".join(self.parts)
|
||||
|
||||
|
||||
def extract_text_from_html(html: str) -> str:
|
||||
"""Parse HTML and return plain text."""
|
||||
parser = _TextExtractor()
|
||||
parser.feed(html)
|
||||
return parser.get_text()
|
||||
|
||||
|
||||
# ── EPUB processing ──────────────────────────────────────────────
|
||||
|
||||
|
||||
def _content_files_from_epub(zf: zipfile.ZipFile) -> list[str]:
|
||||
"""Get ordered list of content XHTML files from the OPF manifest."""
|
||||
# Find the OPF file
|
||||
opf_path = None
|
||||
for name in zf.namelist():
|
||||
if name.endswith(".opf"):
|
||||
opf_path = name
|
||||
break
|
||||
if not opf_path:
|
||||
# Fallback: just use all xhtml files
|
||||
return sorted(
|
||||
n
|
||||
for n in zf.namelist()
|
||||
if n.endswith((".xhtml", ".html"))
|
||||
and "toc" not in n.lower()
|
||||
and "cover" not in n.lower()
|
||||
and "nav" not in n.lower()
|
||||
)
|
||||
|
||||
# Parse OPF to get spine order
|
||||
opf_content = zf.read(opf_path).decode("utf-8")
|
||||
opf_dir = os.path.dirname(opf_path)
|
||||
|
||||
# Extract manifest items: id -> href
|
||||
manifest = {}
|
||||
for m in re.finditer(r'<item\s+[^>]*id="([^"]+)"[^>]*href="([^"]+)"', opf_content):
|
||||
manifest[m.group(1)] = m.group(2)
|
||||
# Also try reversed attribute order
|
||||
for m in re.finditer(r'<item\s+[^>]*href="([^"]+)"[^>]*id="([^"]+)"', opf_content):
|
||||
manifest[m.group(2)] = m.group(1)
|
||||
|
||||
# Extract spine order
|
||||
spine_ids = re.findall(r'<itemref\s+[^>]*idref="([^"]+)"', opf_content)
|
||||
|
||||
result = []
|
||||
for sid in spine_ids:
|
||||
href = manifest.get(sid, "")
|
||||
if href and href.endswith((".xhtml", ".html")):
|
||||
full_path = os.path.join(opf_dir, href) if opf_dir else href
|
||||
# Normalize path separators
|
||||
full_path = full_path.replace("\\", "/")
|
||||
if full_path in zf.namelist():
|
||||
result.append(full_path)
|
||||
|
||||
if not result:
|
||||
# Fallback
|
||||
return sorted(
|
||||
n
|
||||
for n in zf.namelist()
|
||||
if n.endswith((".xhtml", ".html")) and "toc" not in n.lower() and "cover" not in n.lower()
|
||||
)
|
||||
return result
|
||||
|
||||
|
||||
def extract_sentences_from_epub(epub_path: Path, book_name: str) -> list[dict]:
|
||||
"""Extract sentences from an EPUB file.
|
||||
|
||||
Returns list of {"text": str, "book": str, "stripped": str}
|
||||
"""
|
||||
zf = zipfile.ZipFile(epub_path)
|
||||
content_files = _content_files_from_epub(zf)
|
||||
|
||||
all_text = []
|
||||
for cf in content_files:
|
||||
try:
|
||||
html = zf.read(cf).decode("utf-8")
|
||||
except (KeyError, UnicodeDecodeError):
|
||||
continue
|
||||
text = extract_text_from_html(html)
|
||||
all_text.append(text)
|
||||
|
||||
full_text = "\n".join(all_text)
|
||||
return _split_into_sentences(full_text, book_name)
|
||||
|
||||
|
||||
# ── PDF processing ───────────────────────────────────────────────
|
||||
|
||||
|
||||
def extract_sentences_from_pdf(pdf_path: Path, book_name: str) -> list[dict]:
|
||||
"""Extract sentences from a PDF file (best-effort, handles RTL reversal)."""
|
||||
try:
|
||||
import pypdf
|
||||
except ImportError:
|
||||
print(f" [SKIP] pypdf not installed, cannot process {pdf_path.name}")
|
||||
return []
|
||||
|
||||
reader = pypdf.PdfReader(pdf_path)
|
||||
all_text_parts = []
|
||||
|
||||
for page in reader.pages:
|
||||
raw = page.extract_text()
|
||||
if not raw:
|
||||
continue
|
||||
# pypdf often reverses word order for RTL text; fix it
|
||||
fixed_lines = []
|
||||
for line in raw.split("\n"):
|
||||
words = line.split()
|
||||
# Check if this line is predominantly Hebrew
|
||||
hebrew_chars = sum(1 for c in line if "\u0590" <= c <= "\u05ff")
|
||||
if hebrew_chars > len(line) * 0.3 and len(words) > 1:
|
||||
# Reverse word order
|
||||
fixed_lines.append(" ".join(reversed(words)))
|
||||
else:
|
||||
fixed_lines.append(line)
|
||||
all_text_parts.append("\n".join(fixed_lines))
|
||||
|
||||
full_text = "\n".join(all_text_parts)
|
||||
return _split_into_sentences(full_text, book_name)
|
||||
|
||||
|
||||
# ── Sentence splitting ───────────────────────────────────────────
|
||||
|
||||
# Hebrew sentence terminators: period, exclamation, question mark, sof pasuk
|
||||
_SENT_SPLIT = re.compile(r"[.!?\u05C3]+")
|
||||
|
||||
# Punctuation to strip from word boundaries when matching
|
||||
_PUNCT = re.compile(
|
||||
r'^[\u0022\u0027\u05F4\u05F3,;:\-–—…\u201C\u201D\u201E\u201F\u2018\u2019()\[\]{}«»"\']+|[\u0022\u0027\u05F4\u05F3,;:\-–—…\u201C\u201D\u201E\u201F\u2018\u2019()\[\]{}«»"\']+$'
|
||||
)
|
||||
|
||||
|
||||
def _split_into_sentences(text: str, book_name: str) -> list[dict]:
|
||||
"""Split text into sentences and filter by length."""
|
||||
# Normalize whitespace
|
||||
text = re.sub(r"\s+", " ", text).strip()
|
||||
|
||||
raw_sentences = _SENT_SPLIT.split(text)
|
||||
results = []
|
||||
seen = set()
|
||||
|
||||
for sent in raw_sentences:
|
||||
sent = sent.strip()
|
||||
if not sent:
|
||||
continue
|
||||
|
||||
# Count Hebrew words (skip non-Hebrew tokens like numbers)
|
||||
words = sent.split()
|
||||
hebrew_words = [w for w in words if any("\u0590" <= c <= "\u05ff" for c in w)]
|
||||
|
||||
if len(hebrew_words) < MIN_WORDS or len(hebrew_words) > MAX_WORDS:
|
||||
continue
|
||||
|
||||
# Skip duplicates
|
||||
stripped = strip_nikkud(sent)
|
||||
if stripped in seen:
|
||||
continue
|
||||
seen.add(stripped)
|
||||
|
||||
results.append(
|
||||
{
|
||||
"text": sent,
|
||||
"book": book_name,
|
||||
"stripped": stripped,
|
||||
}
|
||||
)
|
||||
|
||||
return results
|
||||
|
||||
|
||||
# ── Vocab loading ────────────────────────────────────────────────
|
||||
|
||||
|
||||
def load_vocab(csv_path: Path) -> dict:
|
||||
"""Load vocab CSV and return {stripped_form: nikkud_word} mapping.
|
||||
|
||||
Also returns reverse mapping for lookup.
|
||||
Returns (word_to_nikkud, nikkud_words_set)
|
||||
"""
|
||||
words_by_stripped: dict[str, list[str]] = {} # stripped -> [nikkud words]
|
||||
|
||||
with open(csv_path, encoding="utf-8") as f:
|
||||
reader = csv.DictReader(f, delimiter=";")
|
||||
for row in reader:
|
||||
nikkud_word = row.get("Word", "").strip()
|
||||
word_no_nik = row.get("Word Without Nikkud", "").strip()
|
||||
if not nikkud_word:
|
||||
continue
|
||||
|
||||
# Method 1: strip nikkud from the Word column
|
||||
stripped_from_nikkud = strip_nikkud(nikkud_word)
|
||||
|
||||
# Add both forms for matching
|
||||
for form in {stripped_from_nikkud, word_no_nik}:
|
||||
if form:
|
||||
words_by_stripped.setdefault(form, []).append(nikkud_word)
|
||||
|
||||
return words_by_stripped
|
||||
|
||||
|
||||
# ── Matching ─────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def match_sentences(sentences: list[dict], words_by_stripped: dict) -> dict:
|
||||
"""Match sentences against vocab words.
|
||||
|
||||
Returns {nikkud_word: [sentences]} with best (shortest) first.
|
||||
"""
|
||||
# Build a set of all stripped forms for fast lookup
|
||||
all_forms = set(words_by_stripped.keys())
|
||||
|
||||
# Hebrew single-letter prefixes: ב, ה, ו, כ, ל, מ, ש, ד (של)
|
||||
_HEB_PREFIXES = set("בהוכלמשד")
|
||||
|
||||
# For each sentence, extract stripped words
|
||||
matches: dict[str, list[tuple[int, str]]] = {} # nikkud_word -> [(word_count, sentence)]
|
||||
|
||||
for sent_info in sentences:
|
||||
sent_text = sent_info["text"]
|
||||
sent_stripped = sent_info["stripped"]
|
||||
word_count = len(sent_text.split())
|
||||
|
||||
# Get stripped words from the sentence
|
||||
raw_words = sent_stripped.split()
|
||||
# Map: candidate_form -> set of original cleaned words that produced it
|
||||
# This lets us verify that prefix stripping is plausible
|
||||
candidates: dict[str, str] = {} # form -> original_word
|
||||
for w in raw_words:
|
||||
cleaned = _PUNCT.sub("", w)
|
||||
if not cleaned:
|
||||
continue
|
||||
# Direct match (always try)
|
||||
candidates[cleaned] = cleaned
|
||||
# Prefix stripping: only if remaining stem is >= 2 chars
|
||||
# and the prefix char is a known Hebrew prefix letter
|
||||
for prefix_len in (1, 2):
|
||||
if len(cleaned) > prefix_len + 1:
|
||||
prefix = cleaned[:prefix_len]
|
||||
stem = cleaned[prefix_len:]
|
||||
if all(c in _HEB_PREFIXES for c in prefix) and len(stem) >= 2:
|
||||
candidates[stem] = cleaned
|
||||
|
||||
# Check which vocab words appear in this sentence
|
||||
matched_forms = set(candidates.keys()) & all_forms
|
||||
for form in matched_forms:
|
||||
# Skip spurious matches: very short vocab forms (1-2 chars)
|
||||
# should only match via direct word match, not prefix stripping
|
||||
if len(form) <= 2 and form not in {_PUNCT.sub("", w) for w in raw_words}:
|
||||
continue
|
||||
for nikkud_word in words_by_stripped[form]:
|
||||
matches.setdefault(nikkud_word, []).append((word_count, sent_text))
|
||||
|
||||
# Sort by word count (prefer shorter sentences) and deduplicate
|
||||
result = {}
|
||||
for nikkud_word, sent_list in matches.items():
|
||||
sent_list.sort(key=lambda x: x[0])
|
||||
seen = set()
|
||||
unique = []
|
||||
for _, sent in sent_list:
|
||||
if sent not in seen:
|
||||
seen.add(sent)
|
||||
unique.append(sent)
|
||||
if len(unique) >= 5: # Keep top 5 per word
|
||||
break
|
||||
result[nikkud_word] = unique
|
||||
|
||||
return result
|
||||
|
||||
|
||||
# ── Main ─────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def main():
|
||||
print("=" * 60)
|
||||
print("EPUB Example Sentence Extraction Pipeline")
|
||||
print("=" * 60)
|
||||
|
||||
# Step 1: Extract sentences from all books
|
||||
all_sentences = []
|
||||
book_counts = {}
|
||||
|
||||
for filename, book_name in EPUB_BOOKS.items():
|
||||
path = EPUB_DIR / filename
|
||||
if not path.exists():
|
||||
print(f"\n[SKIP] {filename} not found")
|
||||
continue
|
||||
print(f"\n[EPUB] Extracting: {book_name} ({filename})")
|
||||
sentences = extract_sentences_from_epub(path, book_name)
|
||||
book_counts[book_name] = len(sentences)
|
||||
all_sentences.extend(sentences)
|
||||
print(f" -> {len(sentences)} sentences")
|
||||
|
||||
for filename, book_name in PDF_BOOKS.items():
|
||||
path = EPUB_DIR / filename
|
||||
if not path.exists():
|
||||
print(f"\n[SKIP] {filename} not found")
|
||||
continue
|
||||
print(f"\n[PDF] Extracting: {book_name} ({filename})")
|
||||
sentences = extract_sentences_from_pdf(path, book_name)
|
||||
book_counts[book_name] = len(sentences)
|
||||
all_sentences.extend(sentences)
|
||||
print(f" -> {len(sentences)} sentences")
|
||||
|
||||
print(f"\nTotal sentences: {len(all_sentences)}")
|
||||
|
||||
# Step 2: Save sentence index
|
||||
index_path = DATA_DIR / "epub_sentence_index.json"
|
||||
with open(index_path, "w", encoding="utf-8") as f:
|
||||
json.dump({"sentences": all_sentences}, f, ensure_ascii=False, indent=2)
|
||||
print(f"\nSaved sentence index: {index_path}")
|
||||
|
||||
# Step 3: Load vocab and match
|
||||
print(f"\nLoading vocab from {DICT_CSV} ...")
|
||||
words_by_stripped = load_vocab(DICT_CSV)
|
||||
total_vocab = len({w for wlist in words_by_stripped.values() for w in wlist})
|
||||
print(f" {total_vocab} unique vocab words ({len(words_by_stripped)} lookup forms)")
|
||||
|
||||
print("\nMatching sentences against vocab ...")
|
||||
examples_cache = match_sentences(all_sentences, words_by_stripped)
|
||||
|
||||
# Step 4: Save examples_cache
|
||||
cache_path = DATA_DIR / "examples_cache.json"
|
||||
with open(cache_path, "w", encoding="utf-8") as f:
|
||||
json.dump(examples_cache, f, ensure_ascii=False, indent=2)
|
||||
print(f"Saved examples cache: {cache_path}")
|
||||
|
||||
# Step 5: Summary stats
|
||||
print("\n" + "=" * 60)
|
||||
print("SUMMARY")
|
||||
print("=" * 60)
|
||||
print("\nSentences per book:")
|
||||
for book_name, count in book_counts.items():
|
||||
print(f" {book_name}: {count}")
|
||||
print(f" Total: {len(all_sentences)}")
|
||||
|
||||
print("\nVocab matching:")
|
||||
print(f" Total vocab words: {total_vocab}")
|
||||
print(f" Words with examples: {len(examples_cache)}")
|
||||
coverage = 100 * len(examples_cache) / total_vocab if total_vocab else 0
|
||||
print(f" Coverage: {coverage:.1f}%")
|
||||
|
||||
# Show some sample matches
|
||||
print("\nSample matches:")
|
||||
count = 0
|
||||
for word, sents in examples_cache.items():
|
||||
if count >= 5:
|
||||
break
|
||||
print(f" {word} -> {sents[0][:60]}...")
|
||||
count += 1
|
||||
|
||||
return examples_cache
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -7,18 +7,15 @@ Exposed API: get_frequency_rank(word_no_nikkud) -> int | None
|
|||
|
||||
import json
|
||||
import logging
|
||||
import re
|
||||
import unicodedata
|
||||
from pathlib import Path
|
||||
|
||||
import requests
|
||||
|
||||
from helpers import strip_nikkud as _strip_nikkud
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
FREQ_URL = (
|
||||
"https://raw.githubusercontent.com/hermitdave/FrequencyWords/"
|
||||
"master/content/2016/he/he_50k.txt"
|
||||
)
|
||||
FREQ_URL = "https://raw.githubusercontent.com/hermitdave/FrequencyWords/master/content/2016/he/he_50k.txt"
|
||||
CACHE_PATH = Path(__file__).parent / "data" / "frequency_cache.json"
|
||||
REQUEST_TIMEOUT = 30
|
||||
|
||||
|
|
@ -26,14 +23,6 @@ REQUEST_TIMEOUT = 30
|
|||
_freq: dict[str, int] = {}
|
||||
|
||||
|
||||
def _strip_nikkud(text: str) -> str:
|
||||
"""Remove Hebrew nikkud (diacritics) from a string."""
|
||||
return "".join(
|
||||
ch for ch in unicodedata.normalize("NFD", text)
|
||||
if unicodedata.category(ch) != "Mn"
|
||||
)
|
||||
|
||||
|
||||
def load(cache_path: Path = CACHE_PATH) -> None:
|
||||
"""Load frequency data from cache, downloading if not present."""
|
||||
global _freq
|
||||
|
|
|
|||
|
|
@ -4,25 +4,20 @@ Extract Hebrew vocabulary from pealim.com dictionary.
|
|||
Scrapes word entries, roots, parts of speech, and audio URLs for Anki flashcards.
|
||||
"""
|
||||
|
||||
import requests
|
||||
import pandas as pd
|
||||
from bs4 import BeautifulSoup
|
||||
import logging
|
||||
import time
|
||||
from typing import Optional
|
||||
|
||||
import pandas as pd
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
# Configure logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(levelname)s - %(message)s'
|
||||
)
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Session for connection pooling
|
||||
session = requests.Session()
|
||||
session.headers.update({
|
||||
'User-Agent': 'Mozilla/5.0 (compatible; pealim-scraper/1.0)'
|
||||
})
|
||||
session.headers.update({"User-Agent": "Mozilla/5.0 (compatible; pealim-scraper/1.0)"})
|
||||
|
||||
PEALIM_DICT_URL = "https://www.pealim.com/dict/"
|
||||
REQUEST_DELAY = 1.5 # seconds between requests (respectful scraping)
|
||||
|
|
@ -33,7 +28,7 @@ def get_total_pages() -> int:
|
|||
"""Dynamically determine total pages from first request."""
|
||||
try:
|
||||
logger.info("Fetching total page count...")
|
||||
cookies = {'translit': 'none', 'hebstyle': 'mo'}
|
||||
cookies = {"translit": "none", "hebstyle": "mo"}
|
||||
response = session.get(PEALIM_DICT_URL, cookies=cookies, timeout=REQUEST_TIMEOUT)
|
||||
response.raise_for_status()
|
||||
# Hardcoded — pealim.com has ~608 pages at ~15 words/page
|
||||
|
|
@ -48,17 +43,17 @@ def _parse_page_with_audio(html_bytes: bytes) -> list[dict]:
|
|||
Parse a dict page with BeautifulSoup to extract word data + audio URL.
|
||||
Returns list of dicts with keys: Word, Root, Part of Speech, Meaning, audio_url.
|
||||
"""
|
||||
soup = BeautifulSoup(html_bytes, 'html.parser')
|
||||
soup = BeautifulSoup(html_bytes, "html.parser")
|
||||
rows = []
|
||||
for tr in soup.select('table tr'):
|
||||
tds = tr.find_all('td')
|
||||
for tr in soup.select("table tr"):
|
||||
tds = tr.find_all("td")
|
||||
if len(tds) < 4:
|
||||
continue
|
||||
# Audio URL from span[data-audio] in first td
|
||||
audio_span = tds[0].find(attrs={'data-audio': True})
|
||||
audio_url = audio_span['data-audio'] if audio_span else ''
|
||||
audio_span = tds[0].find(attrs={"data-audio": True})
|
||||
audio_url = audio_span["data-audio"] if audio_span else ""
|
||||
# Word with nikkud
|
||||
menukad = tds[0].find('span', class_='menukad')
|
||||
menukad = tds[0].find("span", class_="menukad")
|
||||
word = menukad.get_text(strip=True) if menukad else tds[0].get_text(strip=True)
|
||||
# Root (may be link or plain text)
|
||||
root = tds[1].get_text(strip=True)
|
||||
|
|
@ -67,17 +62,19 @@ def _parse_page_with_audio(html_bytes: bytes) -> list[dict]:
|
|||
# Meaning
|
||||
meaning = tds[3].get_text(strip=True)
|
||||
if word:
|
||||
rows.append({
|
||||
'Word': word,
|
||||
'Root': root if root else '-',
|
||||
'Part of Speech': pos,
|
||||
'Meaning': meaning,
|
||||
'audio_url': audio_url,
|
||||
})
|
||||
rows.append(
|
||||
{
|
||||
"Word": word,
|
||||
"Root": root if root else "-",
|
||||
"Part of Speech": pos,
|
||||
"Meaning": meaning,
|
||||
"audio_url": audio_url,
|
||||
}
|
||||
)
|
||||
return rows
|
||||
|
||||
|
||||
def extract_from_website(max_pages: Optional[int] = None) -> pd.DataFrame:
|
||||
def extract_from_website(max_pages: int | None = None) -> pd.DataFrame:
|
||||
"""
|
||||
Extract dictionary entries from pealim.com.
|
||||
Captures audio URLs from each word entry's data-audio attribute.
|
||||
|
|
@ -93,33 +90,33 @@ def extract_from_website(max_pages: Optional[int] = None) -> pd.DataFrame:
|
|||
|
||||
all_rows: list[dict] = []
|
||||
|
||||
for page_num in range(1, total_pages):
|
||||
for page_num in range(1, total_pages + 1):
|
||||
try:
|
||||
url = f"{PEALIM_DICT_URL}?page={page_num}"
|
||||
|
||||
# First request: with nikkud — parse with BeautifulSoup for audio URL
|
||||
cookies = {'translit': 'none', 'hebstyle': 'mo'}
|
||||
cookies = {"translit": "none", "hebstyle": "mo"}
|
||||
response = session.get(url, cookies=cookies, timeout=REQUEST_TIMEOUT)
|
||||
response.raise_for_status()
|
||||
page_rows = _parse_page_with_audio(response.content)
|
||||
|
||||
# Second request: without nikkud — just get the word column
|
||||
cookies_vl = {'translit': 'none', 'hebstyle': 'vl', 'showmeaning': 'off'}
|
||||
cookies_vl = {"translit": "none", "hebstyle": "vl", "showmeaning": "off"}
|
||||
resp_vl = session.get(url, cookies=cookies_vl, timeout=REQUEST_TIMEOUT)
|
||||
resp_vl.raise_for_status()
|
||||
soup_vl = BeautifulSoup(resp_vl.content, 'html.parser')
|
||||
soup_vl = BeautifulSoup(resp_vl.content, "html.parser")
|
||||
no_nik_words = []
|
||||
for tr in soup_vl.select('table tr'):
|
||||
tds = tr.find_all('td')
|
||||
for tr in soup_vl.select("table tr"):
|
||||
tds = tr.find_all("td")
|
||||
if len(tds) < 4:
|
||||
continue
|
||||
menukad = tds[0].find('span', class_='menukad')
|
||||
menukad = tds[0].find("span", class_="menukad")
|
||||
w = menukad.get_text(strip=True) if menukad else tds[0].get_text(strip=True)
|
||||
no_nik_words.append(w)
|
||||
|
||||
# Merge no-nikkud words into rows
|
||||
for i, row in enumerate(page_rows):
|
||||
row['Word Without Nikkud'] = no_nik_words[i] if i < len(no_nik_words) else ''
|
||||
row["Word Without Nikkud"] = no_nik_words[i] if i < len(no_nik_words) else ""
|
||||
|
||||
all_rows.extend(page_rows)
|
||||
|
||||
|
|
@ -136,7 +133,7 @@ def extract_from_website(max_pages: Optional[int] = None) -> pd.DataFrame:
|
|||
continue
|
||||
|
||||
df = pd.DataFrame(all_rows)
|
||||
audio_count = (df['audio_url'] != '').sum() if 'audio_url' in df.columns else 0
|
||||
audio_count = (df["audio_url"] != "").sum() if "audio_url" in df.columns else 0
|
||||
logger.info(f"Extraction complete. Total words: {len(df)}, with audio URL: {audio_count}")
|
||||
return df
|
||||
|
||||
|
|
@ -150,39 +147,39 @@ def modify_for_anki(df: pd.DataFrame) -> pd.DataFrame:
|
|||
|
||||
# Find shared root words
|
||||
shared_root_words = []
|
||||
for idx, row in df.iterrows():
|
||||
root = row['Root']
|
||||
word = row['Word']
|
||||
for _idx, row in df.iterrows():
|
||||
root = row["Root"]
|
||||
word = row["Word"]
|
||||
|
||||
if root != '-' and pd.notna(root):
|
||||
same_root = df[(df['Root'] == root) & (df['Word'] != word)]['Word'].values
|
||||
shared = ' '.join(str(w) for w in same_root)
|
||||
if root != "-" and pd.notna(root):
|
||||
same_root = df[(df["Root"] == root) & (df["Word"] != word)]["Word"].values
|
||||
shared = " ".join(str(w) for w in same_root)
|
||||
shared_root_words.append(shared)
|
||||
else:
|
||||
shared_root_words.append('')
|
||||
shared_root_words.append("")
|
||||
|
||||
df['shared roots'] = shared_root_words
|
||||
df["shared roots"] = shared_root_words
|
||||
|
||||
# Generate Hebrew tags
|
||||
tags = []
|
||||
for idx, row in df.iterrows():
|
||||
for _idx, row in df.iterrows():
|
||||
tag_parts = []
|
||||
|
||||
root = str(row['Root']).replace(' ', '').replace('-', '')
|
||||
if 'nan' not in root and root:
|
||||
root_clean = root.replace('.', '')
|
||||
root = str(row["Root"]).replace(" ", "").replace("-", "")
|
||||
if "nan" not in root and root:
|
||||
root_clean = root.replace(".", "")
|
||||
tag_parts.append(f"שורש::{root_clean}")
|
||||
|
||||
pos = str(row['Part of Speech'])
|
||||
pos = str(row["Part of Speech"])
|
||||
pos_tags = {
|
||||
'Adverb': 'תוארי_הפועל',
|
||||
'Pronoun': 'כינויי_גוף',
|
||||
'Noun': 'שם_עצם',
|
||||
'Verb': 'פעלים',
|
||||
'Adjective': 'שם_תואר',
|
||||
'Preposition': 'מילות_יחס',
|
||||
'Conjunction': 'מילות_חיבור',
|
||||
'Particle': 'מילית'
|
||||
"Adverb": "תוארי_הפועל",
|
||||
"Pronoun": "כינויי_גוף",
|
||||
"Noun": "שם_עצם",
|
||||
"Verb": "פעלים",
|
||||
"Adjective": "שם_תואר",
|
||||
"Preposition": "מילות_יחס",
|
||||
"Conjunction": "מילות_חיבור",
|
||||
"Particle": "מילית",
|
||||
}
|
||||
|
||||
for key, value in pos_tags.items():
|
||||
|
|
@ -190,9 +187,9 @@ def modify_for_anki(df: pd.DataFrame) -> pd.DataFrame:
|
|||
tag_parts.append(value)
|
||||
break
|
||||
|
||||
tags.append(' '.join(tag_parts))
|
||||
tags.append(" ".join(tag_parts))
|
||||
|
||||
df['tags'] = tags
|
||||
df["tags"] = tags
|
||||
logger.info("Anki preparation complete.")
|
||||
return df
|
||||
|
||||
|
|
@ -201,11 +198,11 @@ def main():
|
|||
"""Main entry point."""
|
||||
try:
|
||||
df = extract_from_website()
|
||||
df.to_csv('hebrew_dict.csv', index=True)
|
||||
df.to_csv("hebrew_dict.csv", index=True)
|
||||
logger.info("Saved: hebrew_dict.csv")
|
||||
|
||||
df = modify_for_anki(df)
|
||||
df.to_csv('hebrew_dict_for_anki.csv', sep=';', index=True)
|
||||
df.to_csv("hebrew_dict_for_anki.csv", sep=";", index=True)
|
||||
logger.info("Saved: hebrew_dict_for_anki.csv")
|
||||
|
||||
logger.info("Complete!")
|
||||
|
|
@ -215,5 +212,5 @@ def main():
|
|||
raise
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
|
|||
8
helpers.py
Normal file
8
helpers.py
Normal file
|
|
@ -0,0 +1,8 @@
|
|||
"""Shared helper functions for the Hebrew Flash Cards project."""
|
||||
|
||||
import unicodedata
|
||||
|
||||
|
||||
def strip_nikkud(text: str) -> str:
|
||||
"""Remove Hebrew nikkud (diacritics) from a string."""
|
||||
return "".join(ch for ch in unicodedata.normalize("NFD", text) if unicodedata.category(ch) != "Mn")
|
||||
|
|
@ -22,13 +22,13 @@ import argparse
|
|||
import json
|
||||
import logging
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
import unicodedata
|
||||
from pathlib import Path
|
||||
|
||||
import requests
|
||||
|
||||
from helpers import strip_nikkud as _strip_nikkud
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
DATA_DIR = Path(__file__).parent / "data"
|
||||
|
|
@ -40,23 +40,26 @@ REQUEST_TIMEOUT = 10
|
|||
|
||||
# Abstract noun suffixes — words whose English meaning ends in these are skipped
|
||||
ABSTRACT_SUFFIXES = (
|
||||
"tion", "ity", "ness", "ment", "ance", "ence", "ism",
|
||||
"hood", "ship", "ure", "age",
|
||||
"tion",
|
||||
"ity",
|
||||
"ness",
|
||||
"ment",
|
||||
"ance",
|
||||
"ence",
|
||||
"ism",
|
||||
"hood",
|
||||
"ship",
|
||||
"ure",
|
||||
"age",
|
||||
)
|
||||
|
||||
session = requests.Session()
|
||||
session.headers.update({
|
||||
"User-Agent": "pealim-anki/3.0 (educational Hebrew Anki deck builder; contact: anki@pealim.invalid)"
|
||||
})
|
||||
|
||||
|
||||
def _strip_nikkud(text: str) -> str:
|
||||
return "".join(
|
||||
ch for ch in unicodedata.normalize("NFD", text)
|
||||
if unicodedata.category(ch) != "Mn"
|
||||
session.headers.update(
|
||||
{"User-Agent": "pealim-anki/3.0 (educational Hebrew Anki deck builder; contact: anki@pealim.invalid)"}
|
||||
)
|
||||
|
||||
|
||||
|
||||
def is_concrete(english_meaning: str) -> bool:
|
||||
"""Return True if the English meaning looks like a concrete noun."""
|
||||
meaning = english_meaning.strip().lower()
|
||||
|
|
@ -196,7 +199,7 @@ def load_cache() -> dict:
|
|||
try:
|
||||
with open(CACHE_PATH, encoding="utf-8") as f:
|
||||
return json.load(f)
|
||||
except Exception:
|
||||
except Exception: # noqa: S110
|
||||
pass
|
||||
return {}
|
||||
|
||||
|
|
|
|||
|
|
@ -1,187 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Extract Hebrew vocabulary from pealim.com dictionary.
|
||||
Scrapes word entries, roots, and parts of speech for Anki flashcards.
|
||||
"""
|
||||
|
||||
import requests
|
||||
import pandas as pd
|
||||
import logging
|
||||
import time
|
||||
from typing import Optional
|
||||
|
||||
# Configure logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(levelname)s - %(message)s'
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Session for connection pooling
|
||||
session = requests.Session()
|
||||
session.headers.update({
|
||||
'User-Agent': 'Mozilla/5.0 (compatible; pealim-scraper/1.0)'
|
||||
})
|
||||
|
||||
PEALIM_DICT_URL = "https://www.pealim.com/dict/"
|
||||
REQUEST_DELAY = 1.5 # seconds between requests (respectful scraping)
|
||||
REQUEST_TIMEOUT = 10 # seconds
|
||||
|
||||
|
||||
def get_total_pages() -> int:
|
||||
"""Dynamically determine total pages from first request."""
|
||||
try:
|
||||
logger.info("Fetching total page count...")
|
||||
cookies = {'translit': 'none', 'hebstyle': 'mo'}
|
||||
response = session.get(PEALIM_DICT_URL, cookies=cookies, timeout=REQUEST_TIMEOUT)
|
||||
response.raise_for_status()
|
||||
|
||||
dfs = pd.read_html(response.content)
|
||||
if dfs:
|
||||
# Estimate pages from first page (typically 15 words per page)
|
||||
# For now, use hardcoded value but this could be improved
|
||||
return 608
|
||||
except Exception as e:
|
||||
logger.error(f"Error fetching page count: {e}. Using default (608).")
|
||||
return 608
|
||||
|
||||
|
||||
def extract_from_website(max_pages: Optional[int] = None) -> pd.DataFrame:
|
||||
"""
|
||||
Extract dictionary entries from pealim.com.
|
||||
|
||||
Args:
|
||||
max_pages: Maximum pages to scrape (None = all)
|
||||
|
||||
Returns:
|
||||
DataFrame with Word, Root, Part of Speech, and Word Without Nikkud columns
|
||||
"""
|
||||
total_pages = max_pages or get_total_pages()
|
||||
logger.info(f"Starting extraction from {total_pages} pages...")
|
||||
|
||||
df = pd.DataFrame()
|
||||
|
||||
for page_num in range(1, total_pages):
|
||||
try:
|
||||
url = f"{PEALIM_DICT_URL}?page={page_num}"
|
||||
|
||||
# First request: with nikkud
|
||||
cookies = {'translit': 'none', 'hebstyle': 'mo'}
|
||||
response = session.get(url, cookies=cookies, timeout=REQUEST_TIMEOUT)
|
||||
response.raise_for_status()
|
||||
df_list = pd.read_html(response.content)
|
||||
|
||||
# Second request: without nikkud
|
||||
cookies = {'translit': 'none', 'hebstyle': 'vl', 'showmeaning': 'off'}
|
||||
response = session.get(url, cookies=cookies, timeout=REQUEST_TIMEOUT)
|
||||
response.raise_for_status()
|
||||
without_nikkud_words = pd.read_html(response.content)[-1]['Word']
|
||||
without_nikkud_words = without_nikkud_words.rename('Word Without Nikkud')
|
||||
|
||||
# Combine and append
|
||||
df_to_add = pd.concat([df_list[-1], without_nikkud_words], axis=1)
|
||||
df = pd.concat([df, df_to_add], ignore_index=True)
|
||||
|
||||
if page_num % 50 == 0:
|
||||
logger.info(f"Processed {page_num}/{total_pages} pages...")
|
||||
|
||||
time.sleep(REQUEST_DELAY)
|
||||
|
||||
except requests.RequestException as e:
|
||||
logger.error(f"Error fetching page {page_num}: {e}. Retrying...")
|
||||
time.sleep(REQUEST_DELAY * 2)
|
||||
except Exception as e:
|
||||
logger.error(f"Unexpected error on page {page_num}: {e}")
|
||||
continue
|
||||
|
||||
logger.info(f"Extraction complete. Total words: {len(df)}")
|
||||
return df
|
||||
|
||||
|
||||
def modify_for_anki(df: pd.DataFrame) -> pd.DataFrame:
|
||||
"""
|
||||
Transform dictionary DataFrame for Anki import.
|
||||
Adds shared root words and Hebrew tags.
|
||||
|
||||
Args:
|
||||
df: Dictionary DataFrame
|
||||
|
||||
Returns:
|
||||
Modified DataFrame ready for Anki
|
||||
"""
|
||||
logger.info("Preparing data for Anki...")
|
||||
|
||||
# Find shared root words
|
||||
shared_root_words = []
|
||||
for idx, row in df.iterrows():
|
||||
root = row['Root']
|
||||
word = row['Word']
|
||||
|
||||
if root != '-' and pd.notna(root):
|
||||
# Find other words with same root
|
||||
same_root = df[(df['Root'] == root) & (df['Word'] != word)]['Word'].values
|
||||
shared = ' '.join(str(w) for w in same_root)
|
||||
shared_root_words.append(shared)
|
||||
else:
|
||||
shared_root_words.append('')
|
||||
|
||||
df['shared roots'] = shared_root_words
|
||||
|
||||
# Generate Hebrew tags
|
||||
tags = []
|
||||
for idx, row in df.iterrows():
|
||||
tag_parts = []
|
||||
|
||||
# Root tag
|
||||
root = str(row['Root']).replace(' ', '').replace('-', '')
|
||||
if 'nan' not in root and root:
|
||||
root_clean = root.replace('.', '')
|
||||
tag_parts.append(f"שורש::{root_clean}")
|
||||
|
||||
# Part of speech tag
|
||||
pos = str(row['Part of Speech'])
|
||||
pos_tags = {
|
||||
'Adverb': 'תוארי_הפועל',
|
||||
'Pronoun': 'כינויי_גוף',
|
||||
'Noun': 'שם_עצם',
|
||||
'Verb': 'פעלים',
|
||||
'Adjective': 'שם_תואר',
|
||||
'Preposition': 'מילות_יחס',
|
||||
'Conjunction': 'מילות_חיבור',
|
||||
'Particle': 'מילית'
|
||||
}
|
||||
|
||||
for key, value in pos_tags.items():
|
||||
if key in pos:
|
||||
tag_parts.append(value)
|
||||
break
|
||||
|
||||
tags.append(' '.join(tag_parts))
|
||||
|
||||
df['tags'] = tags
|
||||
logger.info("Anki preparation complete.")
|
||||
return df
|
||||
|
||||
|
||||
def main():
|
||||
"""Main entry point."""
|
||||
try:
|
||||
# Extract from website
|
||||
df = extract_from_website()
|
||||
df.to_csv('pealim_dict.csv', index=True)
|
||||
logger.info("Saved: pealim_dict.csv")
|
||||
|
||||
# Transform for Anki
|
||||
df = modify_for_anki(df)
|
||||
df.to_csv('pealim_dict_for_anki.csv', sep=';', index=True)
|
||||
logger.info("Saved: pealim_dict_for_anki.csv")
|
||||
|
||||
logger.info("✅ Complete!")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Fatal error: {e}")
|
||||
raise
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
80
pyproject.toml
Normal file
80
pyproject.toml
Normal file
|
|
@ -0,0 +1,80 @@
|
|||
[project]
|
||||
name = "hebrew-flash-cards"
|
||||
version = "0.13"
|
||||
description = "Hebrew vocabulary & verb conjugation flashcards for Anki"
|
||||
requires-python = ">=3.11"
|
||||
dependencies = [
|
||||
"beautifulsoup4>=4.11.0",
|
||||
"genanki>=0.8.0",
|
||||
"lxml>=4.9.0",
|
||||
"numpy>=1.21.0",
|
||||
"pandas>=1.3.0",
|
||||
"pymupdf>=1.23.0",
|
||||
"pypdf>=3.0.0",
|
||||
"python-bidi>=0.4.2",
|
||||
"requests>=2.26.0",
|
||||
]
|
||||
|
||||
[project.optional-dependencies]
|
||||
dev = [
|
||||
"bandit",
|
||||
"pytest",
|
||||
"ruff",
|
||||
"vulture",
|
||||
]
|
||||
|
||||
[tool.pytest.ini_options]
|
||||
testpaths = ["tests"]
|
||||
|
||||
[tool.ruff]
|
||||
target-version = "py311"
|
||||
line-length = 120
|
||||
exclude = [
|
||||
"lib/",
|
||||
"bin/",
|
||||
"include/",
|
||||
"lib64/",
|
||||
"archive/",
|
||||
"venv/",
|
||||
]
|
||||
|
||||
[tool.ruff.lint]
|
||||
select = [
|
||||
"E", # pycodestyle errors
|
||||
"W", # pycodestyle warnings
|
||||
"F", # pyflakes
|
||||
"I", # isort
|
||||
"UP", # pyupgrade
|
||||
"B", # flake8-bugbear
|
||||
"SIM", # flake8-simplify
|
||||
"PIE", # flake8-pie
|
||||
"T20", # flake8-print (flag print statements)
|
||||
"RET", # flake8-return
|
||||
"C4", # flake8-comprehensions
|
||||
"S", # flake8-bandit (security)
|
||||
]
|
||||
ignore = [
|
||||
"T201", # allow print() — this is a CLI tool, not a library
|
||||
"S603", # subprocess call with shell=False is fine
|
||||
"S607", # partial executable path is fine for CLI tools
|
||||
"S105", # PASS = "✓" is not a password
|
||||
"S108", # /tmp paths are intentional for temp downloads
|
||||
"S311", # random.Random() is for card ordering, not crypto
|
||||
"E501", # line too long — handled by formatter
|
||||
]
|
||||
|
||||
[tool.ruff.lint.per-file-ignores]
|
||||
"test_*.py" = ["S101"] # allow assert in tests
|
||||
|
||||
[tool.ruff.format]
|
||||
quote-style = "double"
|
||||
indent-style = "space"
|
||||
|
||||
[tool.vulture]
|
||||
paths = ["."]
|
||||
exclude = ["lib/", "bin/", "include/", "lib64/", "venv/", "archive/"]
|
||||
min_confidence = 80
|
||||
|
||||
[tool.bandit]
|
||||
exclude_dirs = ["lib", "bin", "include", "lib64", "venv", "archive"]
|
||||
skips = ["B101"] # allow assert
|
||||
183
rebuild_sentence_matches.py
Normal file
183
rebuild_sentence_matches.py
Normal file
|
|
@ -0,0 +1,183 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Rebuild vocab_sentence_matches.json using both direct word matching
|
||||
and ktiv male conjugated/declined form matching.
|
||||
|
||||
This dramatically improves sentence coverage by matching not just
|
||||
dictionary forms but all conjugated verbs and declined nouns.
|
||||
"""
|
||||
|
||||
import json
|
||||
import logging
|
||||
import re
|
||||
from pathlib import Path
|
||||
|
||||
import pandas as pd
|
||||
|
||||
from helpers import strip_nikkud as _strip_nikkud
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
DATA_DIR = Path(__file__).parent / "data"
|
||||
|
||||
|
||||
def main():
|
||||
# Load sentences
|
||||
with open(DATA_DIR / "epub_sentence_index.json") as f:
|
||||
sentences = json.load(f).get("sentences", [])
|
||||
logger.info(f"Loaded {len(sentences)} sentences")
|
||||
|
||||
# Load vocab CSV
|
||||
csv_path = DATA_DIR / "hebrew_dict_for_anki.csv"
|
||||
try:
|
||||
df = pd.read_csv(csv_path, sep=";", index_col=0)
|
||||
if df.shape[1] < 3:
|
||||
raise ValueError
|
||||
except (ValueError, pd.errors.ParserError):
|
||||
df = pd.read_csv(csv_path, index_col=0)
|
||||
logger.info(f"Loaded {len(df)} vocab entries")
|
||||
|
||||
# Build word lookup: stripped_form → (word_nikkud, word_no_nikkud)
|
||||
word_lookup: dict[str, list[tuple[str, str]]] = {}
|
||||
for _, row in df.iterrows():
|
||||
word = str(row.get("Word", "")).strip()
|
||||
wni = str(row.get("Word Without Nikkud", "")).strip()
|
||||
if not word or word in ("nan", "None"):
|
||||
continue
|
||||
stripped = _strip_nikkud(word)
|
||||
if stripped:
|
||||
word_lookup.setdefault(stripped, []).append((word, wni))
|
||||
|
||||
# Load ktiv male forms: ktiv_male_form → [{word_nikkud, form_type, ...}]
|
||||
ktiv_path = DATA_DIR / "ktiv_male_forms.json"
|
||||
ktiv_forms: dict[str, list[dict]] = {}
|
||||
if ktiv_path.exists():
|
||||
with open(ktiv_path) as f:
|
||||
ktiv_forms = json.load(f)
|
||||
logger.info(f"Loaded {len(ktiv_forms)} ktiv male forms")
|
||||
else:
|
||||
logger.warning("No ktiv_male_forms.json — only using direct matching")
|
||||
|
||||
# Build reverse lookup: ktiv_male → set of dictionary words (nikkud)
|
||||
ktiv_to_word: dict[str, set[str]] = {}
|
||||
for ktiv, entries in ktiv_forms.items():
|
||||
for entry in entries:
|
||||
word_nikkud = entry.get("word_nikkud", "")
|
||||
if word_nikkud:
|
||||
ktiv_to_word.setdefault(ktiv, set()).add(word_nikkud)
|
||||
|
||||
# Also add all vocab words' own stripped forms to ktiv_to_word
|
||||
for stripped, entries in word_lookup.items():
|
||||
for word_nikkud, _ in entries:
|
||||
ktiv_to_word.setdefault(stripped, set()).add(word_nikkud)
|
||||
|
||||
logger.info(f"Total matchable forms: {len(ktiv_to_word)}")
|
||||
|
||||
# Tokenize all sentences once
|
||||
sentence_tokens: list[tuple[dict, list[str]]] = []
|
||||
for s in sentences:
|
||||
stripped = s.get("stripped", _strip_nikkud(s.get("text", "")))
|
||||
tokens = [re.sub(r'[.,!?;:"\'\u05be]', "", t) for t in stripped.split()]
|
||||
tokens = [t for t in tokens if t] # remove empty
|
||||
sentence_tokens.append((s, tokens))
|
||||
|
||||
# Match: for each sentence token, check ktiv_to_word lookup
|
||||
# Build word_nikkud → [sentence_info]
|
||||
matches: dict[str, list[dict]] = {} # word_nikkud → [sentences]
|
||||
|
||||
for sent, tokens in sentence_tokens:
|
||||
text = sent.get("text", "")
|
||||
book = sent.get("book", "")
|
||||
word_len = len(tokens)
|
||||
|
||||
# Skip sentences that are too short or too long
|
||||
if word_len < 4 or word_len > 15:
|
||||
continue
|
||||
|
||||
for tok in tokens:
|
||||
if tok in ktiv_to_word:
|
||||
for word_nikkud in ktiv_to_word[tok]:
|
||||
matches.setdefault(word_nikkud, []).append(
|
||||
{
|
||||
"text": text,
|
||||
"book": book,
|
||||
"matched_form": tok,
|
||||
"word_count": word_len,
|
||||
}
|
||||
)
|
||||
|
||||
logger.info(f"Words with at least 1 match: {len(matches)}")
|
||||
|
||||
# Deduplicate and limit to 3 best sentences per word
|
||||
# Prefer shorter sentences (6-12 words ideal)
|
||||
output: dict[str, dict] = {}
|
||||
for word_nikkud, sents in matches.items():
|
||||
# Deduplicate by text
|
||||
seen_texts = set()
|
||||
unique = []
|
||||
for s in sents:
|
||||
if s["text"] not in seen_texts:
|
||||
seen_texts.add(s["text"])
|
||||
unique.append(s)
|
||||
|
||||
# Score: prefer 6-12 word sentences
|
||||
def score(s):
|
||||
wc = s["word_count"]
|
||||
if 6 <= wc <= 12:
|
||||
return 0 # ideal
|
||||
return abs(wc - 9) # distance from ideal
|
||||
|
||||
unique.sort(key=score)
|
||||
best = unique[:3]
|
||||
|
||||
# Find the Word Without Nikkud for this word
|
||||
stripped = _strip_nikkud(word_nikkud)
|
||||
wni = stripped # default
|
||||
if stripped in word_lookup:
|
||||
for wn, w_wni in word_lookup[stripped]:
|
||||
if wn == word_nikkud:
|
||||
wni = w_wni
|
||||
break
|
||||
|
||||
output[wni] = {
|
||||
"word_nikkud": word_nikkud,
|
||||
"sentences": [{"text": s["text"], "book": s["book"]} for s in best],
|
||||
}
|
||||
|
||||
# Save
|
||||
out_path = DATA_DIR / "vocab_sentence_matches.json"
|
||||
with open(out_path, "w") as f:
|
||||
json.dump(output, f, ensure_ascii=False, indent=1)
|
||||
|
||||
total_sents = sum(len(v["sentences"]) for v in output.values())
|
||||
logger.info(f"Saved {len(output)} words with {total_sents} sentences → {out_path}")
|
||||
|
||||
# Stats
|
||||
total_vocab = len(df)
|
||||
pct = len(output) * 100 / total_vocab
|
||||
logger.info(f"Coverage: {len(output)}/{total_vocab} ({pct:.1f}%)")
|
||||
|
||||
# Breakdown by match type
|
||||
direct_only = 0
|
||||
ktiv_only = 0
|
||||
both = 0
|
||||
for _wni, info in output.items():
|
||||
word = info["word_nikkud"]
|
||||
stripped = _strip_nikkud(word)
|
||||
has_direct = stripped in word_lookup
|
||||
has_ktiv = any(s.get("matched_form", "") != stripped for s in info["sentences"])
|
||||
if has_direct and has_ktiv:
|
||||
both += 1
|
||||
elif has_ktiv:
|
||||
ktiv_only += 1
|
||||
else:
|
||||
direct_only += 1
|
||||
|
||||
logger.info(f" Direct matches only: {direct_only}")
|
||||
logger.info(f" Ktiv male matches only: {ktiv_only}")
|
||||
logger.info(f" Both: {both}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
134
run.py
134
run.py
|
|
@ -6,7 +6,7 @@ Usage:
|
|||
python run.py [options]
|
||||
|
||||
Options:
|
||||
--only {vocab,conjugations} Run only one deck (skips all unrelated steps)
|
||||
--only {vocab,conjugations,confusables,plurals,complete} Run only one deck
|
||||
--skip-scrape Use existing data/pealim_dict.csv (no pealim.com dict scraping)
|
||||
--skip-audio Skip audio .mp3 downloads
|
||||
--skip-examples Skip Ben Yehuda example fetching
|
||||
|
|
@ -22,9 +22,10 @@ import logging
|
|||
import re
|
||||
import sys
|
||||
import time
|
||||
import unicodedata
|
||||
from pathlib import Path
|
||||
|
||||
from helpers import strip_nikkud
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
|
||||
logging.basicConfig(
|
||||
|
|
@ -42,11 +43,19 @@ FONTS_DIR = DATA_DIR / "fonts"
|
|||
|
||||
def parse_args():
|
||||
p = argparse.ArgumentParser(description="Pealim Anki deck builder")
|
||||
p.add_argument("--only", choices=["vocab", "conjugations"], help="Run only one deck (skips all unrelated steps)")
|
||||
p.add_argument(
|
||||
"--only",
|
||||
choices=["vocab", "conjugations", "confusables", "plurals", "complete"],
|
||||
help="Run only one deck (skips all unrelated steps)",
|
||||
)
|
||||
p.add_argument("--skip-scrape", action="store_true", help="Skip dict scraping; use cached CSV")
|
||||
p.add_argument("--skip-audio", action="store_true", help="Skip audio downloads")
|
||||
p.add_argument("--skip-examples", action="store_true", help="Skip Ben Yehuda example lookup")
|
||||
p.add_argument("--skip-conjugations", action="store_true", help="Skip verb conjugation extraction (deprecated: use --only vocab)")
|
||||
p.add_argument(
|
||||
"--skip-conjugations",
|
||||
action="store_true",
|
||||
help="Skip verb conjugation extraction (deprecated: use --only vocab)",
|
||||
)
|
||||
p.add_argument("--skip-images", action="store_true", help="Skip image fetching")
|
||||
p.add_argument("--refresh-examples", action="store_true", help="Force rebuild of Ben Yehuda index")
|
||||
p.add_argument("--test", type=int, metavar="N", help="Limit to first N words")
|
||||
|
|
@ -59,8 +68,6 @@ def step_scrape(args):
|
|||
anki_csv = DATA_DIR / "hebrew_dict_for_anki.csv"
|
||||
# Legacy fallback names
|
||||
legacy_dict = DATA_DIR / "pealim_dict.csv"
|
||||
legacy_anki = DATA_DIR / "pealim_dict_for_anki.csv"
|
||||
|
||||
if args.skip_scrape:
|
||||
if dict_csv.exists():
|
||||
logger.info(f"[1] Using existing {dict_csv}")
|
||||
|
|
@ -72,8 +79,8 @@ def step_scrape(args):
|
|||
return
|
||||
|
||||
logger.info("[1] Scraping dictionary from pealim.com …")
|
||||
|
||||
import hebrew_extract
|
||||
import pandas as pd
|
||||
|
||||
df = hebrew_extract.extract_from_website()
|
||||
df.to_csv(dict_csv, index=True)
|
||||
|
|
@ -88,6 +95,7 @@ def step_frequency() -> dict[str, int]:
|
|||
"""Step 2 — load/download word frequency data."""
|
||||
logger.info("[2] Loading word frequency data …")
|
||||
import frequency_lookup
|
||||
|
||||
frequency_lookup.load()
|
||||
return frequency_lookup._freq
|
||||
|
||||
|
|
@ -104,6 +112,7 @@ def step_examples(args, freq_cache: dict):
|
|||
|
||||
logger.info("[3] Loading Ben Yehuda example index …")
|
||||
import benyehuda
|
||||
|
||||
benyehuda.load(force_rebuild=args.refresh_examples)
|
||||
|
||||
dict_csv = DATA_DIR / "hebrew_dict_for_anki.csv"
|
||||
|
|
@ -116,6 +125,7 @@ def step_examples(args, freq_cache: dict):
|
|||
|
||||
try:
|
||||
import pandas as pd
|
||||
|
||||
try:
|
||||
df = pd.read_csv(dict_csv, sep=";", index_col=0)
|
||||
if df.shape[1] < 3:
|
||||
|
|
@ -158,6 +168,7 @@ def step_audio(args):
|
|||
|
||||
import pandas as pd
|
||||
import requests
|
||||
|
||||
try:
|
||||
try:
|
||||
df = pd.read_csv(dict_csv, sep=";", index_col=0)
|
||||
|
|
@ -166,7 +177,7 @@ def step_audio(args):
|
|||
except (ValueError, pd.errors.ParserError):
|
||||
df = pd.read_csv(dict_csv, index_col=0)
|
||||
|
||||
if 'audio_url' not in df.columns:
|
||||
if "audio_url" not in df.columns:
|
||||
logger.warning(" No audio_url column in CSV — re-scrape with hebrew_extract.py to capture audio URLs")
|
||||
return
|
||||
|
||||
|
|
@ -178,10 +189,6 @@ def step_audio(args):
|
|||
skipped = 0
|
||||
no_url = 0
|
||||
|
||||
def strip_nik(t: str) -> str:
|
||||
return "".join(c for c in unicodedata.normalize("NFD", t)
|
||||
if unicodedata.category(c) != "Mn")
|
||||
|
||||
for _, row in df.iterrows():
|
||||
word = str(row.get("Word", "")).strip()
|
||||
word_plain = str(row.get("Word Without Nikkud", "")).strip()
|
||||
|
|
@ -190,7 +197,7 @@ def step_audio(args):
|
|||
if not word:
|
||||
continue
|
||||
|
||||
safe_name = re.sub(r"[^\u05d0-\u05ea]", "", strip_nik(word_plain or word))
|
||||
safe_name = re.sub(r"[^\u05d0-\u05ea]", "", strip_nikkud(word_plain or word))
|
||||
if not safe_name:
|
||||
continue
|
||||
mp3_path = AUDIO_DIR / f"{safe_name}.mp3"
|
||||
|
|
@ -228,11 +235,12 @@ def step_conj_audio(args, conjugations: dict):
|
|||
AUDIO_CONJ_DIR.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
import requests
|
||||
|
||||
downloaded = 0
|
||||
skipped = 0
|
||||
failed = 0
|
||||
|
||||
for infinitive, data in conjugations.items():
|
||||
for _infinitive, data in conjugations.items():
|
||||
if not data or not data.get("forms"):
|
||||
continue
|
||||
|
||||
|
|
@ -282,10 +290,7 @@ def step_conj_audio(args, conjugations: dict):
|
|||
logger.debug(f" Conj audio failed {filename}: {e}")
|
||||
failed += 1
|
||||
|
||||
logger.info(
|
||||
f" Conjugation audio: {downloaded} downloaded, "
|
||||
f"{skipped} cached, {failed} failed"
|
||||
)
|
||||
logger.info(f" Conjugation audio: {downloaded} downloaded, {skipped} cached, {failed} failed")
|
||||
|
||||
|
||||
def step_fonts(args):
|
||||
|
|
@ -302,6 +307,7 @@ def step_fonts(args):
|
|||
|
||||
# Fetch CSS to get actual TTF source URLs (static subset for Hebrew + Latin)
|
||||
import requests as _req
|
||||
|
||||
headers = {
|
||||
# Request TTF (not woff2) so Anki can embed them
|
||||
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) Gecko/20100101 Firefox/120.0"
|
||||
|
|
@ -355,10 +361,13 @@ def step_images(args) -> dict:
|
|||
limit = args.test # When in test mode, limit images too
|
||||
logger.info("[4d] Fetching images for concrete nouns …")
|
||||
import image_fetch
|
||||
|
||||
return image_fetch.run(limit=limit)
|
||||
|
||||
|
||||
def step_build_all(args, examples_cache: dict, freq_cache: dict, conjugations: dict | None, image_cache: dict | None = None):
|
||||
def step_build_all(
|
||||
args, examples_cache: dict, freq_cache: dict, conjugations: dict | None, image_cache: dict | None = None
|
||||
):
|
||||
"""Step 5 — build all 6 release variants (4 vocab + 2 conj)."""
|
||||
logger.info("[5] Building all deck variants …")
|
||||
import apkg_builder
|
||||
|
|
@ -394,6 +403,7 @@ def step_conjugations(args):
|
|||
logger.info("[6] --skip-conjugations: loading from cache …")
|
||||
with open(conj_cache) as f:
|
||||
import json as _json
|
||||
|
||||
return _json.load(f)
|
||||
logger.info("[6] --skip-conjugations: no cache found, skipping conj decks")
|
||||
return None
|
||||
|
|
@ -407,10 +417,12 @@ def step_conjugations(args):
|
|||
logger.info("[6] Using cached conjugations.json …")
|
||||
with open(conj_cache) as f:
|
||||
import json as _json
|
||||
|
||||
conjugations = _json.load(f)
|
||||
else:
|
||||
logger.info("[6] Extracting verb conjugations …")
|
||||
import conjugation_extract
|
||||
|
||||
conjugations = conjugation_extract.main(verbs_file)
|
||||
|
||||
# Download conjugation audio
|
||||
|
|
@ -434,6 +446,7 @@ def print_summary(args, examples_cache, freq_cache, conjugations):
|
|||
dict_csv = DATA_DIR / "pealim_dict.csv"
|
||||
if dict_csv.exists():
|
||||
import pandas as pd
|
||||
|
||||
try:
|
||||
df = pd.read_csv(dict_csv, sep=";", index_col=0)
|
||||
if df.shape[1] < 3:
|
||||
|
|
@ -455,9 +468,9 @@ def print_summary(args, examples_cache, freq_cache, conjugations):
|
|||
if AUDIO_CONJ_DIR.exists():
|
||||
# Count only files that will be bundled: active non-infinitive forms
|
||||
# (excludes {slug}_passive_* and {slug}_infinitive.mp3 on-disk extras)
|
||||
mp3s = [p for p in AUDIO_CONJ_DIR.glob("*.mp3")
|
||||
if not p.stem.endswith("_infinitive")
|
||||
and "_passive_" not in p.stem]
|
||||
mp3s = [
|
||||
p for p in AUDIO_CONJ_DIR.glob("*.mp3") if not p.stem.endswith("_infinitive") and "_passive_" not in p.stem
|
||||
]
|
||||
logger.info(f" Conjugation audio files (bundled): {len(mp3s)}")
|
||||
|
||||
image_cache_path = DATA_DIR / "image_cache.json"
|
||||
|
|
@ -468,9 +481,18 @@ def print_summary(args, examples_cache, freq_cache, conjugations):
|
|||
logger.info(f" Images: {found_imgs}/{len(ic)} nouns with images")
|
||||
|
||||
import apkg_builder as _ab
|
||||
|
||||
all_apkgs = [
|
||||
_ab.VOCAB_APKG, _ab.VOCAB_APKG_AUDIO, _ab.VOCAB_APKG_IMAGES, _ab.VOCAB_APKG_AUDIO_IMAGES,
|
||||
_ab.CONJ_APKG, _ab.CONJ_APKG_AUDIO,
|
||||
_ab.VOCAB_APKG,
|
||||
_ab.VOCAB_APKG_AUDIO,
|
||||
_ab.VOCAB_APKG_IMAGES,
|
||||
_ab.VOCAB_APKG_AUDIO_IMAGES,
|
||||
_ab.CONJ_APKG,
|
||||
_ab.CONJ_APKG_AUDIO,
|
||||
_ab.CONF_APKG,
|
||||
_ab.CONF_APKG_AUDIO,
|
||||
_ab.COMPLETE_APKG,
|
||||
_ab.COMPLETE_APKG_AUDIO,
|
||||
]
|
||||
for apkg in all_apkgs:
|
||||
if apkg.exists():
|
||||
|
|
@ -502,14 +524,70 @@ def main():
|
|||
conjugations = step_conjugations(args)
|
||||
if conjugations:
|
||||
import apkg_builder
|
||||
apkg_builder.build_all_variants(
|
||||
DATA_DIR / "hebrew_dict_for_anki.csv",
|
||||
conjugations=conjugations,
|
||||
limit=args.test,
|
||||
|
||||
dict_csv = DATA_DIR / "hebrew_dict_for_anki.csv"
|
||||
if not dict_csv.exists():
|
||||
dict_csv = DATA_DIR / "hebrew_dict.csv"
|
||||
for audio, path in [(False, apkg_builder.CONJ_APKG), (True, apkg_builder.CONJ_APKG_AUDIO)]:
|
||||
deck, media = apkg_builder.build_conj_deck(
|
||||
conjugations,
|
||||
include_audio=audio,
|
||||
dict_csv=dict_csv,
|
||||
)
|
||||
apkg_builder.write_conj_apkg(deck, media, out_path=path)
|
||||
print_summary(args, {}, {}, conjugations or {})
|
||||
return
|
||||
|
||||
if args.only == "confusables":
|
||||
step_fonts(args)
|
||||
import apkg_builder
|
||||
|
||||
dict_csv = DATA_DIR / "hebrew_dict_for_anki.csv"
|
||||
for audio, path in [(False, apkg_builder.CONF_APKG), (True, apkg_builder.CONF_APKG_AUDIO)]:
|
||||
deck, media = apkg_builder.build_confusables_deck(dict_csv, include_audio=audio)
|
||||
apkg_builder.write_conf_apkg(deck, media, out_path=path)
|
||||
print_summary(args, {}, {}, {})
|
||||
return
|
||||
|
||||
if args.only == "plurals":
|
||||
step_fonts(args)
|
||||
import apkg_builder
|
||||
|
||||
dict_csv = DATA_DIR / "hebrew_dict_for_anki.csv"
|
||||
if not dict_csv.exists():
|
||||
dict_csv = DATA_DIR / "hebrew_dict.csv"
|
||||
for audio, path in [(False, apkg_builder.PLURAL_APKG), (True, apkg_builder.PLURAL_APKG_AUDIO)]:
|
||||
deck, media = apkg_builder.build_plural_deck(dict_csv=dict_csv, include_audio=audio)
|
||||
apkg_builder.write_plural_apkg(deck, media, out_path=path)
|
||||
print_summary(args, {}, {}, {})
|
||||
return
|
||||
|
||||
if args.only == "complete":
|
||||
step_fonts(args)
|
||||
freq_cache = step_frequency() if not args.skip_scrape else {}
|
||||
examples_cache = step_examples(args, freq_cache) if not args.skip_examples else {}
|
||||
image_cache = step_images(args) if not args.skip_images else {}
|
||||
conjugations = step_conjugations(args)
|
||||
import apkg_builder
|
||||
|
||||
dict_csv = DATA_DIR / "hebrew_dict_for_anki.csv"
|
||||
if not dict_csv.exists():
|
||||
dict_csv = DATA_DIR / "hebrew_dict.csv"
|
||||
emoji_lookup = apkg_builder._load_emoji_lookup()
|
||||
for audio, path in [(False, apkg_builder.COMPLETE_APKG), (True, apkg_builder.COMPLETE_APKG_AUDIO)]:
|
||||
decks, media = apkg_builder.build_complete_deck(
|
||||
dict_csv,
|
||||
conjugations=conjugations or {},
|
||||
examples_cache=examples_cache,
|
||||
freq_cache=freq_cache,
|
||||
image_cache=image_cache,
|
||||
emoji_lookup=emoji_lookup,
|
||||
include_audio=audio,
|
||||
)
|
||||
apkg_builder.write_complete_apkg(decks, media, out_path=path)
|
||||
print_summary(args, examples_cache, freq_cache, conjugations or {})
|
||||
return
|
||||
|
||||
if args.only == "vocab":
|
||||
args.skip_conjugations = True
|
||||
|
||||
|
|
|
|||
405
scripts/extract_pdf_sentences.py
Normal file
405
scripts/extract_pdf_sentences.py
Normal file
|
|
@ -0,0 +1,405 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Extract sentences from PDF books and match vocab words to sentences.
|
||||
|
||||
1. Extract sentences from alice.pdf and lion_strawberry.pdf
|
||||
2. Merge into existing epub_sentence_index.json
|
||||
3. Match vocab words to sentences, produce vocab_sentence_matches.json
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
|
||||
# Use the venv with pymupdf
|
||||
sys.path.insert(0, "/home/node/projects/pealim/venv_pdf/lib/python3.11/site-packages")
|
||||
# Also need the main venv for pandas
|
||||
sys.path.insert(0, "/home/node/projects/pealim/lib/python3.11/site-packages")
|
||||
|
||||
import fitz
|
||||
import pandas as pd
|
||||
|
||||
BASE_DIR = "/home/node/projects/pealim"
|
||||
DATA_DIR = os.path.join(BASE_DIR, "data")
|
||||
EPUBS_DIR = os.path.join(DATA_DIR, "epubs")
|
||||
SENTENCE_INDEX = os.path.join(DATA_DIR, "epub_sentence_index.json")
|
||||
VOCAB_CSV = os.path.join(DATA_DIR, "hebrew_dict_for_anki.csv")
|
||||
MATCHES_FILE = os.path.join(DATA_DIR, "vocab_sentence_matches.json")
|
||||
|
||||
NIKKUD_RE = re.compile(r"[\u0591-\u05C7]")
|
||||
HEBREW_RE = re.compile(r"[\u05d0-\u05ea]")
|
||||
HEBREW_CHAR_RE = re.compile(r"[\u05d0-\u05ea\ufb20-\ufb4f]")
|
||||
|
||||
|
||||
def strip_nikkud(text):
|
||||
"""Remove all Hebrew nikkud/cantillation marks."""
|
||||
return NIKKUD_RE.sub("", text)
|
||||
|
||||
|
||||
def collapse_hebrew_spaces(text):
|
||||
"""Collapse spaces between Hebrew letter fragments (for badly-encoded PDFs).
|
||||
|
||||
Strategy: strip nikkud first, then iteratively remove spaces between
|
||||
Hebrew characters. Real word boundaries are detected by:
|
||||
- Final-form letters (ם ן ף ך ץ) followed by space
|
||||
- Punctuation (.,;:!?"')
|
||||
- Non-Hebrew characters
|
||||
"""
|
||||
stripped = strip_nikkud(text)
|
||||
# Normalize presentation forms to standard Hebrew
|
||||
# FB20-FB4F contains presentation forms
|
||||
for code in range(0xFB2A, 0xFB50):
|
||||
ch = chr(code)
|
||||
if ch in stripped:
|
||||
# Map shin/sin dots, dagesh forms back to base
|
||||
# FB2A = שׁ (shin+dot), FB2B = שׂ (sin+dot)
|
||||
base_map = {
|
||||
"\ufb2a": "ש",
|
||||
"\ufb2b": "ש",
|
||||
"\ufb35": "ו",
|
||||
"\ufb4b": "ו",
|
||||
"\ufb30": "א",
|
||||
"\ufb31": "ב",
|
||||
"\ufb32": "ג",
|
||||
"\ufb33": "ד",
|
||||
"\ufb34": "ה",
|
||||
"\ufb36": "ז",
|
||||
"\ufb38": "ט",
|
||||
"\ufb39": "י",
|
||||
"\ufb3a": "כ",
|
||||
"\ufb3b": "כ",
|
||||
"\ufb3c": "ל",
|
||||
"\ufb3e": "מ",
|
||||
"\ufb40": "נ",
|
||||
"\ufb41": "ס",
|
||||
"\ufb43": "פ",
|
||||
"\ufb44": "פ",
|
||||
"\ufb46": "צ",
|
||||
"\ufb47": "ק",
|
||||
"\ufb48": "ר",
|
||||
"\ufb49": "ש",
|
||||
"\ufb4a": "ת",
|
||||
}
|
||||
if ch in base_map:
|
||||
stripped = stripped.replace(ch, base_map[ch])
|
||||
|
||||
# Replace multiple spaces with single
|
||||
stripped = re.sub(r" {2,}", " ", stripped)
|
||||
|
||||
# Now rebuild text, keeping spaces only at word boundaries
|
||||
# Word boundary markers: final-form letters, punctuation, non-Hebrew
|
||||
final_forms = set("םןףךץ")
|
||||
result = []
|
||||
i = 0
|
||||
chars = list(stripped)
|
||||
|
||||
while i < len(chars):
|
||||
if chars[i] != " ":
|
||||
result.append(chars[i])
|
||||
i += 1
|
||||
continue
|
||||
|
||||
# It's a space. Decide if it's a word boundary.
|
||||
# Look back for the last non-space character
|
||||
prev_ch = None
|
||||
for j in range(len(result) - 1, -1, -1):
|
||||
if result[j] != " ":
|
||||
prev_ch = result[j]
|
||||
break
|
||||
|
||||
# Look forward for next non-space character
|
||||
next_ch = None
|
||||
for j in range(i + 1, len(chars)):
|
||||
if chars[j] != " ":
|
||||
next_ch = chars[j]
|
||||
break
|
||||
|
||||
is_boundary = False
|
||||
|
||||
# After final-form letter = word boundary
|
||||
if prev_ch and prev_ch in final_forms:
|
||||
is_boundary = True
|
||||
|
||||
# Before/after punctuation or non-Hebrew = word boundary
|
||||
if prev_ch and not HEBREW_RE.match(prev_ch):
|
||||
is_boundary = True
|
||||
if next_ch and not HEBREW_RE.match(next_ch):
|
||||
is_boundary = True
|
||||
|
||||
# If either side is not Hebrew at all, boundary
|
||||
if prev_ch is None or next_ch is None:
|
||||
is_boundary = True
|
||||
|
||||
if is_boundary:
|
||||
result.append(" ")
|
||||
# else: skip the space (collapse intra-word gap)
|
||||
i += 1
|
||||
|
||||
return "".join(result).strip()
|
||||
|
||||
|
||||
def extract_pdf_sentences(pdf_path, book_name):
|
||||
"""Extract sentences from a PDF file."""
|
||||
doc = fitz.open(pdf_path)
|
||||
sentences = []
|
||||
|
||||
for page_num in range(len(doc)):
|
||||
page = doc[page_num]
|
||||
text = page.get_text()
|
||||
|
||||
if not text.strip():
|
||||
continue
|
||||
|
||||
# Split into lines first, then split on sentence-ending punctuation
|
||||
lines = text.split("\n")
|
||||
|
||||
raw_sentences = []
|
||||
for line in lines:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
# Split on sentence-ending punctuation followed by space or at end
|
||||
parts = re.split(r"(?<=[.?!])\s+", line)
|
||||
raw_sentences.extend(parts)
|
||||
|
||||
for sent in raw_sentences:
|
||||
sent = sent.strip()
|
||||
if not sent:
|
||||
continue
|
||||
|
||||
# Must contain Hebrew characters
|
||||
if not HEBREW_RE.search(sent):
|
||||
continue
|
||||
|
||||
# Create stripped version (no nikkud, collapsed spaces for PDF)
|
||||
stripped = collapse_hebrew_spaces(sent)
|
||||
|
||||
# Count Hebrew words in stripped version
|
||||
words = [w for w in stripped.split() if HEBREW_RE.search(w)]
|
||||
word_count = len(words)
|
||||
|
||||
# Filter: 4-15 Hebrew words
|
||||
if word_count < 4 or word_count > 15:
|
||||
continue
|
||||
|
||||
# Drop metadata-like lines
|
||||
# Page numbers (just digits)
|
||||
if re.match(r"^\d+$", sent.strip()):
|
||||
continue
|
||||
# Copyright text
|
||||
if any(kw in sent.lower() for kw in ["copyright", "©", "isbn", "printed in"]):
|
||||
continue
|
||||
|
||||
sentences.append(
|
||||
{
|
||||
"text": sent,
|
||||
"book": book_name,
|
||||
"stripped": stripped,
|
||||
}
|
||||
)
|
||||
|
||||
doc.close()
|
||||
return sentences
|
||||
|
||||
|
||||
def has_extractable_text(pdf_path):
|
||||
"""Check if a PDF has extractable text."""
|
||||
doc = fitz.open(pdf_path)
|
||||
text_found = False
|
||||
for i in range(min(len(doc), 10)):
|
||||
if doc[i].get_text().strip():
|
||||
text_found = True
|
||||
break
|
||||
doc.close()
|
||||
return text_found
|
||||
|
||||
|
||||
def load_sentence_index():
|
||||
"""Load existing sentence index."""
|
||||
if os.path.exists(SENTENCE_INDEX):
|
||||
with open(SENTENCE_INDEX, encoding="utf-8") as f:
|
||||
return json.load(f)
|
||||
return {"sentences": []}
|
||||
|
||||
|
||||
def save_sentence_index(data):
|
||||
"""Save sentence index."""
|
||||
with open(SENTENCE_INDEX, "w", encoding="utf-8") as f:
|
||||
json.dump(data, f, ensure_ascii=False, indent=2)
|
||||
|
||||
|
||||
def match_vocab_to_sentences(sentences, vocab_df):
|
||||
"""Match vocab words to sentences."""
|
||||
matches = {}
|
||||
|
||||
# Build lookup: word_no_nikkud -> word_nikkud
|
||||
vocab_words = []
|
||||
for _, row in vocab_df.iterrows():
|
||||
word_no_nik = str(row.get("Word Without Nikkud", "")).strip()
|
||||
word_nik = str(row.get("Word", "")).strip()
|
||||
if word_no_nik and word_nik:
|
||||
vocab_words.append((word_no_nik, word_nik))
|
||||
|
||||
print(f"Matching {len(vocab_words)} vocab words against {len(sentences)} sentences...")
|
||||
|
||||
# Precompute: for each sentence, get the stripped text
|
||||
sent_data = []
|
||||
for s in sentences:
|
||||
stripped = s.get("stripped", "")
|
||||
# For PDF sentences, stripped already has collapsed spaces but words may be joined
|
||||
# For EPUB sentences, stripped has proper word spacing
|
||||
sent_data.append(
|
||||
{
|
||||
"text": s["text"],
|
||||
"book": s["book"],
|
||||
"stripped": stripped,
|
||||
"word_count": len(stripped.split()),
|
||||
}
|
||||
)
|
||||
|
||||
matched_count = 0
|
||||
|
||||
for word_no_nik, word_nik in vocab_words:
|
||||
if len(word_no_nik) < 2:
|
||||
continue
|
||||
|
||||
# Build regex for word boundary matching
|
||||
# Use both approaches: proper word boundary and substring for PDF text
|
||||
pattern = re.compile(r"(?:^|\s)" + re.escape(word_no_nik) + r"(?:\s|$)")
|
||||
# For PDF texts with collapsed spaces, also try substring match
|
||||
# but only for words >= 3 chars to avoid false positives
|
||||
use_substring = len(word_no_nik) >= 3
|
||||
|
||||
word_matches = []
|
||||
|
||||
for sd in sent_data:
|
||||
stripped = sd["stripped"]
|
||||
|
||||
# Try word-boundary match first
|
||||
if pattern.search(stripped):
|
||||
word_matches.append(sd)
|
||||
elif use_substring and word_no_nik in stripped:
|
||||
# Substring match for PDF texts with collapsed spaces
|
||||
# Verify it's not part of a longer word by checking the character
|
||||
# before and after in the collapsed text
|
||||
idx = stripped.find(word_no_nik)
|
||||
before_ok = idx == 0 or not HEBREW_RE.match(stripped[idx - 1])
|
||||
after_idx = idx + len(word_no_nik)
|
||||
after_ok = after_idx >= len(stripped) or not HEBREW_RE.match(stripped[after_idx])
|
||||
# Only count if at least one boundary is clear
|
||||
# (for PDF collapsed text, boundaries are often missing)
|
||||
# For PDF books, we accept substring matches
|
||||
if sd["book"] in ("אליס בארץ הפלאות", "האריה שאהב תות") or before_ok or after_ok:
|
||||
word_matches.append(sd)
|
||||
|
||||
if word_matches:
|
||||
matched_count += 1
|
||||
|
||||
# Sort by preference: 6-12 words ideal, then shorter is better
|
||||
def score(sd):
|
||||
wc = sd["word_count"]
|
||||
if 6 <= wc <= 12:
|
||||
return (0, wc) # ideal range, prefer shorter
|
||||
if wc < 6:
|
||||
return (1, -wc) # too short
|
||||
return (2, wc) # too long
|
||||
|
||||
word_matches.sort(key=score)
|
||||
best = word_matches[:3]
|
||||
|
||||
matches[word_no_nik] = {
|
||||
"word_nikkud": word_nik,
|
||||
"sentences": [{"text": m["text"], "book": m["book"]} for m in best],
|
||||
}
|
||||
|
||||
print(
|
||||
f"Words with at least 1 match: {matched_count}/{len(vocab_words)} ({100 * matched_count / len(vocab_words):.1f}%)"
|
||||
)
|
||||
return matches
|
||||
|
||||
|
||||
def main():
|
||||
# ── Step 1: Extract from PDFs ──
|
||||
pdfs = [
|
||||
("alice.pdf", "אליס בארץ הפלאות"),
|
||||
("lion_strawberry.pdf", "האריה שאהב תות"),
|
||||
]
|
||||
|
||||
all_new_sentences = []
|
||||
|
||||
for filename, book_name in pdfs:
|
||||
pdf_path = os.path.join(EPUBS_DIR, filename)
|
||||
if not os.path.exists(pdf_path):
|
||||
print(f"SKIP: {filename} not found")
|
||||
continue
|
||||
|
||||
if not has_extractable_text(pdf_path):
|
||||
print(f"SKIP: {filename} has no extractable text (likely scanned images)")
|
||||
continue
|
||||
|
||||
print(f"Extracting from {filename} ({book_name})...")
|
||||
sentences = extract_pdf_sentences(pdf_path, book_name)
|
||||
print(f" Extracted {len(sentences)} sentences")
|
||||
all_new_sentences.extend(sentences)
|
||||
|
||||
# ── Step 2: Merge with existing index ──
|
||||
index = load_sentence_index()
|
||||
existing_count = len(index["sentences"])
|
||||
|
||||
# Deduplicate by (stripped, book)
|
||||
existing_keys = set()
|
||||
for s in index["sentences"]:
|
||||
key = (s.get("stripped", ""), s.get("book", ""))
|
||||
existing_keys.add(key)
|
||||
|
||||
added = 0
|
||||
for s in all_new_sentences:
|
||||
key = (s["stripped"], s["book"])
|
||||
if key not in existing_keys:
|
||||
index["sentences"].append(s)
|
||||
existing_keys.add(key)
|
||||
added += 1
|
||||
|
||||
save_sentence_index(index)
|
||||
total = len(index["sentences"])
|
||||
print(f"\nSentence index: {existing_count} existing + {added} new = {total} total")
|
||||
|
||||
# ── Per-book stats ──
|
||||
book_counts = {}
|
||||
for s in index["sentences"]:
|
||||
book = s.get("book", "unknown")
|
||||
book_counts[book] = book_counts.get(book, 0) + 1
|
||||
|
||||
print("\nSentences per book:")
|
||||
for book, count in sorted(book_counts.items(), key=lambda x: -x[1]):
|
||||
print(f" {book}: {count}")
|
||||
|
||||
# ── Step 3: Match vocab words to sentences ──
|
||||
print(f"\nLoading vocab from {VOCAB_CSV}...")
|
||||
vocab_df = pd.read_csv(VOCAB_CSV, sep=";", index_col=0)
|
||||
print(f" {len(vocab_df)} vocab words loaded")
|
||||
|
||||
matches = match_vocab_to_sentences(index["sentences"], vocab_df)
|
||||
|
||||
with open(MATCHES_FILE, "w", encoding="utf-8") as f:
|
||||
json.dump(matches, f, ensure_ascii=False, indent=2)
|
||||
|
||||
print(f"\nWrote {len(matches)} word matches to {MATCHES_FILE}")
|
||||
|
||||
# ── Step 4: Summary stats ──
|
||||
total_words = len(vocab_df)
|
||||
matched_words = len(matches)
|
||||
print(f"\n{'=' * 50}")
|
||||
print("SUMMARY")
|
||||
print(f"{'=' * 50}")
|
||||
print(f"Total sentences: {total}")
|
||||
for book, count in sorted(book_counts.items(), key=lambda x: -x[1]):
|
||||
print(f" {book}: {count}")
|
||||
print(f"Total vocab words: {total_words}")
|
||||
print(f"Words with sentences: {matched_words} ({100 * matched_words / total_words:.1f}%)")
|
||||
print(f"Words without sentences: {total_words - matched_words}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -21,9 +21,10 @@ from pathlib import Path
|
|||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
PDF_URL = "https://books.nevo.engineer/opds/download/117/pdf/"
|
||||
PROJECT_ROOT = Path(__file__).resolve().parent.parent
|
||||
PDF_URL = "" # Set to URL or local path of Coffin & Bolozky PDF
|
||||
PDF_PATH = Path("/tmp/coffin_bolozky.pdf")
|
||||
OUTPUT_PATH = Path(__file__).parent / "verbs_input.txt"
|
||||
OUTPUT_PATH = PROJECT_ROOT / "verbs_input.txt"
|
||||
|
||||
# Pages to scan (Appendix 1)
|
||||
PAGE_START = 390
|
||||
|
|
@ -31,24 +32,38 @@ PAGE_END = 411
|
|||
|
||||
# Binyan headings in Hebrew (vowelled and unvowelled variants)
|
||||
BINYAN_HEADINGS_HEB = [
|
||||
"פָּעַל", "פעל",
|
||||
"נִפְעַל", "נפעל",
|
||||
"פִּעֵל", "פיעל",
|
||||
"פֻּעַל", "פועל",
|
||||
"הִתְפַּעֵל", "התפעל",
|
||||
"הִפְעִיל", "הפעיל",
|
||||
"הֻפְעַל", "הופעל",
|
||||
"פָּעַל",
|
||||
"פעל",
|
||||
"נִפְעַל",
|
||||
"נפעל",
|
||||
"פִּעֵל",
|
||||
"פיעל",
|
||||
"פֻּעַל",
|
||||
"פועל",
|
||||
"הִתְפַּעֵל",
|
||||
"התפעל",
|
||||
"הִפְעִיל",
|
||||
"הפעיל",
|
||||
"הֻפְעַל",
|
||||
"הופעל",
|
||||
]
|
||||
|
||||
# Binyan heading → canonical name
|
||||
BINYAN_CANONICAL = {
|
||||
"פָּעַל": "Pa'al", "פעל": "Pa'al",
|
||||
"נִפְעַל": "Nif'al", "נפעל": "Nif'al",
|
||||
"פִּעֵל": "Pi'el", "פיעל": "Pi'el",
|
||||
"פֻּעַל": "Pu'al", "פועל": "Pu'al",
|
||||
"הִתְפַּעֵל": "Hitpa'el", "התפעל": "Hitpa'el",
|
||||
"הִפְעִיל": "Hif'il", "הפעיל": "Hif'il",
|
||||
"הֻפְעַל": "Huf'al", "הופעל": "Huf'al",
|
||||
"פָּעַל": "Pa'al",
|
||||
"פעל": "Pa'al",
|
||||
"נִפְעַל": "Nif'al",
|
||||
"נפעל": "Nif'al",
|
||||
"פִּעֵל": "Pi'el",
|
||||
"פיעל": "Pi'el",
|
||||
"פֻּעַל": "Pu'al",
|
||||
"פועל": "Pu'al",
|
||||
"הִתְפַּעֵל": "Hitpa'el",
|
||||
"התפעל": "Hitpa'el",
|
||||
"הִפְעִיל": "Hif'il",
|
||||
"הפעיל": "Hif'il",
|
||||
"הֻפְעַל": "Huf'al",
|
||||
"הופעל": "Huf'al",
|
||||
}
|
||||
|
||||
# Passive binyan names — no infinitive, use 3ms past
|
||||
|
|
@ -156,15 +171,16 @@ FALLBACK_VERBS = """# Verb list from Coffin & Bolozky, A Reference Grammar of Mo
|
|||
def _install_deps():
|
||||
"""Install pymupdf and python-bidi if not available."""
|
||||
try:
|
||||
import fitz # noqa: F401
|
||||
import bidi # noqa: F401
|
||||
import fitz # noqa: F401
|
||||
|
||||
return True
|
||||
except ImportError:
|
||||
logger.info("Installing pymupdf and python-bidi …")
|
||||
import subprocess
|
||||
|
||||
result = subprocess.run(
|
||||
[sys.executable, "-m", "pip", "install",
|
||||
"pymupdf", "python-bidi", "--break-system-packages", "-q"],
|
||||
[sys.executable, "-m", "pip", "install", "pymupdf", "python-bidi", "--break-system-packages", "-q"],
|
||||
capture_output=True,
|
||||
)
|
||||
if result.returncode != 0:
|
||||
|
|
@ -182,6 +198,7 @@ def _download_pdf() -> bool:
|
|||
logger.info(f"Downloading PDF from {PDF_URL} …")
|
||||
try:
|
||||
import requests
|
||||
|
||||
resp = requests.get(PDF_URL, timeout=120, stream=True)
|
||||
resp.raise_for_status()
|
||||
PDF_PATH.write_bytes(resp.content)
|
||||
|
|
@ -211,10 +228,7 @@ def _needs_bidi_fix(text: str) -> bool:
|
|||
|
||||
|
||||
def _strip_nikkud(text: str) -> str:
|
||||
return "".join(
|
||||
ch for ch in unicodedata.normalize("NFD", text)
|
||||
if unicodedata.category(ch) != "Mn"
|
||||
)
|
||||
return "".join(ch for ch in unicodedata.normalize("NFD", text) if unicodedata.category(ch) != "Mn")
|
||||
|
||||
|
||||
def _extract_from_pdf() -> list[tuple[str, str, str]]:
|
||||
|
|
@ -244,10 +258,9 @@ def _extract_from_pdf() -> list[tuple[str, str, str]]:
|
|||
# Check if we need bidi correction
|
||||
test_text = ""
|
||||
try:
|
||||
for page_num in range(min(PAGE_START, doc.page_count - 1),
|
||||
min(PAGE_START + 3, doc.page_count)):
|
||||
for page_num in range(min(PAGE_START, doc.page_count - 1), min(PAGE_START + 3, doc.page_count)):
|
||||
test_text += doc[page_num].get_text("text")
|
||||
except Exception:
|
||||
except Exception: # noqa: S110
|
||||
pass
|
||||
|
||||
use_bidi = _needs_bidi_fix(test_text)
|
||||
|
|
@ -259,6 +272,7 @@ def _extract_from_pdf() -> list[tuple[str, str, str]]:
|
|||
return t
|
||||
try:
|
||||
from bidi.algorithm import get_display
|
||||
|
||||
lines = t.split("\n")
|
||||
fixed = []
|
||||
for line in lines:
|
||||
|
|
@ -274,7 +288,7 @@ def _extract_from_pdf() -> list[tuple[str, str, str]]:
|
|||
for page_num in range(PAGE_START - 1, page_end): # fitz is 0-indexed
|
||||
try:
|
||||
raw = doc[page_num].get_text("text")
|
||||
except Exception:
|
||||
except Exception: # noqa: S112
|
||||
continue
|
||||
|
||||
text = fix_text(raw)
|
||||
|
|
@ -316,9 +330,12 @@ def _extract_from_pdf() -> list[tuple[str, str, str]]:
|
|||
heb_words = re.findall(r"[\u05d0-\u05ea\u05b0-\u05c7]{3,}", line)
|
||||
for w in heb_words:
|
||||
stripped_w = _strip_nikkud(w)
|
||||
if current_binyan == "Pu'al" and stripped_w.startswith("פ"):
|
||||
entries.append((current_binyan, "3ms", w))
|
||||
elif current_binyan == "Huf'al" and stripped_w.startswith("ה"):
|
||||
if (
|
||||
current_binyan == "Pu'al"
|
||||
and stripped_w.startswith("פ")
|
||||
or current_binyan == "Huf'al"
|
||||
and stripped_w.startswith("ה")
|
||||
):
|
||||
entries.append((current_binyan, "3ms", w))
|
||||
|
||||
doc.close()
|
||||
|
|
@ -357,16 +374,20 @@ def _write_output(entries: list[tuple[str, str, str]]) -> None:
|
|||
lines.append(form)
|
||||
|
||||
OUTPUT_PATH.write_text("\n".join(lines) + "\n", encoding="utf-8")
|
||||
verb_count = sum(1 for l in lines if l and not l.startswith("#"))
|
||||
passive_count = sum(1 for l in lines if l.startswith("# 3ms:"))
|
||||
verb_count = sum(1 for ln in lines if ln and not ln.startswith("#"))
|
||||
passive_count = sum(1 for ln in lines if ln.startswith("# 3ms:"))
|
||||
logger.info(f"Written {verb_count} active verbs + {passive_count} passive (3ms) → {OUTPUT_PATH}")
|
||||
|
||||
|
||||
def _binyan_heb(name: str) -> str:
|
||||
mapping = {
|
||||
"Pa'al": "פָּעַל", "Nif'al": "נִפְעַל", "Pi'el": "פִּעֵל",
|
||||
"Pu'al": "פֻּעַל", "Hitpa'el": "הִתְפַּעֵל",
|
||||
"Hif'il": "הִפְעִיל", "Huf'al": "הֻפְעַל",
|
||||
"Pa'al": "פָּעַל",
|
||||
"Nif'al": "נִפְעַל",
|
||||
"Pi'el": "פִּעֵל",
|
||||
"Pu'al": "פֻּעַל",
|
||||
"Hitpa'el": "הִתְפַּעֵל",
|
||||
"Hif'il": "הִפְעִיל",
|
||||
"Huf'al": "הֻפְעַל",
|
||||
}
|
||||
return mapping.get(name, name)
|
||||
|
||||
237
scripts/scrape_ktiv_male.py
Normal file
237
scripts/scrape_ktiv_male.py
Normal file
|
|
@ -0,0 +1,237 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Scrape ktiv male (plene/vowelless) forms from pealim.com.
|
||||
|
||||
Uses hebstyle=vl cookie to get vowelless writing with matres lectionis.
|
||||
Builds a lookup: ktiv_male_form → [{word_nikkud, form_type, pos, slug}]
|
||||
|
||||
This enables matching Hebrew text (which is normally in ktiv male)
|
||||
against our vocabulary, including conjugated verbs and noun plurals.
|
||||
"""
|
||||
|
||||
import json
|
||||
import logging
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
DATA_DIR = Path(__file__).resolve().parent.parent / "data"
|
||||
OUTPUT_PATH = DATA_DIR / "ktiv_male_forms.json"
|
||||
COOKIES = {"translit": "none", "hebstyle": "vl"}
|
||||
REQUEST_TIMEOUT = 15
|
||||
DELAY = 1.5 # seconds between requests
|
||||
|
||||
|
||||
def fetch_verb_ktiv_male(slug: str, infinitive_nikkud: str) -> list[dict]:
|
||||
"""Fetch all conjugated forms in ktiv male for a verb."""
|
||||
url = f"https://www.pealim.com/dict/{slug}/"
|
||||
resp = requests.get(url, cookies=COOKIES, timeout=REQUEST_TIMEOUT)
|
||||
resp.raise_for_status()
|
||||
soup = BeautifulSoup(resp.text, "html.parser")
|
||||
|
||||
forms = []
|
||||
table = soup.find("table", class_="conjugation-table")
|
||||
if not table:
|
||||
return forms
|
||||
|
||||
# Also get the infinitive from the page
|
||||
lead = soup.find("div", class_="lead")
|
||||
if lead:
|
||||
inf_spans = lead.find_all("span", class_="menukad")
|
||||
for s in inf_spans:
|
||||
ktiv = s.text.strip()
|
||||
if ktiv:
|
||||
forms.append(
|
||||
{
|
||||
"ktiv_male": ktiv,
|
||||
"word_nikkud": infinitive_nikkud,
|
||||
"form_type": "infinitive",
|
||||
"pos": "Verb",
|
||||
"slug": slug,
|
||||
}
|
||||
)
|
||||
|
||||
rows = table.find_all("tr")
|
||||
for row in rows:
|
||||
menukad_spans = row.find_all("span", class_="menukad")
|
||||
for span in menukad_spans:
|
||||
ktiv = span.text.strip()
|
||||
if ktiv and ktiv not in {f["ktiv_male"] for f in forms}:
|
||||
forms.append(
|
||||
{
|
||||
"ktiv_male": ktiv,
|
||||
"word_nikkud": infinitive_nikkud,
|
||||
"form_type": "conjugation",
|
||||
"pos": "Verb",
|
||||
"slug": slug,
|
||||
}
|
||||
)
|
||||
|
||||
return forms
|
||||
|
||||
|
||||
def fetch_noun_ktiv_male(slug: str, singular_nikkud: str, gender: str) -> list[dict]:
|
||||
"""Fetch noun declension forms in ktiv male."""
|
||||
url = f"https://www.pealim.com/dict/{slug}/"
|
||||
resp = requests.get(url, cookies=COOKIES, timeout=REQUEST_TIMEOUT)
|
||||
resp.raise_for_status()
|
||||
soup = BeautifulSoup(resp.text, "html.parser")
|
||||
|
||||
forms = []
|
||||
table = soup.find("table", class_="conjugation-table")
|
||||
if not table:
|
||||
return forms
|
||||
|
||||
rows = table.find_all("tr")
|
||||
form_labels = ["absolute_singular", "absolute_plural", "construct_singular", "construct_plural"]
|
||||
label_idx = 0
|
||||
|
||||
for row in rows:
|
||||
menukad_spans = row.find_all("span", class_="menukad")
|
||||
for span in menukad_spans:
|
||||
ktiv = span.text.strip()
|
||||
if ktiv:
|
||||
ft = form_labels[label_idx] if label_idx < len(form_labels) else "other"
|
||||
forms.append(
|
||||
{
|
||||
"ktiv_male": ktiv,
|
||||
"word_nikkud": singular_nikkud,
|
||||
"form_type": ft,
|
||||
"pos": "Noun",
|
||||
"slug": slug,
|
||||
"gender": gender,
|
||||
}
|
||||
)
|
||||
label_idx += 1
|
||||
|
||||
return forms
|
||||
|
||||
|
||||
def scrape_verbs() -> list[dict]:
|
||||
"""Scrape ktiv male forms for all verbs in conjugations.json."""
|
||||
conj_path = DATA_DIR / "conjugations.json"
|
||||
if not conj_path.exists():
|
||||
logger.warning("No conjugations.json found")
|
||||
return []
|
||||
|
||||
with open(conj_path) as f:
|
||||
conjugations = json.load(f)
|
||||
|
||||
all_forms = []
|
||||
slugs_done = set()
|
||||
|
||||
for verb, data in conjugations.items():
|
||||
if not data or not data.get("slug"):
|
||||
continue
|
||||
slug = data["slug"]
|
||||
if slug in slugs_done:
|
||||
continue
|
||||
slugs_done.add(slug)
|
||||
|
||||
try:
|
||||
forms = fetch_verb_ktiv_male(slug, verb)
|
||||
all_forms.extend(forms)
|
||||
logger.info(f" Verb {verb} ({slug}): {len(forms)} forms")
|
||||
except Exception as e:
|
||||
logger.warning(f" Verb {verb} ({slug}) failed: {e}")
|
||||
|
||||
time.sleep(DELAY)
|
||||
|
||||
return all_forms
|
||||
|
||||
|
||||
def scrape_nouns() -> list[dict]:
|
||||
"""Scrape ktiv male forms for all nouns in noun_slug_map.json."""
|
||||
slug_path = DATA_DIR / "noun_slug_map.json"
|
||||
if not slug_path.exists():
|
||||
logger.warning("No noun_slug_map.json found")
|
||||
return []
|
||||
|
||||
with open(slug_path) as f:
|
||||
slug_map = json.load(f)
|
||||
|
||||
# Also load existing plurals to get nikkud singular form
|
||||
plurals_path = DATA_DIR / "noun_plurals.json"
|
||||
plurals = {}
|
||||
if plurals_path.exists():
|
||||
with open(plurals_path) as f:
|
||||
plurals = json.load(f)
|
||||
|
||||
all_forms = []
|
||||
done = 0
|
||||
total = len(slug_map)
|
||||
|
||||
for word, info in slug_map.items():
|
||||
slug = info.get("slug", "")
|
||||
if not slug:
|
||||
continue
|
||||
|
||||
# Get nikkud form from plurals data or slug map
|
||||
nikkud = info.get("word_nikkud", word)
|
||||
if word in plurals:
|
||||
nikkud = plurals[word].get("singular", nikkud)
|
||||
gender = info.get("gender", "")
|
||||
|
||||
try:
|
||||
forms = fetch_noun_ktiv_male(slug, nikkud, gender)
|
||||
all_forms.extend(forms)
|
||||
done += 1
|
||||
if done % 50 == 0:
|
||||
logger.info(f" Nouns: {done}/{total} ({len(all_forms)} forms)")
|
||||
# Save incrementally
|
||||
_save_forms(all_forms, partial=True)
|
||||
except Exception as e:
|
||||
logger.warning(f" Noun {word} ({slug}) failed: {e}")
|
||||
done += 1
|
||||
|
||||
time.sleep(DELAY)
|
||||
|
||||
return all_forms
|
||||
|
||||
|
||||
def _save_forms(all_forms: list[dict], partial: bool = False):
|
||||
"""Build and save the ktiv male lookup dict."""
|
||||
lookup: dict[str, list[dict]] = {}
|
||||
for entry in all_forms:
|
||||
ktiv = entry["ktiv_male"]
|
||||
# Don't include ktiv_male in the stored entry (it's the key)
|
||||
stored = {k: v for k, v in entry.items() if k != "ktiv_male"}
|
||||
lookup.setdefault(ktiv, []).append(stored)
|
||||
|
||||
suffix = ".partial" if partial else ""
|
||||
out = OUTPUT_PATH.parent / (OUTPUT_PATH.name + suffix)
|
||||
with open(out, "w") as f:
|
||||
json.dump(lookup, f, ensure_ascii=False, indent=1)
|
||||
|
||||
logger.info(f" Saved {len(lookup)} unique ktiv male forms → {out}")
|
||||
|
||||
|
||||
def main():
|
||||
mode = sys.argv[1] if len(sys.argv) > 1 else "all"
|
||||
|
||||
all_forms = []
|
||||
|
||||
if mode in ("all", "verbs"):
|
||||
logger.info("=== Scraping verb ktiv male forms ===")
|
||||
verb_forms = scrape_verbs()
|
||||
all_forms.extend(verb_forms)
|
||||
logger.info(f"Verbs done: {len(verb_forms)} forms from {len({f['slug'] for f in verb_forms})} verbs")
|
||||
|
||||
if mode in ("all", "nouns"):
|
||||
logger.info("=== Scraping noun ktiv male forms ===")
|
||||
noun_forms = scrape_nouns()
|
||||
all_forms.extend(noun_forms)
|
||||
logger.info(f"Nouns done: {len(noun_forms)} forms")
|
||||
|
||||
_save_forms(all_forms)
|
||||
logger.info(f"Total: {len(all_forms)} forms → {OUTPUT_PATH}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
365
scripts/scrape_noun_plurals.py
Normal file
365
scripts/scrape_noun_plurals.py
Normal file
|
|
@ -0,0 +1,365 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Scrape pealim.com for noun plural and construct forms.
|
||||
|
||||
Step 1: Collect noun slugs from list pages (/dict/?pos=noun&page=N)
|
||||
Step 2: Fetch detail pages for plural + construct forms
|
||||
Step 3: Print summary statistics
|
||||
"""
|
||||
|
||||
import json
|
||||
import re
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
BASE_URL = "https://www.pealim.com"
|
||||
COOKIES = {"translit": "none", "hebstyle": "mo"}
|
||||
HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; PealimScraper/1.0)"}
|
||||
DATA_DIR = Path(__file__).resolve().parent.parent / "data"
|
||||
SLUG_MAP_FILE = DATA_DIR / "noun_slug_map.json"
|
||||
PROGRESS_FILE = DATA_DIR / "noun_slug_map_progress.json"
|
||||
PLURALS_FILE = DATA_DIR / "noun_plurals.json"
|
||||
DELAY = 1.5 # seconds between requests
|
||||
|
||||
|
||||
def load_json(path, default=None):
|
||||
if path.exists():
|
||||
with open(path) as f:
|
||||
return json.load(f)
|
||||
return default if default is not None else {}
|
||||
|
||||
|
||||
def save_json(path, data):
|
||||
with open(path, "w") as f:
|
||||
json.dump(data, f, ensure_ascii=False, indent=2)
|
||||
|
||||
|
||||
def fetch_with_retry(url, max_retries=5):
|
||||
"""Fetch URL with exponential backoff."""
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
r = requests.get(url, cookies=COOKIES, headers=HEADERS, timeout=30)
|
||||
r.raise_for_status()
|
||||
return r
|
||||
except (requests.RequestException, ConnectionError) as e:
|
||||
wait = min(2**attempt * 2, 60)
|
||||
print(f" Retry {attempt + 1}/{max_retries} for {url}: {e} (waiting {wait}s)")
|
||||
time.sleep(wait)
|
||||
print(f" FAILED after {max_retries} retries: {url}")
|
||||
return None
|
||||
|
||||
|
||||
def get_total_pages():
|
||||
"""Get total number of noun list pages."""
|
||||
r = fetch_with_retry(f"{BASE_URL}/dict/?pos=noun&page=1")
|
||||
if not r:
|
||||
return 0
|
||||
soup = BeautifulSoup(r.text, "lxml")
|
||||
pages = set()
|
||||
for a in soup.select("ul.pagination li a"):
|
||||
href = a.get("href", "")
|
||||
m = re.search(r"page=(\d+)", href)
|
||||
if m:
|
||||
pages.add(int(m.group(1)))
|
||||
return max(pages) if pages else 1
|
||||
|
||||
|
||||
def parse_list_page(html):
|
||||
"""Parse a noun list page and return list of noun entries."""
|
||||
soup = BeautifulSoup(html, "lxml")
|
||||
table = soup.select_one("table.dict-table")
|
||||
if not table:
|
||||
return []
|
||||
|
||||
entries = []
|
||||
for row in table.select("tr")[1:]: # skip header
|
||||
tds = row.select("td")
|
||||
if len(tds) < 3:
|
||||
continue
|
||||
|
||||
# First td: word + link
|
||||
first_td = tds[0]
|
||||
a = first_td.select_one("a")
|
||||
if not a:
|
||||
continue
|
||||
href = a.get("href", "")
|
||||
slug_match = re.search(r"/dict/([^/]+)/", href)
|
||||
if not slug_match:
|
||||
continue
|
||||
slug = slug_match.group(1)
|
||||
|
||||
menukad = first_td.select_one("span.menukad")
|
||||
word_nikkud = menukad.get_text(strip=True) if menukad else ""
|
||||
|
||||
# Word without nikkud (strip combining marks)
|
||||
word_plain = re.sub(r"[\u0591-\u05C7]", "", word_nikkud)
|
||||
|
||||
# Third td: part of speech
|
||||
pos_text = tds[2].get_text(strip=True)
|
||||
|
||||
# Gender
|
||||
gender = ""
|
||||
if "masculine" in pos_text.lower():
|
||||
gender = "masculine"
|
||||
elif "feminine" in pos_text.lower():
|
||||
gender = "feminine"
|
||||
|
||||
# Mishkal pattern
|
||||
mishkal = ""
|
||||
m = re.search(r"(\w+)\s*pattern", pos_text.lower())
|
||||
if m:
|
||||
mishkal = m.group(1)
|
||||
|
||||
entries.append(
|
||||
{
|
||||
"word_plain": word_plain,
|
||||
"slug": slug,
|
||||
"word_nikkud": word_nikkud,
|
||||
"pos": pos_text,
|
||||
"gender": gender,
|
||||
"mishkal": mishkal,
|
||||
}
|
||||
)
|
||||
|
||||
return entries
|
||||
|
||||
|
||||
def step1_collect_slugs():
|
||||
"""Step 1: Collect noun slugs from list pages."""
|
||||
print("=" * 60)
|
||||
print("STEP 1: Collecting noun slugs from list pages")
|
||||
print("=" * 60)
|
||||
|
||||
slug_map = load_json(SLUG_MAP_FILE, {})
|
||||
progress = load_json(PROGRESS_FILE, [])
|
||||
completed_pages = set(progress) if isinstance(progress, list) else set()
|
||||
|
||||
# Get total pages
|
||||
total_pages = get_total_pages()
|
||||
print(f"Total pages: {total_pages}")
|
||||
print(f"Already completed: {len(completed_pages)} pages, {len(slug_map)} nouns")
|
||||
|
||||
remaining = [p for p in range(1, total_pages + 1) if p not in completed_pages]
|
||||
print(f"Remaining pages: {len(remaining)}")
|
||||
|
||||
if not remaining:
|
||||
print("All pages already scraped!")
|
||||
return slug_map
|
||||
|
||||
for i, page_num in enumerate(remaining):
|
||||
url = f"{BASE_URL}/dict/?pos=noun&page={page_num}"
|
||||
r = fetch_with_retry(url)
|
||||
if not r:
|
||||
print(f" Skipping page {page_num}")
|
||||
continue
|
||||
|
||||
entries = parse_list_page(r.text)
|
||||
for entry in entries:
|
||||
word = entry["word_plain"]
|
||||
slug_map[word] = {
|
||||
"slug": entry["slug"],
|
||||
"word_nikkud": entry["word_nikkud"],
|
||||
"pos": entry["pos"],
|
||||
"gender": entry["gender"],
|
||||
"mishkal": entry["mishkal"],
|
||||
}
|
||||
|
||||
completed_pages.add(page_num)
|
||||
done = len(completed_pages)
|
||||
print(f" Page {page_num} ({done}/{total_pages}): {len(entries)} nouns (total: {len(slug_map)})")
|
||||
|
||||
# Save progress every 10 pages
|
||||
if (i + 1) % 10 == 0 or page_num == remaining[-1]:
|
||||
save_json(SLUG_MAP_FILE, slug_map)
|
||||
save_json(PROGRESS_FILE, sorted(completed_pages))
|
||||
print(f" [Saved progress: {len(slug_map)} nouns, {done} pages]")
|
||||
|
||||
time.sleep(DELAY)
|
||||
|
||||
# Final save
|
||||
save_json(SLUG_MAP_FILE, slug_map)
|
||||
save_json(PROGRESS_FILE, sorted(completed_pages))
|
||||
print(f"\nStep 1 complete: {len(slug_map)} total nouns from {len(completed_pages)} pages")
|
||||
return slug_map
|
||||
|
||||
|
||||
def parse_detail_page(html, slug, gender, mishkal):
|
||||
"""Parse a noun detail page for plural/construct forms."""
|
||||
soup = BeautifulSoup(html, "lxml")
|
||||
tables = soup.select("table.conjugation-table")
|
||||
if not tables:
|
||||
return None
|
||||
|
||||
table = tables[0]
|
||||
rows = table.select("tr")
|
||||
|
||||
result = {
|
||||
"slug": slug,
|
||||
"singular": "",
|
||||
"singular_audio": "",
|
||||
"plural": "",
|
||||
"plural_audio": "",
|
||||
"construct_singular": "",
|
||||
"construct_plural": "",
|
||||
"gender": gender,
|
||||
"mishkal": mishkal,
|
||||
}
|
||||
|
||||
for row in rows:
|
||||
th = row.select_one("th")
|
||||
if not th:
|
||||
continue
|
||||
label = th.get_text(strip=True).lower()
|
||||
tds = row.select("td")
|
||||
|
||||
if "absolute" in label:
|
||||
if len(tds) >= 1:
|
||||
td = tds[0]
|
||||
m = td.select_one("span.menukad")
|
||||
result["singular"] = m.get_text(strip=True) if m else ""
|
||||
audio_el = td.select_one("[data-audio]")
|
||||
result["singular_audio"] = audio_el.get("data-audio", "") if audio_el else td.get("data-audio", "")
|
||||
if len(tds) >= 2:
|
||||
td = tds[1]
|
||||
m = td.select_one("span.menukad")
|
||||
result["plural"] = m.get_text(strip=True) if m else ""
|
||||
audio_el = td.select_one("[data-audio]")
|
||||
result["plural_audio"] = audio_el.get("data-audio", "") if audio_el else td.get("data-audio", "")
|
||||
|
||||
elif "construct" in label:
|
||||
if len(tds) >= 1:
|
||||
td = tds[0]
|
||||
m = td.select_one("span.menukad")
|
||||
result["construct_singular"] = m.get_text(strip=True) if m else ""
|
||||
if len(tds) >= 2:
|
||||
td = tds[1]
|
||||
m = td.select_one("span.menukad")
|
||||
result["construct_plural"] = m.get_text(strip=True) if m else ""
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def step2_fetch_plurals(slug_map):
|
||||
"""Step 2: Fetch detail pages for plural + construct forms."""
|
||||
print("\n" + "=" * 60)
|
||||
print("STEP 2: Fetching plural + construct forms from detail pages")
|
||||
print("=" * 60)
|
||||
|
||||
plurals = load_json(PLURALS_FILE, {})
|
||||
already_done = set(plurals.keys())
|
||||
|
||||
# Build work list: nouns not yet in plurals
|
||||
work = []
|
||||
for word, info in slug_map.items():
|
||||
if word not in already_done:
|
||||
work.append((word, info))
|
||||
|
||||
print(f"Already have plural data: {len(already_done)}")
|
||||
print(f"Remaining to fetch: {len(work)}")
|
||||
|
||||
if not work:
|
||||
print("All nouns already have plural data!")
|
||||
return plurals
|
||||
|
||||
skipped = 0
|
||||
for i, (word, info) in enumerate(work):
|
||||
slug = info["slug"]
|
||||
url = f"{BASE_URL}/dict/{slug}/"
|
||||
r = fetch_with_retry(url)
|
||||
if not r:
|
||||
print(f" Skipping {word} ({slug})")
|
||||
skipped += 1
|
||||
continue
|
||||
|
||||
entry = parse_detail_page(r.text, slug, info.get("gender", ""), info.get("mishkal", ""))
|
||||
if entry:
|
||||
plurals[word] = entry
|
||||
else:
|
||||
# No declension table - store minimal entry
|
||||
plurals[word] = {
|
||||
"slug": slug,
|
||||
"singular": info.get("word_nikkud", ""),
|
||||
"singular_audio": "",
|
||||
"plural": "",
|
||||
"plural_audio": "",
|
||||
"construct_singular": "",
|
||||
"construct_plural": "",
|
||||
"gender": info.get("gender", ""),
|
||||
"mishkal": info.get("mishkal", ""),
|
||||
"no_declension_table": True,
|
||||
}
|
||||
|
||||
done = len(already_done) + i + 1 - skipped
|
||||
total = len(already_done) + len(work)
|
||||
if (i + 1) % 50 == 0 or i == 0:
|
||||
print(
|
||||
f" [{i + 1}/{len(work)}] {word} ({slug}): "
|
||||
f"plural={entry['plural'] if entry else 'N/A'} "
|
||||
f"(total: {done}/{total})"
|
||||
)
|
||||
|
||||
# Save every 50 entries
|
||||
if (i + 1) % 50 == 0 or i == len(work) - 1:
|
||||
save_json(PLURALS_FILE, plurals)
|
||||
print(f" [Saved: {len(plurals)} entries]")
|
||||
|
||||
time.sleep(DELAY)
|
||||
|
||||
save_json(PLURALS_FILE, plurals)
|
||||
print(f"\nStep 2 complete: {len(plurals)} total noun entries with plural data")
|
||||
return plurals
|
||||
|
||||
|
||||
def step3_summary(slug_map, plurals):
|
||||
"""Step 3: Print summary statistics."""
|
||||
print("\n" + "=" * 60)
|
||||
print("SUMMARY")
|
||||
print("=" * 60)
|
||||
|
||||
total_slugs = len(slug_map)
|
||||
total_plurals = len(plurals)
|
||||
has_plural = sum(1 for v in plurals.values() if v.get("plural"))
|
||||
has_construct = sum(1 for v in plurals.values() if v.get("construct_singular") or v.get("construct_plural"))
|
||||
has_audio = sum(1 for v in plurals.values() if v.get("singular_audio") or v.get("plural_audio"))
|
||||
no_table = sum(1 for v in plurals.values() if v.get("no_declension_table"))
|
||||
|
||||
# Irregular plurals: masculine with ות- ending, feminine with ים- ending
|
||||
irregular = 0
|
||||
for _word, v in plurals.items():
|
||||
plural = v.get("plural", "")
|
||||
gender = v.get("gender", "")
|
||||
if not plural or not gender:
|
||||
continue
|
||||
plain_plural = re.sub(r"[\u0591-\u05C7]", "", plural)
|
||||
if (
|
||||
gender == "masculine"
|
||||
and plain_plural.endswith("ות")
|
||||
or gender == "feminine"
|
||||
and plain_plural.endswith("ים")
|
||||
):
|
||||
irregular += 1
|
||||
|
||||
print(f"Total nouns in slug map: {total_slugs}")
|
||||
print(f"Total nouns with plural data: {total_plurals}")
|
||||
print(f" - With plural form: {has_plural}")
|
||||
print(f" - With construct forms: {has_construct}")
|
||||
print(f" - With audio URLs: {has_audio}")
|
||||
print(f" - No declension table: {no_table}")
|
||||
print(f" - Irregular plurals: {irregular}")
|
||||
|
||||
|
||||
def main():
|
||||
print("Pealim Noun Plural Scraper")
|
||||
print(f"Data directory: {DATA_DIR}")
|
||||
print()
|
||||
|
||||
slug_map = step1_collect_slugs()
|
||||
plurals = step2_fetch_plurals(slug_map)
|
||||
step3_summary(slug_map, plurals)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
250
scripts/scrape_verb_ktiv.py
Normal file
250
scripts/scrape_verb_ktiv.py
Normal file
|
|
@ -0,0 +1,250 @@
|
|||
#!/usr/bin/env python3
|
||||
"""Scrape ktiv male (vowelless plene) conjugation forms for top 500 verbs from pealim.com."""
|
||||
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
|
||||
sys.stdout.reconfigure(line_buffering=True)
|
||||
import requests # noqa: E402
|
||||
from bs4 import BeautifulSoup # noqa: E402
|
||||
|
||||
DATA_DIR = os.path.join(os.path.dirname(os.path.dirname(__file__)), "data")
|
||||
INPUT_FILE = os.path.join(DATA_DIR, "top_verbs_to_scrape.json")
|
||||
OUTPUT_FILE = os.path.join(DATA_DIR, "ktiv_male_forms.json")
|
||||
PARTIAL_FILE = os.path.join(DATA_DIR, "ktiv_male_forms_partial.json")
|
||||
PROGRESS_FILE = os.path.join(DATA_DIR, "ktiv_scrape_progress.json")
|
||||
|
||||
COOKIES = {"translit": "none", "hebstyle": "vl"}
|
||||
HEADERS = {"User-Agent": "Mozilla/5.0 (compatible; PealimScraper/1.0)"}
|
||||
DELAY = 1.5
|
||||
|
||||
session = requests.Session()
|
||||
session.cookies.update(COOKIES)
|
||||
session.headers.update(HEADERS)
|
||||
|
||||
|
||||
def load_json(path):
|
||||
if os.path.exists(path):
|
||||
with open(path, encoding="utf-8") as f:
|
||||
return json.load(f)
|
||||
return {}
|
||||
|
||||
|
||||
def save_json(data, path):
|
||||
with open(path, "w", encoding="utf-8") as f:
|
||||
json.dump(data, f, ensure_ascii=False, indent=1)
|
||||
|
||||
|
||||
def search_slug(wni):
|
||||
"""Search pealim for a verb and return the first result's slug."""
|
||||
url = "https://www.pealim.com/search/"
|
||||
resp = session.get(url, params={"q": wni}, timeout=15)
|
||||
resp.raise_for_status()
|
||||
soup = BeautifulSoup(resp.text, "html.parser")
|
||||
|
||||
# Look for result links like /dict/SLUG/
|
||||
for a in soup.select("a[href]"):
|
||||
href = a["href"]
|
||||
m = re.match(r"/dict/(\d+-[^/]+)/", href)
|
||||
if m:
|
||||
return m.group(1)
|
||||
return None
|
||||
|
||||
|
||||
def scrape_verb_forms(slug):
|
||||
"""Fetch a verb's detail page and extract all ktiv male conjugation forms."""
|
||||
url = f"https://www.pealim.com/dict/{slug}/"
|
||||
resp = session.get(url, timeout=15)
|
||||
resp.raise_for_status()
|
||||
soup = BeautifulSoup(resp.text, "html.parser")
|
||||
|
||||
forms = set()
|
||||
|
||||
# Get infinitive from div.lead or page title
|
||||
lead = soup.select_one("div.lead")
|
||||
if lead:
|
||||
menukad_spans = lead.select("span.menukad")
|
||||
for span in menukad_spans:
|
||||
text = span.get_text(strip=True)
|
||||
if text:
|
||||
forms.add(text)
|
||||
|
||||
# Get word_nikkud (the nikkud form of the infinitive) from the page
|
||||
# We need to fetch with mo cookie for that, but we already have it from input data
|
||||
# Instead, get the page title which usually has the nikkud form
|
||||
word_nikkud = None
|
||||
title = soup.select_one("h1")
|
||||
if title:
|
||||
menukad_in_title = title.select_one("span.menukad")
|
||||
if menukad_in_title:
|
||||
word_nikkud = menukad_in_title.get_text(strip=True)
|
||||
|
||||
# Get ALL span.menukad elements from conjugation tables
|
||||
for span in soup.select("span.menukad"):
|
||||
text = span.get_text(strip=True)
|
||||
if text:
|
||||
forms.add(text)
|
||||
|
||||
return forms, word_nikkud
|
||||
|
||||
|
||||
def main():
|
||||
verbs = load_json(INPUT_FILE)
|
||||
if not verbs:
|
||||
print("ERROR: No verbs found in input file")
|
||||
sys.exit(1)
|
||||
|
||||
# Load existing forms
|
||||
existing_forms = load_json(OUTPUT_FILE)
|
||||
new_forms = {} # Will be merged into existing at the end
|
||||
|
||||
# Load progress to resume
|
||||
progress = load_json(PROGRESS_FILE)
|
||||
done_wnis = set(progress.get("done_wnis", []))
|
||||
slug_cache = progress.get("slug_cache", {})
|
||||
|
||||
# Pre-populate slug cache from conjugations.json
|
||||
conj_file = os.path.join(DATA_DIR, "conjugations.json")
|
||||
if os.path.exists(conj_file):
|
||||
conj_data = load_json(conj_file)
|
||||
for wni_key, cdata in conj_data.items():
|
||||
if isinstance(cdata, dict) and "slug" in cdata and wni_key not in slug_cache:
|
||||
slug_cache[wni_key] = cdata["slug"]
|
||||
print(f"Pre-populated {len(slug_cache)} slugs from conjugations.json")
|
||||
|
||||
# Deduplicate verbs by wni
|
||||
seen_wni = set()
|
||||
unique_verbs = []
|
||||
for v in verbs:
|
||||
if v["wni"] not in seen_wni:
|
||||
seen_wni.add(v["wni"])
|
||||
unique_verbs.append(v)
|
||||
|
||||
total = len(unique_verbs)
|
||||
to_scrape = [v for v in unique_verbs if v["wni"] not in done_wnis]
|
||||
print(f"Total unique verbs: {total}, already done: {total - len(to_scrape)}, to scrape: {len(to_scrape)}")
|
||||
|
||||
scraped_count = 0
|
||||
skipped_count = 0
|
||||
total_new_forms = 0
|
||||
sample_verbs = {} # For summary: wni -> list of forms
|
||||
|
||||
for i, verb in enumerate(to_scrape):
|
||||
wni = verb["wni"]
|
||||
word_nikkud_input = verb["word"]
|
||||
|
||||
try:
|
||||
# Step 1: Find slug
|
||||
if wni in slug_cache:
|
||||
slug = slug_cache[wni]
|
||||
else:
|
||||
slug = search_slug(wni)
|
||||
time.sleep(DELAY)
|
||||
|
||||
if not slug:
|
||||
print(f" [{i + 1}/{len(to_scrape)}] SKIP {wni} - not found on pealim")
|
||||
skipped_count += 1
|
||||
done_wnis.add(wni)
|
||||
continue
|
||||
|
||||
slug_cache[wni] = slug
|
||||
|
||||
# Step 2: Scrape forms
|
||||
forms, page_nikkud = scrape_verb_forms(slug)
|
||||
time.sleep(DELAY)
|
||||
|
||||
# Use the nikkud form from our input data (more reliable)
|
||||
nikkud_to_use = word_nikkud_input
|
||||
|
||||
# Build entries for each form
|
||||
for form in forms:
|
||||
entry = {
|
||||
"word_nikkud": nikkud_to_use,
|
||||
"form_type": "conjugation",
|
||||
"pos": "Verb",
|
||||
"slug": slug,
|
||||
}
|
||||
if form not in new_forms:
|
||||
new_forms[form] = []
|
||||
# Check for duplicate entry
|
||||
if not any(e["slug"] == slug for e in new_forms[form]):
|
||||
new_forms[form].append(entry)
|
||||
total_new_forms += 1
|
||||
|
||||
scraped_count += 1
|
||||
# Collect samples (first 3 completed)
|
||||
if len(sample_verbs) < 3:
|
||||
sample_verbs[wni] = sorted(forms)
|
||||
|
||||
print(f" [{i + 1}/{len(to_scrape)}] {wni} -> {slug} ({len(forms)} forms)")
|
||||
done_wnis.add(wni)
|
||||
|
||||
except Exception as e:
|
||||
print(f" [{i + 1}/{len(to_scrape)}] ERROR {wni}: {e}")
|
||||
skipped_count += 1
|
||||
done_wnis.add(wni)
|
||||
|
||||
# Save progress every 50 verbs
|
||||
if (i + 1) % 50 == 0:
|
||||
progress = {"done_wnis": list(done_wnis), "slug_cache": slug_cache}
|
||||
save_json(progress, PROGRESS_FILE)
|
||||
# Save partial merged result
|
||||
merged = dict(existing_forms)
|
||||
for form, entries in new_forms.items():
|
||||
if form in merged:
|
||||
existing_slugs = {e["slug"] for e in merged[form]}
|
||||
for entry in entries:
|
||||
if entry["slug"] not in existing_slugs:
|
||||
merged[form].append(entry)
|
||||
else:
|
||||
merged[form] = entries
|
||||
save_json(merged, PARTIAL_FILE)
|
||||
print(f" -- Progress saved at {i + 1}/{len(to_scrape)} --")
|
||||
|
||||
# Final merge
|
||||
merged = dict(existing_forms)
|
||||
for form, entries in new_forms.items():
|
||||
if form in merged:
|
||||
existing_slugs = {e["slug"] for e in merged[form]}
|
||||
for entry in entries:
|
||||
if entry["slug"] not in existing_slugs:
|
||||
merged[form].append(entry)
|
||||
else:
|
||||
merged[form] = entries
|
||||
|
||||
save_json(merged, OUTPUT_FILE)
|
||||
|
||||
# Save final progress
|
||||
progress = {"done_wnis": list(done_wnis), "slug_cache": slug_cache}
|
||||
save_json(progress, PROGRESS_FILE)
|
||||
|
||||
# Clean up partial file
|
||||
if os.path.exists(PARTIAL_FILE):
|
||||
os.remove(PARTIAL_FILE)
|
||||
|
||||
# Summary
|
||||
print(f"\n{'=' * 50}")
|
||||
print("SUMMARY")
|
||||
print(f"{'=' * 50}")
|
||||
print(f"Verbs scraped: {scraped_count}")
|
||||
print(f"Verbs skipped: {skipped_count}")
|
||||
print(f"New forms added: {total_new_forms}")
|
||||
print(f"Total unique ktiv male forms: {len(merged)}")
|
||||
print(f"Previous forms count: {len(existing_forms)}")
|
||||
print(f"Net new form keys: {len(merged) - len(existing_forms)}")
|
||||
|
||||
if sample_verbs:
|
||||
print("\nSample verbs:")
|
||||
for wni, forms in list(sample_verbs.items())[:3]:
|
||||
print(f"\n {wni} ({len(forms)} forms):")
|
||||
for f in forms[:8]:
|
||||
print(f" {f}")
|
||||
if len(forms) > 8:
|
||||
print(f" ... and {len(forms) - 8} more")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -1,31 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
word = 'אבל'
|
||||
url = f'https://www.pealim.com/search/?q={word}'
|
||||
headers = {
|
||||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
|
||||
}
|
||||
|
||||
try:
|
||||
response = requests.get(url, headers=headers, timeout=10)
|
||||
print(f'Status: {response.status_code}')
|
||||
soup = BeautifulSoup(response.content, 'html.parser')
|
||||
|
||||
# Debug: check what we find
|
||||
word_elem = soup.find('h1', class_='word-title')
|
||||
pos_elem = soup.find('span', class_='pos')
|
||||
definition_elem = soup.find('div', class_='definition')
|
||||
|
||||
print(f'word_elem found: {word_elem is not None}')
|
||||
print(f'pos_elem found: {pos_elem is not None}')
|
||||
print(f'definition_elem found: {definition_elem is not None}')
|
||||
|
||||
print('\n--- HTML snippet (first 3000 chars) ---')
|
||||
print(soup.prettify()[:3000])
|
||||
|
||||
except Exception as e:
|
||||
print(f'Error: {e}')
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
0
tests/__init__.py
Normal file
0
tests/__init__.py
Normal file
45
tests/test_smoke.py
Normal file
45
tests/test_smoke.py
Normal file
|
|
@ -0,0 +1,45 @@
|
|||
"""Smoke tests for the Hebrew Flash Cards project."""
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
# Ensure project root is on path
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||
|
||||
|
||||
def test_helpers_strip_nikkud():
|
||||
from helpers import strip_nikkud
|
||||
|
||||
assert strip_nikkud("שָׁלוֹם") == "שלום"
|
||||
assert strip_nikkud("hello") == "hello"
|
||||
assert strip_nikkud("") == ""
|
||||
|
||||
|
||||
def test_apkg_builder_imports():
|
||||
import apkg_builder
|
||||
|
||||
assert hasattr(apkg_builder, "build_vocab_deck")
|
||||
assert hasattr(apkg_builder, "build_conj_deck")
|
||||
assert apkg_builder.VOCAB_MODEL_ID == 1_701_222_017_968
|
||||
|
||||
|
||||
def test_data_files_exist():
|
||||
data_dir = Path(__file__).resolve().parent.parent / "data"
|
||||
assert (data_dir / "hebrew_dict_for_anki.csv").exists(), "vocab CSV missing"
|
||||
assert (data_dir / "conjugations.json").exists(), "conjugations cache missing"
|
||||
|
||||
|
||||
def test_strip_nikkud_idempotent():
|
||||
from helpers import strip_nikkud
|
||||
|
||||
plain = "שלום"
|
||||
assert strip_nikkud(plain) == plain
|
||||
|
||||
|
||||
def test_strip_nikkud_all_marks():
|
||||
from helpers import strip_nikkud
|
||||
|
||||
# Comprehensive: patach, kamatz, segol, tsere, hiriq, holam, kubutz, shva, dagesh
|
||||
nikkud = "הַמַּלְכָּה"
|
||||
plain = strip_nikkud(nikkud)
|
||||
assert all(ch < "\u0591" or ch > "\u05C7" for ch in plain), f"Residual nikkud in: {plain}"
|
||||
130
validate_apkg.py
130
validate_apkg.py
|
|
@ -14,7 +14,6 @@ import json
|
|||
import os
|
||||
import re
|
||||
import sqlite3
|
||||
import struct
|
||||
import sys
|
||||
import tempfile
|
||||
import zipfile
|
||||
|
|
@ -22,6 +21,9 @@ from pathlib import Path
|
|||
|
||||
VOCAB_APKG = Path("output/hebrew_vocabulary.apkg")
|
||||
CONJ_APKG = Path("output/hebrew_conjugations.apkg")
|
||||
CONF_APKG = Path("output/hebrew_confusables.apkg")
|
||||
PLURAL_APKG = Path("output/hebrew_plurals.apkg")
|
||||
COMPLETE_APKG = Path("output/hebrew_complete.apkg")
|
||||
|
||||
PASS = "\033[32m✓\033[0m"
|
||||
FAIL = "\033[31m✗\033[0m"
|
||||
|
|
@ -60,7 +62,6 @@ def _detect_format(data: bytes) -> str:
|
|||
|
||||
def validate_apkg(apkg_path: Path) -> int:
|
||||
"""Run all checks. Returns number of failures."""
|
||||
name = apkg_path.name
|
||||
print(f"\n{'=' * 60}")
|
||||
print(f" Validating: {apkg_path}")
|
||||
print(f"{'=' * 60}")
|
||||
|
|
@ -78,16 +79,17 @@ def validate_apkg(apkg_path: Path) -> int:
|
|||
print("\n[ZIP structure]")
|
||||
try:
|
||||
zf = zipfile.ZipFile(apkg_path)
|
||||
except zipfile.BadZipFile as e:
|
||||
print(f" {FAIL} Invalid ZIP: {e}")
|
||||
return 1
|
||||
|
||||
with zf, tempfile.TemporaryDirectory() as tmpdir:
|
||||
namelist = zf.namelist()
|
||||
has_db = "collection.anki2" in namelist
|
||||
has_media = "media" in namelist
|
||||
failures += 0 if check("collection.anki2 present", has_db) else 1
|
||||
failures += 0 if check("media manifest present", has_media) else 1
|
||||
except zipfile.BadZipFile as e:
|
||||
print(f" {FAIL} Invalid ZIP: {e}")
|
||||
return 1
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmpdir:
|
||||
zf.extractall(tmpdir)
|
||||
|
||||
# --- Media manifest ---
|
||||
|
|
@ -116,8 +118,11 @@ def validate_apkg(apkg_path: Path) -> int:
|
|||
size = zf.getinfo(num).file_size if num in zf.NameToInfo else -1
|
||||
if size == 0:
|
||||
zero_byte.append(orig)
|
||||
failures += 0 if check("No zero-byte media files", len(zero_byte) == 0,
|
||||
f"{len(zero_byte)} empty" if zero_byte else "") else 1
|
||||
failures += (
|
||||
0
|
||||
if check("No zero-byte media files", len(zero_byte) == 0, f"{len(zero_byte)} empty" if zero_byte else "")
|
||||
else 1
|
||||
)
|
||||
|
||||
# Check audio format sample (first 20 mp3s)
|
||||
mp3_names = [num for num, orig in media_map.items() if orig.endswith(".mp3")]
|
||||
|
|
@ -127,16 +132,19 @@ def validate_apkg(apkg_path: Path) -> int:
|
|||
fmt = _detect_format(data)
|
||||
if "MP3" not in fmt:
|
||||
bad_format.append(f"{media_map[num]}: {fmt}")
|
||||
failures += 0 if check(
|
||||
failures += (
|
||||
0
|
||||
if check(
|
||||
f"Audio format (sampled {min(20, len(mp3_names))} files)",
|
||||
len(bad_format) == 0,
|
||||
"; ".join(bad_format) if bad_format else f"all MP3",
|
||||
) else 1
|
||||
"; ".join(bad_format) if bad_format else "all MP3",
|
||||
)
|
||||
else 1
|
||||
)
|
||||
|
||||
# Fonts present
|
||||
font_files = [v for v in original_names if v.endswith(".ttf")]
|
||||
check("Heebo font files bundled", len(font_files) >= 1,
|
||||
", ".join(font_files) if font_files else "none found")
|
||||
check("Heebo font files bundled", len(font_files) >= 1, ", ".join(font_files) if font_files else "none found")
|
||||
|
||||
# --- Database ---
|
||||
print("\n[Database]")
|
||||
|
|
@ -144,8 +152,7 @@ def validate_apkg(apkg_path: Path) -> int:
|
|||
conn = sqlite3.connect(db_path)
|
||||
|
||||
schema_ver = conn.execute("SELECT ver FROM col").fetchone()[0]
|
||||
failures += 0 if check("Schema version 11 (Anki 2.1)", schema_ver == 11,
|
||||
f"got {schema_ver}") else 1
|
||||
failures += 0 if check("Schema version 11 (Anki 2.1)", schema_ver == 11, f"got {schema_ver}") else 1
|
||||
|
||||
note_count = conn.execute("SELECT COUNT(*) FROM notes").fetchone()[0]
|
||||
card_count = conn.execute("SELECT COUNT(*) FROM cards").fetchone()[0]
|
||||
|
|
@ -153,33 +160,37 @@ def validate_apkg(apkg_path: Path) -> int:
|
|||
failures += 0 if check("Cards present", card_count > 0, f"{card_count:,} cards") else 1
|
||||
|
||||
# Determine expected cards per note from model templates
|
||||
# Some templates are optional (e.g. cloze only generates when field is non-empty),
|
||||
# so we check that cards fall between min and max expected range.
|
||||
models_json_raw = conn.execute("SELECT models FROM col").fetchone()[0]
|
||||
models_raw = json.loads(models_json_raw)
|
||||
tmpl_counts = [len(m["tmpls"]) for m in models_raw.values()]
|
||||
expected_ratio = tmpl_counts[0] if len(set(tmpl_counts)) == 1 else None
|
||||
if expected_ratio:
|
||||
failures += 0 if check(
|
||||
f"{expected_ratio} card(s) per note",
|
||||
card_count == note_count * expected_ratio,
|
||||
f"{note_count} notes × {expected_ratio} = {note_count * expected_ratio}, got {card_count}",
|
||||
) else 1
|
||||
if len(set(tmpl_counts)) == 1 and len(tmpl_counts) == 1:
|
||||
expected_ratio = tmpl_counts[0]
|
||||
# Allow fewer cards when optional templates exist (e.g. cloze)
|
||||
min_cards = note_count # at least 1 card per note
|
||||
max_cards = note_count * expected_ratio
|
||||
failures += (
|
||||
0
|
||||
if check(
|
||||
f"Cards per note (1–{expected_ratio} templates)",
|
||||
min_cards <= card_count <= max_cards,
|
||||
f"{card_count:,} cards from {note_count:,} notes",
|
||||
)
|
||||
else 1
|
||||
)
|
||||
|
||||
# Duplicate GUIDs
|
||||
dup_guids = conn.execute(
|
||||
"SELECT guid, COUNT(*) c FROM notes GROUP BY guid HAVING c > 1"
|
||||
).fetchall()
|
||||
failures += 0 if check("No duplicate GUIDs", len(dup_guids) == 0,
|
||||
f"{len(dup_guids)} duplicates") else 1
|
||||
dup_guids = conn.execute("SELECT guid, COUNT(*) c FROM notes GROUP BY guid HAVING c > 1").fetchall()
|
||||
failures += 0 if check("No duplicate GUIDs", len(dup_guids) == 0, f"{len(dup_guids)} duplicates") else 1
|
||||
|
||||
# Card queue states
|
||||
queues = conn.execute(
|
||||
"SELECT type, queue, COUNT(*) FROM cards GROUP BY type, queue"
|
||||
).fetchall()
|
||||
queues = conn.execute("SELECT type, queue, COUNT(*) FROM cards GROUP BY type, queue").fetchall()
|
||||
queue_map = {(t, q): cnt for t, q, cnt in queues}
|
||||
new_cards = queue_map.get((0, 0), 0)
|
||||
suspended = queue_map.get((0, -1), 0) + queue_map.get((1, -1), 0) + queue_map.get((2, -1), 0)
|
||||
if new_cards > 0:
|
||||
check(f"Cards in new queue (type=0, queue=0)", True, f"{new_cards:,}")
|
||||
check("Cards in new queue (type=0, queue=0)", True, f"{new_cards:,}")
|
||||
if suspended > 0:
|
||||
warn("Suspended cards", f"{suspended:,}")
|
||||
|
||||
|
|
@ -190,23 +201,18 @@ def validate_apkg(apkg_path: Path) -> int:
|
|||
per_days = {dc.get("new", {}).get("perDay") for dc in dconf.values() if isinstance(dc, dict)}
|
||||
check("new.order configured", bool(orders), f"{orders}")
|
||||
if per_days:
|
||||
check("new.perDay > 0", all(p and p > 0 for p in per_days if p is not None),
|
||||
f"perDay={per_days}")
|
||||
check("new.perDay > 0", all(p and p > 0 for p in per_days if p is not None), f"perDay={per_days}")
|
||||
|
||||
# Deck assignment
|
||||
decks_json = conn.execute("SELECT decks FROM col").fetchone()[0]
|
||||
decks = json.loads(decks_json)
|
||||
real_decks = {did: d for did, d in decks.items() if did != "1"}
|
||||
if real_decks:
|
||||
check("Custom deck exists (not Default only)", True,
|
||||
", ".join(d["name"] for d in real_decks.values()))
|
||||
check("Custom deck exists (not Default only)", True, ", ".join(d["name"] for d in real_decks.values()))
|
||||
# All cards in the custom deck?
|
||||
for did_str in real_decks:
|
||||
assigned = conn.execute(
|
||||
"SELECT COUNT(*) FROM cards WHERE did=?", [int(did_str)]
|
||||
).fetchone()[0]
|
||||
check(f"Cards in deck '{real_decks[did_str]['name']}'", assigned > 0,
|
||||
f"{assigned:,}/{card_count:,}")
|
||||
assigned = conn.execute("SELECT COUNT(*) FROM cards WHERE did=?", [int(did_str)]).fetchone()[0]
|
||||
check(f"Cards in deck '{real_decks[did_str]['name']}'", assigned > 0, f"{assigned:,}/{card_count:,}")
|
||||
|
||||
# --- Sound references vs media manifest ---
|
||||
print("\n[Sound references]")
|
||||
|
|
@ -218,16 +224,21 @@ def validate_apkg(apkg_path: Path) -> int:
|
|||
|
||||
missing_audio = sound_refs - original_names
|
||||
orphaned_audio = original_names - sound_refs - set(font_files)
|
||||
failures += 0 if check("All sound refs in media manifest", len(missing_audio) == 0,
|
||||
f"{len(missing_audio)} missing" if missing_audio else "") else 1
|
||||
failures += (
|
||||
0
|
||||
if check(
|
||||
"All sound refs in media manifest",
|
||||
len(missing_audio) == 0,
|
||||
f"{len(missing_audio)} missing" if missing_audio else "",
|
||||
)
|
||||
else 1
|
||||
)
|
||||
if orphaned_audio:
|
||||
warn("Media files not referenced by any card", f"{len(orphaned_audio)} orphaned")
|
||||
|
||||
notes_with_audio = sum(
|
||||
1 for (flds,) in notes_flds if "[sound:" in flds
|
||||
)
|
||||
notes_with_audio = sum(1 for (flds,) in notes_flds if "[sound:" in flds)
|
||||
pct = notes_with_audio / note_count * 100 if note_count else 0
|
||||
check(f"Notes with audio", notes_with_audio > 0, f"{notes_with_audio:,}/{note_count:,} ({pct:.0f}%)")
|
||||
check("Notes with audio", notes_with_audio > 0, f"{notes_with_audio:,}/{note_count:,} ({pct:.0f}%)")
|
||||
|
||||
# --- Empty fields check ---
|
||||
print("\n[Field content]")
|
||||
|
|
@ -236,22 +247,12 @@ def validate_apkg(apkg_path: Path) -> int:
|
|||
field_names = [f["name"] for f in model["flds"]]
|
||||
# Check required fields (first 3) are not empty
|
||||
required_idx = list(range(min(3, len(field_names))))
|
||||
all_notes_for_model = conn.execute("SELECT flds FROM notes WHERE mid=?", [int(mid_str)]).fetchall()
|
||||
for idx in required_idx:
|
||||
fname = field_names[idx]
|
||||
empty_count = conn.execute(
|
||||
"""SELECT COUNT(*) FROM notes
|
||||
WHERE mid=? AND (
|
||||
flds LIKE ? OR
|
||||
instr(flds, char(31)) = 0
|
||||
)""",
|
||||
[int(mid_str), "\x1f" * idx + "\x1f%"],
|
||||
).fetchone()[0]
|
||||
# Simpler: count notes where field idx is empty
|
||||
all_notes_for_model = conn.execute(
|
||||
"SELECT flds FROM notes WHERE mid=?", [int(mid_str)]
|
||||
).fetchall()
|
||||
empty = sum(
|
||||
1 for (flds,) in all_notes_for_model
|
||||
1
|
||||
for (flds,) in all_notes_for_model
|
||||
if len(flds.split("\x1f")) <= idx or not flds.split("\x1f")[idx].strip()
|
||||
)
|
||||
if empty > 0:
|
||||
|
|
@ -271,6 +272,9 @@ def main() -> None:
|
|||
group = parser.add_mutually_exclusive_group()
|
||||
group.add_argument("--vocab", action="store_true", help="Validate vocabulary deck only")
|
||||
group.add_argument("--conjugations", action="store_true", help="Validate conjugation deck only")
|
||||
group.add_argument("--confusables", action="store_true", help="Validate confusables deck only")
|
||||
group.add_argument("--plurals", action="store_true", help="Validate plurals deck only")
|
||||
group.add_argument("--complete", action="store_true", help="Validate complete combined deck only")
|
||||
args = parser.parse_args()
|
||||
|
||||
targets: list[Path] = []
|
||||
|
|
@ -280,8 +284,14 @@ def main() -> None:
|
|||
targets = [VOCAB_APKG]
|
||||
elif args.conjugations:
|
||||
targets = [CONJ_APKG]
|
||||
elif args.confusables:
|
||||
targets = [CONF_APKG]
|
||||
elif args.plurals:
|
||||
targets = [PLURAL_APKG]
|
||||
elif args.complete:
|
||||
targets = [COMPLETE_APKG]
|
||||
else:
|
||||
targets = [VOCAB_APKG, CONJ_APKG]
|
||||
targets = [VOCAB_APKG, CONJ_APKG, CONF_APKG, PLURAL_APKG, COMPLETE_APKG]
|
||||
|
||||
total_failures = 0
|
||||
for path in targets:
|
||||
|
|
|
|||
|
|
@ -120,7 +120,7 @@ def main() -> None:
|
|||
print(f"ERROR: {SOURCE_FILE} not found", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
lines = [l.strip() for l in SOURCE_FILE.read_text(encoding="utf-8").splitlines() if l.strip()]
|
||||
lines = [line.strip() for line in SOURCE_FILE.read_text(encoding="utf-8").splitlines() if line.strip()]
|
||||
print(f"Loaded {len(lines)} entries from {SOURCE_FILE.name}")
|
||||
print(f"Querying pealim.com (delay {REQUEST_DELAY}s per request)…\n")
|
||||
|
||||
|
|
@ -137,14 +137,19 @@ def main() -> None:
|
|||
|
||||
if issue_type == "REVIEW":
|
||||
# Don't query pealim for known-bad entries
|
||||
print(f"REVIEW (skipping query)")
|
||||
results.append({
|
||||
"line": line_num, "word": word,
|
||||
print("REVIEW (skipping query)")
|
||||
results.append(
|
||||
{
|
||||
"line": line_num,
|
||||
"word": word,
|
||||
"expected_binyan": expected_binyan,
|
||||
"slug": "", "page_binyan": "",
|
||||
"status": "REVIEW", "notes": issue_note,
|
||||
"slug": "",
|
||||
"page_binyan": "",
|
||||
"status": "REVIEW",
|
||||
"notes": issue_note,
|
||||
"is_3ms": is_3ms_by_position,
|
||||
})
|
||||
}
|
||||
)
|
||||
continue
|
||||
|
||||
time.sleep(REQUEST_DELAY)
|
||||
|
|
@ -171,13 +176,18 @@ def main() -> None:
|
|||
notes = ""
|
||||
|
||||
print(f"{status:<12} slug={slug or '-':<35} binyan={page_binyan or '-'}")
|
||||
results.append({
|
||||
"line": line_num, "word": word,
|
||||
results.append(
|
||||
{
|
||||
"line": line_num,
|
||||
"word": word,
|
||||
"expected_binyan": expected_binyan,
|
||||
"slug": slug or "", "page_binyan": page_binyan,
|
||||
"status": status, "notes": notes,
|
||||
"slug": slug or "",
|
||||
"page_binyan": page_binyan,
|
||||
"status": status,
|
||||
"notes": notes,
|
||||
"is_3ms": is_3ms_by_position or issue_type == "3ms",
|
||||
})
|
||||
}
|
||||
)
|
||||
|
||||
# ── Write cleaned verbs_input.txt ────────────────────────────────────────────
|
||||
sections: dict[str, list[str]] = {b: [] for b in SECTION_HEADERS}
|
||||
|
|
@ -219,7 +229,6 @@ def main() -> None:
|
|||
print(f"\nWrote → {OUTPUT_FILE}")
|
||||
|
||||
# ── Print summary table ──────────────────────────────────────────────────────
|
||||
col_w = [4, 22, 14, 38, 12]
|
||||
print("\n" + "=" * 95)
|
||||
print("VALIDATION REPORT")
|
||||
print("=" * 95)
|
||||
|
|
@ -232,8 +241,7 @@ def main() -> None:
|
|||
)
|
||||
print("=" * 95)
|
||||
|
||||
counts = {s: sum(1 for r in results if r["status"] == s)
|
||||
for s in ("OK", "3ms", "MISMATCH", "REVIEW", "NOT_FOUND")}
|
||||
counts = {s: sum(1 for r in results if r["status"] == s) for s in ("OK", "3ms", "MISMATCH", "REVIEW", "NOT_FOUND")}
|
||||
print(
|
||||
f"\nSummary: {counts['OK']} OK | {counts['3ms']} 3ms-past | "
|
||||
f"{counts['MISMATCH']} MISMATCH | {counts['REVIEW']} REVIEW | {counts['NOT_FOUND']} NOT_FOUND"
|
||||
|
|
@ -241,10 +249,7 @@ def main() -> None:
|
|||
print(f"Total entries: {len(results)}")
|
||||
|
||||
if counts["REVIEW"] > 0 or counts["NOT_FOUND"] > 0 or counts["MISMATCH"] > 0:
|
||||
print(
|
||||
"\n⚠ Review flagged entries in verbs_input.txt before running:\n"
|
||||
" python3 conjugation_extract.py"
|
||||
)
|
||||
print("\n⚠ Review flagged entries in verbs_input.txt before running:\n python3 conjugation_extract.py")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
|
|
|||
|
|
@ -2,6 +2,8 @@
|
|||
# Lines prefixed '# 3ms:' are searched by 3ms past form (Pu'al/Huf'al).
|
||||
|
||||
# Pa'al (פָּעַל)
|
||||
# slug: להיות 454-lihyot
|
||||
להיות
|
||||
לשמור
|
||||
ללמוד
|
||||
לאסוף
|
||||
|
|
|
|||
3
vulture_whitelist.py
Normal file
3
vulture_whitelist.py
Normal file
|
|
@ -0,0 +1,3 @@
|
|||
# Vulture whitelist: suppress false positives for interface methods
|
||||
# HTMLParser.handle_starttag requires (self, tag, attrs) signature
|
||||
attrs # noqa
|
||||
Loading…
Reference in a new issue