Compare commits

..

38 commits

Author SHA1 Message Date
6d2d446ed5 feat: pseudo-frequency for confusables using English word frequency
264 confusable groups where all entries shared the same Hebrew frequency
now have differentiated pseudo_frequency values based on English word
commonality (hermitdave en_50k.txt). Most common meaning keeps base
rank; less common meanings get +100 offset per position.

Examples:
- אב: "father" (en:194) → 2491, "bud" (en:2963) → 2591
- אח: "brother" (en:300) → 911, "fireplace" (en:9389) → 1011

Builder uses pseudo_frequency for sort order when available.
Confusable card definitions now sorted most-common-first.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-03 05:28:30 +00:00
f978e5f39a fix: vet fallback emoji — verb gate + expanded stop list removes 852 bad matches
The fallback emoji system (keyword→Unicode char matching at build time)
was producing 1,733 matches, many with wrong-sense emoji:
- "high, tall" →  (from "high voltage")
- "to cut" → 🥩 (cut of meat)
- "city" → 🇻🇦 (Vatican flag)

Two fixes:
1. Skip fallback for verbs (meanings starting "to ") — 476 removed
2. Expand _EMOJI_STOP with 100+ polysemous/abstract keywords — 376 more

Result: 1733 → 881 fallback matches (49% reduction). The 114 from_pealim
emojis (concrete nouns like 🍎 apple, 🐪 camel) are unaffected.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-03 05:17:31 +00:00
5f617af4ba feat: vet emoji assignments — 114 visible, 3 removed, 1 fixed
Reviewed all 117 pealim-inherited emoji assignments:
- Made 114 correct assignments visible (emoji_visible: true)
- Removed: goblet (🏆 is trophy), fitness (🏋 too abstract), red (💄 is lipstick)
- Fixed: onion 🌰🧅 (was chestnut emoji)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-26 23:19:43 +00:00
f3496998f5 feat: confusables show ktiv male, emoji/prep stripping fully upstream
- Confusables deck front now shows shared ktiv male form instead of
  nikkud variants joined by "/". Back still shows nikkud with definitions.
- Fixed list scraper EMOJI_RE to catch variation selectors (U+FE0F) and
  ZWJ (U+200D) — cleaned 17 entries with leftover selectors in meaning.
- Removed build-time prep extraction fallback (0 entries relied on it).
- release.py: fix keeshare field name (API_TOKEN → password).

Closes: Pealim #11 (emoji/prep upstream), Pealim #16 (confusables ktiv male)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-21 02:19:03 +00:00
138acb06d8 bump RELEASE_TAG to v0.20
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 14:09:19 +00:00
0a85291975 feat(v0.20): adaptive sentence difficulty scoring — 3163 scored cloze sentences
Replaces length-based sentence selection with frequency-based difficulty
scoring. Easiest sentences (most common context words) become cloze candidates.

Score range: 1-50000, median=179. Coverage: 3163/3325 cloze entries scored.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 13:35:23 +00:00
14d567a261 schema: add difficulty_score field + update spec with MIN_WORDS=3
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 13:30:13 +00:00
8b24d0fd26 Task 7: Add integration tests for frequency-based sentence scoring
Tests verify that update_words_json produces a cloze with `difficulty_score`,
that vetted sentences are sorted by difficulty, and that the easiest sentence
becomes the cloze candidate.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-15 13:29:25 +00:00
272a2a080d Task 7: Replace length-based scoring with frequency-based scoring in update_words_json
- Import `sentence_difficulty.build_nikkud_map`, `score_sentence`, and `frequency_lookup`
- Build `nikkud_index`, `nikkud_map`, `freq_data` once before the per-word loop
- Replace `_score()` closure with call to `score_sentence()` (median context-word rank)
- Store `difficulty_score` in cloze dict for downstream use

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-15 13:29:22 +00:00
fb12f806a8 feat: add sentence_difficulty module with 5-tier frequency scoring
Implements build_nikkud_map(), _resolve_token_frequency(), and
score_sentence() for v0.20 adaptive cloze sentence selection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 13:23:21 +00:00
00fba934fb feat(epub_examples): export try_strip_prefix as public alias
Exposes _try_strip_prefix under a public name so the upcoming
sentence_difficulty module can reuse Hebrew prefix stripping logic
without duplicating it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-15 13:18:55 +00:00
d2a7c9d483 feat(epub_examples): lower MIN_WORDS from 4 to 3
Hebrew is more concise than English — 3-word sentences are valid
candidates for the example sentence pool. Expands the pool for the
upcoming adaptive sentence difficulty ranking.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-15 13:18:44 +00:00
d0f4aea58d feat(frequency_lookup): add get_freq_data() for batch frequency access
Exposes the full word→rank dict for use by the upcoming sentence_difficulty
module without requiring per-word lookups.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-15 13:18:11 +00:00
b3ea086e85 v0.20 design spec + nikkud-to-ktiv-male converter
Add Academy-rules-based nikkud→ktiv male converter (91.6% accuracy
vs 77.2% for strip_nikkud) and v0.20 adaptive sentence difficulty
cloze design spec. The converter enables frequency-based sentence
scoring by properly resolving nikkud tokens to their ktiv male forms
for frequency corpus lookup.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 12:57:14 +00:00
af186e2030 Sprint 17: homograph example dedup + plural audio + prep extraction
- Homograph collision fix: _deduplicate_confusable_examples() clears
  shared examples from less-common confusable group members (36 entries
  fixed). Keeps examples only on highest-frequency meaning.
- Plural deck audio: wired up PluralAudio field in apkg_builder.py,
  downloaded 613 plural audio files from pealim.com for all deck entries.
- Prep extraction upstream: moved Hebrew preposition parsing from build
  time into list/detail scrapers (SCHEMA.yaml prep field added).
- Validation: new no_shared_confusable_examples check in validate_data.py
- Tests: 9 new unit tests for confusable deduplication (98 total)
- Release: v0.19

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 21:51:35 +00:00
0d92451271 Sprint 16: collapsible card details + related words table
- All secondary fields (shoresh, PoS, ktiv male, plural, related words)
  behind a "מידע נוסף" toggle button using HTML <details>/<summary>
- Conjugation back: English meaning, binyan also behind toggle
- Related words: table format with word + meaning, sorted by frequency
- Hebrew words not bold, English meanings 24px gray (#555)
- "מִילִים קְשׁוּרוֹת" sub-header with nikkud inside toggle
- "אֵיךְ אוֹמְרִים" prompt centered using hint class
- New CSS: .more-toggle, .more-header, .related-header, .rw-word, .rw-meaning
- Dark mode support for all new classes
- Bump to v0.18

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 01:34:14 +00:00
c85063ee2f Sprint 15: example sentence pipeline overhaul + corpus expansion + card improvements
- Regenerated all example sentences from scratch (deleted legacy + stale entries)
- Added .txt file support to epub_examples.py for Ben Yehuda corpus
- 7 Ben Yehuda nikkud'd children's texts + 3 new Time Tunnel EPUBs
- Maqaf-stripped construct form indexing (+68% inflected matches)
- Total: 3,598 words with examples, 3,289 with cloze (was ~2,900)
- Cloze prefix preservation (_cloze_prefix_len)
- Hebrew spoiler stripping from English meanings
- Gender field (זָכָר/נְקֵבָה) on vocab cards
- sec-table CSS layout for aligned key:value pairs
- Mishkal uses mishkal_hebrew on plural cards
- Improved mishkal extraction from pealim detail pages
- 21 new pytest tests (cloze, PoS, Hebrew stripping, gender, mishkal)
- 2 new validate_data.py tests + mishkal stats
- Colliding forms tracking (local-only)
- Release tag v0.17

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 10:44:14 +00:00
efd0745ada Sprint 14: deck template/CSS overhaul + Sprint 12 detail scrape
Template & CSS fixes (15 items from Mar 9 feedback):
- Fix conjugation front showing 3ms form instead of infinitive
- Rename conjugation model to "Hebrew Conjugation"
- Strip Hebrew parenthesized text from English meanings
- Shoresh separator: spaces → dots (א.כ.ל)
- Remove duplicate English meaning from cloze back
- Remove example sentences from vocab front/back (cloze only)
- Center-align audio buttons on all decks
- Fix parenthesis spacing: "you(feminine,singular)" → "you (feminine, singular)"
- Unify sec-key/sec-label fonts, make keys bold
- Size overhaul: bigger Hebrew (42px), meaning (34px), secondary (28px)
- Center-align related words groups
- Sort confusables by average frequency
- Plurals: show Gender (Hebrew) before Mishkal, strip emoji from meaning
- Clean duplicate quotation marks in cloze sentences

Sprint 12 carry-forward (detail scrape + EPUB):
- Adjective/preposition detail scraping in pealim_detail_scrape.py
- EPUB example matching rewrite in epub_examples.py
- Delete benyehuda.py and rebuild_sentence_matches.py (merged)
- 49 parser tests for detail scraping
- SCHEMA.yaml updates for new fields

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 07:44:47 +00:00
3b0f9defa9 feat: YAP-cleaned frequency corpus + two-tier assignment pipeline
- Add clean_frequency_corpus.py: YAP morphological analyzer removes
  prefix+word combos (e.g. בבית=ב+בית) from he_50k frequency data.
  Headwords always protected. 30,430 clean entries from 49,999 raw.
- Add assign_frequency.py: two-tier assignment with PoS-aware homograph
  handling. Tier 1 matches headwords; Tier 2 matches inflections (any rank)
  and conjugations (rank>5000 only, to avoid false positives).
  Function words claim frequency over content words in homograph groups,
  with manual overrides for 12 common dual-use words.
- frequency_lookup.py auto-prefers frequency_clean.json when available
- 6,691 entries now have frequency (was 5,974), 717 newly assigned

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 06:22:55 +00:00
b8b65442cb restore epub_examples.py and rebuild_sentence_matches.py
Accidentally removed in 6c2a0f8 — these are the EPUB sentence
extraction and matching scripts used to build vetted_sentences.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 04:33:32 +00:00
04a4b52113 fix: deduplicate 66 plural GUIDs for homograph nouns
Homographs (same nikkud form, different meanings) had identical
plurals_guid values. Regenerated unique GUIDs by including meaning
in the hash. Also updated build-time fallback to use meaning.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 04:12:45 +00:00
f6af714e22 bump RELEASE_TAG to v0.15.1
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 04:08:35 +00:00
b2fef5aa8a Sprint 11.1: strip_nikkud cleanup, dead code removal, test fixes
Remove strip_nikkud from all pipeline files — use ktiv_male directly.
Fix case-insensitive binyan matching in detail scraper (og:description
uses UPPERCASE). Fix integration test slugs and test limits. Delete
legacy CSVs, stale .apkg, and dead scripts from git. Add vulture to
pre-commit hook.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 04:03:47 +00:00
a1d970a782 fix: reorder pipeline — detail scrape immediately after list scrape
List scrape captures slugs needed by detail scrape, so they should be
adjacent. Reordered: list→detail→frequency→examples→audio→fonts→images→build

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 19:16:57 +00:00
6c2a0f8eed chore: remove legacy scraping scripts replaced by unified pipeline
Removed 11 files that are no longer called by the active pipeline:
- hebrew_extract.py (replaced by pealim_list_scrape.py)
- conjugation_extract.py (replaced by pealim_detail_scrape.py)
- scripts/scrape_noun_plurals.py, scrape_verb_ktiv.py, scrape_ktiv_male.py
  (all replaced by pealim_detail_scrape.py)
- scripts/migrate_to_json.py, repair_slugs.py (one-time migration, complete)
- epub_examples.py, rebuild_sentence_matches.py (unused utilities)
- scripts/extract_pdf_sentences.py, add_slugs.py (unused one-off scripts)

Kept: check_guid_coverage.py, validate_data.py, extract_verb_list.py,
validate_apkg.py, validate_verb_list.py, release.py (standalone utilities)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 11:08:33 +00:00
08fb7009d8 Sprint 11: unified JSON architecture + consolidated scraping pipeline
Migrate from fragmented CSV + 10 JSON files to a single data/words.json
(9,104 entries) as the unified data store. All GUIDs preserved for Anki
study progress continuity.

New files:
- SCHEMA.yaml: authoritative schema for words.json
- pealim_list_scrape.py: consolidated list page scraper → words.json
- pealim_detail_scrape.py: noun/verb detail scraper → words.json
- pealim_audio_download.py: audio downloader reading from words.json
- scripts/migrate_to_json.py: one-time CSV→JSON migration
- scripts/validate_data.py: 17 data integrity tests
- scripts/check_guid_coverage.py: GUID preservation checker
- scripts/repair_slugs.py: slug deduplication repair tool
- tests/test_scraper_integration.py: live scraper integration tests

Updated:
- apkg_builder.py: reads from words.json (no more pandas)
- run.py: 8-step pipeline (list scrape → frequency → examples →
  detail scrape → audio download → fonts → images → build)
- benyehuda.py, frequency_lookup.py, image_fetch.py: TODO markers
  for future words.json integration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 10:54:58 +00:00
2e48109d7f v0.15: PoS fix, slug-based audio, CSS cleanup, template improvements
- Fix PoS substring bug: "Pronoun" no longer matches "Noun"
- CSS: reduce sec-label/sec-key font sizes, add .definitions/.conf-entry
- Slug-based audio filenames for confusable words (no more collisions)
- Scraper captures slug from pealim.com list page links
- Confusables: RTL alignment, re-enable audio (remove all-must-have gate)
- Plurals: blue given word, gray meaning, labeled mishkal badge
- Conjugation: add "אֵיךְ אוֹמְרִים" prompt, tense prefix (בְּ),
  Prep field from HBPAREN_RE, labeled RelatedVocab
- Ben Yehuda: skip stripped fallback for confusable words
- Bump RELEASE_TAG to v0.15

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 17:50:23 +00:00
802c369365 v0.14: rescrape vocab, formatting fixes for all decks
- Full pealim.com rescrape: 9,120 words (15 new), all with audio URLs
- Plurals deck: 2:1 regular:irregular ratio (649 notes), RTL arrows, 1.6x hint text
- Conjugation deck: blue infinitive on front, plain meaning on back, nikkud labels
- Confusables deck: larger prompt text (32px), audio only when all words have it
- Validator: non-audio variants no longer false-fail on audio check
- 14 new audio files downloaded

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 09:26:41 +00:00
def2fc1aca fix: card formatting, example sentence homograph protection, plural coverage
Formatting (#5):
- Labels now display with nikkud (שֹׁרֶשׁ, חֵלֶק דִּיבּוּר, רַבִּים, etc.)
- Secondary fields below audio 1.6x bigger (20px → 32px)
- Label keys styled separately (.sec-key class, smaller/dimmer than values)
- Example sentences centered on card (margin: auto, max-width: 90%)
- Emoji only on English side (removed duplicate from Eng→Heb back)
- Broken images hidden via onerror handler

Example sentences (#6):
- Confusable words (same consonants, different nikkud) now only match
  example sentences by exact nikkud form, preventing wrong-word sentences
- Same protection applied to cloze sentence and vetted sentence lookups

Plural coverage (#3):
- Added stripped-nikkud fallback for noun plural matching
- 3,918 nouns now show plurals (was ~3,604, +314 from fallback)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 08:45:53 +00:00
5685270dfa chore: add PROJECT_NOTES.md, update .gitignore
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 08:20:31 +00:00
34bec8f4ce feat: add release.py for automated Forgejo releases
Creates git tag, Forgejo release, uploads all 12 deck variants.
Supports --dry-run, --validate, --force for re-uploading.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 08:12:53 +00:00
17f7458d19 Sprint 9: cloze cards, plurals deck, project reorg, lint tooling
- Cloze card pipeline: 924 cards from 2,296 AI-vetted Hebrew book sentences
- Plurals deck: 375 notes (144 irregular + 231 regular from 86 mishkal patterns)
- Ktiv male forms expanded to 20,711 entries for sentence matching
- Project reorg: helpers.py (deduped strip_nikkud from 10 files), scripts/ for
  one-off tools, tests/ with smoke tests, deleted 3 dead files
- Lint tooling: pyproject.toml with ruff/vulture/bandit/pytest config, .editorconfig,
  fixed all 129 ruff errors (B023 closure fix, SIM103, unused vars)
- validate_apkg.py: card count range check for optional cloze template
- Data caches committed: vetted_sentences, ktiv_male_forms, noun_plurals,
  noun_slug_map, vocab_sentence_matches, epub_sentence_index

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 08:09:39 +00:00
419e952389 feat: curated emoji denylist, vocab audio URLs in CSV
- Expanded _EMOJI_STOP from ~20 to ~80 keywords after manual review
  of all 2,261 emoji-word pairs. Removes false positives from
  polysemous words (french→🍟, water→🤽, rock→🪨, etc.)
- Emoji count: 2,261 → 1,820 (removed ~440 bad matches)
- hebrew_dict.csv now populated with audio_url from pealim.com scrape
  (8,727 words with audio URLs)
- Cached emoji_lookup.json (1,749 keywords from Unicode emoji-test.txt)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 12:29:15 +00:00
607fd1a3bc feat: emoji Unicode lookup, conj nikkud, fix summary metric
- Emoji: _load_emoji_lookup() fetches unicode.org emoji-test.txt, builds
  {keyword: emoji_char} map cached in data/emoji_lookup.json. Falls back
  to empty dict on network failure. build_all_variants() loads once and
  passes to all build_vocab_deck() calls. For each word without pealim
  emoji, tries first 5 keywords from English meaning against lookup.
- Nikkud: זכר→זָכָר, נקבה→נְקֵבָה in PRESENT_EXPANSION constants and
  build_conj_deck() 1st-person gender labels.
- Summary: conj audio file count now excludes _infinitive and _passive_
  on-disk extras never bundled in .apkg (was 2235, now shows ~1765).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 21:24:10 +00:00
3fc3a21a33 fix: load conjugations from cache when --skip-conjugations passed
Previously --skip-conjugations returned None, causing build_all_variants()
to produce near-empty conjugation decks (0.3MB font-only files). Now loads
from conjugations.json cache so all 6 release variants build correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 21:01:47 +00:00
ccd7d61efb Add 6-variant release build (4 vocab + 2 conj), bump to v0.12
- build_vocab_deck(): include_audio/include_images flags
- build_conj_deck(): include_audio flag
- build_all_variants(): builds all 6 apkg files in one call
- Variants: hebrew_vocabulary{,_audio,_images,_audio_images}.apkg
            hebrew_conjugations{,_audio}.apkg
- run.py: step_build_all() replaces step_build_vocab(); conjugation
  extraction reuses cached conjugations.json unless refreshed
- RELEASE_TAG bumped to v0.12

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 20:58:06 +00:00
62c92ffae0 Emoji/image mutual exclusion, font size increases, layout
- Emoji and image are now mutually exclusive: emoji shown if present,
  image used as fallback ({{^Emoji}}{{#Image}}...{{/Image}}{{/Emoji}})
- Emoji shown on English card front (under meaning) — both card directions
- Emoji appears directly under meaning on backs, before secondary info
- sec-label: 16px → 20px; root-info/example: 16px → 18px; related-group: 15px → 18px
- hebrew-sm font-weight:normal (prep label no longer inherits bold)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 20:51:20 +00:00
0e4b041331 Fix emoji ordering and prep font weight
- Emoji now shown above image on both back templates (was below)
- Emoji also shown on English→Hebrew card front (visual cue with meaning)
- hebrew-sm: add font-weight:normal (was inheriting bold from .hebrew parent)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 06:09:55 +00:00
65 changed files with 2746378 additions and 66138 deletions

26
.claude/settings.json Normal file
View file

@ -0,0 +1,26 @@
{
"hooks": {
"PostToolUse": [
{
"matcher": "Edit|Write",
"hooks": [
{
"type": "command",
"command": "file=\"$CLAUDE_FILE_PATH\"; if [ -n \"$file\" ] && echo \"$file\" | grep -q '\\.py$'; then ruff format --quiet \"$file\" && ruff check --fix --quiet \"$file\" 2>/dev/null; fi"
}
]
}
],
"PreToolUse": [
{
"matcher": "Edit|Write",
"hooks": [
{
"type": "command",
"command": "file=\"$CLAUDE_FILE_PATH\"; if echo \"$file\" | grep -qE '(legacy_guid_map\\.json|\\.env)$'; then echo 'BLOCKED: Protected file — legacy_guid_map.json and .env are read-only' >&2; exit 2; fi"
}
]
}
]
}
}

15
.editorconfig Normal file
View file

@ -0,0 +1,15 @@
root = true
[*]
indent_style = space
indent_size = 4
end_of_line = lf
charset = utf-8
trim_trailing_whitespace = true
insert_final_newline = true
[*.{json,yml,yaml,toml}]
indent_size = 2
[*.md]
trim_trailing_whitespace = false

25
.gitignore vendored
View file

@ -11,9 +11,11 @@ pyvenv.cfg
venv/ venv/
__pycache__/ __pycache__/
*.pyc *.pyc
.pytest_cache/
# Large generated cache files (rebuild locally) # Large generated cache files (rebuild locally)
data/benyehuda_index.json data/benyehuda_index.json
data/colliding_forms.json
# Audio directories (large; rebuild locally) # Audio directories (large; rebuild locally)
data/audio/ data/audio/
@ -28,9 +30,32 @@ output/
# Internal / private files — not for public repo # Internal / private files — not for public repo
ANKIWEB_DESCRIPTION.md ANKIWEB_DESCRIPTION.md
PROJECT_NOTES.md
PROJECTS.md PROJECTS.md
SPRINT_LOG.md SPRINT_LOG.md
CLAUDE.md CLAUDE.md
RECOMMENDATIONS.md
# Intermediate scrape progress files
data/ktiv_male_forms.json.partial
data/ktiv_male_forms_partial.json
data/ktiv_scrape_progress.json
data/noun_slug_map_progress.json
data/top_verbs_to_scrape.json
# EPUB source files (large; user-specific)
data/epubs/
# Stray deck files
Everything__*.apkg
*.apkg
# Legacy CSV files (replaced by data/words.json)
*.csv
data/*.csv
# Dead whitelist files
vulture_whitelist.py
# Release artifacts — distributed via Forgejo releases, not committed to tree # Release artifacts — distributed via Forgejo releases, not committed to tree
releases/ releases/

154
README.md
View file

@ -6,16 +6,17 @@
## For Hebrew learners ## For Hebrew learners
This project generates two Anki decks for learning Modern Hebrew: A set of Anki flashcard decks for learning Modern Hebrew — vocabulary, verb conjugations, and more. All words include nikkud (vowel marks), audio, and are sorted by frequency so you learn the most useful words first.
- **Vocabulary deck** — ~9,100 words from [pealim.com](https://www.pealim.com/dict/), with nikkud (vowel marks), roots, parts of speech, related words, and example sentences from classic Hebrew literature. ### What's included
- **Conjugation deck** — 70 paradigm verbs from Coffin & Bolozky's *A Reference Grammar of Modern Hebrew* (2005), fully conjugated in all tenses and persons, across all seven binyanim.
All card data comes from open or academic sources: - **Vocabulary** — ~9,100 Hebrew words with pronunciation audio, roots, example sentences from Hebrew literature, images, and frequency rankings.
- Word data: [pealim.com](https://www.pealim.com) — a free Modern Hebrew dictionary - **Verb conjugations** — 71 core verbs fully conjugated in all tenses and persons, covering all seven binyanim (verb patterns).
- Example sentences: [Project Ben-Yehuda](https://benyehuda.org) — public-domain Hebrew literature corpus - **Confusables** — Words that look the same without vowel marks (e.g., דָּבָר "thing" vs. דִּבֵּר "spoke") shown side by side so you can tell them apart.
- Word frequency: [hermitdave/FrequencyWords](https://github.com/hermitdave/FrequencyWords) — Hebrew frequency list - **Noun plurals** — Practice forming singular↔plural pairs, with a focus on irregular plurals and common patterns.
- Verb paradigm list: Coffin, Edna Amir and Shmuel Bolozky. *A Reference Grammar of Modern Hebrew*. Cambridge University Press, 2005. - **All-in-one** — A combined deck with everything above, organized as subdecks.
You can download and import any deck individually — or use the combined deck to get everything at once.
--- ---
@ -25,17 +26,19 @@ All card data comes from open or academic sources:
2. Double-click to import into [Anki](https://apps.ankiweb.net/) (free, cross-platform) 2. Double-click to import into [Anki](https://apps.ankiweb.net/) (free, cross-platform)
3. Start studying 3. Start studying
Both decks can be imported independently. If you already have one, re-importing the same file updates your deck without losing study progress. All decks can be imported independently — pick just the ones you want. Re-importing the same file later updates your deck without losing study progress.
--- ---
## What's in the vocabulary deck ## What's in the vocabulary deck
Each card has two sides: Each note generates up to three cards:
**Hebrew → English:** See the Hebrew word (with nikkud) + hear audio → recall the meaning. **Hebrew → English:** See the Hebrew word (with nikkud) + hear audio → recall the meaning.
**English → Hebrew:** See the English meaning → recall the Hebrew word, its root, and how to write it. **English → Hebrew:** See the English meaning → recall the Hebrew word. When multiple words share the same English meaning, a disambiguation hint (part of speech + binyan) helps you know which word is expected.
**Sentence Cloze:** A Hebrew sentence with the target word blanked out → fill in the missing word. Only generated for words with a vetted example sentence. Tests recognition in context.
Fields on each card: Fields on each card:
| Field | Example | | Field | Example |
@ -43,44 +46,70 @@ Fields on each card:
| Hebrew word (nikkud) | שָׁמַר | | Hebrew word (nikkud) | שָׁמַר |
| Meaning | kept, watched over | | Meaning | kept, watched over |
| Root | שמ״ר | | Root | שמ״ר |
| Part of speech | פועל (verb) | | Part of speech | פועל — פָּעַל |
| Without nikkud | שמר | | Without nikkud | שמר |
| Related words | שׁוֹמֵר, שְׁמִירָה | | Related words | שׁוֹמֵר, שְׁמִירָה (grouped by Part of Speech) |
| Example sentence | from Ben-Yehuda corpus | | Example sentence | from nikkud'd Hebrew books |
| Audio | pronunciation from pealim.com | | Audio | pronunciation from pealim.com |
| Frequency rank | #412 | | Frequency rank | #412 |
| Image / Emoji | for concrete nouns |
| Plural form | for nouns: רבים: שֻׁלְחָנוֹת |
| Disambiguation hint | for ambiguous Eng→Heb cards |
Cards are presented in **frequency order** — Anki will show you the most common words first. Frequency rank is displayed on every card so you can see how common each word is. Words not in the top 50,000 show a "50k+" badge. Cards are presented in **frequency order** — Anki will show you the most common words first. Note that because frequency is collected with words without nikkud, words that have the same letters but different nikkud will be assigned the same frequency.
### Eng→Heb disambiguation
When two Hebrew words translate to the same English (e.g., both mean "to return"), the Eng→Heb card shows a hint to tell them apart:
- **Layer 1:** Automatic Part of Speech + binyan hints for words with different parts of speech (163 words)
- **Layer 2:** AI-refined distinct glosses for true synonyms sharing the same Part of Speech (440 words)
--- ---
## What's in the conjugation deck ## What's in the conjugation deck
70 paradigm verbs from Coffin & Bolozky's *A Reference Grammar of Modern Hebrew* (Appendix 1), covering all seven binyanim: 71 verbs listed in Appendix 1 of Coffin & Bolozky's *A Reference Grammar of Modern Hebrew* covering all seven binyanim, and **all irregular forms**
- פָּעַל (Pa'al), נִפְעַל (Nif'al), פִּעֵל (Pi'el), פֻּעַל (Pu'al) - פָּעַל (Pa'al), נִפְעַל (Nif'al), פִּעֵל (Pi'el), פֻּעַל (Pu'al)
- הִתְפַּעֵל (Hitpa'el), הִפְעִיל (Hif'il), הֻפְעַל (Huf'al) - הִתְפַּעֵל (Hitpa'el), הִפְעִיל (Hif'il), הֻפְעַל (Huf'al)
Each verb is drilled in: present, past, future, and imperative — all persons and genders. The infinitive is shown on the card front as context but is not quizzed. Each verb is drilled in: present, past, future, and imperative — all persons and genders. Each card shows the English meaning and related vocabulary from the same root.
**Present tense expansion:** Each present form generates 3 cards (one per pronoun that uses it), so you learn אֲנִי, אַתָּה, and הוּא all separately with the same masculine singular form. **Present tense expansion:** Each present tense form randomly generates a pronoun to be shown in the front of the card, so you acclimate to seeing אֲנִי, אַתָּה, and הוּא with the conjugated verb, even though they are all conjugated the same in present tense.
**Modern Hebrew 2fp/3fp:** Classical feminine plural future forms (e.g., תִּשְׁמֹרְנָה) are shown in parentheses; the card's primary answer is the modern masculine plural form used in everyday speech. **Modern Hebrew 2fp/3fp:** Classical feminine plural future forms (e.g., תִּשְׁמֹרְנָה) are shown in parentheses, and played via audio (for the audio-included decks). the card's primary answer is the modern masculine plural form used in everyday speech.
**Passive label:** Pu'al and Huf'al cards show the active partner's infinitive on the front (e.g., לְבַטֵּל) followed by **(סָבִיל)** in smaller text, so you know you're drilling the passive conjugation. Active verbs show no label. **Passive label:** Pu'al and Huf'al cards show the active partner's infinitive on the front (e.g., לְבַטֵּל) followed by **(סָבִיל)** in smaller text, so you know you're drilling the passive conjugation.
**Card order:** New cards are introduced in random order. **Card order:** New conjugation cards are introduced in random order (not grouped by verb).
**Citation:** Coffin, Edna Amir and Shmuel Bolozky. *A Reference Grammar of Modern Hebrew*. Cambridge University Press, 2005. ---
## What's in the confusables deck
Hebrew without vowel marks is full of lookalikes. This deck groups words that are spelled identically without nikkud and asks "מה ההבדל?" (what's the difference?). The answer reveals all the words side by side with their nikkud and definitions.
Examples: דָּבָר (thing) vs. דִּבֵּר (spoke), סֵפֶר (book) vs. סָפַר (counted) vs. סַפָּר (barber).
---
## What's in the plurals deck
Two card directions for each noun:
- **Singular → Plural:** See שֻׁלְחָן → produce שֻׁלְחָנוֹת
- **Plural → Singular:** See שֻׁלְחָנוֹת → produce שֻׁלְחָן
Focuses on irregular plurals (the tricky ones that don't follow the rules) and common examples from each noun pattern. Cards are tagged by pattern for filtered study.
--- ---
## Suggested study strategy ## Suggested study strategy
Start with the vocabulary deck. Anki will present the most frequent words first. Don't try to study to many cards every single day-- Anki suggests 20 per day. Start with the vocabulary deck. Anki will present the most frequent words first. Don't try to study too many cards every single day — Anki suggests 20 per day.
The conjugation cards reinforce verb forms you've already seen in vocabulary. The conjugation cards reinforce verb forms you've already seen in vocabulary.
Use the Hebrew → English direction to build reading comprehension. Use the English → Hebrew direction to build writing and speaking recall. Use the Hebrew → English direction to build reading comprehension. Use the English → Hebrew direction to build writing and speaking recall. The sentence cloze cards test whether you can recognize words in real Hebrew text.
--- ---
@ -90,9 +119,11 @@ Use the Hebrew → English direction to build reading comprehension. Use the Eng
**Project Ben-Yehuda** — A public-domain digital library of Hebrew literature. Example sentences come from the nikkud corpus (classic texts with full vowel marks). **Project Ben-Yehuda** — A public-domain digital library of Hebrew literature. Example sentences come from the nikkud corpus (classic texts with full vowel marks).
**Hebrew books** — Additional example sentences from nikkud'd (menukad) Hebrew books, with Claude Sonnet AI-vetted quality filtering. The AI doesn't generate the sentences, it just determines whether it is a high quality sentence as an example, or not.
**FrequencyWords** — An open Hebrew word frequency list derived from subtitle data. Used to sort vocabulary cards from most to least common. **FrequencyWords** — An open Hebrew word frequency list derived from subtitle data. Used to sort vocabulary cards from most to least common.
**Coffin & Bolozky** — The verb paradigm list for the conjugation deck comes from Appendix 1 of *A Reference Grammar of Modern Hebrew* (Cambridge University Press, 2005), which provides a comprehensive reference for Modern Hebrew verbal morphology. **Coffin & Bolozky** — The verb list, and known good conjugation reference for the conjugation deck comes from Appendix 1 of *A Reference Grammar of Modern Hebrew* (Cambridge University Press, 2005).
--- ---
@ -102,7 +133,7 @@ If you notice a wrong translation, missing audio, or incorrect conjugation:
- For vocabulary errors: the source is pealim.com — you can suggest corrections there. But if you think morfix has a correct translation and pealim.com does not, we may be able to encode an override. - For vocabulary errors: the source is pealim.com — you can suggest corrections there. But if you think morfix has a correct translation and pealim.com does not, we may be able to encode an override.
For any other issue, whether you know to code or not: Email me at pealim [at] nevo [dot] engineer For any other issue, whether you know how to code or not: Email me at hebrew [at] nevo [dot] engineer
--- ---
@ -136,13 +167,14 @@ python run.py --skip-scrape --refresh-examples
``` ```
python run.py [options] python run.py [options]
--skip-scrape Use cached data/hebrew_dict.csv (no pealim.com scraping) --only {vocab,conjugations,confusables,plurals,complete}
Build only one deck type
--skip-scrape Use cached data/hebrew_dict.csv
--skip-audio Skip audio .mp3 downloads --skip-audio Skip audio .mp3 downloads
--skip-examples Skip Ben Yehuda example fetching --skip-examples Skip Ben Yehuda example fetching
--only {vocab,conjugations} Run only one deck (skips all unrelated steps) --skip-conjugations Skip verb conjugation extraction
--skip-conjugations Skip verb conjugation extraction (deprecated: use --only vocab)
--skip-images Skip image fetching for concrete nouns --skip-images Skip image fetching for concrete nouns
--refresh-examples Force rebuild of Ben Yehuda index (nikkud corpus) --refresh-examples Force rebuild of Ben Yehuda index
--test N Process only first N words --test N Process only first N words
``` ```
@ -150,28 +182,60 @@ python run.py [options]
| File | Description | | File | Description |
|------|-------------| |------|-------------|
| `data/hebrew_dict.csv` | Raw dictionary | | `output/hebrew_vocabulary.apkg` | Vocabulary deck (text only) |
| `data/hebrew_dict_for_anki.csv` | Enriched Anki CSV | | `output/hebrew_vocabulary_audio.apkg` | Vocabulary deck + audio |
| `data/conjugations.json` | Verb conjugation data | | `output/hebrew_vocabulary_images.apkg` | Vocabulary deck + images |
| `data/audio/` | Vocabulary audio (.mp3) | | `output/hebrew_vocabulary_audio_images.apkg` | Vocabulary deck + audio + images |
| `data/audio_conj/` | Conjugation audio (.mp3) | | `output/hebrew_conjugations.apkg` | Conjugation deck |
| `data/fonts/` | Heebo font files (bundled in .apkg) | | `output/hebrew_conjugations_audio.apkg` | Conjugation deck + audio |
| `data/images/` | Noun images from Wikipedia/Commons | | `output/hebrew_confusables.apkg` | Confusables deck |
| `data/image_cache.json` | Image fetch cache | | `output/hebrew_confusables_audio.apkg` | Confusables deck + audio |
| `output/hebrew_vocabulary.apkg` | Vocabulary Anki deck | | `output/hebrew_plurals.apkg` | Plurals deck |
| `output/hebrew_conjugations.apkg` | Conjugation Anki deck | | `output/hebrew_plurals_audio.apkg` | Plurals deck + audio |
| `output/hebrew_complete.apkg` | All decks combined |
| `output/hebrew_complete_audio.apkg` | All decks combined + audio |
### Data files
| File | Description |
|------|-------------|
| `data/hebrew_dict_for_anki.csv` | Enriched vocabulary CSV |
| `data/conjugations.json` | Verb conjugation data (71 verbs) |
| `data/noun_plurals.json` | Noun plural/construct forms |
| `data/refined_meanings.json` | AI-disambiguated meanings (440 words) |
| `data/vetted_sentences.json` | AI-vetted example sentences |
| `data/ktiv_male_forms.json` | Ktiv male (plene) forms for sentence matching |
| `data/legacy_guid_map.json` | Legacy GUIDs for study progress preservation |
### Pipeline overview ### Pipeline overview
1. `hebrew_extract.py` — scrapes pealim.com dictionary 1. `hebrew_extract.py` — scrapes pealim.com dictionary
2. `frequency_lookup.py` — downloads/loads Hebrew frequency data 2. `frequency_lookup.py` — downloads/loads Hebrew frequency data
3. `benyehuda.py` — builds sentence index from Ben-Yehuda corpus 3. `benyehuda.py` — builds sentence index from Ben-Yehuda nikkud corpus
4. `extract_verb_list.py` — extracts verb list from Coffin & Bolozky PDF 4. `extract_verb_list.py` — extracts verb list from Coffin & Bolozky PDF
5. `conjugation_extract.py` — fetches conjugation tables from pealim.com 5. `conjugation_extract.py` — fetches conjugation tables + meanings from pealim.com
6. `image_fetch.py` — fetches Wikipedia/Commons images for concrete nouns 6. `image_fetch.py` — fetches Wikipedia/Commons images for concrete nouns
7. `validate_verb_list.py` — validates verb list against pealim.com 7. `scrape_noun_plurals.py` — scrapes noun plural/construct forms from pealim.com
8. `apkg_builder.py` — assembles both `.apkg` files 8. `scrape_ktiv_male.py` — scrapes ktiv male (plene) forms for sentence matching
9. `run.py` — orchestrates all steps 9. `rebuild_sentence_matches.py` — matches vocab words to book sentences
10. `apkg_builder.py` — assembles all `.apkg` files
11. `run.py` — orchestrates all steps
12. `validate_apkg.py` — validates output decks
---
## Deck variants
| Variant | Contents | Size |
|---------|----------|------|
| `hebrew_vocabulary.apkg` | Text + images | ~15 MB |
| `hebrew_vocabulary_audio.apkg` | Text + images + audio | ~80 MB |
| `hebrew_conjugations.apkg` | Text only | ~1 MB |
| `hebrew_conjugations_audio.apkg` | Text + audio | ~5 MB |
| `hebrew_confusables.apkg` | Text only | ~1 MB |
| `hebrew_plurals.apkg` | Text only | ~1 MB |
| `hebrew_complete.apkg` | Everything combined | ~20 MB |
| `hebrew_complete_audio.apkg` | Everything + audio | ~90 MB |
--- ---

192
SCHEMA.yaml Normal file
View file

@ -0,0 +1,192 @@
# Hebrew Flash Cards — Unified Data Schema (words.json)
# Revised based on Nevo's feedback (2026-03-08)
#
# Top-level: dict keyed by unique_key
# Unique key: nikkud word for most entries (e.g. "אָב")
# For 146 homographs (same nikkud, different meaning): "word|pos" e.g. "אָח|Noun"
# For same nikkud AND same pos: "word|pos|meaning" e.g. "אָח|Noun|brother"
#
# Hebrew text fields use nikkud/ktiv_male subfields:
# field:
# nikkud: "אָב" # with nikkud (hebstyle=mo)
# ktiv_male: "אב" # plene spelling (hebstyle=vl)
# This pattern applies to: word, singular, plural, construct forms, conjugated forms, etc.
#
# Pronoun notation for conjugation forms uses grammatical codes:
# 1s, 1p, 2ms, 2fs, 2mp, 2fp, 3ms, 3fs, 3mp, 3fp
# (not Hebrew pronoun strings, which are ambiguous for gender in some persons)
entry:
# --- Core Identity ---
word:
nikkud: "אָב"
ktiv_male: "אב"
slug: "6009-av" # Pealim URL slug (e.g. pealim.com/dict/6009-av/)
root: ["א", "ב"] # Shoresh as list of consonant chars
pos: "Noun" # Part of speech in English (as from pealim)
pos_hebrew: "שֵׁם עֶצֶם" # Part of speech in Hebrew (with nikkud)
meaning: "father" # English meaning (cleaned — no inline emoji, no Hebrew prepositions)
meaning_raw: "father 👨" # Original meaning as scraped (may contain emoji and/or Hebrew preps)
prep: "על" # Hebrew preposition(s) governing this word, extracted from meaning_raw (e.g. "(על)" → "על"); null if none
audio_url: "https://..." # Pealim audio URL
audio_file: "6009-av.mp3" # Local filename (slug-based for confusables, consonant-based otherwise)
tags: "" # Pealim tags if any
last_scrape_date: "2026-03-08" # ISO date of most recent pealim.com scrape for this entry
# --- Identity & Progress ---
vocab_legacy_guid: "abc123..." # Vocab note GUID from legacy_guid_map.json
# Other note GUIDs stored in their respective sections (cloze, plurals, conjugation)
# --- Frequency ---
frequency: 412 # Hebrew frequency rank from hermitdave/FrequencyWords he_50k (ktiv male based)
pseudo_frequency: null # Adjusted frequency for confusable homographs (deferred to future sprint)
# --- Display Enrichment ---
emoji: "👨"
emoji_source: "ai_vetted" # One of: ai_vetted, from_pealim, null
emoji_visible: false # Whether to show on cards (false until emoji vetting is done)
image: "father.jpg" # Wikipedia/Commons image filename, or null
image_source: "wikipedia" # One of: wikipedia, commons, null
hint: "" # Eng→Heb disambiguation hint (from refined_meanings.json)
# --- Shared Roots ---
shared_roots: [] # List of unique_keys of other words sharing the same root
# Computed by iterating all entries and grouping by root
# --- Confusables ---
confusable_group: null # List of unique_keys sharing same ktiv_male, or null
# e.g. ["אָח|Noun|brother", "אָח|Noun|fireplace"]
# --- Example Sentences ---
examples:
vetted: # AI-vetted sentences from Ben Yehuda / EPUB corpus
- text: "הָאָב הָלַךְ לַעֲבוֹדָה"
source: "ben_yehuda" # One of: ben_yehuda, epub_little_prince, epub_alice, ...
vetted: true
cloze: # Best sentence for cloze card, or null
text: "הָאָב הָלַךְ לַעֲבוֹדָה"
cloze_word_start: 0 # Character offset of the clozed word in text
cloze_word_end: 4 # End offset — enables exact extraction regardless of nikkud changes
cloze_hint: "family member"
cloze_guid: "def456..." # GUID for the cloze note
difficulty_score: 234 # Median frequency rank of context words (lower = easier); optional
rejected_count: 0
# --- Noun-specific: Inflection Forms ---
noun_inflection: null # null for non-nouns
# When populated:
# plurals_guid: "ghi789..." # GUID for plurals deck note
# singular: # null if noun is inherently plural (e.g. bicycle/אופניים)
# nikkud: "אָב"
# ktiv_male: "אב"
# plural:
# nikkud: "אָבוֹת"
# ktiv_male: "אבות"
# singular_audio: "6009-av.mp3"
# plural_audio: null # TODO: scrape from detail pages
# construct_singular:
# nikkud: "אֲבִי"
# ktiv_male: "אבי"
# construct_plural:
# nikkud: "אֲבוֹת"
# ktiv_male: "אבות"
# pronominal_suffixes: # Scraped from pealim "forms with pronominal affixes" section
# 1s:
# nikkud: "אָבִי"
# ktiv_male: "אבי"
# 1p:
# nikkud: "אָבִינוּ"
# ktiv_male: "אבינו"
# 2ms: ...
# 2fs: ...
# 2mp: ...
# 2fp: ...
# 3ms: ...
# 3fs: ...
# 3mp: ...
# 3fp: ...
# gender: "masculine"
# gender_hebrew:
# nikkud: "זָכָר"
# ktiv_male: "זכר"
# mishkal: "CaCaC" # English mishkal name (scraped from pealim PoS section)
# mishkal_hebrew: "קָטָל" # Hebrew mishkal name (computed via mapping)
# --- Verb-specific: Conjugation Data ---
conjugation: null # null for non-verbs
# When populated:
# in_conjugation_deck: true # Whether this verb is in the 71-verb conjugation deck
# infinitive:
# nikkud: "לִשְׁמֹר"
# ktiv_male: "לשמור"
# reference_form: # 3ms past (the citation form)
# nikkud: "שָׁמַר"
# ktiv_male: "שמר"
# binyan: "Pa'al" # English binyan name
# binyan_hebrew: "פָּעַל" # Hebrew binyan name (with nikkud)
# prep: "על" # Hebrew preposition the verb takes, or null
# active_forms:
# - person: "1s" # Grammatical code: 1s, 1p, 2ms, 2fs, 2mp, 2fp, 3ms, 3fs, 3mp, 3fp
# tense: "עָבָר"
# form:
# nikkud: "שָׁמַרְתִּי"
# ktiv_male: "שמרתי"
# audio_url: "https://..."
# audio_file: null # For future use
# hufal_pual_forms: null # Same structure as active_forms; non-null only for hif'il/pi'el verbs
# # When non-null, binyan MUST be Hif'il or Pi'el (validated)
# reference_form_passive: # 3ms past of the huf'al/pu'al counterpart, or null
# nikkud: "שֻׁמַּר"
# ktiv_male: "שומר"
# --- Adjective-specific ---
adjective_inflection: null # null for non-adjectives
# When populated:
# ms:
# nikkud: "גָּדוֹל"
# ktiv_male: "גדול"
# fs:
# nikkud: "גְּדוֹלָה"
# ktiv_male: "גדולה"
# mp:
# nikkud: "גְּדוֹלִים"
# ktiv_male: "גדולים"
# fp:
# nikkud: "גְּדוֹלוֹת"
# ktiv_male: "גדולות"
# mishkal: "CaCaC" # English mishkal name (scraped from pealim PoS section)
# mishkal_hebrew: "קָטָל" # Hebrew mishkal name (computed via mapping)
# --- Preposition-specific ---
preposition_inflection: null # null for non-prepositions
# When populated:
# 1s:
# nikkud: "שֶׁלִּי"
# ktiv_male: "שלי"
# 1p:
# nikkud: "שֶׁלָּנוּ"
# ktiv_male: "שלנו"
# 2ms:
# nikkud: "שֶׁלְּךָ"
# ktiv_male: "שלך"
# 2fs:
# nikkud: "שֶׁלָּךְ"
# ktiv_male: "שלך"
# 2mp:
# nikkud: "שֶׁלָּכֶם"
# ktiv_male: "שלכם"
# 2fp:
# nikkud: "שֶׁלָּכֶן"
# ktiv_male: "שלכן"
# 3ms:
# nikkud: "שֶׁלּוֹ"
# ktiv_male: "שלו"
# 3fs:
# nikkud: "שֶׁלָּהּ"
# ktiv_male: "שלה"
# 3mp:
# nikkud: "שֶׁלָּהֶם"
# ktiv_male: "שלהם"
# 3fp:
# nikkud: "שֶׁלָּהֶן"
# ktiv_male: "שלהן"

File diff suppressed because it is too large Load diff

View file

@ -1,205 +0,0 @@
#!/usr/bin/env python3
"""
Ben Yehuda corpus example-sentence lookup (nikkud corpus).
Downloads the nikkud-bearing plaintext ZIP once, indexes sentences by nikkud word form,
then answers queries locally.
Exposed API:
load(force_rebuild=False)
get_examples(word_nikkud) -> list[str] (returns 0 or 1 examples)
save_examples_cache()
"""
import json
import logging
import re
import unicodedata
import zipfile
from io import BytesIO
from pathlib import Path
import requests
logger = logging.getLogger(__name__)
# Nikkud-bearing corpus (txt.zip instead of txt_stripped.zip)
CORPUS_URL = (
"https://github.com/projectbenyehuda/public_domain_dump/releases/"
"download/2025-10/txt.zip"
)
INDEX_PATH = Path(__file__).parent / "data" / "benyehuda_index.json"
EXAMPLES_CACHE_PATH = Path(__file__).parent / "data" / "examples_cache.json"
REQUEST_TIMEOUT = 120
MIN_SENTENCE_LEN = 20
MAX_SENTENCE_LEN = 200
MAX_INDEX_ENTRIES = 500 # cap examples kept per word in index to limit memory
# Module-level state
_index: dict[str, list[str]] = {} # word (with nikkud) -> [sentence, ...]
_examples_cache: dict[str, list[str]] = {} # word -> cached result for this run
def _strip_nikkud(text: str) -> str:
return "".join(
ch for ch in unicodedata.normalize("NFD", text)
if unicodedata.category(ch) != "Mn"
)
def _split_sentences(text: str) -> list[str]:
"""
Split text into sentences on newlines only (Hebrew sentences don't have
mid-word period issues like English). Min 20 chars, max 200 chars.
"""
out = []
for line in text.split("\n"):
s = line.strip().strip("\"'.,;:!?")
s = s.strip()
if MIN_SENTENCE_LEN <= len(s) <= MAX_SENTENCE_LEN:
out.append(s)
return out
def _build_index(corpus_zip_bytes: bytes) -> None:
"""Parse corpus ZIP and build word (nikkud) → sentences index."""
global _index
_index = {}
logger.info("Building Ben Yehuda index from nikkud corpus …")
with zipfile.ZipFile(BytesIO(corpus_zip_bytes)) as zf:
txt_files = [n for n in zf.namelist() if n.endswith(".txt")]
logger.info(f" Corpus contains {len(txt_files)} text files")
for fname in txt_files:
try:
raw = zf.read(fname).decode("utf-8", errors="ignore")
except Exception:
continue
for sentence in _split_sentences(raw):
# Index by each unique Hebrew token (with nikkud) in the sentence
words = re.findall(r"[\u05d0-\u05ea\u05b0-\u05c7'\"]+", sentence)
for w in set(words):
if len(w) >= 2:
bucket = _index.setdefault(w, [])
if len(bucket) < MAX_INDEX_ENTRIES:
bucket.append(sentence)
logger.info(f"Index built: {len(_index)} unique word forms")
def _save_index() -> None:
INDEX_PATH.parent.mkdir(parents=True, exist_ok=True)
with open(INDEX_PATH, "w", encoding="utf-8") as f:
json.dump(_index, f, ensure_ascii=False)
logger.info(f"Ben Yehuda index saved → {INDEX_PATH}")
def _load_index() -> None:
global _index
with open(INDEX_PATH, encoding="utf-8") as f:
_index = json.load(f)
logger.info(f"Ben Yehuda index loaded: {len(_index)} word forms")
def load(force_rebuild: bool = False) -> None:
"""Load or build the Ben Yehuda index. Downloads corpus if needed."""
global _index, _examples_cache
if _index and not force_rebuild:
return
if force_rebuild:
# Delete old index and discard examples cache
if INDEX_PATH.exists():
INDEX_PATH.unlink()
logger.info("Deleted old Ben Yehuda index (force rebuild)")
_examples_cache = {}
else:
# Load persisted examples cache (not needed on rebuild)
if EXAMPLES_CACHE_PATH.exists():
with open(EXAMPLES_CACHE_PATH, encoding="utf-8") as f:
_examples_cache = json.load(f)
if INDEX_PATH.exists():
_load_index()
return
logger.info("Downloading Ben Yehuda nikkud corpus … (this may take 2-3 minutes)")
resp = requests.get(CORPUS_URL, timeout=REQUEST_TIMEOUT, stream=True)
resp.raise_for_status()
data = resp.content
logger.info(f"Corpus downloaded: {len(data) / 1e6:.1f} MB")
_build_index(data)
_save_index()
def save_examples_cache() -> None:
EXAMPLES_CACHE_PATH.parent.mkdir(parents=True, exist_ok=True)
with open(EXAMPLES_CACHE_PATH, "w", encoding="utf-8") as f:
json.dump(_examples_cache, f, ensure_ascii=False)
logger.info(f"Examples cache saved: {len(_examples_cache)} entries → {EXAMPLES_CACHE_PATH}")
def get_examples(word_nikkud: str) -> list[str]:
"""
Return 0 or 1 example sentences for the given word (nikkud form).
Lookup strategy:
1. Try exact nikkud match in index.
2. Fall back to stripped (no-nikkud) match against index keys.
Returns the single longest sentence MAX_SENTENCE_LEN that contains
the word as a whole token.
"""
if not _index:
load()
word = word_nikkud.strip()
word_stripped = _strip_nikkud(word)
cache_key = word
if cache_key in _examples_cache:
return _examples_cache[cache_key]
# Lookup: try exact nikkud first, then stripped fallback
candidates = _index.get(word, [])
if not candidates and word_stripped:
# Try looking up by stripped form across index keys
for k, v in _index.items():
if _strip_nikkud(k) == word_stripped:
candidates = v
break
# Filter: word must appear as a whole token
# Match the stripped form (for robustness with nikkud variants in sentence)
if word_stripped:
pattern = r"(?<!\w)" + re.escape(word_stripped) + r"(?!\w)"
matched = [s for s in candidates if re.search(pattern, _strip_nikkud(s))]
else:
matched = candidates[:]
# Filter by length
matched = [s for s in matched if MIN_SENTENCE_LEN <= len(s) <= MAX_SENTENCE_LEN]
# Return the single longest sentence ≤ MAX_SENTENCE_LEN
if matched:
best = max(matched, key=len)
result = [best]
else:
result = []
_examples_cache[cache_key] = result
return result
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")
load()
tests = ["שָׁלוֹם", "בַּיִת", "סֵפֶר", "מַיִם", "אַהֲבָה", "יֶלֶד"]
for w in tests:
exs = get_examples(w)
print(f"\n{w}: {len(exs)} example(s)")
for ex in exs:
print(f"{ex[:100]}")
save_examples_cache()

110
card_preview.html Normal file
View file

@ -0,0 +1,110 @@
<!DOCTYPE html>
<html dir="rtl">
<head>
<meta charset="utf-8">
<style>
body { font-family: 'Heebo', 'Arial Hebrew', sans-serif; background: #fff; max-width: 600px; margin: 20px auto; }
.card-container { border: 1px solid #ccc; border-radius: 8px; margin: 20px 0; overflow: hidden; }
.card-label { background: #333; color: #fff; padding: 6px 12px; font-size: 14px; font-family: sans-serif; direction: ltr; }
.card-content { padding: 16px; text-align: center; }
.card-content hr { border: none; border-top: 1px solid #ccc; margin: 12px 0; }
.hebrew { font-size: 48px; font-weight: bold; color: #222; direction: rtl; text-align: center; }
.hebrew-sm { font-size: 28px; font-weight: normal; color: #222; direction: rtl; }
.meaning { font-size: 28px; color: #1a1a8c; text-align: center; direction: ltr; margin: 4px 0; }
.emoji-img { font-size: 48px; text-align: center; margin: 4px 0; }
.divider { border-top: 1px solid #ccc; margin: 8px 0; }
.sec-table { display: table; margin: 6px auto 0; direction: rtl; border-collapse: collapse; }
.sec-label { display: table-row; font-size: 28px; font-weight: normal; color: #222; direction: rtl; }
.sec-key { display: table-cell; font-size: 28px; color: #222; font-weight: bold; text-align: right; padding: 2px 0 2px 8px; white-space: nowrap; }
.sec-val { display: table-cell; font-size: 28px; color: #222; text-align: right; padding: 2px 0; }
.hint { font-size: 22px; color: #555; margin: 4px 0; direction: rtl; text-align: center; }
.example { font-size: 24px; color: #222; padding: 6px 8px; direction: rtl; text-align: center; border-left: 3px solid #ccc; font-style: italic; margin: 6px auto; max-width: 90%; }
.voice-label { font-size: 20px; color: #888; }
.more-toggle { text-align: center; direction: rtl; margin-top: 8px; }
.more-header {
display: inline-block; font-size: 18px; color: #555; cursor: pointer; list-style: none;
border: 1px solid #ccc; border-radius: 16px; padding: 4px 16px; margin: 4px 0; background: #f8f8f8;
}
.more-header::-webkit-details-marker { display: none; }
.more-header::before { content: "○ "; font-size: 14px; }
details[open] > .more-header::before { content: "● "; }
.related-header { font-size: 22px; color: #555; text-align: center; margin: 4px 0; }
.rw-word { display: table-cell; font-size: 28px; color: #222; font-weight: normal; text-align: right; padding: 2px 0 2px 8px; white-space: nowrap; }
.rw-meaning { display: table-cell; font-size: 24px; color: #555; text-align: left; direction: ltr; padding: 2px 0; }
</style>
</head>
<body>
<h2 style="font-family:sans-serif;direction:ltr;">Vocab: English → Hebrew (BACK) — collapsed</h2>
<div class="card-container">
<div class="card-label">English → Hebrew — Back (default: collapsed)</div>
<div class="card-content">
<div class="meaning">time (occasion), time round; once (when used as an adverb)</div>
<div class="emoji-img">📍</div>
<div class="divider"></div>
<div class="hebrew">פַּעַם</div>
<details class="more-toggle"><summary class="more-header">מידע נוסף</summary>
<div class="sec-table">
<div class="sec-label"><span class="sec-key">לְלֹא נִיקּוּד:</span><span class="sec-val">פעם</span></div>
<div class="sec-label"><span class="sec-key">שֹׁרֶשׁ:</span><span class="sec-val">פ.ע.ם</span></div>
<div class="sec-label"><span class="sec-key">חֵלֶק דִּיבּוּר:</span><span class="sec-val">שֵׁם עֶצֶם, נְקֵבָה</span></div>
<div class="sec-label"><span class="sec-key">רַבִּים:</span><span class="sec-val">פְּעָמִים</span></div>
</div>
<div class="divider" style="margin:6px 0;"></div>
<div class="related-header">מִילִים קְשׁוּרוֹת</div>
<div class="sec-table">
<div class="sec-label"><span class="rw-word">פַּעְמַיִם</span><span class="rw-meaning">twice, two times</span></div>
<div class="sec-label"><span class="rw-word">לְפַעֵם</span><span class="rw-meaning">to surge (feeling, emotion)</span></div>
<div class="sec-label"><span class="rw-word">פַּעֲמוֹן</span><span class="rw-meaning">bell</span></div>
<div class="sec-label"><span class="rw-word">פְּעִימָה</span><span class="rw-meaning">heartbeat; beat; stroke (technolo…</span></div>
<div class="sec-label"><span class="rw-word">לִפְעֹם</span><span class="rw-meaning">to beat, to pulse, to throb</span></div>
<div class="sec-label"><span class="rw-word">לְהִתְפַּעֵם</span><span class="rw-meaning">to be excited (emotionally)</span></div>
<div class="sec-label"><span class="rw-word">לְהַפְעִים</span><span class="rw-meaning">to excite, to agitate (lit.)</span></div>
<div class="sec-label"><span class="rw-word">לְהִפָּעֵם</span><span class="rw-meaning">to be excited, to be thrilled</span></div>
</div>
</details>
</div>
</div>
<h2 style="font-family:sans-serif;direction:ltr;">Same card — EXPANDED</h2>
<div class="card-container">
<div class="card-label">English → Hebrew — Back (expanded)</div>
<div class="card-content">
<div class="meaning">time (occasion), time round; once (when used as an adverb)</div>
<div class="emoji-img">📍</div>
<div class="divider"></div>
<div class="hebrew">פַּעַם</div>
<details class="more-toggle" open><summary class="more-header">מידע נוסף</summary>
<div class="sec-table">
<div class="sec-label"><span class="sec-key">לְלֹא נִיקּוּד:</span><span class="sec-val">פעם</span></div>
<div class="sec-label"><span class="sec-key">שֹׁרֶשׁ:</span><span class="sec-val">פ.ע.ם</span></div>
<div class="sec-label"><span class="sec-key">חֵלֶק דִּיבּוּר:</span><span class="sec-val">שֵׁם עֶצֶם, נְקֵבָה</span></div>
<div class="sec-label"><span class="sec-key">רַבִּים:</span><span class="sec-val">פְּעָמִים</span></div>
</div>
<div class="divider" style="margin:6px 0;"></div>
<div class="related-header">מִילִים קְשׁוּרוֹת</div>
<div class="sec-table">
<div class="sec-label"><span class="rw-word">פַּעְמַיִם</span><span class="rw-meaning">twice, two times</span></div>
<div class="sec-label"><span class="rw-word">לְפַעֵם</span><span class="rw-meaning">to surge (feeling, emotion)</span></div>
<div class="sec-label"><span class="rw-word">פַּעֲמוֹן</span><span class="rw-meaning">bell</span></div>
<div class="sec-label"><span class="rw-word">פְּעִימָה</span><span class="rw-meaning">heartbeat; beat; stroke (technolo…</span></div>
<div class="sec-label"><span class="rw-word">לִפְעֹם</span><span class="rw-meaning">to beat, to pulse, to throb</span></div>
<div class="sec-label"><span class="rw-word">לְהִתְפַּעֵם</span><span class="rw-meaning">to be excited (emotionally)</span></div>
<div class="sec-label"><span class="rw-word">לְהַפְעִים</span><span class="rw-meaning">to excite, to agitate (lit.)</span></div>
<div class="sec-label"><span class="rw-word">לְהִפָּעֵם</span><span class="rw-meaning">to be excited, to be thrilled</span></div>
</div>
</details>
</div>
</div>
</body>
</html>

114
card_preview_conj.html Normal file
View file

@ -0,0 +1,114 @@
<!DOCTYPE html>
<html dir="rtl">
<head>
<meta charset="utf-8">
<style>
body { font-family: 'Heebo', 'Arial Hebrew', sans-serif; background: #fff; max-width: 600px; margin: 20px auto; }
.card-container { border: 1px solid #ccc; border-radius: 8px; margin: 20px 0; overflow: hidden; }
.card-label { background: #333; color: #fff; padding: 6px 12px; font-size: 14px; font-family: sans-serif; direction: ltr; }
.card-content { padding: 16px; text-align: center; }
.card-content hr { border: none; border-top: 1px solid #ccc; margin: 12px 0; }
.hebrew { font-size: 48px; font-weight: bold; color: #222; direction: rtl; text-align: center; }
.hebrew-sm { font-size: 28px; font-weight: normal; color: #222; direction: rtl; }
.meaning { font-size: 28px; color: #1a1a8c; text-align: center; direction: ltr; margin: 4px 0; }
.hint { font-size: 22px; color: #555; margin: 4px 0; direction: rtl; text-align: center; }
.divider { border-top: 1px solid #ccc; margin: 8px 0; }
.sec-table { display: table; margin: 6px auto 0; direction: rtl; border-collapse: collapse; }
.sec-label { display: table-row; font-size: 28px; font-weight: normal; color: #222; direction: rtl; }
.sec-key { display: table-cell; font-size: 28px; color: #222; font-weight: bold; text-align: right; padding: 2px 0 2px 8px; white-space: nowrap; }
.sec-val { display: table-cell; font-size: 28px; color: #222; text-align: right; padding: 2px 0; }
.voice-label { font-size: 20px; color: #888; }
.more-toggle { text-align: center; direction: rtl; margin-top: 8px; }
.more-header {
display: inline-block; font-size: 18px; color: #555; cursor: pointer; list-style: none;
border: 1px solid #ccc; border-radius: 16px; padding: 4px 16px; margin: 4px 0; background: #f8f8f8;
}
.more-header::-webkit-details-marker { display: none; }
.more-header::before { content: "○ "; font-size: 14px; }
details[open] > .more-header::before { content: "● "; }
.related-header { font-size: 22px; color: #555; text-align: center; margin: 4px 0; }
.rw-word { display: table-cell; font-size: 28px; color: #222; font-weight: normal; text-align: right; padding: 2px 0 2px 8px; white-space: nowrap; }
.rw-meaning { display: table-cell; font-size: 24px; color: #555; text-align: left; direction: ltr; padding: 2px 0; }
</style>
</head>
<body>
<h2 style="font-family:sans-serif;direction:ltr;">Conjugation Card — FRONT</h2>
<div class="card-container">
<div class="card-label">Front</div>
<div class="card-content">
<div class="hint">אֵיךְ אוֹמְרִים</div>
<div class="hebrew">אַתָּה</div>
<div class="hebrew" style="color:#1a1a8c;">לִשְׁמֹר <span class="hebrew-sm">(על)</span></div>
<div class="hebrew">בַּהוֹוֶה</div>
</div>
</div>
<h2 style="font-family:sans-serif;direction:ltr;">Conjugation Card — BACK (collapsed)</h2>
<div class="card-container">
<div class="card-label">Back — default state</div>
<div class="card-content">
<div class="hint">אֵיךְ אוֹמְרִים</div>
<div class="hebrew">אַתָּה</div>
<div class="hebrew" style="color:#1a1a8c;">לִשְׁמֹר <span class="hebrew-sm">(על)</span></div>
<div class="hebrew">בַּהוֹוֶה</div>
<hr>
<div class="hebrew">שׁוֹמֵר (על)</div>
<details class="more-toggle"><summary class="more-header">מידע נוסף</summary>
<div class="sec-label" style="text-align:center;display:block;">to guard; to keep, to maintain</div>
<div class="sec-table">
<div class="sec-label"><span class="sec-key">שֹׁרֶשׁ:</span><span class="sec-val">שׁ.מ.ר</span></div>
<div class="sec-label"><span class="sec-key">בִּנְיָן:</span><span class="sec-val">פָּעַל</span></div>
</div>
<div class="divider" style="margin:6px 0;"></div>
<div class="related-header">מִילִים קְשׁוּרוֹת</div>
<div class="sec-table">
<div class="sec-label"><span class="rw-word">מִשְׁמָר</span><span class="rw-meaning">guard, watch; shift</span></div>
<div class="sec-label"><span class="rw-word">שׁוֹמֵר</span><span class="rw-meaning">guard, watchman</span></div>
<div class="sec-label"><span class="rw-word">שְׁמִירָה</span><span class="rw-meaning">guarding, watching</span></div>
<div class="sec-label"><span class="rw-word">לְהִשָּׁמֵר</span><span class="rw-meaning">to beware, to be careful</span></div>
</div>
</details>
</div>
</div>
<h2 style="font-family:sans-serif;direction:ltr;">Conjugation Card — BACK (expanded)</h2>
<div class="card-container">
<div class="card-label">Back — expanded</div>
<div class="card-content">
<div class="hint">אֵיךְ אוֹמְרִים</div>
<div class="hebrew">אַתָּה</div>
<div class="hebrew" style="color:#1a1a8c;">לִשְׁמֹר <span class="hebrew-sm">(על)</span></div>
<div class="hebrew">בַּהוֹוֶה</div>
<hr>
<div class="hebrew">שׁוֹמֵר (על)</div>
<details class="more-toggle" open><summary class="more-header">מידע נוסף</summary>
<div class="sec-label" style="text-align:center;display:block;">to guard; to keep, to maintain</div>
<div class="sec-table">
<div class="sec-label"><span class="sec-key">שֹׁרֶשׁ:</span><span class="sec-val">שׁ.מ.ר</span></div>
<div class="sec-label"><span class="sec-key">בִּנְיָן:</span><span class="sec-val">פָּעַל</span></div>
</div>
<div class="divider" style="margin:6px 0;"></div>
<div class="related-header">מִילִים קְשׁוּרוֹת</div>
<div class="sec-table">
<div class="sec-label"><span class="rw-word">מִשְׁמָר</span><span class="rw-meaning">guard, watch; shift</span></div>
<div class="sec-label"><span class="rw-word">שׁוֹמֵר</span><span class="rw-meaning">guard, watchman</span></div>
<div class="sec-label"><span class="rw-word">שְׁמִירָה</span><span class="rw-meaning">guarding, watching</span></div>
<div class="sec-label"><span class="rw-word">לְהִשָּׁמֵר</span><span class="rw-meaning">to beware, to be careful</span></div>
</div>
</details>
</div>
</div>
</body>
</html>

View file

@ -1,679 +0,0 @@
#!/usr/bin/env python3
"""
Extract Hebrew verb conjugations from pealim.com.
Input: verbs_input.txt (one Hebrew infinitive per line;
lines starting with '# 3ms:' search by 3ms past form for Pu'al/Huf'al)
Output: data/conjugations.json
For each verb:
1. Search pealim.com/search/?q=<verb> to find URL slug
2. Fetch /dict/<slug>/ with hebstyle=mo cookie
3. Parse conjugation table by row labels
4. Capture audio URLs per form
5. Parse passive (Pu'al/Huf'al) forms from the same page
Resume-safe: verbs already in conjugations.json are skipped.
"""
import json
import logging
import re
import time
import unicodedata
import urllib.parse
from pathlib import Path
import requests
from bs4 import BeautifulSoup
logger = logging.getLogger(__name__)
PEALIM_BASE = "https://www.pealim.com"
REQUEST_DELAY = 1.5
REQUEST_TIMEOUT = 15
VERBS_INPUT = Path(__file__).parent / "verbs_input.txt"
CONJUGATIONS_PATH = Path(__file__).parent / "data" / "conjugations.json"
DICT_CSV = next(
(p for p in [
Path(__file__).parent / "data" / "hebrew_dict_for_anki.csv",
Path(__file__).parent / "data" / "pealim_dict_for_anki.csv",
] if p.exists()),
Path(__file__).parent / "data" / "hebrew_dict_for_anki.csv",
)
# Pronoun labels (for card front display)
PRONOUN_LABELS = {
"present_ms": "",
"present_fs": "",
"present_mp": "",
"present_fp": "",
"past_1s": "אֲנִי",
"past_1p": "אֲנַחְנוּ",
"past_2ms": "אַתָּה",
"past_2fs": "אַתְּ",
"past_2mp": "אַתֶּם",
"past_2fp": "אַתֶּן",
"past_3ms": "הוּא",
"past_3fs": "הִיא",
"past_3p": "הֵם / הֵן",
"future_1s": "אֲנִי",
"future_1p": "אֲנַחְנוּ",
"future_2ms": "אַתָּה",
"future_2fs": "אַתְּ",
"future_2mp": "אַתֶּם",
"future_2fp": "אַתֶּן",
"future_3ms": "הוּא",
"future_3fs": "הִיא",
"future_3mp": "הֵם",
"future_3fp": "הֵן",
"imperative_ms": "אַתָּה",
"imperative_fs": "אַתְּ",
"imperative_mp": "אַתֶּם",
"imperative_fp": "אַתֶּן",
"infinitive": "",
}
# Human-readable tense description for card front
TENSE_DESCRIPTION = {
"present_ms": "הוֹוֶה",
"present_fs": "הוֹוֶה",
"present_mp": "הוֹוֶה",
"present_fp": "הוֹוֶה",
"past_1s": "עָבָר",
"past_1p": "עָבָר",
"past_2ms": "עָבָר",
"past_2fs": "עָבָר",
"past_2mp": "עָבָר",
"past_2fp": "עָבָר",
"past_3ms": "עָבָר",
"past_3fs": "עָבָר",
"past_3p": "עָבָר",
"future_1s": "עָתִיד",
"future_1p": "עָתִיד",
"future_2ms": "עָתִיד",
"future_2fs": "עָתִיד",
"future_2mp": "עָתִיד",
"future_2fp": "עָתִיד",
"future_3ms": "עָתִיד",
"future_3fs": "עָתִיד",
"future_3mp": "עָתִיד",
"future_3fp": "עָתִיד",
"imperative_ms": "צִוּוּי",
"imperative_fs": "צִוּוּי",
"imperative_mp": "צִוּוּי",
"imperative_fp": "צִוּוּי",
"infinitive": "מְקוֹר",
}
BINYAN_NAMES: tuple[str, ...] = (
"Pa'al", "Nif'al", "Pi'el", "Pu'al", "Hitpa'el", "Hif'il", "Huf'al"
)
session = requests.Session()
session.headers.update({"User-Agent": "Mozilla/5.0 (compatible; pealim-anki/2.0)"})
def _strip_nikkud(text: str) -> str:
"""Remove Hebrew nikkud (diacritics) from a string."""
return "".join(
ch for ch in unicodedata.normalize("NFD", text)
if unicodedata.category(ch) != "Mn"
)
def _build_pos_lookup() -> dict[str, str]:
"""Build word_stripped → binyan dict from pealim_dict_for_anki.csv."""
lookup: dict[str, str] = {}
if not DICT_CSV.exists():
return lookup
try:
import pandas as pd
try:
df = pd.read_csv(DICT_CSV, sep=";", index_col=0)
if df.shape[1] < 3:
raise ValueError("too few columns")
except (ValueError, pd.errors.ParserError):
df = pd.read_csv(DICT_CSV, index_col=0)
for _, row in df.iterrows():
word = str(row.get("Word", "")).strip()
pos = str(row.get("Part of speech", row.get("Part of Speech", ""))).strip()
if word and pos and "nan" not in pos.lower():
lookup[_strip_nikkud(word)] = pos
except Exception as e:
logger.debug(f"Could not load PoS lookup: {e}")
return lookup
# Cache PoS lookup (built once)
_pos_lookup: dict[str, str] | None = None
def _get_pos_lookup() -> dict[str, str]:
global _pos_lookup
if _pos_lookup is None:
_pos_lookup = _build_pos_lookup()
return _pos_lookup
def _binyan_from_pos(word: str) -> str:
"""Look up binyan from PoS field: 'Verb pa\'al' or 'Verb Pi\'el' → canonical name."""
lookup = _get_pos_lookup()
pos_str = lookup.get(_strip_nikkud(word), "")
if not pos_str:
return ""
pos_lower = pos_str.lower()
# Map lowercase pealim.com PoS variants → canonical names
for bname, variants in [
("Pa'al", ["pa'al", "paal"]),
("Nif'al", ["nif'al", "nifal"]),
("Pi'el", ["pi'el", "piel"]),
("Pu'al", ["pu'al", "pual"]),
("Hitpa'el", ["hitpa'el", "hitpael"]),
("Hif'il", ["hif'il", "hifil"]),
("Huf'al", ["huf'al", "hufal"]),
]:
if any(v in pos_lower for v in variants):
return bname
return ""
def _find_slug(query: str) -> str | None:
"""Search pealim.com/search/?q=<verb> and return the URL slug."""
url = f"{PEALIM_BASE}/search/?q={urllib.parse.quote(query)}"
try:
resp = session.get(url, timeout=REQUEST_TIMEOUT)
resp.raise_for_status()
slugs = re.findall(r"/dict/(\d+[-][^/?\"<>\s]+)/", resp.text)
if slugs:
slug = slugs[0]
logger.info(f" Slug: {slug}")
return slug
except Exception as e:
logger.error(f" Error searching for '{query}': {e}")
return None
def _is_passive_binyan(binyan: str) -> bool:
"""Return True if the binyan is a passive (Pu'al or Huf'al)."""
return any(marker in binyan for marker in ("פֻּעַל", "הֻפְעַל", "Pu'al", "Huf'al"))
def _get_menukad(cell) -> tuple[str, str]:
"""
Extract nikkud Hebrew text and audio URL from a table cell.
Returns (form_text, audio_url).
"""
# Audio URL
audio_span = cell.find("span", class_=lambda c: c and "audio-play" in c)
audio_url = ""
if audio_span:
audio_url = audio_span.get("data-audio", "")
span = cell.find("span", class_="menukad")
if span:
return span.get_text(strip=True), audio_url
txt = cell.get_text(strip=True)
if re.search(r"[\u05d0-\u05ea]", txt):
return txt, audio_url
return "", audio_url
def _parse_table(soup: BeautifulSoup, passive: bool = False, table_el=None) -> dict[str, dict]:
"""
Parse the pealim conjugation table and return form_key -> {form, audio_url} mapping.
If passive=True, look for the passive table (after "Passive" heading).
If table_el is provided (and passive=False), parse that table directly.
"""
if passive:
# Find <h3> containing "Passive"
passive_h3 = None
for h3 in soup.find_all("h3"):
if "passive" in h3.get_text(strip=True).lower():
passive_h3 = h3
break
if not passive_h3:
return {}
# Find next conjugation table after this heading
table = None
for sib in passive_h3.find_all_next():
if sib.name == "table" and "conjugation-table" in sib.get("class", []):
table = sib
break
if not table:
return {}
elif table_el is not None:
table = table_el
else:
table = soup.find("table", class_="conjugation-table")
if not table:
return {}
rows = table.find_all("tr")
if len(rows) < 9:
return {}
forms: dict[str, dict] = {}
def first_heb_forms(row_idx: int) -> list[tuple[str, str]]:
"""Get only the Hebrew-text cells from a row (skip label cells)."""
cells = rows[row_idx].find_all(["th", "td"])
result = []
for cell in cells:
txt, audio_url = _get_menukad(cell)
colspan = int(cell.get("colspan", 1))
if txt and re.search(r"[\u05d0-\u05ea]", txt):
for _ in range(colspan):
result.append((txt, audio_url))
return result
def deduplicate(pairs: list[tuple[str, str]]) -> list[tuple[str, str]]:
"""Return pairs with duplicate form-text entries removed (first occurrence kept)."""
seen: set[str] = set()
out: list[tuple[str, str]] = []
for pair in pairs:
if pair[0] not in seen:
seen.add(pair[0])
out.append(pair)
return out
# Find rows by tense label
present_row = past_row = future_row = imp_row = inf_row = -1
for i, row in enumerate(rows):
label = row.get_text(" ", strip=True).lower()
if "present" in label and present_row < 0:
present_row = i
elif "past" in label and past_row < 0:
past_row = i
elif "future" in label and future_row < 0:
future_row = i
elif "imperative" in label and imp_row < 0:
imp_row = i
elif "infinitive" in label and inf_row < 0:
inf_row = i
def store(key: str, form: str, audio_url: str) -> None:
if form:
forms[key] = {"form": form, "audio_url": audio_url}
# Present tense (4 forms: ms fs mp fp)
if present_row >= 0:
hf = first_heb_forms(present_row)
keys = ["present_ms", "present_fs", "present_mp", "present_fp"]
for k, (v, au) in zip(keys, hf):
store(k, v, au)
# Past tense
if past_row >= 0:
unique = deduplicate(first_heb_forms(past_row))
if len(unique) >= 1:
store("past_1s", unique[0][0], unique[0][1])
if len(unique) >= 2:
store("past_1p", unique[1][0], unique[1][1])
if past_row + 1 < len(rows):
hf2 = first_heb_forms(past_row + 1)
keys2 = ["past_2ms", "past_2fs", "past_2mp", "past_2fp"]
for k, (v, au) in zip(keys2, hf2):
store(k, v, au)
if past_row + 2 < len(rows):
unique3 = deduplicate(first_heb_forms(past_row + 2))
keys3 = ["past_3ms", "past_3fs", "past_3p"]
for k, (v, au) in zip(keys3, unique3):
store(k, v, au)
# Future tense
if future_row >= 0:
unique_f = deduplicate(first_heb_forms(future_row))
if len(unique_f) >= 1:
store("future_1s", unique_f[0][0], unique_f[0][1])
if len(unique_f) >= 2:
store("future_1p", unique_f[1][0], unique_f[1][1])
if future_row + 1 < len(rows):
hf2 = first_heb_forms(future_row + 1)
keys2 = ["future_2ms", "future_2fs", "future_2mp", "future_2fp"]
for k, (v, au) in zip(keys2, hf2):
store(k, v, au)
if future_row + 2 < len(rows):
hf3 = first_heb_forms(future_row + 2)
keys3 = ["future_3ms", "future_3fs", "future_3mp", "future_3fp"]
for k, (v, au) in zip(keys3, hf3):
store(k, v, au)
# Imperative
if imp_row >= 0:
hf = first_heb_forms(imp_row)
keys = ["imperative_ms", "imperative_fs", "imperative_mp", "imperative_fp"]
for k, (v, au) in zip(keys, hf):
store(k, v, au)
# Infinitive
if inf_row >= 0:
hf = first_heb_forms(inf_row)
if hf:
store("infinitive", hf[0][0], hf[0][1])
return forms
def _extract_binyan_from_page(soup: BeautifulSoup) -> str:
"""Extract binyan from page header span."""
for h3 in soup.find_all("h3", class_="page-header"):
text = h3.get_text(" ", strip=True)
for bname in BINYAN_NAMES:
if bname in text:
return bname
# Also try og:description
meta = soup.find("meta", {"property": "og:description"})
if meta:
desc = meta.get("content", "")
for bname in BINYAN_NAMES:
if bname in desc:
return bname
return ""
def _extract_passive_binyan_from_page(soup: BeautifulSoup) -> str:
"""Extract passive binyan name from passive section heading."""
for h3 in soup.find_all("h3"):
text = h3.get_text(" ", strip=True)
if "passive" in text.lower():
for bname in ("Pu'al", "Huf'al"):
if bname in text:
return bname
# Infer: Pa'al/Pi'el → Pu'al; Hif'il → Huf'al (stored as span text)
span = h3.find("span", class_="small")
if span:
span_text = span.get_text(strip=True)
for bname in ("Pu'al", "Huf'al"):
if bname in span_text:
return bname
return ""
def _extract_conjugations(slug: str, search_term: str, is_3ms_search: bool = False, binyan_hint: str = "") -> dict | None:
"""Fetch /dict/<slug>/ and parse conjugation table (active + passive)."""
url = f"{PEALIM_BASE}/dict/{slug}/"
try:
resp = session.get(url, cookies={"hebstyle": "mo"}, timeout=REQUEST_TIMEOUT)
resp.raise_for_status()
except Exception as e:
logger.error(f" Error fetching {url}: {e}")
return None
soup = BeautifulSoup(resp.text, "lxml")
# Extract root
root = ""
for span in soup.find_all("span", class_="menukad"):
txt = span.get_text(strip=True)
if txt and re.search(r"[\u05d0-\u05ea]", txt) and "-" in txt:
root = txt
break
# Extract binyan: try PoS lookup first, then page header, then section hint
binyan = _binyan_from_pos(search_term) if not is_3ms_search else ""
if not binyan:
binyan = _extract_binyan_from_page(soup)
if not binyan:
binyan = binyan_hint
# Parse active forms table
forms_raw = _parse_table(soup, passive=False)
if not forms_raw:
logger.warning(f" No forms found for {slug}")
return None
is_passive = _is_passive_binyan(binyan)
# For passive binyan search (3ms search), the "active" table is actually the passive one
# Determine reference form
infinitive_form = forms_raw.get("infinitive", {}).get("form", "") if not is_passive else ""
past_3ms_form = forms_raw.get("past_3ms", {}).get("form", "")
if is_passive:
reference_form = past_3ms_form or search_term
else:
reference_form = infinitive_form or search_term
# Build active result
result = {
"infinitive": search_term,
"slug": slug,
"root": root,
"binyan": binyan,
"is_passive": is_passive,
"reference_form": reference_form,
"forms": {},
}
for key, form_data in forms_raw.items():
if key in PRONOUN_LABELS:
result["forms"][key] = {
"form": form_data["form"],
"audio_url": form_data.get("audio_url", ""),
"pronoun": PRONOUN_LABELS[key],
"tense": TENSE_DESCRIPTION.get(key, ""),
}
# Check for a second conjugation table (alternate paradigm, e.g. להתגלות)
# Collect all active tables (exclude passive tables which follow the "Passive" h3)
passive_h3 = next(
(h for h in soup.find_all("h3") if "passive" in h.get_text(strip=True).lower()),
None,
)
passive_table_ids = {
id(t) for t in (passive_h3.find_all_next("table", class_="conjugation-table") if passive_h3 else [])
}
active_tables = [
t for t in soup.find_all("table", class_="conjugation-table")
if id(t) not in passive_table_ids
]
if len(active_tables) >= 2:
alt_raw = _parse_table(soup, passive=False, table_el=active_tables[1])
alternate_forms = {}
for key, form_data in alt_raw.items():
if key in PRONOUN_LABELS:
alt_form = form_data["form"]
primary_form = forms_raw.get(key, {}).get("form", "")
if alt_form and alt_form != primary_form:
alternate_forms[key] = alt_form
if alternate_forms:
result["alternate_forms"] = alternate_forms
logger.info(f" Found {len(alternate_forms)} alternate forms for {search_term}")
logger.info(f" Extracted {len(result['forms'])} forms for {search_term}")
return result
def _load_conjugations() -> dict:
if CONJUGATIONS_PATH.exists():
with open(CONJUGATIONS_PATH, encoding="utf-8") as f:
return json.load(f)
return {}
def _save_conjugations(data: dict) -> None:
CONJUGATIONS_PATH.parent.mkdir(parents=True, exist_ok=True)
with open(CONJUGATIONS_PATH, "w", encoding="utf-8") as f:
json.dump(data, f, ensure_ascii=False, indent=2)
def _extract_passive_from_active_slug(active_slug: str, search_term: str, binyan_hint: str = "") -> dict | None:
"""Fetch active verb page and extract only the passive section forms.
Used for Pu'al/Huf'al 3ms entries where we know the active verb's slug."""
url = f"{PEALIM_BASE}/dict/{active_slug}/"
try:
resp = session.get(url, cookies={"hebstyle": "mo"}, timeout=REQUEST_TIMEOUT)
resp.raise_for_status()
except Exception as e:
logger.error(f" Error fetching {url}: {e}")
return None
soup = BeautifulSoup(resp.text, "lxml")
root = ""
for span in soup.find_all("span", class_="menukad"):
txt = span.get_text(strip=True)
if txt and re.search(r"[\u05d0-\u05ea]", txt) and "-" in txt:
root = txt
break
active_binyan = _extract_binyan_from_page(soup)
active_forms_raw = _parse_table(soup, passive=False)
active_infinitive = active_forms_raw.get("infinitive", {}).get("form", "")
passive_forms_raw = _parse_table(soup, passive=True)
if not passive_forms_raw:
logger.warning(f" No passive forms found on {active_slug} for {search_term}")
return None
passive_binyan = _extract_passive_binyan_from_page(soup)
if not passive_binyan:
passive_binyan = "Pu'al" if active_binyan == "Pi'el" else "Huf'al" if active_binyan == "Hif'il" else ""
if not passive_binyan:
passive_binyan = binyan_hint
result = {
"infinitive": search_term,
"slug": active_slug,
"root": root,
"binyan": passive_binyan,
"is_passive": True,
"reference_form": active_infinitive or search_term,
"forms": {},
}
for key, form_data in passive_forms_raw.items():
if key in PRONOUN_LABELS:
result["forms"][key] = {
"form": form_data["form"],
"audio_url": form_data.get("audio_url", ""),
"pronoun": PRONOUN_LABELS[key],
"tense": TENSE_DESCRIPTION.get(key, ""),
}
logger.info(f" Extracted {len(result['forms'])} passive forms for {search_term} from {active_slug}")
return result
def main(verbs_file: Path = VERBS_INPUT) -> dict:
"""Read verbs from file and extract conjugations. Returns full conjugations dict."""
if not verbs_file.exists():
logger.warning(f"verbs_input.txt not found at {verbs_file} — skipping")
return _load_conjugations()
raw_lines = verbs_file.read_text(encoding="utf-8").splitlines()
# Parse slug overrides: "# slug: VERB SLUG" anywhere in the file
slug_overrides: dict[str, str] = {}
for line in raw_lines:
stripped = line.strip()
if stripped.startswith("# slug:"):
parts = stripped[len("# slug:"):].strip().split()
if len(parts) >= 2:
slug_overrides[parts[0]] = parts[1]
# Map section header keywords → binyan name (for binyan_hint fallback)
SECTION_BINYAN = {
"pa'al": "Pa'al", "nif'al": "Nif'al", "pi'el": "Pi'el",
"pu'al": "Pu'al", "hitpa'el": "Hitpa'el", "hif'il": "Hif'il", "huf'al": "Huf'al",
}
# Parse: regular verbs and # 3ms: lines (optional active slug on 3ms lines)
# Track current section binyan from comment headers for use as a hint
verbs: list[tuple[str, bool, str | None, str]] = [] # (search_term, is_3ms_search, active_slug, binyan_hint)
current_binyan_hint = ""
for line in raw_lines:
stripped = line.strip()
if not stripped or stripped.startswith("# slug:"):
continue
if stripped.startswith("# 3ms:"):
parts = stripped[len("# 3ms:"):].strip().split()
if parts:
form = parts[0]
active_slug = parts[1] if len(parts) >= 2 else None
verbs.append((form, True, active_slug, current_binyan_hint))
elif stripped.startswith("#"):
# Check if this is a section header setting the binyan context
low = stripped.lower()
for key, bname in SECTION_BINYAN.items():
if key in low:
current_binyan_hint = bname
break
else:
verbs.append((stripped, False, None, current_binyan_hint))
logger.info(f"Loaded {len(verbs)} verbs from {verbs_file} "
f"({sum(1 for _, p, _, _ in verbs if p)} passive 3ms)")
if slug_overrides:
logger.info(f" Slug overrides: {slug_overrides}")
conjugations = _load_conjugations()
new_count = 0
for verb, is_3ms, active_slug, binyan_hint in verbs:
if verb in conjugations:
logger.info(f"Skipping {verb} (cached)")
continue
logger.info(f"Processing: {verb} {'(3ms search)' if is_3ms else ''}")
time.sleep(REQUEST_DELAY)
if is_3ms:
# Passive-only extraction: use provided active slug or search to find it
if active_slug:
slug = active_slug
logger.info(f" Using active slug {slug} for passive extraction")
else:
slug = _find_slug(verb)
if not slug:
logger.warning(f" No slug found for {verb}")
conjugations[verb] = None
_save_conjugations(conjugations)
continue
logger.info(f" Found active slug {slug} for passive extraction")
time.sleep(REQUEST_DELAY)
data = _extract_passive_from_active_slug(slug, verb, binyan_hint=binyan_hint)
else:
override = slug_overrides.get(verb)
if override:
logger.info(f" Slug override: {override}")
slug = override
else:
slug = _find_slug(verb)
if not slug:
logger.warning(f" No slug found for {verb}")
conjugations[verb] = None
_save_conjugations(conjugations)
continue
time.sleep(REQUEST_DELAY)
data = _extract_conjugations(slug, verb, is_3ms_search=False, binyan_hint=binyan_hint)
conjugations[verb] = data
_save_conjugations(conjugations)
new_count += 1
logger.info(f"Done: {new_count} new verbs processed")
return conjugations
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
result = main()
for verb, data in result.items():
if data:
forms = data.get("forms", {})
print(f"{verb}: {len(forms)} forms, binyan={data.get('binyan')}")
sample_form = next(iter(forms.values()), {}) if forms else {}
print(f" sample audio_url: {sample_form.get('audio_url', 'MISSING')[:60]}")
else:
print(f"{verb}: no data")

View file

@ -175,7 +175,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to guard; to keep, to maintain (על)"
}, },
"ללמוד": { "ללמוד": {
"infinitive": "ללמוד", "infinitive": "ללמוד",
@ -353,7 +354,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to learn, to study"
}, },
"לאסוף": { "לאסוף": {
"infinitive": "לאסוף", "infinitive": "לאסוף",
@ -531,7 +533,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to collect, to pick up, to reap"
}, },
"לעבוד": { "לעבוד": {
"infinitive": "לעבוד", "infinitive": "לעבוד",
@ -709,7 +712,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to work; to operate, to function"
}, },
"לחבוש": { "לחבוש": {
"infinitive": "לחבוש", "infinitive": "לחבוש",
@ -887,7 +891,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to bandage; to put on (a hat)"
}, },
"לאכול": { "לאכול": {
"infinitive": "לאכול", "infinitive": "לאכול",
@ -1065,7 +1070,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to eat"
}, },
"לשאול": { "לשאול": {
"infinitive": "לשאול", "infinitive": "לשאול",
@ -1243,7 +1249,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to ask; to borrow"
}, },
"לשלוח": { "לשלוח": {
"infinitive": "לשלוח", "infinitive": "לשלוח",
@ -1421,7 +1428,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to send, to dispatch"
}, },
"לגבוה": { "לגבוה": {
"infinitive": "לגבוה", "infinitive": "לגבוה",
@ -1599,7 +1607,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to be high, exalted"
}, },
"לשבת": { "לשבת": {
"infinitive": "לשבת", "infinitive": "לשבת",
@ -1777,7 +1786,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to sit, to settle"
}, },
"לרשת": { "לרשת": {
"infinitive": "לרשת", "infinitive": "לרשת",
@ -1955,7 +1965,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to inherit"
}, },
"לִיפּוֹל": { "לִיפּוֹל": {
"infinitive": "לִיפּוֹל", "infinitive": "לִיפּוֹל",
@ -2133,7 +2144,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to fall, to drop"
}, },
"לקום": { "לקום": {
"infinitive": "לקום", "infinitive": "לקום",
@ -2311,7 +2323,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to get up, to stand up, to arise; to be established, to come into being"
}, },
"לחון": { "לחון": {
"infinitive": "לחון", "infinitive": "לחון",
@ -2489,7 +2502,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to pardon, to amnesty; to endow"
}, },
"לקרוא": { "לקרוא": {
"infinitive": "לקרוא", "infinitive": "לקרוא",
@ -2667,7 +2681,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to read (ב-, את); to call (ל-)"
}, },
"לקנות": { "לקנות": {
"infinitive": "לקנות", "infinitive": "לקנות",
@ -2845,7 +2860,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to buy, to purchase"
}, },
"להיבדק": { "להיבדק": {
"infinitive": "להיבדק", "infinitive": "להיבדק",
@ -3023,7 +3039,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to be tested, examined"
}, },
"להרדם": { "להרדם": {
"infinitive": "להרדם", "infinitive": "להרדם",
@ -3201,7 +3218,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to fall asleep, to doze off"
}, },
"להיהרג": { "להיהרג": {
"infinitive": "להיהרג", "infinitive": "להיהרג",
@ -3379,7 +3397,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to be killed"
}, },
"להחקר": { "להחקר": {
"infinitive": "להחקר", "infinitive": "להחקר",
@ -3557,7 +3576,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to be investigated, explored"
}, },
"להישאר": { "להישאר": {
"infinitive": "להישאר", "infinitive": "להישאר",
@ -3735,7 +3755,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to remain"
}, },
"להיפגע": { "להיפגע": {
"infinitive": "להיפגע", "infinitive": "להיפגע",
@ -3913,7 +3934,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to be damaged, to be injured, to be wounded; to be insulted, to be offended"
}, },
"להיוולד": { "להיוולד": {
"infinitive": "להיוולד", "infinitive": "להיוולד",
@ -4091,7 +4113,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to be born"
}, },
"להנצל": { "להנצל": {
"infinitive": "להנצל", "infinitive": "להנצל",
@ -4269,7 +4292,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to be saved, to be rescued, to survive"
}, },
"להיסוג": { "להיסוג": {
"infinitive": "להיסוג", "infinitive": "להיסוג",
@ -4447,7 +4471,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to withdraw, to retreat"
}, },
"להימצא": { "להימצא": {
"infinitive": "להימצא", "infinitive": "להימצא",
@ -4625,7 +4650,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to be found, discovered; to be present, to be located"
}, },
"להיבנות": { "להיבנות": {
"infinitive": "להיבנות", "infinitive": "להיבנות",
@ -4803,7 +4829,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to be built, constructed"
}, },
"לדבר": { "לדבר": {
"infinitive": "לדבר", "infinitive": "לדבר",
@ -5130,7 +5157,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to speak, to talk"
}, },
"לברך": { "לברך": {
"infinitive": "לברך", "infinitive": "לברך",
@ -5457,7 +5485,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to bless, to greet, to felicitate"
}, },
"לנהל": { "לנהל": {
"infinitive": "לנהל", "infinitive": "לנהל",
@ -5784,7 +5813,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to manage, to organize"
}, },
"לנצח": { "לנצח": {
"infinitive": "לנצח", "infinitive": "לנצח",
@ -6111,7 +6141,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to win; to overcome, to beat; to conduct, to orchestrate"
}, },
"לקומם": { "לקומם": {
"infinitive": "לקומם", "infinitive": "לקומם",
@ -6438,7 +6469,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to outrage, to anger"
}, },
"למלא": { "למלא": {
"infinitive": "למלא", "infinitive": "למלא",
@ -6765,7 +6797,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to fill; to fill out; to fulfil"
}, },
"לחכות": { "לחכות": {
"infinitive": "לחכות", "infinitive": "לחכות",
@ -7092,7 +7125,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to await, to wait for (ל-)"
}, },
"לגלגל": { "לגלגל": {
"infinitive": "לגלגל", "infinitive": "לגלגל",
@ -7419,7 +7453,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to roll, to revolve (transitive)"
}, },
"להתלבש": { "להתלבש": {
"infinitive": "להתלבש", "infinitive": "להתלבש",
@ -7597,7 +7632,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to dress oneself"
}, },
"להסתלק": { "להסתלק": {
"infinitive": "להסתלק", "infinitive": "להסתלק",
@ -7775,7 +7811,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to leave, to go away"
}, },
"להצטלם": { "להצטלם": {
"infinitive": "להצטלם", "infinitive": "להצטלם",
@ -7953,7 +7990,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to pose for a photograph, to be photographed"
}, },
"להזדקק": { "להזדקק": {
"infinitive": "להזדקק", "infinitive": "להזדקק",
@ -8131,7 +8169,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to need, to require (ל-)"
}, },
"להתנהג": { "להתנהג": {
"infinitive": "להתנהג", "infinitive": "להתנהג",
@ -8309,7 +8348,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to behave"
}, },
"להתקומם": { "להתקומם": {
"infinitive": "להתקומם", "infinitive": "להתקומם",
@ -8487,7 +8527,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to rebel, to revolt"
}, },
"להתפלא": { "להתפלא": {
"infinitive": "להתפלא", "infinitive": "להתפלא",
@ -8665,7 +8706,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to wonder, to be surprised"
}, },
"להתקלקל": { "להתקלקל": {
"infinitive": "להתקלקל", "infinitive": "להתקלקל",
@ -8843,7 +8885,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to be damaged, to be spoiled (of food products)"
}, },
"להכניס": { "להכניס": {
"infinitive": "להכניס", "infinitive": "להכניס",
@ -9170,7 +9213,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to insert, to bring in"
}, },
"להעסיק": { "להעסיק": {
"infinitive": "להעסיק", "infinitive": "להעסיק",
@ -9497,7 +9541,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to keep busy; to employ"
}, },
"להחליט": { "להחליט": {
"infinitive": "להחליט", "infinitive": "להחליט",
@ -9824,7 +9869,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to decide"
}, },
"להבטיח": { "להבטיח": {
"infinitive": "להבטיח", "infinitive": "להבטיח",
@ -10151,7 +10197,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to ensure, to promise"
}, },
"להוריד": { "להוריד": {
"infinitive": "להוריד", "infinitive": "להוריד",
@ -10478,7 +10525,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to lower, to reduce; to download (computing)"
}, },
"להפיל": { "להפיל": {
"infinitive": "להפיל", "infinitive": "להפיל",
@ -10805,7 +10853,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to drop, to throw down"
}, },
"להקים": { "להקים": {
"infinitive": "להקים", "infinitive": "להקים",
@ -11132,7 +11181,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to build, to found, to establish"
}, },
"להמציא": { "להמציא": {
"infinitive": "להמציא", "infinitive": "להמציא",
@ -11459,7 +11509,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to invent; to make up; to present"
}, },
"להרשות": { "להרשות": {
"infinitive": "להרשות", "infinitive": "להרשות",
@ -11786,7 +11837,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to allow, to permit"
}, },
"להקל": { "להקל": {
"infinitive": "להקל", "infinitive": "להקל",
@ -12113,7 +12165,8 @@
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} }
} },
"meaning": "to ease, to alleviate"
}, },
"לָשִׂים": { "לָשִׂים": {
"infinitive": "לָשִׂים", "infinitive": "לָשִׂים",
@ -12291,7 +12344,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to put, to put on"
}, },
"בוטל": { "בוטל": {
"infinitive": "בוטל", "infinitive": "בוטל",
@ -12439,7 +12493,8 @@
"pronoun": "הֵן", "pronoun": "הֵן",
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} },
"meaning": "to cancel, to undo"
}, },
"תואם": { "תואם": {
"infinitive": "תואם", "infinitive": "תואם",
@ -12587,7 +12642,8 @@
"pronoun": "הֵן", "pronoun": "הֵן",
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} },
"meaning": "to coordinate"
}, },
"קומם": { "קומם": {
"infinitive": "קומם", "infinitive": "קומם",
@ -12735,7 +12791,8 @@
"pronoun": "הֵן", "pronoun": "הֵן",
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} },
"meaning": "to outrage, to anger"
}, },
"דוכא": { "דוכא": {
"infinitive": "דוכא", "infinitive": "דוכא",
@ -12883,7 +12940,8 @@
"pronoun": "הֵן", "pronoun": "הֵן",
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} },
"meaning": "to oppress, to crush; to cause depression"
}, },
"זוכה": { "זוכה": {
"infinitive": "זוכה", "infinitive": "זוכה",
@ -13031,7 +13089,8 @@
"pronoun": "הֵן", "pronoun": "הֵן",
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} },
"meaning": "to achieve; to credit"
}, },
"פורסם": { "פורסם": {
"infinitive": "פורסם", "infinitive": "פורסם",
@ -13179,7 +13238,8 @@
"pronoun": "הֵן", "pronoun": "הֵן",
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} },
"meaning": "to advertise, to publish, to publicize"
}, },
"הוגבל": { "הוגבל": {
"infinitive": "הוגבל", "infinitive": "הוגבל",
@ -13327,7 +13387,8 @@
"pronoun": "הֵן", "pronoun": "הֵן",
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} },
"meaning": "to limit, to restrict, to confine"
}, },
"העבר": { "העבר": {
"infinitive": "העבר", "infinitive": "העבר",
@ -13475,7 +13536,8 @@
"pronoun": "הֵן", "pronoun": "הֵן",
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} },
"meaning": "to transfer, to pass something"
}, },
"הוזהר": { "הוזהר": {
"infinitive": "הוזהר", "infinitive": "הוזהר",
@ -13623,7 +13685,8 @@
"pronoun": "הֵן", "pronoun": "הֵן",
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} },
"meaning": "to warn"
}, },
"הופל": { "הופל": {
"infinitive": "הופל", "infinitive": "הופל",
@ -13771,7 +13834,8 @@
"pronoun": "הֵן", "pronoun": "הֵן",
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} },
"meaning": "to drop, to throw down"
}, },
"הוקם": { "הוקם": {
"infinitive": "הוקם", "infinitive": "הוקם",
@ -13919,7 +13983,8 @@
"pronoun": "הֵן", "pronoun": "הֵן",
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} },
"meaning": "to build, to found, to establish"
}, },
"הוחל": { "הוחל": {
"infinitive": "הוחל", "infinitive": "הוחל",
@ -14067,7 +14132,8 @@
"pronoun": "הֵן", "pronoun": "הֵן",
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} },
"meaning": "to apply, to enforce, to put in force"
}, },
"הוקפא": { "הוקפא": {
"infinitive": "הוקפא", "infinitive": "הוקפא",
@ -14215,7 +14281,8 @@
"pronoun": "הֵן", "pronoun": "הֵן",
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} },
"meaning": "to freeze (something)"
}, },
"הופנה": { "הופנה": {
"infinitive": "הופנה", "infinitive": "הופנה",
@ -14363,7 +14430,8 @@
"pronoun": "הֵן", "pronoun": "הֵן",
"tense": "עָתִיד" "tense": "עָתִיד"
} }
} },
"meaning": "to direct; to refer someone"
}, },
"להתקלח": { "להתקלח": {
"infinitive": "להתקלח", "infinitive": "להתקלח",
@ -14541,7 +14609,8 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
} },
"meaning": "to take a shower"
}, },
"להתגלות": { "להתגלות": {
"infinitive": "להתגלות", "infinitive": "להתגלות",
@ -14719,6 +14788,162 @@
"pronoun": "", "pronoun": "",
"tense": "מְקוֹר" "tense": "מְקוֹר"
} }
},
"meaning": "to be discovered, to appear"
},
"להיות": {
"infinitive": "להיות",
"slug": "454-lihyot",
"root": "ה - י - ה",
"binyan": "Pa'al",
"is_passive": false,
"reference_form": "לִהְיוֹת",
"forms": {
"past_1s": {
"form": "הָיִיתִי",
"audio_url": "https://audio.pealim.com/v0/bx/bxtedharx4kd.mp3",
"pronoun": "אֲנִי",
"tense": "עָבָר"
},
"past_1p": {
"form": "הָיִינוּ",
"audio_url": "https://audio.pealim.com/v0/bz/bztr7bt7yw8j.mp3",
"pronoun": "אֲנַחְנוּ",
"tense": "עָבָר"
},
"past_2ms": {
"form": "הָיִיתָ",
"audio_url": "https://audio.pealim.com/v0/1i/1imxfddysg8d8.mp3",
"pronoun": "אַתָּה",
"tense": "עָבָר"
},
"past_2fs": {
"form": "הָיִית",
"audio_url": "https://audio.pealim.com/v0/si/sizbwqsi2wej.mp3",
"pronoun": "אַתְּ",
"tense": "עָבָר"
},
"past_2mp": {
"form": "הֱיִיתֶם",
"audio_url": "https://audio.pealim.com/v0/31/31081nk4lvxj.mp3",
"pronoun": "אַתֶּם",
"tense": "עָבָר"
},
"past_2fp": {
"form": "הֱיִיתֶן",
"audio_url": "https://audio.pealim.com/v0/30/30zpav63u9ig.mp3",
"pronoun": "אַתֶּן",
"tense": "עָבָר"
},
"past_3ms": {
"form": "הָיָה",
"audio_url": "https://audio.pealim.com/v0/1h/1hxhgoyxra6fs.mp3",
"pronoun": "הוּא",
"tense": "עָבָר"
},
"past_3fs": {
"form": "הָיְתָה",
"audio_url": "https://audio.pealim.com/v0/17/17fb6fulu2da8.mp3",
"pronoun": "הִיא",
"tense": "עָבָר"
},
"past_3p": {
"form": "הָיוּ",
"audio_url": "https://audio.pealim.com/v0/1h/1hxhgf26s3ou9.mp3",
"pronoun": "הֵם / הֵן",
"tense": "עָבָר"
},
"future_1s": {
"form": "אֶהְיֶה",
"audio_url": "https://audio.pealim.com/v0/at/atd2i0kljhge.mp3",
"pronoun": "אֲנִי",
"tense": "עָתִיד"
},
"future_1p": {
"form": "נִהְיֶה",
"audio_url": "https://audio.pealim.com/v0/2a/2a41xa7h8jei.mp3",
"pronoun": "אֲנַחְנוּ",
"tense": "עָתִיד"
},
"future_2ms": {
"form": "תִּהְיֶה",
"audio_url": "https://audio.pealim.com/v0/g6/g6saa9abkllk.mp3",
"pronoun": "אַתָּה",
"tense": "עָתִיד"
},
"future_2fs": {
"form": "תִּהְיִי",
"audio_url": "https://audio.pealim.com/v0/g6/g6s9q8uugtnx.mp3",
"pronoun": "אַתְּ",
"tense": "עָתִיד"
},
"future_2mp": {
"form": "תִּהְיוּ",
"audio_url": "https://audio.pealim.com/v0/g6/g6sjf854r5a7.mp3",
"pronoun": "אַתֶּם",
"tense": "עָתִיד"
},
"future_2fp": {
"form": "תִּהְיֶינָה",
"audio_url": "https://audio.pealim.com/v0/12/12upso035jy8g.mp3",
"pronoun": "אַתֶּן",
"tense": "עָתִיד"
},
"future_3ms": {
"form": "יִהְיֶה",
"audio_url": "https://audio.pealim.com/v0/yy/yyo97spf6rob.mp3",
"pronoun": "הוּא",
"tense": "עָתִיד"
},
"future_3fs": {
"form": "תִּהְיֶה",
"audio_url": "https://audio.pealim.com/v0/g6/g6saa9abkllk.mp3",
"pronoun": "הִיא",
"tense": "עָתִיד"
},
"future_3mp": {
"form": "יִהְיוּ",
"audio_url": "https://audio.pealim.com/v0/yy/yyo02tum07zo.mp3",
"pronoun": "הֵם",
"tense": "עָתִיד"
},
"future_3fp": {
"form": "תִּהְיֶינָה",
"audio_url": "https://audio.pealim.com/v0/12/12upso035jy8g.mp3",
"pronoun": "הֵן",
"tense": "עָתִיד"
},
"imperative_ms": {
"form": "הֱיֵה!",
"audio_url": "https://audio.pealim.com/v0/1h/1hxjabs7uspli.mp3",
"pronoun": "אַתָּה",
"tense": "צִוּוּי"
},
"imperative_fs": {
"form": "הֱיִי!",
"audio_url": "https://audio.pealim.com/v0/1h/1hxjac2th43as.mp3",
"pronoun": "אַתְּ",
"tense": "צִוּוּי"
},
"imperative_mp": {
"form": "הֱיוּ!",
"audio_url": "https://audio.pealim.com/v0/1h/1hxja0tjuptcu.mp3",
"pronoun": "אַתֶּם",
"tense": "צִוּוּי"
},
"imperative_fp": {
"form": "הֱיֶינָה!",
"audio_url": "https://audio.pealim.com/v0/xe/xef6kg7mexvb.mp3",
"pronoun": "אַתֶּן",
"tense": "צִוּוּי"
},
"infinitive": {
"form": "לִהְיוֹת",
"audio_url": "https://audio.pealim.com/v0/1n/1nej50k4t35xi.mp3",
"pronoun": "",
"tense": "מְקוֹר"
} }
},
"meaning": "to be"
} }
} }

1
data/emoji_lookup.json Normal file

File diff suppressed because one or more lines are too long

50000
data/en_50k.txt Normal file

File diff suppressed because it is too large Load diff

15904
data/epub_sentence_index.json Normal file

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because one or more lines are too long

97847
data/frequency_discarded.json Normal file

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

140248
data/ktiv_male_forms.json Normal file

File diff suppressed because it is too large Load diff

9242
data/legacy_guid_map.json Normal file

File diff suppressed because it is too large Load diff

46510
data/noun_plurals.json Normal file

File diff suppressed because it is too large Load diff

29598
data/noun_slug_map.json Normal file

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

442
data/refined_meanings.json Normal file
View file

@ -0,0 +1,442 @@
{
"שְׁלָל": "abundance; loot, plunder, spoils",
"שֶׁפַע": "abundance, plenty, profusion",
"נַר": "acquaintance (person one knows)",
"הֶכֵּרוּת": "acquaintance (the state of knowing someone)",
"כְּתֹבֶת": "address (postal/location)",
"מַעַן": "address (formal, for the sake of; destination)",
"שׁוּב": "again (once more, to repeat an action)",
"שֵׁנִית": "again; a second time, secondly",
"כְּנֶגֶד": "against; compared to, as opposed to",
"מוּל": "opposite, facing; against",
"נֶגֶד": "against; contrary to",
"נֶכֶס": "asset, property (financial/material possession)",
"קִנְיָן": "asset, property; possession, ownership (abstract or acquired)",
"הִתְבּוֹלְלוּת": "assimilation (cultural/ethnic blending in)",
"הִטַּמְּעוּת": "assimilation (absorption, integration into surroundings)",
"כְּפִיפָה": "basket (woven, traditional/biblical)",
"סַל": "basket (general, everyday)",
"מַשְׁמִים": "boring, dreary (causing desolation/boredom)",
"מְשַׁעְמֵם": "boring, tedious (causing boredom, common usage)",
"מַשָּׂא": "burden, load (heavy cargo; figurative weight)",
"נֵטֶל": "burden, load; ballast (dead weight)",
"טָרוּד": "busy, preoccupied (mentally troubled/distracted)",
"עָסוּק": "busy, occupied (engaged in an activity)",
"מַמְתָּק": "candy, sweet (generic confection)",
"סֻכָּרִיָּה": "candy, sweet (individual wrapped candy piece)",
"מַרְבָד": "carpet, rug (literary/poetic); bedspread",
"שָׁטִיחַ": "carpet, rug (standard, everyday word)",
"כַּרְפַּס": "celery (also: the Passover seder vegetable)",
"סֶלֶרִי": "celery (modern loanword, everyday usage)",
"שַׁלְשֶׁלֶת": "chain (figurative: chain of events, lineage)",
"שַׁרְשֶׁרֶת": "chain (physical chain, links)",
"אָפְיָן": "characteristic (trait, attribute of a person/thing)",
"סַמְמָן": "characteristic; indicator, hallmark",
"שׁוֹקוֹלָד": "chocolate (the substance, mass noun, masc.)",
"שׁוֹקוֹלָדָה": "chocolate (a piece of chocolate; hot chocolate, fem.)",
"עִגּוּל": "circle (the shape); rounding",
"מַעֲגָל": "circle (circular path, cycle, circuit)",
"נִקּוּי": "cleaning (the act of cleaning, removing dirt)",
"נִקָּיוֹן": "cleanliness, tidiness (state of being clean)",
"בִּקּוּעַ": "cleaving, splitting (a single crack or fissure)",
"הִתְבַּקְּעוּת": "cleaving, splitting (the process of cracking apart)",
"בְּעִילָה": "coitus, sexual intercourse (legal/halachic term)",
"מִשְׁגָּל": "coitus, sexual intercourse (formal/literary)",
"מִדְרָשָׁה": "college (religious seminary, study institute)",
"מִכְלָלָה": "college (academic institution, secular)",
"תַּחֲרוּת": "competition, contest (an event or rivalry)",
"הִתְחָרוּת": "competition (the act/process of competing)",
"לְגַמְרֵי": "completely, totally (colloquial, very common)",
"כָּלִיל": "completely, entirely (literary/formal); wholly",
"רְכִיב": "component (technical part, element in a system)",
"מַרְכִּיב": "component, ingredient (constituent that makes up a whole)",
"תַּבְעֵרָה": "conflagration, fire (intense blaze, biblical/literary)",
"דְּלֵקָה": "fire (accidental fire, house fire, everyday)",
"צַרְכָנוּת": "consumerism; consumer advocacy",
"צְרִיכָה": "consumption (using up resources, usage)",
"קֵרוּר": "cooling, refrigeration (active process of making cold)",
"הִתְקָרְרוּת": "cooling (becoming cold); catching a cold",
"חָשׁוּךְ": "dark (of a place, lacking light; figuratively bleak)",
"כֵּהֶה": "dark (of a color, shade; dim)",
"אֲפֵלָה": "darkness (deep gloom; figurative despair)",
"אֹפֶל": "darkness (poetic/literary, deep darkness)",
"חֹשֶׁךְ": "darkness (general, common word)",
"יַקִּיר": "darling, dear (masculine form)",
"יַקִּירָה": "darling, dear (feminine form)",
"מִרְמָה": "deceit, fraud (cunning deception, trickery)",
"תַּרְמִית": "deceit, fraud (a specific act of swindling)",
"אֲבַדּוֹן": "destruction (total ruin, perdition; the abyss)",
"הֶרֶס": "destruction, demolition (physical wreckage)",
"הֶבְדֵּל": "difference, distinction (between two things)",
"שֹׁנִי": "difference (variance, otherness)",
"הֵעָלְמוּת": "disappearance (the act of vanishing, going missing)",
"הֶעֱלֵם": "disappearance (concealment, suppression of information)",
"נְדָבָה": "donation (voluntary, charitable gift; tip)",
"תְּרוּמָה": "donation, contribution (formal; also: religious offering)",
"הִשְׁתַּעְבְּדוּת": "enslavement (the process of becoming enslaved)",
"שִׁעְבּוּד": "enslavement, subjugation; mortgaging (finance)",
"טָעוּת": "mistake, error (common, everyday blunder)",
"שְׁגִיאָה": "error, mistake (formal, technical error)",
"הִתְאַדּוּת": "evaporation (natural process of turning to vapor)",
"הִתְאַיְּדוּת": "evaporation (process of dissipating, vaporizing)",
"דֻּגְמָה": "example, sample (concrete instance or specimen)",
"מָשָׁל": "example; parable, allegory, proverb",
"גּוֹלָה": "exile, diaspora (the community in exile)",
"גָּלוּת": "exile, diaspora (the state/condition of being exiled)",
"חֲוָיָה": "experience (a lived event, an adventure)",
"הִתְנַסּוּת": "experience (the process of trying/experimenting)",
"נִסָּיוֹן": "experience (accumulated knowledge); attempt, trial",
"בֵּאוּר": "explanation, elucidation (detailed clarification)",
"הֶסְבֵּר": "explanation (the act of explaining, making understood)",
"פָּנִים": "face (standard word); surface",
"פַּרְצוּף": "face (appearance, facial expression; colloquial)",
"מֶחְדָּל": "failure, omission (negligent failure to act)",
"כִּשָּׁלוֹן": "failure (general: failed attempt or endeavor)",
"כֶּשֶׁל": "failure, malfunction (technical breakdown)",
"תַּעְנִית": "fast (religious fast day, formal term)",
"צוֹם": "fast, fasting (the act of fasting, general)",
"תְּחוּשָׁה": "feeling, sensation (physical or gut feeling)",
"הַרְגָּשָׁה": "feeling (emotional sense; well-being)",
"רֶגֶשׁ": "feeling, emotion (inner emotional state)",
"לֶהָבָה": "flame (common word for a flame)",
"שַׁלְהֶבֶת": "flame (poetic/literary, blazing flame)",
"כָּפִיף": "flexible, pliable (can be bent physically)",
"מָתִיחַ": "flexible, elastic (stretchy, resilient)",
"זֶרֶם": "flow, current (of water, electricity, or ideas)",
"זְרִימָה": "flow, flowing (the act/process of flowing)",
"אֹכֶל": "food (general, everyday word for food/meal)",
"מַאֲכָל": "food (a specific dish, a prepared food item)",
"מָזוֹן": "food, nourishment (sustenance, nutrition)",
"חֹפֶשׁ": "freedom; vacation, time off (colloquial)",
"חֵרוּת": "freedom, liberty (formal, political/ideological)",
"הַקְפָּאָה": "freezing (active act of freezing something; a freeze/suspension)",
"קִפָּאוֹן": "freezing; standstill, stagnation (frozen state)",
"תְּדִירוּת": "frequency (how often something occurs)",
"תֶּדֶר": "frequency (radio/physics frequency)",
"תָּדִיר": "frequent, regular (happening at steady intervals)",
"תָּכוּף": "frequent, rapid (happening in quick succession)",
"גָּאוֹן": "genius (title of greatness; rabbinical title Gaon)",
"עִלּוּי": "genius, prodigy (exceptionally gifted person)",
"תְּשׁוּרָה": "gift, present (formal/literary offering)",
"שַׁי": "gift, present (a token gift, small present)",
"אַכְלָן": "glutton (big eater, food-lover, common)",
"רְעַבְתָּן": "glutton (insatiably hungry person)",
"מֶמְשֶׁלֶת": "government (construct state form, used in compounds)",
"מֶמְשָׁלָה": "government (standard form)",
"מֶמְשַׁלְתִּי": "governmental (relating to the government/cabinet)",
"שִׁלְטוֹנִי": "governmental (relating to ruling authority/regime)",
"חֹפֶן": "handful (cupped palm, a scooped amount)",
"קֹמֶץ": "handful (a pinch, a small quantity)",
"יָד": "handle (of a tool, door); hand",
"יָדִית": "handle (a knob or grip, specifically a handle)",
"כָּאן": "here (standard, common usage)",
"פֹּה": "here (colloquial/informal variant)",
"טָמוּן": "hidden (buried, latent, lying within)",
"נִסְתָּר": "hidden, concealed (secret, mysterious; grammar: 3rd person)",
"מֻצְנָע": "hidden, concealed (modestly tucked away, discreet)",
"תְּמוּנָה": "image, picture (photo, illustration, scene)",
"צֶלֶם": "image (likeness, form); idol",
"הִתְרַשְּׁמוּת": "impression (the experience of being impressed)",
"רֹשֶׁם": "impression (a mark left; an effect on someone)",
"בִּפְנִים": "inside (location: on the inside, indoors)",
"פְּנִימָה": "inside (direction: inward, toward the inside)",
"עֶלְבּוֹן": "insult, offence (the slight or affront itself)",
"הַעֲלָבָה": "insult (the act of insulting someone)",
"פְּנִים": "interior, inside (inner part, inner side)",
"קֶרֶב": "interior; innards, midst (among, in the thick of)",
"תָּוֶךְ": "interior, inside; center, middle; essence",
"תַּחְקִיר": "investigation (journalistic/official inquiry)",
"חֲקִירָה": "investigation, inquiry (police/legal; research)",
"רִנָּה": "joy; joyful song, singing (literary)",
"מָשׂוֹשׂ": "joy, delight (source of joy, literary)",
"גִּיל": "joy, elation (exuberant happiness; age)",
"שִׂמְחָה": "joy, happiness (celebration, festive occasion)",
"עֶלְצוֹן": "jubilance, exultation (archaic, the feeling)",
"עֶלְצָה": "jubilance, exultation (archaic, feminine noun form)",
"עָצֵל": "lazy, idle (basic adjective form)",
"עַצְלָן": "lazy, lazybones (characteristically lazy person)",
"תְּחִקָּה": "legislation (a specific statute or enacted law)",
"חֲקִיקָה": "legislation (the process/act of legislating)",
"הִתְהוֹלְלוּת": "licentiousness, revelry (wild raucous behavior)",
"הוֹלֵלוּת": "licentiousness, debauchery (moral depravity)",
"שׁוֹשָׁן": "lily (the flower, masculine; also: the name Shoshan)",
"שׁוֹשַׁנָּה": "lily; rose (archaic); the name Shoshana",
"הִמָּצְאוּת": "location; presence (being found/situated somewhere)",
"מִקּוּם": "location, positioning (placing in a specific spot)",
"נַעֲלֶה": "lofty, exalted (elevated, superior in quality)",
"נִשְׂגָּב": "lofty, exalted (sublime, beyond reach, grand)",
"תַּאֲוָה": "lust, craving (appetite, physical desire)",
"תְּשׁוּקָה": "passion, desire (deep longing, yearning)",
"אַחְזָקָה": "maintenance; holding (corporate; upkeep of property)",
"תַּחְזוּקָה": "maintenance (technical upkeep of systems/equipment)",
"תִּחְזוּק": "maintenance (the process/act of maintaining)",
"מִנְהָל": "administration, management (the office/system)",
"נִהוּל": "management (the act/process of managing)",
"הַנְהָלָה": "management (the managing body, executive board)",
"פֵּרוּשׁ": "meaning; interpretation, commentary",
"מַשְׁמָעוּת": "meaning, significance (broader importance)",
"מַשְׁמָע": "meaning, implication (what is implied)",
"לַחַן": "melody, tune (a musical composition)",
"נִגּוּן": "melody, tune (a chant; Hasidic wordless melody)",
"נְעִימָה": "melody, tune; tone, intonation (of voice)",
"נֵס": "miracle (divine intervention; common word)",
"פֶּלֶא": "wonder, marvel (something astonishing)",
"תְּזוּזָה": "movement (a budge, slight motion, shift)",
"תְּנוּעָה": "movement (broad: traffic; organization; vowel mark)",
"מִסְתּוֹרִין": "mystery (enigma, something hidden/secret)",
"תַּעֲלוּמָה": "mystery (unsolved puzzle, unknown secret)",
"עֵירֹם": "naked (completely nude, formal)",
"עָרֹם": "naked (nude; also: shrewd, cunning in biblical Hebrew)",
"אֻמָּה": "nation (a unified political/cultural entity)",
"לְאֹם": "nation, people (ethnic group; literary/formal)",
"זִלְזוּל": "negligence; contempt, disrespect (dismissive attitude)",
"הִתְרַשְּׁלוּת": "negligence (carelessness, failure to take proper care)",
"נֵיטְרָלִי": "neutral (politically/scientifically neutral, loanword)",
"סְתָמִי": "neutral; vague, nondescript, generic",
"אֲצֻלָּה": "nobility, aristocracy (the aristocratic class)",
"אֲצִילוּת": "nobility (the quality of being noble, refinement)",
"הִסְתַּכְּלוּת": "observation (looking, watching, contemplation)",
"תַּצְפִּית": "observation (military/scientific lookout; observation post)",
"מִכְשׁוֹל": "obstacle, stumbling block (impediment to progress)",
"נֶגֶף": "obstacle; plague, affliction (biblical)",
"עַל": "on, upon; about, regarding",
"עַל גַּב": "on, upon (on the back/surface of)",
"עַל גַּבֵּי": "on, upon (on top of, on the surface of)",
"פְּקֻדָּה": "order, command (military/authoritative directive)",
"צַו": "order, decree (legal injunction, official order)",
"בָּחוּץ": "outside (location: on the outside, outdoors)",
"הַחוּצָה": "outside (direction: outward, to the outside)",
"מַאֲרָז": "package (a packed container, packaging)",
"חֲבִילָה": "package, parcel (a bundle, a wrapped item)",
"מְחִילָה": "pardon, forgiveness (personal, between individuals)",
"סְלִיחָה": "pardon, forgiveness (also: excuse me; liturgical pardon)",
"סַיֶּרֶת": "patrol (elite military unit, commando squad)",
"סִיּוּר": "patrol; tour (a round of inspection or sightseeing)",
"שָׂכָר": "payment; salary, wage (earned compensation)",
"תַּשְׁלוּם": "payment (a single payment/installment; compensation)",
"עֲצוּמָה": "petition (public petition with signatures)",
"עֲתִירָה": "petition (legal petition, court appeal)",
"דַּלּוּת": "poverty; meagerness, paucity (scarcity of quality/quantity)",
"עֹנִי": "poverty (destitution, financial hardship)",
"עָצְמָתִי": "powerful (having great inherent power)",
"רַב עָצְמָה": "powerful (of great might, formidable)",
"הַאֲמָרָה": "price increase (deliberate raising of prices)",
"הִתְיַקְּרוּת": "price increase (becoming more expensive, rising costs)",
"קִדְמָה": "progress (general/societal advancement, modernity)",
"הִתְקַדְּמוּת": "progress (the process of advancing, making headway)",
"הַסְבָּרָה": "propaganda; public diplomacy (Israeli hasbara)",
"תַּעֲמוּלָה": "propaganda (political propaganda, agitation)",
"סְמִיכוּת": "proximity; construct state (grammar term)",
"קִרְבָה": "proximity; kinship, closeness (relational nearness)",
"תְּהִלּוֹת": "Psalms (variant plural form)",
"תְּהִלִּים": "Psalms (standard name for the Book of Psalms)",
"קְנִיָּה": "purchase (a buy, an act of buying, everyday)",
"רְכִישָׁה": "acquisition (formal purchase, procurement)",
"בִּזְרִיזוּת": "quickly, nimbly (with agile efficiency)",
"בִּמְהִירוּת": "quickly, at high speed (with velocity)",
"רִיצָה": "running (the activity of running)",
"מְרוּצָה": "race (a competitive running event)",
"גְּאֻלָּה": "redemption (national/messianic deliverance)",
"פְּדוּת": "redemption (ransoming, being redeemed; literary)",
"הוֹצָאָה": "removal; expense, expenditure; publishing house",
"הַסָּחָה": "removal; deflection, diversion, distraction",
"יִצּוּג": "representation (acting on behalf of; depiction)",
"נְצִיגוּת": "representation (the body of representatives, delegation)",
"מְכִירָה": "sale (the act of selling, a transaction)",
"מֶכֶר": "sale; merchandise, value (literary/biblical)",
"יֶשַׁע": "salvation, deliverance (divine rescue, literary)",
"תְּשׁוּעָה": "salvation, victory (triumphant rescue, literary)",
"הַפְרָדָה": "separation (active act of separating things/people)",
"הִפָּרְדוּת": "separation (the process of parting ways)",
"חַד": "sharp (of edges, blades; clear-cut)",
"חָרִיף": "sharp, acute; spicy, pungent; keen, witty",
"חָסוּת": "shelter, patronage (protection under authority)",
"מִקְלָט": "shelter, refuge (bomb shelter, safe haven, physical place)",
"חֻלְצָה": "shirt, blouse (modern everyday word)",
"כֻּתֹּנֶת": "shirt; tunic, gown (biblical/traditional garment)",
"שֶׁקֶט": "silence, quiet (peaceful calm, serenity)",
"שְׁתִיקָה": "silence (the act of keeping silent, not speaking)",
"חֶטְא": "sin (a specific transgression, missing the mark)",
"עָווֹן": "sin, iniquity (moral guilt; legal: misdemeanor)",
"זִמְרָה": "singing (musical performance, song/hymn)",
"רְנָנָה": "singing; joyful song, jubilant cry (literary)",
"נָטוּי": "slanted, inclined (tilted, leaning; grammar: inflected)",
"מְשֻׁפָּע": "slanted, inclined; having an abundance of something",
"כִּשּׁוּף": "sorcery, witchcraft (dark magic, spellcasting)",
"קֶסֶם": "magic, charm (enchantment, allure)",
"נֶפֶשׁ": "soul (life force, self, being; appetite)",
"נְשָׁמָה": "soul (divine breath of life, spiritual essence)",
"מַצָּת": "spark plug (automotive ignition component)",
"פְּלָג": "spark plug (variant/slang term)",
"דּוֹבֵר": "speaker, spokesman (masculine form)",
"דּוֹבֶרֶת": "speaker, spokeswoman (feminine form)",
"סוּפָה": "storm, tempest (violent windstorm)",
"סְעָרָה": "storm, tempest (raging storm; figurative turmoil)",
"קַשׁ": "straw (dry stalks; figuratively: trivial thing)",
"תֶּבֶן": "straw, hay (animal feed, dried grass)",
"עִקֵּשׁ": "stubborn, obstinate (perversely rigid)",
"עַקְשָׁן": "stubborn, obstinate (characteristically persistent/stubborn person)",
"חָנִיךְ": "student, pupil (trainee, apprentice, cadet)",
"תַּלְמִיד": "student, pupil (school student, common word)",
"פִּקּוּחַ": "supervision (regulatory oversight, monitoring)",
"הַשְׁגָּחָה": "supervision (watchful care, divine providence; kosher certification)",
"הַסְפָּקָה": "supply, provision (the act of supplying goods)",
"אַסְפָּקָה": "supply, provision (military/logistical provisioning)",
"אֲרָעִי": "temporary, provisional (makeshift, not permanent)",
"זְמַנִּי": "temporary, time-limited (for a limited period)",
"אֵלֶה": "these (standard demonstrative pronoun)",
"אֵלוּ": "these (literary/Mishnaic variant)",
"בֹּהֶן": "thumb; big toe (anatomical term)",
"אֲגוּדָל": "thumb (common/colloquial word for thumb)",
"זְמַן": "time (general, measurable time; tense in grammar)",
"עֵת": "time (a specific moment, epoch, literary/biblical)",
"עִתּוּי": "timing (choosing the right moment)",
"תִּזְמוּן": "timing (synchronization, technical scheduling)",
"לְכַתֵּב": "to address (write an address on); to engrave",
"לְמַעֵן": "to address (direct/target communication toward)",
"לְזַיֵּן": "to arm (equip with weapons; vulgar slang)",
"לְחַמֵּשׁ": "to arm (equip/furnish with armaments)",
"לְהִתְאַסֵּף": "to assemble, to gather together (of people collecting)",
"לְהִתְכַּנֵּס": "to assemble, to convene (a formal meeting/conference)",
"לְהִכָּבֵל": "to be bound (chained, shackled with chains)",
"לְהִכָּפֵת": "to be bound (handcuffed, tied up physically)",
"לְהִבָּרֵא": "to be created (divine/fundamental creation, ex nihilo)",
"לְהִוָּצֵר": "to be created (formed, shaped, manufactured)",
"לְהִגָּזֵז": "to be cut off (sheared, trimmed, as hair/wool)",
"לְהִגָּזֵר": "to be cut off (decreed, sentenced; derived from)",
"לְהִקָּטֵעַ": "to be cut off (interrupted, severed abruptly)",
"לְהִנָּגֵף": "to be defeated (struck down, plagued; biblical)",
"לְהֵרָעֵץ": "to be defeated (crushed, shattered; literary)",
"לְהֵהָרֵס": "to be destroyed (demolished, wrecked; slang: exhausted)",
"לְהֵחָרֵב": "to be destroyed (laid waste, devastated; of cities/temples)",
"לְהִסָּתֵר": "to be hidden; to hide oneself (take cover)",
"לְהִצָּפֵן": "to be hidden (encoded, concealed from view)",
"לְהִנָּטֵעַ": "to be planted (of trees/plants, set in soil)",
"לְהִשָּׁתֵל": "to be planted (implanted, transplanted; of an organ or undercover agent)",
"לָדֹם": "to be silent (to become utterly still; literary)",
"לִשְׁתֹּק": "to be silent (to stop talking, keep quiet; common)",
"לְהִתְקַמֵּץ": "to be stingy (to pinch pennies, scrimp)",
"לְהִתְקַמְצֵן": "to be stingy (to act like a miser, be miserly)",
"לְהִבָּדֵק": "to be tested, checked (verified, inspected)",
"לְהִבָּחֵן": "to be tested, examined (undergo a formal exam/evaluation)",
"נִהְיָה": "to become (turn into, come to be; common)",
"לְהֵעָשׂוֹת": "to become; to be made, to be done, to be carried out",
"לְהִתְבַּהֵר": "to become clear (clarified, understood)",
"לְהִצְטַלֵּל": "to become clear (of liquid becoming transparent/limpid)",
"לְכוֹפֵף": "to bend (flex, bow down, curve something)",
"לְקַמֵּר": "to bend, to vault (arch over, create a dome shape)",
"לְקַשֵּׁת": "to bend, to curve (form into a bow/arc shape)",
"לְפַחֵם": "to blacken (carbonize, char with coal/charcoal)",
"לְפַיֵּחַ": "to blacken (cover with soot, smoke residue)",
"לְמַצְמֵץ": "to blink (rapidly open and close one's eyes)",
"לְעַפְעֵף": "to blink (flutter one's eyelids)",
"לִנְפֹּחַ": "to blow (puff up, inflate; blow air)",
"לִנְשֹׁף": "to blow, to exhale; to play a wind instrument",
"לְצַיֵּץ": "to chirp, to tweet (of birds; to post on social media)",
"לְצַפְצֵף": "to chirp, to whistle (shrill piping sound; to not care — slang)",
"לְחַבֵּר": "to connect, to join (attach together; to compose/write)",
"לְקַשֵּׁר": "to connect, to link (establish a relationship/connection)",
"לְהָסִיחַ": "to converse (engage in casual talk; to divert attention)",
"לְהָשִׂיחַ": "to converse, to talk (literary; to speak with)",
"לְסַלְסֵל": "to curl (hair); to trill (music)",
"לְתַלְתֵּל": "to curl (hair into ringlets/curls)",
"לְיַפּוֹת": "to beautify, to embellish (make more attractive)",
"לְפַרְכֵּס": "to embellish; to squirm, to flounder",
"לִדְרֹשׁ": "to demand; to inquire, to preach (seek/expound)",
"לִתְבֹּעַ": "to demand; to sue, to claim (legal demand)",
"לְהֵישִׁיר": "to direct; to straighten, to look straight at",
"לְהַפְנוֹת": "to direct; to refer someone (redirect attention/person)",
"לְהַגְזִים": "to exaggerate (overstate, blow out of proportion; common)",
"לְהַפְרִיז": "to exaggerate (go to extremes, overdo; formal)",
"לְהִמּוֹג": "to fade, to dissolve (melt away, lose form; literary)",
"לְהִנָּדֵף": "to fade, to dissipate (blown away, scattered by wind)",
"לִפֹּל": "to fall (general: fall down, collapse; common word)",
"לִנְשֹׁר": "to fall, to drop (shed: leaves, hair; drop out of school)",
"לְכַלּוֹת": "to finish (consume entirely, exhaust; to annihilate)",
"לְסַיֵּם": "to finish, to complete (conclude, bring to an end; common)",
"לִנְהֹר": "to flow (stream toward); to shine, to glow",
"לִשְׁתֹּת": "to flow (pour forth, stream out; literary)",
"לִמְחֹל": "to forgive (pardon on a personal level, waive a claim)",
"לִסְלֹחַ": "to forgive, to pardon (general, standard word for forgiving)",
"לְהַחְבִּיא": "to hide, to conceal (physically stash away; common)",
"לְהַעֲלִים": "to hide, to conceal (suppress information; to evade)",
"לִדְלֹף": "to leak (of a pipe, roof; seep through)",
"לִנְזֹל": "to drip, to trickle (flow in drops, ooze)",
"לִזְנֹחַ": "to abandon, to neglect (forsake, discard)",
"לַעֲזֹב": "to leave, to abandon (depart from; give up; common word)",
"לְהַנִּיחַ": "to place, to put (set down carefully); to assume",
"לְהָשִׂים": "to place, to put (set/assign); to turn into something",
"לְפָאֵר": "to glorify, to adorn (extol with grandeur)",
"לְשַׁבֵּחַ": "to praise, to commend (express approval; common)",
"לִדְחֹף": "to push, to shove (physically push forward; common)",
"לִדְחֹק": "to push, to press (squeeze, crowd; urge insistently)",
"לְהַבְרִיא": "to recover (regain health, get well; common)",
"לְהַחְלִים": "to recover, to convalesce (heal fully from illness; formal)",
"לַעֲלֹץ": "to rejoice, to exult (leap with joy; literary)",
"לָשׂוּשׂ": "to rejoice (be glad, delight in; biblical/literary)",
"לְהוֹשִׁיעַ": "to rescue, to save (deliver from danger; biblical/literary)",
"לְהַצִּיל": "to rescue, to save (common, everyday word)",
"לְחַכֵּךְ": "to rub (scratch an itch, abrade gently)",
"לְשַׁפְשֵׁף": "to rub (scrub, polish by rubbing repeatedly)",
"לִסְרֹט": "to scratch (scrape with a sharp object; to make a video/film)",
"לִשְׂרֹט": "to scratch (draw a line, score a surface)",
"לִנְגֹּהַּ": "to shine (glow with bright light; literary)",
"לִקְרֹן": "to shine, to beam (radiate light, as from horns of light)",
"לְהַחֲרִישׁ": "to silence; to be silent (choose not to respond; literary)",
"לְהַשְׁתִּיק": "to silence (make someone/something stop making noise; common)",
"לִטְבֹּחַ": "to slaughter (massacre, butcher violently)",
"לִשְׁחֹט": "to slaughter (ritually slaughter an animal; shecht)",
"לְהִתְמַחוֹת": "to specialize (become an expert in a field)",
"לְהִתְמַקְצֵעַ": "to specialize (become a professional, gain proficiency)",
"לְבַקֵּעַ": "to split, to cleave (crack open forcefully)",
"לְבַתֵּק": "to split, to cleave; to pierce (cut through)",
"לִמְרֹחַ": "to spread (smear, apply a spread on surface)",
"לִשְׁטֹחַ": "to spread (lay out flat, unfurl); to present, explicate",
"לְאַשֵּׁשׁ": "to strengthen, to establish (shore up, substantiate)",
"לְחַזֵּק": "to strengthen (make stronger, reinforce; common word)",
"לְהִתְיַסֵּר": "to suffer (be tormented, endure agony)",
"לְהִתְעַנּוֹת": "to suffer; to fast (endure hardship/deprivation; literary)",
"לִידוֹת": "to throw, to hurl (cast, fling; biblical)",
"לִרְמוֹת": "to throw, to hurl (toss; biblical)",
"לִגְזֹז": "to trim (shear wool/hair, clip close)",
"לִגְזֹם": "to trim (prune branches/bushes, cut back vegetation)",
"לְאַדּוֹת": "to vaporize (steam, evaporate); to simmer, to poach (cooking)",
"לְאַיֵּד": "to vaporize, to evaporate (cause to turn into vapor)",
"לֶאֱרֹג": "to weave (on a loom, produce fabric; common word)",
"לִשְׁזֹר": "to weave (intertwine, braid, thread together)",
"בְּיַחַד": "together (as a group, common usage with 'be-')",
"יַחַד": "together (jointly, in unison; literary)",
"יַחְדָּו": "together (jointly; biblical/poetic variant)",
"מִסְחָר": "trade, commerce (the business/sector of trading)",
"סַחַר": "trade, commerce (goods traded, merchandise; literary)",
"אֱמֶת": "truth (common word for truth, verity)",
"אֲמִתָּה": "truth; axiom (fundamental truth, literary)",
"מִצְנֶפֶת": "turban (formal headdress, priestly turban)",
"צָנִיף": "turban, head wrap (wrapped head covering)",
"אַחְדוּת": "unity (state of being united, solidarity)",
"אִחוּד": "unification (the act of uniting, merging)",
"בִּקְעָה": "valley (broad, flat valley plain)",
"עֵמֶק": "valley (deep valley between mountains/hills)",
"אִשְׁרָה": "visa; approval (entry permit; formal approval)",
"וִיזָה": "visa (travel visa, loanword)",
"כֹּתֶל": "wall (the Western Wall; a freestanding stone wall)",
"קִיר": "wall (common word for wall of a room/building)",
"אַזְהָרָה": "warning (a caution, alert; legal/safety warning)",
"הַזְהָרָה": "warning (the act of warning someone; admonition)",
"רַהַט": "water trough (channel, gutter for water flow)",
"שֹׁקֶת": "water trough (feeding/drinking trough for animals)",
"אִלּוּלֵי": "were it not for (standard conditional; common)",
"לוּלֵא": "were it not for (literary/Talmudic variant)",
"אוֹפַןּ": "wheel (a single wheel; biblical/poetic)",
"גַּלְגַּל": "wheel (rolling wheel; cycle, pulley)",
"אַיֵּה": "where? (literary/biblical: where is?)",
"הֵיכָן": "where? (standard literary form of 'where')",
"לֹבֶן": "whiteness (white of the eye; white color)",
"צְחוֹר": "whiteness; purity (brilliant white, radiance)",
"עוֹלָם": "world (the world, universe; eternity; common word)",
"תֵּבֵל": "world, universe (the inhabited world; poetic/literary)",
"פֶּצַע": "wound (a specific cut, gash, open wound)",
"פְּצִיעָה": "wound, injury (the event/act of being wounded)",
"כִּסּוּפִים": "yearning, longing (wistful craving, literary; plural)",
"עֶרְגָּה": "yearning, longing (deep nostalgic longing, literary)"
}

19525
data/vetted_sentences.json Normal file

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

2297586
data/words.json Normal file

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,150 @@
# Adaptive Sentence Difficulty Cloze — v0.20 Design Spec
**Date:** 2026-03-15
**Status:** Approved
**Release:** v0.20
## Problem
Cloze cards currently select the example sentence closest to 9 words in length. This ignores whether the surrounding context words are familiar to the learner. A sentence full of rare words is harder than one with common words, regardless of length.
## Solution
Replace the length-based `_score()` function in `epub_examples.py` with a **frequency-based difficulty score**. The easiest sentence (most common context words) becomes the cloze. All vetted sentences remain on the card, ordered easy→hard.
## Scoring Pipeline
### Token Frequency Lookup (5-tier)
Given a nikkud sentence token, resolve its frequency rank:
1. **Known mapping** — look up token in the nikkud→ktiv_male map built from words.json headwords, conjugations, and inflections (94k mappings). If found, look up the ktiv_male in the frequency data.
2. **Nikkud prefix stripping** — use `_try_strip_prefix()` to strip validated Hebrew prefixes (בהוכלמש), then resolve the remainder via the known mapping.
3. **Academy rules converter** — apply `nikkud_to_ktiv_male.convert()` (91.6% accuracy) to produce ktiv_male, look up in frequency data.
4. **strip_nikkud fallback** — use `helpers.strip_nikkud()` as a lossy fallback.
5. **Ktiv_male prefix stripping** — strip 1-2 character Hebrew prefixes from the converted/stripped form and look up the stem.
Tokens not found in any tier are assigned a default high rank (50,000).
**Coverage:** ~93% of example sentence tokens resolve to a frequency rank (measured empirically on 7,588 sentences).
**Frequency data source:** Use `frequency_lookup.py` which auto-selects `frequency_clean.json` when available, falling back to `frequency_cache.json`.
### Sentence Difficulty Score
For a given word's candidate sentence:
1. Tokenize: split on whitespace, strip punctuation (.,!?;:"'"״׳–—()[]{}), split on maqaf (־).
2. Exclude the target word's token using `cloze_word_start`/`cloze_word_end` offsets from the matched sentence.
3. For each remaining token (length >= 2), resolve its frequency rank via the 5-tier pipeline.
4. **Score = median frequency rank of context tokens.**
Lower score = easier (context words are more common). Median resists outliers (one rare proper noun shouldn't dominate).
### Integration Point
The scoring integrates into `epub_examples.py`'s existing `_score()` closure inside `update_words_json()` (line ~677). Currently:
```python
def _score(s: dict) -> tuple[int,]:
wc = s["word_count"]
length_score = abs(wc - 9) if not (6 <= wc <= 12) else 0
return (length_score,)
```
New scoring replaces length with frequency-based difficulty. The `_score` function gains access to the frequency pipeline via closure over the nikkud_map, nikkud_index, and freq_data built once at the start of `update_words_json()`.
**Minimum sentence length:** Reduced from 4 words to 3 words (`MIN_WORDS = 3` in epub_examples.py). Hebrew is more concise than English — 3-word sentences are valid and common. This expands the candidate pool for cloze selection.
**Behavioral change:** Because `pool.sort(key=_score)` determines which 3 sentences are selected as `best = pool[:3]`, changing the scoring function changes **which sentences are selected**, not just their order. This is intentional — we want the easiest sentences as cloze candidates, not the closest-to-9-words ones. Existing cloze GUIDs will be preserved when the same sentence text is re-selected; entries where a different sentence wins will get new GUIDs.
## Data Model Changes
### words.json
The `examples.cloze` dict (single sentence) gains an optional `difficulty_score` field:
```json
{
"examples": {
"vetted": [
{"text": "...", "source": "...", "match_method": "..."},
{"text": "...", "source": "...", "match_method": "..."}
],
"cloze": {
"text": "...",
"cloze_word_start": 5,
"cloze_word_end": 10,
"cloze_hint": null,
"cloze_guid": "abc123",
"difficulty_score": 234
}
}
}
```
The vetted list is also sorted by difficulty (easiest first), so the card back shows sentences in pedagogically useful order.
### SCHEMA.yaml
Add `difficulty_score` as optional integer field under `examples.cloze`.
## Implementation Scope
### New file: `sentence_difficulty.py`
Standalone module for sentence scoring. No pipeline step — called by `epub_examples.py`.
- `score_sentence(sentence_text: str, target_start: int, target_end: int, nikkud_map: dict, nikkud_index: dict, freq_data: dict) -> int` — returns median context frequency rank. Uses `target_start`/`target_end` character offsets to exclude the cloze target token.
- `build_nikkud_map(words: dict) -> dict[str, str]` — builds nikkud→ktiv_male lookup from words.json (headwords + conjugation forms + noun inflections). Returns `{nikkud_form: ktiv_male_form}`. Implementation note: should share iteration logic with `epub_examples._build_nikkud_index()` or derive from its output to avoid duplicating the traversal of words.json forms.
- `_resolve_token_frequency(token: str, nikkud_map: dict, nikkud_index: dict, freq_data: dict) -> int` — the 5-tier lookup. Uses `_try_strip_prefix` from epub_examples (made importable by removing underscore or adding a public wrapper).
### Modified files
- **`epub_examples.py`**:
- Import `sentence_difficulty.score_sentence` and `sentence_difficulty.build_nikkud_map`
- In `update_words_json()`: build nikkud_map and load freq_data once at start (before per-word loop)
- Replace `_score()` closure with frequency-based scoring that calls `score_sentence()`
- Sort vetted list by difficulty score (easiest first)
- Store `difficulty_score` in the cloze dict
- Make `_try_strip_prefix` importable (rename to `try_strip_prefix` or add public alias)
- **`frequency_lookup.py`** — add `get_freq_data() -> dict` public accessor to expose the loaded frequency dict (avoids accessing private `_freq` directly)
- **`SCHEMA.yaml`** — add `difficulty_score` field
- **`run.py`** — no changes; scoring happens inside epub_examples step
### Not modified
- **`apkg_builder.py`** — reads cloze as-is; vetted order is already respected
- **`nikkud_to_ktiv_male.py`** — used as-is
- **Card templates** — no changes needed
## Dependencies
- `nikkud_to_ktiv_male.convert()` — Academy rules converter (already written)
- `epub_examples._try_strip_prefix()` / `_build_nikkud_index()` — nikkud prefix stripping and index
- `frequency_lookup.py` — loads frequency data (auto-selects clean vs cache)
- `helpers.strip_nikkud()` — fallback converter
## Validation
- **Unit tests** for `score_sentence()` with known easy/hard sentences
- **Unit tests** for `_resolve_token_frequency()` covering all 5 tiers
- **Integration test**: verify cloze selection picks easiest sentence, vetted list is sorted
- **Spot check**: manually review 10 words with 3+ sentences to confirm ordering
- **Regression**: existing tests pass, GUID coverage unchanged, deck validates
## Constraints
- `examples.cloze` remains a single dict (not converted to list)
- No new Anki card types or fields
- No runtime JS in Anki cards
- No network calls during scoring
- `difficulty_score` is informational metadata; card rendering doesn't depend on it
- Existing cloze GUIDs preserved when the same sentence is re-selected
## Scope Exclusions (Future Work)
- **Pronominal suffix stripping** — would improve the ~7% unscored token rate; deferred (PROJECT_NOTES.md)
- **Kamatz katan disambiguation** — requires morphological analysis; accepted limitation
- **Per-learner adaptive difficulty** — requires Anki plugin; out of scope for static deck
- **Multiple cloze sentences per card** — would require schema migration to list; deferred

913
epub_examples.py Normal file
View file

@ -0,0 +1,913 @@
#!/usr/bin/env python3
"""
Extract example sentences from nikud'd Hebrew EPUB files, match them against
the vocabulary list in data/words.json, and write matched examples back into
words.json.
Usage (standalone):
python3 epub_examples.py
Called from run.py via:
run(words) words dict is passed in and updated in place
"""
import logging
import os
import re
import zipfile
from html.parser import HTMLParser
from pathlib import Path
import frequency_lookup
from helpers import strip_nikkud
from sentence_difficulty import build_nikkud_map, score_sentence
logger = logging.getLogger(__name__)
DATA_DIR = Path(__file__).parent / "data"
EPUB_DIR = DATA_DIR / "epubs"
WORDS_JSON = DATA_DIR / "words.json"
# Book metadata: filename -> display name
def _discover_epubs() -> dict[str, str]:
"""Auto-discover all .epub and .txt files in EPUB_DIR, returning {filepath: display_name}."""
if not EPUB_DIR.exists():
return {}
books: dict[str, str] = {}
for path in sorted(EPUB_DIR.glob("*.epub")):
stem = path.stem
stem_stripped = strip_nikkud(stem).lower()
# Derive a brief English display name from the filename
parts = stem.split(" -- ")
title_part = strip_nikkud(parts[0]).strip().lower()
if "alice" in stem_stripped or "אליס" in title_part:
name = "alice_wonderland"
elif "little_prince" in stem_stripped or "נסיך" in title_part:
name = "little_prince"
elif "מנהרת" in title_part or "time_tunnel" in stem_stripped:
num_match = re.search(r"(\d+)", stem_stripped)
num = num_match.group(1) if num_match else stem_stripped.replace("time_tunnel_", "")
name = f"time_tunnel_{num}"
else:
name = stem_stripped[:40]
books[str(path)] = name
# Also discover plain-text files (e.g. Ben Yehuda downloads)
for path in sorted(EPUB_DIR.glob("*.txt")):
books[str(path)] = path.stem
return books
# Sentence length bounds (word count)
MIN_WORDS = 3
MAX_WORDS = 15
# ── HTML text extraction ─────────────────────────────────────────
class _TextExtractor(HTMLParser):
"""Extract text content from HTML, skipping script/style tags."""
SKIP_TAGS = {"script", "style", "head"}
def __init__(self):
super().__init__()
self.parts: list[str] = []
self._skip_depth = 0
def handle_starttag(self, tag, attrs):
_ = attrs # required by HTMLParser interface
if tag in self.SKIP_TAGS:
self._skip_depth += 1
# Insert newline for block-level elements to avoid word concatenation
if tag in (
"p",
"div",
"br",
"li",
"h1",
"h2",
"h3",
"h4",
"h5",
"h6",
"td",
"th",
"tr",
"blockquote",
"section",
):
self.parts.append("\n")
def handle_endtag(self, tag):
if tag in self.SKIP_TAGS:
self._skip_depth = max(0, self._skip_depth - 1)
def handle_data(self, data):
if self._skip_depth == 0:
self.parts.append(data)
def get_text(self) -> str:
return "".join(self.parts)
def extract_text_from_html(html: str) -> str:
"""Parse HTML and return plain text."""
parser = _TextExtractor()
parser.feed(html)
return parser.get_text()
# ── EPUB processing ──────────────────────────────────────────────
def _content_files_from_epub(zf: zipfile.ZipFile) -> list[str]:
"""Get ordered list of content XHTML files from the OPF manifest."""
opf_path = None
for name in zf.namelist():
if name.endswith(".opf"):
opf_path = name
break
if not opf_path:
# Fallback: just use all xhtml files
return sorted(
n
for n in zf.namelist()
if n.endswith((".xhtml", ".html"))
and "toc" not in n.lower()
and "cover" not in n.lower()
and "nav" not in n.lower()
)
# Parse OPF to get spine order
opf_content = zf.read(opf_path).decode("utf-8")
opf_dir = os.path.dirname(opf_path)
# Extract manifest items: id -> href
manifest: dict[str, str] = {}
for m in re.finditer(r'<item\s+[^>]*id="([^"]+)"[^>]*href="([^"]+)"', opf_content):
manifest[m.group(1)] = m.group(2)
# Also try reversed attribute order
for m in re.finditer(r'<item\s+[^>]*href="([^"]+)"[^>]*id="([^"]+)"', opf_content):
manifest[m.group(2)] = m.group(1)
# Extract spine order
spine_ids = re.findall(r'<itemref\s+[^>]*idref="([^"]+)"', opf_content)
result = []
for sid in spine_ids:
href = manifest.get(sid, "")
if href and href.endswith((".xhtml", ".html")):
full_path = os.path.join(opf_dir, href) if opf_dir else href
# Normalize path separators
full_path = full_path.replace("\\", "/")
if full_path in zf.namelist():
result.append(full_path)
if not result:
# Fallback
return sorted(
n
for n in zf.namelist()
if n.endswith((".xhtml", ".html")) and "toc" not in n.lower() and "cover" not in n.lower()
)
return result
def extract_sentences_from_epub(epub_path: Path, book_name: str) -> list[dict]:
"""Extract sentences from an EPUB file.
Args:
epub_path: Path to the .epub file.
book_name: Human-readable book name used as the ``source`` field.
Returns:
List of ``{"text": str, "source": str}`` dicts.
"""
zf = zipfile.ZipFile(epub_path)
content_files = _content_files_from_epub(zf)
all_text = []
for cf in content_files:
try:
html = zf.read(cf).decode("utf-8")
except (KeyError, UnicodeDecodeError):
continue
text = extract_text_from_html(html)
all_text.append(text)
full_text = "\n".join(all_text)
return _split_into_sentences(full_text, book_name)
def extract_sentences_from_text(text_path: Path, book_name: str) -> list[dict]:
"""Extract sentences from a plain-text file (e.g. Ben Yehuda downloads).
Args:
text_path: Path to the .txt file.
book_name: Human-readable book name used as the ``source`` field.
Returns:
List of ``{"text": str, "source": str}`` dicts.
"""
full_text = text_path.read_text(encoding="utf-8")
return _split_into_sentences(full_text, book_name)
# ── Sentence splitting ───────────────────────────────────────────
# Hebrew sentence terminators: period, exclamation, question mark, sof pasuk
_SENT_SPLIT = re.compile(r"[.!?\u05C3]+")
# Punctuation to strip from word boundaries when matching
_PUNCT = re.compile(
r'^[\u0022\u0027\u05F4\u05F3,;:\-\u2013\u2014\u2026\u201C\u201D\u201E\u201F\u2018\u2019()\[\]{}«»"\']+|'
r'[\u0022\u0027\u05F4\u05F3,;:\-\u2013\u2014\u2026\u201C\u201D\u201E\u201F\u2018\u2019()\[\]{}«»"\']+$'
)
def _split_into_sentences(text: str, book_name: str) -> list[dict]:
"""Split text into Hebrew sentences and filter by word count.
Args:
text: Raw extracted text from an EPUB chapter.
book_name: Source label for each sentence dict.
Returns:
List of ``{"text": str, "source": str}`` dicts, deduplicated by exact text.
"""
# Normalize whitespace
text = re.sub(r"\s+", " ", text).strip()
raw_sentences = _SENT_SPLIT.split(text)
results: list[dict] = []
seen: set[str] = set()
for sent in raw_sentences:
sent = sent.strip()
if not sent:
continue
# Count Hebrew words (skip non-Hebrew tokens like numbers)
words = sent.split()
hebrew_words = [w for w in words if any("\u0590" <= c <= "\u05ff" for c in w)]
if len(hebrew_words) < MIN_WORDS or len(hebrew_words) > MAX_WORDS:
continue
# Deduplicate by exact nikkud text
if sent in seen:
continue
seen.add(sent)
results.append({"text": sent, "source": book_name})
return results
# ── Nikkud index ─────────────────────────────────────────────────
# Unicode ranges for Hebrew combining marks
_NIKKUD_LOW = 0x05B0 # start of vowel points (shva)
_NIKKUD_HIGH = 0x05BD # end of vowel range (meteg); 0x05BE is maqaf (punctuation)
_DAGESH = "\u05bc"
_SHIN_DOT = "\u05c1"
_SIN_DOT = "\u05c2"
# Valid prefix consonants
_PREFIX_CONSONANTS = set("בהוכלמש")
# Named vowel combining marks
_SHVA = "\u05b0"
_HIRIQ = "\u05b4"
_TSERE = "\u05b5"
_SEGOL = "\u05b6"
_PATACH = "\u05b7"
_QAMATZ = "\u05b8"
# Valid nikkud patterns on each prefix consonant.
# Key = consonant, Value = set of frozensets of combining marks valid for that prefix.
_VALID_PREFIX_MARKS: dict[str, set[frozenset]] = {
"ב": {
frozenset({_SHVA, _DAGESH}), # בְּ standard
frozenset({_HIRIQ, _DAGESH}), # בִּ before shva
frozenset({_PATACH, _DAGESH}), # בַּ with definite article
frozenset({_QAMATZ, _DAGESH}), # בָּ before chataf qamatz
frozenset({_SEGOL, _DAGESH}), # בֶּ before chataf segol
},
"כ": {
frozenset({_SHVA, _DAGESH}), # כְּ
frozenset({_HIRIQ, _DAGESH}), # כִּ
frozenset({_PATACH, _DAGESH}), # כַּ
frozenset({_QAMATZ, _DAGESH}), # כָּ
frozenset({_SEGOL, _DAGESH}), # כֶּ
},
"ל": {
frozenset({_SHVA}), # לְ standard
frozenset({_HIRIQ}), # לִ before shva
frozenset({_PATACH}), # לַ with definite article
frozenset({_QAMATZ}), # לָ demonstratives
frozenset({_SEGOL}), # לֶ before chataf segol
},
"ו": {
frozenset({_SHVA}), # וְ standard
frozenset({_DAGESH}), # וּ (shureq) before shva/bumf
frozenset({_PATACH}), # וַ before chataf patach
frozenset({_QAMATZ}), # וָ before chataf qamatz
frozenset({_SEGOL}), # וֶ before chataf segol
frozenset({_HIRIQ}), # וִ before yud-shva
},
"מ": {
frozenset({_HIRIQ}), # מִ standard
frozenset({_TSERE}), # מֵ before gutturals
},
"ש": {
frozenset({_SEGOL, _DAGESH}), # שֶׁ standard
frozenset({_SEGOL, _DAGESH, _SHIN_DOT}), # שֶׁ with explicit shin dot
},
"ה": {
frozenset({_PATACH}), # הַ standard definite article
frozenset({_QAMATZ}), # הָ before gutturals
frozenset({_SEGOL}), # הֶ before qamatz-bearing gutturals
},
}
def _is_combining_mark(ch: str) -> bool:
"""Return True if ch is a Hebrew combining mark (nikkud, dagesh, or dots)."""
cp = ord(ch)
if _NIKKUD_LOW <= cp <= _NIKKUD_HIGH:
return True
return ch in (_DAGESH, _SHIN_DOT, _SIN_DOT)
def _decompose_first_char(token: str) -> tuple[str, frozenset, str]:
"""Split token into (first_consonant, its_combining_marks, remainder).
Args:
token: A nikkud Hebrew token string.
Returns:
A tuple of (consonant, marks, rest). Returns ("", frozenset(), token)
if the token does not start with a Hebrew consonant (aleftav range).
"""
if not token:
return ("", frozenset(), token)
first = token[0]
# Check it's a Hebrew consonant (aleftav)
if not ("\u05d0" <= first <= "\u05ea"):
return ("", frozenset(), token)
# Collect all combining marks that follow the consonant
marks: set[str] = set()
i = 1
while i < len(token):
ch = token[i]
if _is_combining_mark(ch):
marks.add(ch)
i += 1
else:
break
return (first, frozenset(marks), token[i:])
def _is_valid_prefix(consonant: str, marks: frozenset) -> bool:
"""Check if consonant + marks form a valid Hebrew prefix combination.
Args:
consonant: The prefix consonant character.
marks: Frozenset of combining mark characters on that consonant.
Returns:
True if this is a recognised Hebrew prefix vocalization.
"""
valid = _VALID_PREFIX_MARKS.get(consonant)
if not valid:
return False
# For ש, allow shin dot to be present or absent
if consonant == "ש":
marks_without_shin = marks - {_SHIN_DOT}
return marks_without_shin in valid or marks in valid
return marks in valid
def _rebuild_token(consonant: str, marks: frozenset, rest: str) -> str:
"""Reassemble a token from its decomposed parts, sorting marks by codepoint."""
return consonant + "".join(sorted(marks)) + rest
def _try_strip_prefix(token: str, nikkud_index: dict) -> list[tuple[str, str, str]]:
"""Try stripping 1 or 2 prefix letters from a nikkud token.
Args:
token: A cleaned nikkud word token.
nikkud_index: Mapping from nikkud form to list of (unique_key, match_type).
Returns:
List of (unique_key, match_type, matched_remainder) for each hit found.
The match_type will have ``"_prefix"`` appended to the base type.
"""
results: list[tuple[str, str, str]] = []
# Try 1-letter prefix
c1, m1, rest1 = _decompose_first_char(token)
if not (c1 and _is_valid_prefix(c1, m1) and rest1):
return results
# Direct match on 1-prefix remainder
if rest1 in nikkud_index:
for unique_key, match_type in nikkud_index[rest1]:
results.append((unique_key, match_type + "_prefix", rest1))
# Try removing dagesh from first letter of remainder
# (handles absorbed definite article: לַמֶּלֶךְ → מֶּלֶךְ → מֶלֶךְ)
c2, m2, rest2_inner = _decompose_first_char(rest1)
if c2 and _DAGESH in m2:
without_dagesh = _rebuild_token(c2, m2 - {_DAGESH}, rest2_inner)
if without_dagesh != rest1 and without_dagesh in nikkud_index:
for unique_key, match_type in nikkud_index[without_dagesh]:
results.append((unique_key, match_type + "_prefix", without_dagesh))
# Try 2-letter prefix (ו and ש commonly stack with another prefix)
if c1 in "וש":
c2b, m2b, rest2b = _decompose_first_char(rest1)
if c2b and c2b in _PREFIX_CONSONANTS and _is_valid_prefix(c2b, m2b) and rest2b:
if rest2b in nikkud_index:
for unique_key, match_type in nikkud_index[rest2b]:
results.append((unique_key, match_type + "_prefix", rest2b))
# Also try dagesh removal on remainder of 2-letter prefix
c3, m3, rest3_inner = _decompose_first_char(rest2b)
if c3 and _DAGESH in m3:
without_dagesh2 = _rebuild_token(c3, m3 - {_DAGESH}, rest3_inner)
if without_dagesh2 != rest2b and without_dagesh2 in nikkud_index:
for unique_key, match_type in nikkud_index[without_dagesh2]:
results.append((unique_key, match_type + "_prefix", without_dagesh2))
return results
# Public alias for use by sentence_difficulty module
try_strip_prefix = _try_strip_prefix
def _build_nikkud_index(words: dict) -> dict[str, list[tuple[str, str]]]:
"""Build a mapping from nikkud form to list of (unique_key, match_type).
Indexes the following sources per entry:
- ``word.nikkud`` "direct"
- conjugation active/passive forms "conjugated"
- conjugation infinitive and reference_form "conjugated"
- noun inflection singular/plural/construct/pronominal "inflected"
Args:
words: The full words.json dict keyed by unique_key.
Returns:
Dict mapping each nikkud form to a list of (unique_key, match_type) tuples.
"""
index: dict[str, list[tuple[str, str]]] = {}
def _add(form: str | None, unique_key: str, match_type: str) -> None:
if form:
index.setdefault(form, []).append((unique_key, match_type))
for unique_key, entry in words.items():
# Direct word form
word = entry.get("word") or {}
_add(word.get("nikkud"), unique_key, "direct")
# Conjugation forms
conj = entry.get("conjugation") or {}
for form_entry in conj.get("active_forms") or []:
form = (form_entry.get("form") or {}).get("nikkud")
_add(form, unique_key, "conjugated")
for form_entry in conj.get("hufal_pual_forms") or []:
form = (form_entry.get("form") or {}).get("nikkud")
_add(form, unique_key, "conjugated")
inf = conj.get("infinitive") or {}
_add(inf.get("nikkud"), unique_key, "conjugated")
ref = conj.get("reference_form") or {}
_add(ref.get("nikkud"), unique_key, "conjugated")
# Noun inflection forms
noun = entry.get("noun_inflection") or {}
for field in ("singular", "plural", "construct_singular", "construct_plural"):
sub = noun.get(field) or {}
form = sub.get("nikkud")
_add(form, unique_key, "inflected")
# Index construct forms without maqaf too — modern text often
# writes smichut as two space-separated words without maqaf
if form and form.endswith("־"):
_add(form[:-1], unique_key, "inflected")
pronominal = noun.get("pronominal_suffixes") or {}
for _person, sub in pronominal.items():
if isinstance(sub, dict):
_add(sub.get("nikkud"), unique_key, "inflected")
return index
def _filter_collision_forms(nikkud_index: dict) -> dict:
"""Remove colliding forms for entries that have other unique forms.
A "colliding form" maps to 2+ unique_keys. For each unique_key that
appears in a collision, check whether it also has at least one
non-colliding form in the index. If so, remove it from the colliding
form's entry list. If a unique_key's *only* indexed forms all collide,
keep them (otherwise the entry would get zero matches).
Returns a new index dict with the same structure.
"""
# Identify collision forms and build reverse map (key → its forms)
collision_forms: set[str] = set()
key_to_forms: dict[str, set[str]] = {}
for form, entries in nikkud_index.items():
keys = {uk for uk, _ in entries}
if len(keys) >= 2:
collision_forms.add(form)
for uk, _ in entries:
key_to_forms.setdefault(uk, set()).add(form)
# For each key, check if it has any non-colliding form
keys_with_unique_forms: set[str] = set()
for uk, forms in key_to_forms.items():
if forms - collision_forms:
keys_with_unique_forms.add(uk)
# Build filtered index
filtered: dict[str, list[tuple[str, str]]] = {}
removed = 0
for form, entries in nikkud_index.items():
if form in collision_forms:
kept = [(uk, mt) for uk, mt in entries if uk not in keys_with_unique_forms]
removed += len(entries) - len(kept)
if kept:
filtered[form] = kept
else:
filtered[form] = entries
logger.info(f" Filtered {removed} collision mappings from entries with unique forms")
return filtered
# ── Matching ─────────────────────────────────────────────────────
def match_sentences(
sentences: list[dict],
nikkud_index: dict,
confusable_keys: set[str],
) -> dict:
"""Match sentences to vocab words using the nikkud index.
Args:
sentences: List of ``{"text": str, "source": str}`` dicts.
nikkud_index: Output of ``_build_nikkud_index``.
confusable_keys: Set of unique_keys that are in confusable groups.
Returns:
Dict mapping unique_key list of match dicts, each containing:
``text``, ``source``, ``match_method``, ``word_count``,
``matched_form``, ``char_offset``, ``char_end``.
"""
matches: dict[str, list[dict]] = {}
for sent_info in sentences:
text = sent_info["text"]
source = sent_info["source"]
words_in_sent = text.split()
word_count = len(words_in_sent)
char_pos = 0
for raw_word in words_in_sent:
cleaned = _PUNCT.sub("", raw_word)
if not cleaned:
word_start = text.find(raw_word, char_pos)
char_pos = word_start + len(raw_word) if word_start >= 0 else char_pos
continue
# Locate positions within the sentence
word_start_in_sent = text.find(raw_word, char_pos)
if word_start_in_sent < 0:
word_start_in_sent = char_pos
clean_offset_in_raw = raw_word.find(cleaned)
if clean_offset_in_raw < 0:
clean_offset_in_raw = 0
clean_start = word_start_in_sent + clean_offset_in_raw
clean_end = clean_start + len(cleaned)
found: list[tuple[str, str]] = []
# Direct nikkud match
if cleaned in nikkud_index:
for unique_key, match_type in nikkud_index[cleaned]:
found.append((unique_key, match_type))
# Prefix stripping — only if no direct match exists
if cleaned not in nikkud_index:
for unique_key, match_type, _remainder in _try_strip_prefix(cleaned, nikkud_index):
found.append((unique_key, match_type))
for unique_key, match_method in found:
matches.setdefault(unique_key, []).append(
{
"text": text,
"source": source,
"match_method": match_method,
"word_count": word_count,
"matched_form": cleaned,
"char_offset": clean_start,
"char_end": clean_end,
}
)
char_pos = word_start_in_sent + len(raw_word)
return matches
# ── Writing results ──────────────────────────────────────────────
def update_words_json(words: dict, matches: dict, confusable_keys: set[str]) -> int:
"""Update words dict entries with matched example sentences.
Selects up to 3 best sentences per word (scoring prefers 612 word
sentences and non-prefix matches). Also generates a cloze entry for
the top match, unless the word is in the confusable set.
Args:
words: The full words.json dict, modified in place.
matches: Output of ``match_sentences``.
confusable_keys: Set of unique_keys in confusable groups.
Returns:
Count of words.json entries that were updated.
"""
import genanki # noqa: PLC0415 — import only where needed
updated = 0
# Build frequency scoring infrastructure (once for all words)
nikkud_index = _build_nikkud_index(words)
nikkud_map = build_nikkud_map(words)
freq_data = frequency_lookup.get_freq_data()
for unique_key, sent_list in matches.items():
if unique_key not in words:
continue
entry = words[unique_key]
# Deduplicate by sentence text
seen_texts: set[str] = set()
unique: list[dict] = []
for s in sent_list:
if s["text"] not in seen_texts:
seen_texts.add(s["text"])
unique.append(s)
# Prefer direct matches; only fall back to prefix if none exist
direct = [s for s in unique if "prefix" not in s["match_method"]]
prefix_only = [s for s in unique if "prefix" in s["match_method"]]
pool = direct if direct else prefix_only
# Score: prefer sentences with easier (more common) context words
def _score(s: dict) -> tuple[int,]:
return (
score_sentence(
s["text"],
s["char_offset"],
s["char_end"],
nikkud_map,
nikkud_index,
freq_data,
),
)
pool.sort(key=_score)
best = pool[:3]
# Build vetted list
if not entry.get("examples"):
entry["examples"] = {}
examples: dict = entry["examples"]
examples["vetted"] = [
{
"text": s["text"],
"source": s["source"],
"match_method": s["match_method"],
}
for s in best
]
# Build cloze from best sentence (skip confusables)
is_confusable = unique_key in confusable_keys
if not is_confusable and best:
top = best[0]
# Preserve existing cloze_guid if sentence text unchanged
old_cloze = examples.get("cloze") or {}
if old_cloze.get("text") == top["text"]:
cloze_guid = old_cloze.get("cloze_guid")
else:
cloze_guid = genanki.guid_for("cloze", unique_key)
examples["cloze"] = {
"text": top["text"],
"cloze_word_start": top["char_offset"],
"cloze_word_end": top["char_end"],
"cloze_hint": None,
"cloze_guid": cloze_guid,
"difficulty_score": _score(top)[0],
}
elif is_confusable:
examples.pop("cloze", None)
examples["rejected_count"] = 0
updated += 1
# Deduplicate shared examples across confusable groups
cleared = _deduplicate_confusable_examples(words)
if cleared:
logger.info(f" Cleared shared examples from {cleared} confusable entries")
return updated
def _deduplicate_confusable_examples(words: dict) -> int:
"""Remove shared examples from less-common confusable group members.
After example matching assigns sentences, confusable entries often share
identical examples (matched via shared nikkud forms). This function keeps
examples only on the highest-frequency member, clearing others.
Args:
words: The full words.json dict, modified in place (examples already
assigned).
Returns:
Count of entries whose examples were cleared.
"""
from collections import defaultdict
# Build confusable group map: group_id → [unique_key, ...]
group_map: dict[tuple[str, ...], list[str]] = defaultdict(list)
for key, entry in words.items():
cg = entry.get("confusable_group")
if cg:
group_id = tuple(sorted(cg))
group_map[group_id].append(key)
cleared = 0
for _group_id, members in group_map.items():
if len(members) < 2:
continue
# Collect vetted sentence text sets per member
member_texts: dict[str, frozenset[str]] = {}
for key in members:
vetted = (words[key].get("examples") or {}).get("vetted") or []
texts = frozenset(e.get("text", "") for e in vetted)
member_texts[key] = texts
# Find members with identical non-empty sentence sets
# Group members by their sentence set
text_groups: dict[frozenset[str], list[str]] = defaultdict(list)
for key, texts in member_texts.items():
if texts: # skip entries with no examples
text_groups[texts].append(key)
# For each set of members sharing identical examples, keep only the
# highest-frequency one
for _texts, sharing_keys in text_groups.items():
if len(sharing_keys) < 2:
continue
# Sort by frequency_rank (lower = more common = winner).
# No frequency → sort last (use large sentinel).
# Tie-break: alphabetical by unique_key.
def _sort_key(k: str) -> tuple[int, str]:
rank = words[k].get("frequency_rank")
return (rank if rank is not None else 999999, k)
sharing_keys.sort(key=_sort_key)
winner = sharing_keys[0]
losers = sharing_keys[1:]
for loser_key in losers:
entry = words[loser_key]
examples = entry.get("examples") or {}
examples["vetted"] = []
examples.pop("cloze", None)
entry["examples"] = examples
cleared += 1
logger.debug(f" Cleared examples from {loser_key} (kept on {winner})")
return cleared
# ── Public API ───────────────────────────────────────────────────
def run(words: dict) -> dict:
"""Extract EPUB sentences, match against words, update words dict in place.
Called from run.py with the already-loaded words.json dict.
Args:
words: The full words.json dict keyed by unique_key. Modified in place.
Returns:
Summary stats dict with keys ``books``, ``matched``, ``total_vocab``.
"""
logger.info(" Extracting sentences from EPUBs ...")
all_sentences: list[dict] = []
book_counts: dict[str, int] = {}
for filepath, book_name in _discover_epubs().items():
path = Path(filepath)
if path.suffix == ".txt":
sentences = extract_sentences_from_text(path, book_name)
else:
sentences = extract_sentences_from_epub(path, book_name)
book_counts[book_name] = len(sentences)
all_sentences.extend(sentences)
logger.info(f" {book_name}: {len(sentences)} sentences")
if not all_sentences:
logger.warning(" No EPUB files found — skipping example extraction")
return {"books": {}, "matched": 0, "total_vocab": len(words)}
logger.info(f" Total sentences: {len(all_sentences)}")
# Build nikkud index
logger.info(" Building nikkud index from words.json ...")
nikkud_index = _build_nikkud_index(words)
logger.info(f" {len(nikkud_index)} unique nikkud forms indexed")
# Filter out collision forms for entries that have unique forms
nikkud_index = _filter_collision_forms(nikkud_index)
# Build confusable key set
confusable_keys: set[str] = set()
for key, entry in words.items():
if entry.get("confusable_group"):
confusable_keys.add(key)
# Match sentences
logger.info(" Matching sentences against vocab ...")
matches = match_sentences(all_sentences, nikkud_index, confusable_keys)
logger.info(f" {len(matches)} words matched")
# Break down by match method
method_counts: dict[str, int] = {}
for sent_list in matches.values():
for s in sent_list:
method = s["match_method"]
method_counts[method] = method_counts.get(method, 0) + 1
for method, count in sorted(method_counts.items()):
logger.info(f" {method}: {count} sentence-word pairs")
# Update words dict in place
updated = update_words_json(words, matches, confusable_keys)
logger.info(f" Updated {updated} entries in words.json")
return {
"books": book_counts,
"matched": len(matches),
"total_vocab": len(words),
}
# ── Standalone entry point ───────────────────────────────────────
if __name__ == "__main__":
import json
logging.basicConfig(level=logging.INFO, format="%(message)s")
words_path = DATA_DIR / "words.json"
with open(words_path, encoding="utf-8") as f:
words = json.load(f)
stats = run(words)
# Save updated words.json
with open(words_path, "w", encoding="utf-8") as f:
json.dump(words, f, ensure_ascii=False, indent=2)
coverage = stats["matched"] * 100 / stats["total_vocab"] if stats["total_vocab"] else 0
logger.info(f" Coverage: {stats['matched']}/{stats['total_vocab']} ({coverage:.1f}%)")

View file

@ -3,44 +3,43 @@
Hebrew word frequency lookup from hermitdave/FrequencyWords corpus. Hebrew word frequency lookup from hermitdave/FrequencyWords corpus.
Downloads he_50k.txt once; subsequent runs read from cache. Downloads he_50k.txt once; subsequent runs read from cache.
Exposed API: get_frequency_rank(word_no_nikkud) -> int | None Exposed API: get_frequency_rank(word_no_nikkud) -> int | None
TODO: Rewrite to update words.json frequency field directly instead of
writing to a separate frequency_cache.json. Currently the migration script
bridges the gap. See Phase 5 in SPRINT_LOG.md.
""" """
import json import json
import logging import logging
import re
import unicodedata
from pathlib import Path from pathlib import Path
import requests import requests
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
FREQ_URL = ( FREQ_URL = "https://raw.githubusercontent.com/hermitdave/FrequencyWords/master/content/2016/he/he_50k.txt"
"https://raw.githubusercontent.com/hermitdave/FrequencyWords/"
"master/content/2016/he/he_50k.txt"
)
CACHE_PATH = Path(__file__).parent / "data" / "frequency_cache.json" CACHE_PATH = Path(__file__).parent / "data" / "frequency_cache.json"
CLEAN_CACHE_PATH = Path(__file__).parent / "data" / "frequency_clean.json"
REQUEST_TIMEOUT = 30 REQUEST_TIMEOUT = 30
# Module-level cache: word_no_nikkud -> rank (1 = most common) # Module-level cache: word_no_nikkud -> rank (1 = most common)
_freq: dict[str, int] = {} _freq: dict[str, int] = {}
def _strip_nikkud(text: str) -> str:
"""Remove Hebrew nikkud (diacritics) from a string."""
return "".join(
ch for ch in unicodedata.normalize("NFD", text)
if unicodedata.category(ch) != "Mn"
)
def load(cache_path: Path = CACHE_PATH) -> None: def load(cache_path: Path = CACHE_PATH) -> None:
"""Load frequency data from cache, downloading if not present.""" """Load frequency data from cache, downloading if not present.
Prefers frequency_clean.json (YAP-filtered) over raw frequency_cache.json.
"""
global _freq global _freq
if cache_path.exists(): # Prefer YAP-cleaned frequency data if available
with open(cache_path, encoding="utf-8") as f: clean_path = cache_path.parent / "frequency_clean.json" if cache_path == CACHE_PATH else None
load_path = clean_path if clean_path and clean_path.exists() else cache_path
if load_path.exists():
with open(load_path, encoding="utf-8") as f:
_freq = json.load(f) _freq = json.load(f)
logger.info(f"Frequency cache loaded: {len(_freq)} entries") label = "clean" if load_path == clean_path else "raw"
logger.info(f"Frequency cache loaded ({label}): {len(_freq)} entries")
return return
logger.info("Downloading FrequencyWords he_50k.txt …") logger.info("Downloading FrequencyWords he_50k.txt …")
@ -52,7 +51,7 @@ def load(cache_path: Path = CACHE_PATH) -> None:
line = line.strip() line = line.strip()
if not line: if not line:
continue continue
word = _strip_nikkud(line.split()[0]) word = line.split()[0]
if word and word not in _freq: if word and word not in _freq:
_freq[word] = rank _freq[word] = rank
rank += 1 rank += 1
@ -67,14 +66,24 @@ def get_frequency_rank(word_no_nikkud: str) -> int | None:
""" """
Return the frequency rank of a word (1 = most common). Return the frequency rank of a word (1 = most common).
Returns None if not found in the corpus. Returns None if not found in the corpus.
Strips nikkud from the input before lookup. Expects ktiv male (no nikkud) input.
""" """
if not _freq: if not _freq:
load() load()
clean = _strip_nikkud(word_no_nikkud.strip()) clean = word_no_nikkud.strip()
return _freq.get(clean) return _freq.get(clean)
def get_freq_data() -> dict[str, int]:
"""Return the full frequency dict (word -> rank).
Auto-loads from cache if not yet loaded.
"""
if not _freq:
load()
return _freq
if __name__ == "__main__": if __name__ == "__main__":
logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s") logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")
load() load()

View file

@ -1,219 +0,0 @@
#!/usr/bin/env python3
"""
Extract Hebrew vocabulary from pealim.com dictionary.
Scrapes word entries, roots, parts of speech, and audio URLs for Anki flashcards.
"""
import requests
import pandas as pd
from bs4 import BeautifulSoup
import logging
import time
from typing import Optional
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
# Session for connection pooling
session = requests.Session()
session.headers.update({
'User-Agent': 'Mozilla/5.0 (compatible; pealim-scraper/1.0)'
})
PEALIM_DICT_URL = "https://www.pealim.com/dict/"
REQUEST_DELAY = 1.5 # seconds between requests (respectful scraping)
REQUEST_TIMEOUT = 10 # seconds
def get_total_pages() -> int:
"""Dynamically determine total pages from first request."""
try:
logger.info("Fetching total page count...")
cookies = {'translit': 'none', 'hebstyle': 'mo'}
response = session.get(PEALIM_DICT_URL, cookies=cookies, timeout=REQUEST_TIMEOUT)
response.raise_for_status()
# Hardcoded — pealim.com has ~608 pages at ~15 words/page
return 608
except Exception as e:
logger.error(f"Error fetching page count: {e}. Using default (608).")
return 608
def _parse_page_with_audio(html_bytes: bytes) -> list[dict]:
"""
Parse a dict page with BeautifulSoup to extract word data + audio URL.
Returns list of dicts with keys: Word, Root, Part of Speech, Meaning, audio_url.
"""
soup = BeautifulSoup(html_bytes, 'html.parser')
rows = []
for tr in soup.select('table tr'):
tds = tr.find_all('td')
if len(tds) < 4:
continue
# Audio URL from span[data-audio] in first td
audio_span = tds[0].find(attrs={'data-audio': True})
audio_url = audio_span['data-audio'] if audio_span else ''
# Word with nikkud
menukad = tds[0].find('span', class_='menukad')
word = menukad.get_text(strip=True) if menukad else tds[0].get_text(strip=True)
# Root (may be link or plain text)
root = tds[1].get_text(strip=True)
# Part of speech
pos = tds[2].get_text(strip=True)
# Meaning
meaning = tds[3].get_text(strip=True)
if word:
rows.append({
'Word': word,
'Root': root if root else '-',
'Part of Speech': pos,
'Meaning': meaning,
'audio_url': audio_url,
})
return rows
def extract_from_website(max_pages: Optional[int] = None) -> pd.DataFrame:
"""
Extract dictionary entries from pealim.com.
Captures audio URLs from each word entry's data-audio attribute.
Args:
max_pages: Maximum pages to scrape (None = all)
Returns:
DataFrame with Word, Root, Part of Speech, Meaning, Word Without Nikkud, audio_url columns
"""
total_pages = max_pages or get_total_pages()
logger.info(f"Starting extraction from {total_pages} pages...")
all_rows: list[dict] = []
for page_num in range(1, total_pages):
try:
url = f"{PEALIM_DICT_URL}?page={page_num}"
# First request: with nikkud — parse with BeautifulSoup for audio URL
cookies = {'translit': 'none', 'hebstyle': 'mo'}
response = session.get(url, cookies=cookies, timeout=REQUEST_TIMEOUT)
response.raise_for_status()
page_rows = _parse_page_with_audio(response.content)
# Second request: without nikkud — just get the word column
cookies_vl = {'translit': 'none', 'hebstyle': 'vl', 'showmeaning': 'off'}
resp_vl = session.get(url, cookies=cookies_vl, timeout=REQUEST_TIMEOUT)
resp_vl.raise_for_status()
soup_vl = BeautifulSoup(resp_vl.content, 'html.parser')
no_nik_words = []
for tr in soup_vl.select('table tr'):
tds = tr.find_all('td')
if len(tds) < 4:
continue
menukad = tds[0].find('span', class_='menukad')
w = menukad.get_text(strip=True) if menukad else tds[0].get_text(strip=True)
no_nik_words.append(w)
# Merge no-nikkud words into rows
for i, row in enumerate(page_rows):
row['Word Without Nikkud'] = no_nik_words[i] if i < len(no_nik_words) else ''
all_rows.extend(page_rows)
if page_num % 50 == 0:
logger.info(f"Processed {page_num}/{total_pages} pages ({len(all_rows)} words so far)...")
time.sleep(REQUEST_DELAY)
except requests.RequestException as e:
logger.error(f"Error fetching page {page_num}: {e}. Retrying...")
time.sleep(REQUEST_DELAY * 2)
except Exception as e:
logger.error(f"Unexpected error on page {page_num}: {e}")
continue
df = pd.DataFrame(all_rows)
audio_count = (df['audio_url'] != '').sum() if 'audio_url' in df.columns else 0
logger.info(f"Extraction complete. Total words: {len(df)}, with audio URL: {audio_count}")
return df
def modify_for_anki(df: pd.DataFrame) -> pd.DataFrame:
"""
Transform dictionary DataFrame for Anki import.
Adds shared root words and Hebrew tags. Preserves audio_url column.
"""
logger.info("Preparing data for Anki...")
# Find shared root words
shared_root_words = []
for idx, row in df.iterrows():
root = row['Root']
word = row['Word']
if root != '-' and pd.notna(root):
same_root = df[(df['Root'] == root) & (df['Word'] != word)]['Word'].values
shared = ' '.join(str(w) for w in same_root)
shared_root_words.append(shared)
else:
shared_root_words.append('')
df['shared roots'] = shared_root_words
# Generate Hebrew tags
tags = []
for idx, row in df.iterrows():
tag_parts = []
root = str(row['Root']).replace(' ', '').replace('-', '')
if 'nan' not in root and root:
root_clean = root.replace('.', '')
tag_parts.append(f"שורש::{root_clean}")
pos = str(row['Part of Speech'])
pos_tags = {
'Adverb': 'תוארי_הפועל',
'Pronoun': 'כינוייוף',
'Noun': 'שם_עצם',
'Verb': 'פעלים',
'Adjective': 'שם_תואר',
'Preposition': 'מילות_יחס',
'Conjunction': 'מילות_חיבור',
'Particle': 'מילית'
}
for key, value in pos_tags.items():
if key in pos:
tag_parts.append(value)
break
tags.append(' '.join(tag_parts))
df['tags'] = tags
logger.info("Anki preparation complete.")
return df
def main():
"""Main entry point."""
try:
df = extract_from_website()
df.to_csv('hebrew_dict.csv', index=True)
logger.info("Saved: hebrew_dict.csv")
df = modify_for_anki(df)
df.to_csv('hebrew_dict_for_anki.csv', sep=';', index=True)
logger.info("Saved: hebrew_dict_for_anki.csv")
logger.info("Complete!")
except Exception as e:
logger.error(f"Fatal error: {e}")
raise
if __name__ == '__main__':
main()

8
helpers.py Normal file
View file

@ -0,0 +1,8 @@
"""Shared helper functions for the Hebrew Flash Cards project."""
import unicodedata
def strip_nikkud(text: str) -> str:
"""Remove Hebrew nikkud (diacritics) from a string."""
return "".join(ch for ch in unicodedata.normalize("NFD", text) if unicodedata.category(ch) != "Mn")

View file

@ -2,6 +2,10 @@
""" """
Fetch images for concrete Hebrew nouns from Wikipedia / Wikimedia Commons. Fetch images for concrete Hebrew nouns from Wikipedia / Wikimedia Commons.
TODO: Rewrite to update words.json image/image_source fields directly instead of
writing to a separate image_cache.json. Currently the migration script bridges
the gap. See Phase 5 in SPRINT_LOG.md.
Scope: Noun PoS entries only. Concreteness heuristic: Scope: Noun PoS entries only. Concreteness heuristic:
- English meaning has no abstract suffixes (-tion, -ity, -ness, -ment, -ance, -ism, -hood, - English meaning has no abstract suffixes (-tion, -ity, -ness, -ment, -ance, -ism, -hood,
-ship, -ure, -al, -ing when not a gerund, -ence) -ship, -ure, -al, -ing when not a gerund, -ence)
@ -22,9 +26,7 @@ import argparse
import json import json
import logging import logging
import re import re
import sys
import time import time
import unicodedata
from pathlib import Path from pathlib import Path
import requests import requests
@ -40,21 +42,23 @@ REQUEST_TIMEOUT = 10
# Abstract noun suffixes — words whose English meaning ends in these are skipped # Abstract noun suffixes — words whose English meaning ends in these are skipped
ABSTRACT_SUFFIXES = ( ABSTRACT_SUFFIXES = (
"tion", "ity", "ness", "ment", "ance", "ence", "ism", "tion",
"hood", "ship", "ure", "age", "ity",
"ness",
"ment",
"ance",
"ence",
"ism",
"hood",
"ship",
"ure",
"age",
) )
session = requests.Session() session = requests.Session()
session.headers.update({ session.headers.update(
"User-Agent": "pealim-anki/3.0 (educational Hebrew Anki deck builder; contact: anki@pealim.invalid)" {"User-Agent": "pealim-anki/3.0 (educational Hebrew Anki deck builder; contact: anki@pealim.invalid)"}
}) )
def _strip_nikkud(text: str) -> str:
return "".join(
ch for ch in unicodedata.normalize("NFD", text)
if unicodedata.category(ch) != "Mn"
)
def is_concrete(english_meaning: str) -> bool: def is_concrete(english_meaning: str) -> bool:
@ -72,7 +76,7 @@ def is_concrete(english_meaning: str) -> bool:
def _safe_name(word_no_nikkud: str) -> str: def _safe_name(word_no_nikkud: str) -> str:
"""Create a safe ASCII-ish filename from a Hebrew word (strip to Hebrew letters only).""" """Create a safe ASCII-ish filename from a Hebrew word (strip to Hebrew letters only)."""
hebrew_only = re.sub(r"[^\u05d0-\u05ea]", "", _strip_nikkud(word_no_nikkud)) hebrew_only = re.sub(r"[^\u05d0-\u05ea]", "", word_no_nikkud)
return hebrew_only if hebrew_only else "unknown" return hebrew_only if hebrew_only else "unknown"
@ -196,7 +200,7 @@ def load_cache() -> dict:
try: try:
with open(CACHE_PATH, encoding="utf-8") as f: with open(CACHE_PATH, encoding="utf-8") as f:
return json.load(f) return json.load(f)
except Exception: except Exception: # noqa: S110
pass pass
return {} return {}
@ -255,7 +259,7 @@ def run(limit: int | None = None, dry_run: bool = False, single_word: str | None
if single_word and word_plain != single_word: if single_word and word_plain != single_word:
continue continue
cache_key = word_plain or _strip_nikkud(word) cache_key = word_plain
if cache_key in cache: if cache_key in cache:
skipped_cached += 1 skipped_cached += 1

185
nikkud_to_ktiv_male.py Normal file
View file

@ -0,0 +1,185 @@
"""Convert nikkud (vocalized) Hebrew to ktiv male (plene spelling).
Implements Hebrew Academy rules for matres lectionis insertion:
- Rule A: U vowel (kubutz) always insert vav
- Rule B: O vowel (holam on non-vav) insert vav
- Rule C: I vowel (hiriq) insert yod (conditionally)
- Rule D: E vowel (tsere) insert yod (limited cases)
- Rule E/F: Consonantal vav/yod doubling
Reference: https://hebrew-academy.org.il/topic/hahlatot/missingvocalizationspelling/
"""
import unicodedata
# Hebrew nikkud code points
SHVA = "\u05b0"
HATAF_SEGOL = "\u05b1"
HATAF_PATAH = "\u05b2"
HATAF_KAMATZ = "\u05b3"
HIRIQ = "\u05b4"
TSERE = "\u05b5"
SEGOL = "\u05b6"
PATAH = "\u05b7"
KAMATZ = "\u05b8"
HOLAM = "\u05b9"
HOLAM_HASER = "\u05ba"
KUBUTZ = "\u05bb"
DAGESH = "\u05bc"
METEG = "\u05bd"
RAFE = "\u05bf"
SHIN_DOT = "\u05c1"
SIN_DOT = "\u05c2"
VAV = "ו"
YOD = "י"
MAQAF = "־"
VOWELS = {SHVA, HATAF_SEGOL, HATAF_PATAH, HATAF_KAMATZ, HIRIQ, TSERE, SEGOL, PATAH, KAMATZ, HOLAM, HOLAM_HASER, KUBUTZ}
NIKKUD_MARKS = VOWELS | {DAGESH, METEG, RAFE, SHIN_DOT, SIN_DOT}
def _parse_segments(text: str) -> list[tuple[str, list[str]]]:
"""Parse nikkud text into (character, [marks]) segments."""
segments: list[tuple[str, list[str]]] = []
cur_char: str | None = None
cur_marks: list[str] = []
for ch in text:
if unicodedata.category(ch) == "Mn":
cur_marks.append(ch)
else:
if cur_char is not None:
segments.append((cur_char, cur_marks))
cur_char = ch
cur_marks = []
if cur_char is not None:
segments.append((cur_char, cur_marks))
return segments
def _get_vowel(marks: list[str]) -> str | None:
"""Extract the vowel mark from a list of combining marks."""
for m in marks:
if m in VOWELS:
return m
return None
def _has_dagesh(marks: list[str]) -> bool:
return DAGESH in marks
def _is_hebrew_letter(ch: str) -> bool:
return "\u05d0" <= ch <= "\u05ea"
def convert(text: str) -> str:
"""Convert nikkud Hebrew text to ktiv male.
Strips all nikkud marks and inserts matres lectionis (vav/yod)
according to Hebrew Academy spelling rules.
"""
segments = _parse_segments(text)
result: list[str] = []
for i, (ch, marks) in enumerate(segments):
if not _is_hebrew_letter(ch):
# Non-Hebrew character: output as-is (no marks)
result.append(ch)
continue
vowel = _get_vowel(marks)
has_dag = _has_dagesh(marks)
# Output the base letter (strip all nikkud marks)
result.append(ch)
# --- Rule A: U vowel (kubutz) → always add vav ---
if vowel == KUBUTZ:
result.append(VAV)
continue
# --- Shuruk detection ---
# Vav with dagesh and no other vowel = shuruk (already a mater)
# Vav with dagesh AND a vowel = consonantal vav (ב with dagesh)
# If letter is vav with dagesh only → it's shuruk, already output
if ch == VAV and has_dag and vowel is None:
# Shuruk: vav IS the mater lectionis, already output
continue
# --- Rule B: O vowel (holam) → add vav ---
if vowel in (HOLAM, HOLAM_HASER):
if ch != VAV:
# Exception: holam before aleph (pe-aleph verbs) — no vav
# e.g., תֹּאבַד→תאבד, יֹאבַד→יאבד, נֹאבַד→נאבד
next_is_aleph = i + 1 < len(segments) and segments[i + 1][0] == "א"
if not next_is_aleph:
result.append(VAV)
# If ch IS vav (holam male), vav already output
continue
# --- Rule C: I vowel (hiriq) → conditionally add yod ---
if vowel == HIRIQ:
if ch == YOD:
# Yod already present, don't double
continue
# Don't insert yod if next letter is already yod
if i + 1 < len(segments) and segments[i + 1][0] == YOD:
continue
# Rule C Section 3: Don't add yod if the NEXT consonant
# has shva (indicating shva nach on that consonant)
add_yod = True
if i + 1 < len(segments):
next_ch, next_marks = segments[i + 1]
next_vowel = _get_vowel(next_marks)
# Shva on next consonant = shva nach → don't add yod
# UNLESS next consonant also has dagesh (= shva na / doubled)
next_has_dagesh = _has_dagesh(next_marks)
if next_vowel == SHVA and not next_has_dagesh:
add_yod = False
# No vowel on next consonant (word-final) = closed syllable
# → don't add yod (e.g., suffix -תי -נו -תם)
elif next_vowel is None and _is_hebrew_letter(next_ch):
# Check if this is truly word-final or next-to-last
remaining_letters = sum(1 for j in range(i + 1, len(segments)) if _is_hebrew_letter(segments[j][0]))
if remaining_letters <= 2:
# Short suffix like תי, נו — don't add yod
add_yod = False
if add_yod:
result.append(YOD)
continue
# --- Rule D: E vowel (tsere/segol) → generally NO yod ---
# Exception (b): tsere before guttural/resh gets yod ONLY
# in word-initial position (dagesh substitution in Hif'il/noun patterns)
# e.g., הֵחֵל→היחל, תֵּאָבֵד→תיאבד, הֵרִיעַ→היריע
# but NOT mid-word: מְסַפֵּר→מספר, מְעַבֵּר→מעבר
if vowel == TSERE:
add_yod = False
if i + 1 < len(segments):
next_ch = segments[i + 1][0]
if next_ch in "אהחער":
# Only at word-initial (pos 0) or after prefix (pos 1)
# where dagesh substitution applies
hebrew_pos = sum(1 for j in range(i) if _is_hebrew_letter(segments[j][0]))
if hebrew_pos <= 1:
add_yod = True
if add_yod:
result.append(YOD)
continue
# All other vowels (patah, kamatz, segol, shva, hataf-*):
# No mater lectionis insertion needed
return "".join(result)

Binary file not shown.

348
pealim_audio_download.py Normal file
View file

@ -0,0 +1,348 @@
#!/usr/bin/env python3
"""Download audio files from URLs stored in words.json.
Three audio categories are handled:
1. Vocab audio data/audio/{audio_file}
2. Noun plural data/audio/{slug}_plural.mp3
3. Conjugation data/audio_conj/{slug}_{form_key}.mp3
data/audio_conj/{slug}_passive_{form_key}.mp3
"""
import argparse
import json
import logging
import re
import time
from pathlib import Path
import requests
logger = logging.getLogger(__name__)
DATA_DIR = Path(__file__).parent / "data"
AUDIO_DIR = DATA_DIR / "audio"
AUDIO_CONJ_DIR = DATA_DIR / "audio_conj"
WORDS_JSON = DATA_DIR / "words.json"
DOWNLOAD_DELAY = 0.3
MAX_RETRIES = 3
# Map Hebrew tense names to English prefixes for form_key construction.
# "מְקוֹר" (infinitive) is included for forward compatibility; it does not
# appear in the current dataset but the form_key collapses to bare "infinitive".
TENSE_TO_PREFIX = {
"הוֹוֶה": "present",
"עָבָר": "past",
"עָתִיד": "future",
"צִוּוּי": "imperative",
"מְקוֹר": "infinitive",
}
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _make_audio_file(entry: dict) -> str:
"""Derive the vocab audio filename when audio_file is absent.
Slug-based for confusable entries (slug contains the disambiguating ID),
consonant-only for all others.
Args:
entry: A words.json entry dict.
Returns:
Filename string, e.g. ``"1234-shalom.mp3"`` or ``"שלום.mp3"``.
"""
audio_file = entry.get("audio_file", "")
if audio_file:
return audio_file
# Fallback: use slug for confusables, ktiv_male for others
slug = entry.get("slug", "")
if entry.get("confusable_group"):
return f"{slug}.mp3"
ktiv_male = entry.get("word", {}).get("ktiv_male", "")
safe_name = re.sub(r"[^\u05d0-\u05ea]", "", ktiv_male)
return f"{safe_name}.mp3"
def _form_key(person: str, tense: str) -> str:
"""Build a filesystem-safe form key from person and tense fields.
Args:
person: Person code, e.g. ``"1s"``, ``"3fp"``, ``"ms"``.
tense: Hebrew tense string from the conjugation form.
Returns:
Form key such as ``"past_1s"`` or ``"present_ms"``.
Infinitive tense always returns ``"infinitive"`` (no person suffix).
"""
prefix = TENSE_TO_PREFIX.get(tense, tense)
if prefix == "infinitive":
return "infinitive"
return f"{prefix}_{person}"
def _download(url: str, dest: Path, session: requests.Session) -> bool:
"""Download *url* to *dest*, retrying up to MAX_RETRIES times.
Skips the download silently if *dest* already exists.
Args:
url: HTTP(S) URL to download.
dest: Local path to write the file to.
session: Shared requests session.
Returns:
``True`` if the file was downloaded (or already existed),
``False`` if all retries were exhausted.
"""
if dest.exists():
return True
for attempt in range(1, MAX_RETRIES + 1):
try:
resp = session.get(url, timeout=15)
resp.raise_for_status()
dest.write_bytes(resp.content)
logger.debug("Downloaded %s%s", url, dest.name)
return True
except requests.RequestException as exc:
wait = 2**attempt
if attempt < MAX_RETRIES:
logger.warning(
"Attempt %d/%d failed for %s (%s) — retrying in %ds",
attempt,
MAX_RETRIES,
url,
exc,
wait,
)
time.sleep(wait)
else:
logger.error("All %d attempts failed for %s: %s", MAX_RETRIES, url, exc)
return False
# ---------------------------------------------------------------------------
# Per-category downloaders
# ---------------------------------------------------------------------------
def download_vocab_audio(
entries: list[dict],
session: requests.Session,
) -> tuple[int, int, int]:
"""Download vocabulary audio files.
Args:
entries: List of words.json entry dicts.
session: Shared requests session.
Returns:
Tuple of (downloaded, cached, no_url) counts.
"""
downloaded = cached = no_url = 0
for entry in entries:
url: str | None = entry.get("audio_url")
if not url:
no_url += 1
continue
audio_file: str | None = entry.get("audio_file")
if not audio_file:
audio_file = _make_audio_file(entry)
dest = AUDIO_DIR / audio_file
if dest.exists():
cached += 1
continue
if _download(url, dest, session):
downloaded += 1
time.sleep(DOWNLOAD_DELAY)
else:
no_url += 1 # count persistent failures alongside missing URLs
return downloaded, cached, no_url
def download_noun_plural_audio(
entries: list[dict],
session: requests.Session,
) -> tuple[int, int]:
"""Download noun plural audio files.
Destination: ``data/audio/{slug}_plural.mp3``
Args:
entries: List of words.json entry dicts.
session: Shared requests session.
Returns:
Tuple of (downloaded, cached) counts.
"""
downloaded = cached = 0
for entry in entries:
ni = entry.get("noun_inflection")
if not ni or not isinstance(ni, dict):
continue
url: str | None = ni.get("plural_audio")
if not url or not url.startswith("http"):
continue
slug: str = entry["slug"]
dest = AUDIO_DIR / f"{slug}_plural.mp3"
if dest.exists():
cached += 1
continue
if _download(url, dest, session):
downloaded += 1
time.sleep(DOWNLOAD_DELAY)
return downloaded, cached
def download_conjugation_audio(
entries: list[dict],
session: requests.Session,
) -> tuple[int, int, int]:
"""Download conjugation form audio files.
Active forms ``data/audio_conj/{slug}_{form_key}.mp3``
Passive forms ``data/audio_conj/{slug}_passive_{form_key}.mp3``
Args:
entries: List of words.json entry dicts.
session: Shared requests session.
Returns:
Tuple of (downloaded, cached, failed) counts.
"""
downloaded = cached = failed = 0
for entry in entries:
conj = entry.get("conjugation")
if not conj:
continue
slug: str = entry["slug"]
form_sets: list[tuple[str, list]] = [
("", conj.get("active_forms") or []),
("passive_", conj.get("hufal_pual_forms") or []),
]
for prefix, forms in form_sets:
for form in forms:
url: str | None = form.get("audio_url")
if not url:
continue
key = _form_key(form.get("person", ""), form.get("tense", ""))
dest = AUDIO_CONJ_DIR / f"{slug}_{prefix}{key}.mp3"
if dest.exists():
cached += 1
continue
if _download(url, dest, session):
downloaded += 1
time.sleep(DOWNLOAD_DELAY)
else:
failed += 1
return downloaded, cached, failed
# ---------------------------------------------------------------------------
# Entry point
# ---------------------------------------------------------------------------
def main() -> None:
"""Parse CLI args and run the audio download pipeline."""
parser = argparse.ArgumentParser(description="Download Pealim audio files from words.json URLs.")
parser.add_argument(
"--skip-vocab",
action="store_true",
help="Skip vocabulary audio downloads.",
)
parser.add_argument(
"--skip-conj",
action="store_true",
help="Skip conjugation audio downloads.",
)
parser.add_argument(
"--test",
metavar="N",
type=int,
default=None,
help="Limit processing to the first N words.json entries.",
)
args = parser.parse_args()
logging.basicConfig(
level=logging.INFO,
format="%(message)s",
)
AUDIO_DIR.mkdir(parents=True, exist_ok=True)
AUDIO_CONJ_DIR.mkdir(parents=True, exist_ok=True)
with open(WORDS_JSON, encoding="utf-8") as fh:
raw: dict[str, dict] = json.load(fh)
entries = list(raw.values())
if args.test is not None:
entries = entries[: args.test]
logger.info("[4] Downloading audio files …")
session = requests.Session()
session.headers["User-Agent"] = "Mozilla/5.0 (compatible; PealimAnkiDeck/1.0; audio-fetch)"
# --- Vocab ---
if not args.skip_vocab:
v_dl, v_cached, v_no_url = download_vocab_audio(entries, session)
else:
v_dl = v_cached = v_no_url = 0
# --- Noun plural ---
np_dl, np_cached = download_noun_plural_audio(entries, session)
# --- Conjugation ---
if not args.skip_conj:
c_dl, c_cached, c_failed = download_conjugation_audio(entries, session)
else:
c_dl = c_cached = c_failed = 0
# --- Summary ---
if not args.skip_vocab:
logger.info(
" Vocab: %d downloaded, %d cached, %d no URL",
v_dl,
v_cached,
v_no_url,
)
logger.info(" Noun plural: %d downloaded, %d cached", np_dl, np_cached)
if not args.skip_conj:
failed_msg = f", {c_failed} failed" if c_failed else ""
logger.info(
" Conjugation: %d downloaded, %d cached%s",
c_dl,
c_cached,
failed_msg,
)
if __name__ == "__main__":
main()

1593
pealim_detail_scrape.py Normal file

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

View file

@ -1,187 +0,0 @@
#!/usr/bin/env python3
"""
Extract Hebrew vocabulary from pealim.com dictionary.
Scrapes word entries, roots, and parts of speech for Anki flashcards.
"""
import requests
import pandas as pd
import logging
import time
from typing import Optional
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
# Session for connection pooling
session = requests.Session()
session.headers.update({
'User-Agent': 'Mozilla/5.0 (compatible; pealim-scraper/1.0)'
})
PEALIM_DICT_URL = "https://www.pealim.com/dict/"
REQUEST_DELAY = 1.5 # seconds between requests (respectful scraping)
REQUEST_TIMEOUT = 10 # seconds
def get_total_pages() -> int:
"""Dynamically determine total pages from first request."""
try:
logger.info("Fetching total page count...")
cookies = {'translit': 'none', 'hebstyle': 'mo'}
response = session.get(PEALIM_DICT_URL, cookies=cookies, timeout=REQUEST_TIMEOUT)
response.raise_for_status()
dfs = pd.read_html(response.content)
if dfs:
# Estimate pages from first page (typically 15 words per page)
# For now, use hardcoded value but this could be improved
return 608
except Exception as e:
logger.error(f"Error fetching page count: {e}. Using default (608).")
return 608
def extract_from_website(max_pages: Optional[int] = None) -> pd.DataFrame:
"""
Extract dictionary entries from pealim.com.
Args:
max_pages: Maximum pages to scrape (None = all)
Returns:
DataFrame with Word, Root, Part of Speech, and Word Without Nikkud columns
"""
total_pages = max_pages or get_total_pages()
logger.info(f"Starting extraction from {total_pages} pages...")
df = pd.DataFrame()
for page_num in range(1, total_pages):
try:
url = f"{PEALIM_DICT_URL}?page={page_num}"
# First request: with nikkud
cookies = {'translit': 'none', 'hebstyle': 'mo'}
response = session.get(url, cookies=cookies, timeout=REQUEST_TIMEOUT)
response.raise_for_status()
df_list = pd.read_html(response.content)
# Second request: without nikkud
cookies = {'translit': 'none', 'hebstyle': 'vl', 'showmeaning': 'off'}
response = session.get(url, cookies=cookies, timeout=REQUEST_TIMEOUT)
response.raise_for_status()
without_nikkud_words = pd.read_html(response.content)[-1]['Word']
without_nikkud_words = without_nikkud_words.rename('Word Without Nikkud')
# Combine and append
df_to_add = pd.concat([df_list[-1], without_nikkud_words], axis=1)
df = pd.concat([df, df_to_add], ignore_index=True)
if page_num % 50 == 0:
logger.info(f"Processed {page_num}/{total_pages} pages...")
time.sleep(REQUEST_DELAY)
except requests.RequestException as e:
logger.error(f"Error fetching page {page_num}: {e}. Retrying...")
time.sleep(REQUEST_DELAY * 2)
except Exception as e:
logger.error(f"Unexpected error on page {page_num}: {e}")
continue
logger.info(f"Extraction complete. Total words: {len(df)}")
return df
def modify_for_anki(df: pd.DataFrame) -> pd.DataFrame:
"""
Transform dictionary DataFrame for Anki import.
Adds shared root words and Hebrew tags.
Args:
df: Dictionary DataFrame
Returns:
Modified DataFrame ready for Anki
"""
logger.info("Preparing data for Anki...")
# Find shared root words
shared_root_words = []
for idx, row in df.iterrows():
root = row['Root']
word = row['Word']
if root != '-' and pd.notna(root):
# Find other words with same root
same_root = df[(df['Root'] == root) & (df['Word'] != word)]['Word'].values
shared = ' '.join(str(w) for w in same_root)
shared_root_words.append(shared)
else:
shared_root_words.append('')
df['shared roots'] = shared_root_words
# Generate Hebrew tags
tags = []
for idx, row in df.iterrows():
tag_parts = []
# Root tag
root = str(row['Root']).replace(' ', '').replace('-', '')
if 'nan' not in root and root:
root_clean = root.replace('.', '')
tag_parts.append(f"שורש::{root_clean}")
# Part of speech tag
pos = str(row['Part of Speech'])
pos_tags = {
'Adverb': 'תוארי_הפועל',
'Pronoun': 'כינוייוף',
'Noun': 'שם_עצם',
'Verb': 'פעלים',
'Adjective': 'שם_תואר',
'Preposition': 'מילות_יחס',
'Conjunction': 'מילות_חיבור',
'Particle': 'מילית'
}
for key, value in pos_tags.items():
if key in pos:
tag_parts.append(value)
break
tags.append(' '.join(tag_parts))
df['tags'] = tags
logger.info("Anki preparation complete.")
return df
def main():
"""Main entry point."""
try:
# Extract from website
df = extract_from_website()
df.to_csv('pealim_dict.csv', index=True)
logger.info("Saved: pealim_dict.csv")
# Transform for Anki
df = modify_for_anki(df)
df.to_csv('pealim_dict_for_anki.csv', sep=';', index=True)
logger.info("Saved: pealim_dict_for_anki.csv")
logger.info("✅ Complete!")
except Exception as e:
logger.error(f"Fatal error: {e}")
raise
if __name__ == '__main__':
main()

714
pealim_list_scrape.py Normal file
View file

@ -0,0 +1,714 @@
#!/usr/bin/env python3
"""
Consolidated list page scraper for pealim.com.
Scrapes /dict/?page=N with two cookie variants (hebstyle=mo for nikkud,
hebstyle=vl for ktiv male) and writes results directly to data/words.json.
Usage:
python3 pealim_list_scrape.py [--test N] [--force-refresh]
"""
import argparse
import json
import logging
import os
import re
import time
from datetime import date
from pathlib import Path
import requests
from bs4 import BeautifulSoup
# ---------------------------------------------------------------------------
# Paths
# ---------------------------------------------------------------------------
PROJECT_ROOT = Path(__file__).parent
DATA_DIR = PROJECT_ROOT / "data"
WORDS_JSON = DATA_DIR / "words.json"
PROGRESS_JSON = DATA_DIR / "list_scrape_progress.json"
# ---------------------------------------------------------------------------
# Constants
# ---------------------------------------------------------------------------
PEALIM_DICT_URL = "https://www.pealim.com/dict/"
REQUEST_DELAY = 1.5 # seconds between requests
REQUEST_TIMEOUT = 15 # seconds
DEFAULT_TOTAL_PAGES = 608
SAVE_EVERY = 10 # pages between incremental saves
TODAY = date.today().isoformat()
# Prefer lxml if available; html.parser is the fallback
try:
import lxml # type: ignore[import-untyped] # noqa: F401
BS4_PARSER = "lxml"
except ImportError:
BS4_PARSER = "html.parser"
# ---------------------------------------------------------------------------
# Part-of-speech mappings
# ---------------------------------------------------------------------------
POS_HEBREW: dict[str, str] = {
"Noun": "שֵׁם עֶצֶם",
"Verb": "פֹּעַל",
"Adjective": "שֵׁם תֹּאַר",
"Adverb": "תֹּאַר הַפֹּעַל",
"Pronoun": "כִּנּוּי גּוּף",
"Preposition": "מִילַּת יַחַס",
"Conjunction": "מִילַּת חִבּוּר",
"Interjection": "מִילַּת קְרִיאָה",
"Numeral": "שֵׁם מִסְפָּר",
"Cardinal numeral": "שֵׁם מִסְפָּר",
"Particle": "מִילִּית",
"Determiner": "מְגַדִּיר",
"Existential": "מִילַּת קִיּוּם",
"Interrogative": "מִילַּת שְׁאֵלָה",
}
# Use exact match on the POS string prefix; longer keys must be checked first.
POS_HEBREW_ORDERED: list[tuple[str, str]] = sorted(POS_HEBREW.items(), key=lambda x: -len(x[0]))
BINYAN_HEBREW: dict[str, str] = {
"Pa'al": "פָּעַל",
"Nif'al": "נִפְעַל",
"Pi'el": "פִּיעֵל",
"Pu'al": "פֻּעַל",
"Hif'il": "הִפְעִיל",
"Huf'al": "הֻפְעַל",
"Hitpa'el": "הִתְפַּעֵל",
}
# Regex for extracting emoji characters
EMOJI_RE = re.compile(
r"[\U0001F300-\U0001FFFF\U00002600-\U000027BF\U0001F000-\U0001F9FF\u2600-\u26FF\u2700-\u27BF\uFE0E\uFE0F\u200D]+",
re.UNICODE,
)
# Regex for extracting Hebrew prepositions wrapped in parentheses, e.g. "(על)" or "(ב-)"
HBPAREN_RE = re.compile(r"\(([\u05b0-\u05ea\u05f0-\u05f4\-]+)\)")
# Fields that must never be overwritten when updating an existing entry
PROTECTED_FIELDS = frozenset(
[
"vocab_legacy_guid",
"confusables_guid",
"frequency",
"pseudo_frequency",
"emoji",
"emoji_source",
"emoji_visible",
"image",
"image_source",
"hint",
"examples",
"noun_inflection",
"conjugation",
"adjective_inflection",
"preposition_inflection",
]
)
# ---------------------------------------------------------------------------
# Logging
# ---------------------------------------------------------------------------
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(message)s",
)
logger = logging.getLogger(__name__)
# ---------------------------------------------------------------------------
# HTTP session
# ---------------------------------------------------------------------------
session = requests.Session()
session.headers.update({"User-Agent": "Mozilla/5.0 (compatible; pealim-anki-scraper/1.0)"})
# ---------------------------------------------------------------------------
# Default entry template
# ---------------------------------------------------------------------------
def _default_entry() -> dict:
"""Return a fresh entry with all fields initialised to safe defaults."""
return {
"word": {"nikkud": "", "ktiv_male": ""},
"slug": "",
"root": [],
"pos": "",
"pos_hebrew": "",
"meaning": "",
"meaning_raw": "",
"audio_url": "",
"audio_file": "",
"tags": "",
"last_scrape_date": "",
"vocab_legacy_guid": None,
"frequency": None,
"pseudo_frequency": None,
"emoji": None,
"emoji_source": None,
"emoji_visible": False,
"image": None,
"image_source": None,
"hint": "",
"prep": None,
"shared_roots": [],
"confusable_group": None,
"confusables_guid": None,
"examples": None,
"noun_inflection": None,
"conjugation": None,
"adjective_inflection": None,
"preposition_inflection": None,
}
# ---------------------------------------------------------------------------
# Parsing helpers
# ---------------------------------------------------------------------------
def _extract_emoji(text: str) -> str | None:
"""Return the first emoji run found in *text*, or None."""
m = EMOJI_RE.search(text)
return m.group(0) if m else None
def _clean_meaning(raw: str) -> str:
"""Strip emoji, Hebrew parenthesized prepositions, and extra whitespace from a raw meaning string."""
cleaned = EMOJI_RE.sub("", raw)
cleaned = HBPAREN_RE.sub("", cleaned)
return " ".join(cleaned.split())
def _parse_pos(pos_raw: str) -> tuple[str, str]:
"""
Parse raw PoS string into (pos_en, pos_hebrew).
Examples:
"Noun masculine" ("Noun", "שֵׁם עֶצֶם")
"Verb pa'al" ("Verb", "פֹּעַל — פָּעַל")
"Cardinal numeral" ("Cardinal numeral", "שֵׁם מִסְפָּר")
"""
# Strip leading/trailing whitespace; normalise dashes
pos_clean = pos_raw.strip()
# Determine the base English PoS with longest-match strategy
pos_en = ""
for key, _ in POS_HEBREW_ORDERED:
if pos_clean.startswith(key):
pos_en = key
break
if not pos_en:
# Fallback: take everything up to " " or the full string
pos_en = pos_clean.split(" ")[0].split(" - ")[0].strip()
pos_heb = POS_HEBREW.get(pos_en, pos_en)
# For verbs, attempt to append binyan
if pos_en == "Verb":
# Look for binyan after dash; pealim uses "Verb pa'al"
dash_parts = re.split(r"\s*[-]\s*", pos_clean)
if len(dash_parts) >= 2:
binyan_raw = dash_parts[1].strip()
# Normalise capitalisation for lookup: "pa'al" → "Pa'al"
binyan_key = binyan_raw.capitalize()
# Handle mixed-case entries like "Nif'al"
for bkey in BINYAN_HEBREW:
if bkey.lower() == binyan_raw.lower():
binyan_key = bkey
break
binyan_heb = BINYAN_HEBREW.get(binyan_key)
if binyan_heb:
pos_heb = f"{pos_heb}{binyan_heb}"
return pos_en, pos_heb
def _parse_root(root_raw: str) -> list[str]:
"""
Convert raw root text to a list of consonants.
Pealim shows roots as "פ - ע - ל" or "פ.ע.ל" or "" (no root).
"""
if not root_raw or root_raw in ("-", "", ""):
return []
# Split on " - " or "." separators
parts = re.split(r"\s*[-–—.]\s*", root_raw.strip())
return [p.strip() for p in parts if p.strip()]
def _build_tags(pos_en: str, root: list[str]) -> str:
"""
Generate Anki tags string matching the existing project convention.
Examples:
pos=Noun, root=[] "שם_עצם"
pos=Noun, root=["א","ב"] "שורש::אב שם_עצם"
pos=Verb, root=["שמר"] "שורש::שמר פעלים"
"""
pos_tag_map = {
"Noun": "שם_עצם",
"Verb": "פעלים",
"Adjective": "שם_תואר",
"Adverb": "תוארי_הפועל",
"Pronoun": "כינוייוף",
"Preposition": "מילות_יחס",
"Conjunction": "מילות_חיבור",
"Particle": "מילית",
"Numeral": "שם_מספר",
"Cardinal numeral": "שם_מספר",
"Determiner": "מגדיר",
"Existential": "מילת_קיום",
"Interrogative": "מילת_שאלה",
"Interjection": "מילת_קריאה",
}
parts: list[str] = []
if root:
root_str = "".join(root)
parts.append(f"שורש::{root_str}")
pos_heb_tag = pos_tag_map.get(pos_en, "")
if pos_heb_tag:
parts.append(pos_heb_tag)
return " ".join(parts)
def _compute_audio_file(slug: str, ktiv_male: str) -> str:
"""
Return the local audio filename for an entry.
The actual confusable detection happens later (after all pages are scraped);
here we store a placeholder that post_process() will correct.
We default to the consonant-based name; confusables get slug-based names.
"""
consonants = ktiv_male or ""
return f"{consonants}.mp3" if consonants else f"{slug}.mp3"
# ---------------------------------------------------------------------------
# Page parsing
# ---------------------------------------------------------------------------
def _parse_mo_page(html: bytes) -> list[dict]:
"""
Parse a hebstyle=mo (nikkud) list page.
Returns a list of raw row dicts with keys:
nikkud, slug, root_raw, pos_raw, meaning_raw, audio_url
"""
soup = BeautifulSoup(html, BS4_PARSER)
rows: list[dict] = []
for tr in soup.select("table tr"):
tds = tr.find_all("td")
if len(tds) < 4:
continue
# Audio URL
audio_span = tds[0].find(attrs={"data-audio": True})
audio_url: str = audio_span["data-audio"] if audio_span else ""
# Slug
slug = ""
link = tds[0].find("a", href=True)
if link:
m = re.search(r"/dict/([^/]+)/", link["href"])
if m:
slug = m.group(1)
# Nikkud word
menukad = tds[0].find("span", class_="menukad")
nikkud = menukad.get_text(strip=True) if menukad else tds[0].get_text(strip=True)
root_raw = tds[1].get_text(strip=True)
pos_raw = tds[2].get_text(strip=True)
meaning_raw = tds[3].get_text(strip=True)
if nikkud:
rows.append(
{
"nikkud": nikkud,
"slug": slug,
"root_raw": root_raw,
"pos_raw": pos_raw,
"meaning_raw": meaning_raw,
"audio_url": audio_url,
}
)
return rows
def _parse_vl_words(html: bytes) -> list[str]:
"""
Parse a hebstyle=vl (ktiv male) list page.
Returns ordered list of ktiv male strings (one per table row).
"""
soup = BeautifulSoup(html, BS4_PARSER)
words: list[str] = []
for tr in soup.select("table tr"):
tds = tr.find_all("td")
if len(tds) < 4:
continue
menukad = tds[0].find("span", class_="menukad")
word = menukad.get_text(strip=True) if menukad else tds[0].get_text(strip=True)
words.append(word)
return words
# ---------------------------------------------------------------------------
# words.json I/O
# ---------------------------------------------------------------------------
def _load_words() -> dict:
"""Load words.json; return empty dict if missing."""
if not WORDS_JSON.exists():
logger.info("data/words.json not found — starting fresh.")
return {}
with WORDS_JSON.open(encoding="utf-8") as fh:
return json.load(fh)
def _save_words(words: dict) -> None:
"""Atomically write words to words.json via a .tmp file."""
tmp = WORDS_JSON.with_suffix(".json.tmp")
with tmp.open("w", encoding="utf-8") as fh:
json.dump(words, fh, ensure_ascii=False, indent=2)
os.replace(tmp, WORDS_JSON)
logger.info("Saved data/words.json (%d entries)", len(words))
# ---------------------------------------------------------------------------
# Progress tracking
# ---------------------------------------------------------------------------
def _load_progress() -> set[int]:
"""Return set of already-completed page numbers."""
if not PROGRESS_JSON.exists():
return set()
with PROGRESS_JSON.open(encoding="utf-8") as fh:
data = json.load(fh)
return set(data.get("completed_pages", []))
def _save_progress(completed: set[int]) -> None:
"""Atomically write progress file."""
tmp = PROGRESS_JSON.with_suffix(".json.tmp")
with tmp.open("w", encoding="utf-8") as fh:
json.dump({"completed_pages": sorted(completed)}, fh)
os.replace(tmp, PROGRESS_JSON)
# ---------------------------------------------------------------------------
# Unique key generation
# ---------------------------------------------------------------------------
def _make_unique_key(nikkud: str, pos_en: str, meaning: str, existing_keys: set[str]) -> str:
"""
Generate a collision-free unique key for a new entry.
Escalation:
1. nikkud
2. nikkud|pos_en
3. nikkud|pos_en|meaning
4. nikkud|pos_en|meaning|N (N = 2, 3, )
"""
candidate = nikkud
if candidate not in existing_keys:
return candidate
candidate = f"{nikkud}|{pos_en}"
if candidate not in existing_keys:
return candidate
candidate = f"{nikkud}|{pos_en}|{meaning}"
if candidate not in existing_keys:
return candidate
n = 2
while True:
candidate = f"{nikkud}|{pos_en}|{meaning}|{n}"
if candidate not in existing_keys:
return candidate
n += 1
# ---------------------------------------------------------------------------
# Core: merge one scraped row into words dict
# ---------------------------------------------------------------------------
def _merge_row(
words: dict,
slug_index: dict[str, str],
nikkud: str,
ktiv_male: str,
slug: str,
root_raw: str,
pos_raw: str,
meaning_raw_raw: str,
audio_url: str,
) -> None:
"""
Upsert a single scraped row into *words* in-place.
*slug_index* maps slug unique_key for fast lookup and is updated here
when a new entry is created.
"""
# Derived fields
pos_en, pos_heb = _parse_pos(pos_raw)
root = _parse_root(root_raw)
meaning_raw = meaning_raw_raw
meaning = _clean_meaning(meaning_raw)
emoji = _extract_emoji(meaning_raw_raw)
tags = _build_tags(pos_en, root)
audio_file = _compute_audio_file(slug, ktiv_male)
# Extract Hebrew preposition(s) from the raw meaning (e.g. "(על)" → "על")
prep_matches = HBPAREN_RE.findall(meaning_raw)
prep: str | None = " ".join(prep_matches) if prep_matches else None
# ---- locate existing entry ----
unique_key: str | None = slug_index.get(slug) if slug else None
if unique_key and unique_key in words:
# Update list-level fields only; never touch protected fields
entry = words[unique_key]
entry["word"]["nikkud"] = nikkud
entry["word"]["ktiv_male"] = ktiv_male
entry["slug"] = slug
entry["root"] = root
entry["pos"] = pos_en
entry["pos_hebrew"] = pos_heb
entry["meaning"] = meaning
entry["meaning_raw"] = meaning_raw
entry["prep"] = prep
entry["audio_url"] = audio_url
entry["audio_file"] = audio_file
entry["tags"] = tags
entry["last_scrape_date"] = TODAY
else:
# Create new entry
unique_key = _make_unique_key(nikkud, pos_en, meaning, set(words.keys()))
entry = _default_entry()
entry["word"]["nikkud"] = nikkud
entry["word"]["ktiv_male"] = ktiv_male
entry["slug"] = slug
entry["root"] = root
entry["pos"] = pos_en
entry["pos_hebrew"] = pos_heb
entry["meaning"] = meaning
entry["meaning_raw"] = meaning_raw
entry["prep"] = prep
entry["emoji"] = emoji
entry["emoji_source"] = "from_pealim" if emoji else None
entry["audio_url"] = audio_url
entry["audio_file"] = audio_file
entry["tags"] = tags
entry["last_scrape_date"] = TODAY
words[unique_key] = entry
if slug:
slug_index[slug] = unique_key
# ---------------------------------------------------------------------------
# Post-processing: recompute confusable_group, shared_roots, audio_file
# ---------------------------------------------------------------------------
def _post_process(words: dict) -> None:
"""
After all pages are scraped, recompute derived cross-entry fields:
- confusable_group: entries sharing the same ktiv_male (2+)
- shared_roots: entries sharing the same root (excluding self)
- audio_file: slug-based for confusables, consonant-based otherwise
"""
logger.info("Post-processing: recomputing confusable groups and shared roots...")
# --- confusable groups ---
ktiv_to_keys: dict[str, list[str]] = {}
for key, entry in words.items():
ktiv = entry.get("word", {}).get("ktiv_male", "")
if ktiv:
ktiv_to_keys.setdefault(ktiv, []).append(key)
for _, entry in words.items():
ktiv = entry.get("word", {}).get("ktiv_male", "")
group = ktiv_to_keys.get(ktiv, [])
if len(group) >= 2:
entry["confusable_group"] = sorted(group)
# Confusable → slug-based audio filename
slug = entry.get("slug", "")
if slug:
entry["audio_file"] = f"{slug}.mp3"
else:
# Only clear confusable_group if it wasn't set by enrichment (i.e. no confusables_guid)
if not entry.get("confusables_guid"):
entry["confusable_group"] = None
# Non-confusable → consonant-based audio filename
ktiv_male = entry.get("word", {}).get("ktiv_male", "")
consonants = ktiv_male or ""
slug = entry.get("slug", "")
entry["audio_file"] = f"{consonants}.mp3" if consonants else f"{slug}.mp3"
# --- shared roots ---
root_to_keys: dict[str, list[str]] = {}
for key, entry in words.items():
root = entry.get("root")
if root:
root_str = "|".join(root) # canonical form for grouping
root_to_keys.setdefault(root_str, []).append(key)
for key, entry in words.items():
root = entry.get("root")
if root:
root_str = "|".join(root)
siblings = root_to_keys.get(root_str, [])
entry["shared_roots"] = sorted(k for k in siblings if k != key)
else:
entry["shared_roots"] = []
logger.info("Post-processing complete.")
# ---------------------------------------------------------------------------
# Scraping loop
# ---------------------------------------------------------------------------
def _build_slug_index(words: dict) -> dict[str, str]:
"""Build slug → unique_key lookup from the current words dict."""
index: dict[str, str] = {}
for key, entry in words.items():
slug = entry.get("slug", "")
if slug and slug not in index:
index[slug] = key
return index
def _fetch_page(url: str, cookies: dict) -> bytes | None:
"""Fetch a single page; return raw bytes or None on failure."""
try:
resp = session.get(url, cookies=cookies, timeout=REQUEST_TIMEOUT)
resp.raise_for_status()
return resp.content
except requests.RequestException as exc:
logger.error("Request failed for %s: %s", url, exc)
return None
def run_scrape(total_pages: int, force_refresh: bool) -> None:
"""
Main scrape loop.
Args:
total_pages: Number of list pages to scrape.
force_refresh: If True, ignore progress file and re-scrape all pages.
"""
words = _load_words()
slug_index = _build_slug_index(words)
completed = set() if force_refresh else _load_progress()
if force_refresh and completed:
logger.info("--force-refresh: ignoring %d completed pages.", len(completed))
pages_to_do = [p for p in range(1, total_pages + 1) if p not in completed]
logger.info(
"Pages to scrape: %d / %d (already done: %d)",
len(pages_to_do),
total_pages,
len(completed),
)
pages_since_save = 0
for page_num in pages_to_do:
url = f"{PEALIM_DICT_URL}?page={page_num}"
logger.info("Scraping page %d / %d", page_num, total_pages)
# --- hebstyle=mo (nikkud + audio + slug) ---
mo_html = _fetch_page(url, {"translit": "none", "hebstyle": "mo"})
if mo_html is None:
logger.warning("Skipping page %d (mo fetch failed).", page_num)
time.sleep(REQUEST_DELAY * 2)
continue
time.sleep(REQUEST_DELAY)
# --- hebstyle=vl (ktiv male) ---
vl_html = _fetch_page(url, {"translit": "none", "hebstyle": "vl"})
if vl_html is None:
logger.warning("Skipping page %d (vl fetch failed).", page_num)
time.sleep(REQUEST_DELAY * 2)
continue
# Parse
mo_rows = _parse_mo_page(mo_html)
vl_words = _parse_vl_words(vl_html)
if not mo_rows:
logger.warning("Page %d returned no rows — might be past end.", page_num)
completed.add(page_num)
_save_progress(completed)
time.sleep(REQUEST_DELAY)
continue
# Merge each row
for i, row in enumerate(mo_rows):
ktiv_male = vl_words[i] if i < len(vl_words) else ""
_merge_row(
words=words,
slug_index=slug_index,
nikkud=row["nikkud"],
ktiv_male=ktiv_male,
slug=row["slug"],
root_raw=row["root_raw"],
pos_raw=row["pos_raw"],
meaning_raw_raw=row["meaning_raw"],
audio_url=row["audio_url"],
)
completed.add(page_num)
pages_since_save += 1
# Incremental save every SAVE_EVERY pages
if pages_since_save >= SAVE_EVERY:
_save_words(words)
_save_progress(completed)
pages_since_save = 0
time.sleep(REQUEST_DELAY)
# Final save + post-processing
logger.info("All pages scraped. Running post-processing…")
_post_process(words)
_save_words(words)
_save_progress(completed)
logger.info("Done. Total entries in words.json: %d", len(words))
# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------
def main() -> None:
"""Entry point."""
parser = argparse.ArgumentParser(description="Scrape pealim.com list pages into data/words.json.")
parser.add_argument(
"--test",
metavar="N",
type=int,
default=None,
help="Scrape only the first N pages (for testing).",
)
parser.add_argument(
"--force-refresh",
action="store_true",
default=False,
help="Re-scrape all pages, ignoring existing progress.",
)
args = parser.parse_args()
total_pages = args.test if args.test is not None else DEFAULT_TOTAL_PAGES
logger.info(
"Starting pealim list scraper | pages=%d | force=%s | parser=%s",
total_pages,
args.force_refresh,
BS4_PARSER,
)
run_scrape(total_pages=total_pages, force_refresh=args.force_refresh)
if __name__ == "__main__":
main()

83
pyproject.toml Normal file
View file

@ -0,0 +1,83 @@
[project]
name = "hebrew-flash-cards"
version = "0.13"
description = "Hebrew vocabulary & verb conjugation flashcards for Anki"
requires-python = ">=3.11"
dependencies = [
"beautifulsoup4>=4.11.0",
"genanki>=0.8.0",
"lxml>=4.9.0",
"numpy>=1.21.0",
"pandas>=1.3.0",
"pymupdf>=1.23.0",
"pypdf>=3.0.0",
"python-bidi>=0.4.2",
"requests>=2.26.0",
]
[project.optional-dependencies]
dev = [
"bandit",
"pytest",
"ruff",
"vulture",
]
[tool.pytest.ini_options]
testpaths = ["tests"]
markers = [
"integration: marks tests that hit the real pealim.com network (deselect with -m 'not integration')",
]
[tool.ruff]
target-version = "py311"
line-length = 120
exclude = [
"lib/",
"bin/",
"include/",
"lib64/",
"archive/",
"venv/",
]
[tool.ruff.lint]
select = [
"E", # pycodestyle errors
"W", # pycodestyle warnings
"F", # pyflakes
"I", # isort
"UP", # pyupgrade
"B", # flake8-bugbear
"SIM", # flake8-simplify
"PIE", # flake8-pie
"T20", # flake8-print (flag print statements)
"RET", # flake8-return
"C4", # flake8-comprehensions
"S", # flake8-bandit (security)
]
ignore = [
"T201", # allow print() — this is a CLI tool, not a library
"S603", # subprocess call with shell=False is fine
"S607", # partial executable path is fine for CLI tools
"S105", # PASS = "✓" is not a password
"S108", # /tmp paths are intentional for temp downloads
"S311", # random.Random() is for card ordering, not crypto
"E501", # line too long — handled by formatter
]
[tool.ruff.lint.per-file-ignores]
"test_*.py" = ["S101"] # allow assert in tests
[tool.ruff.format]
quote-style = "double"
indent-style = "space"
[tool.vulture]
paths = ["."]
exclude = ["lib/", "bin/", "include/", "lib64/", "venv/", "archive/"]
min_confidence = 80
[tool.bandit]
exclude_dirs = ["lib", "bin", "include", "lib64", "venv", "archive"]
skips = ["B101"] # allow assert

208
release.py Normal file
View file

@ -0,0 +1,208 @@
#!/usr/bin/env python3
"""
Create a Forgejo release and upload all .apkg deck variants.
Usage:
python3 release.py # uses RELEASE_TAG from apkg_builder.py
python3 release.py v0.14 # explicit tag
python3 release.py --dry-run # show what would be uploaded without doing it
python3 release.py --validate # run validate_apkg.py first, abort on failure
Requires:
FORGEJO_TOKEN env var or hardcoded token below.
Git tag must not already exist (creates tag + release).
"""
import argparse
import subprocess
import sys
from pathlib import Path
import requests
sys.path.insert(0, "/home/node/projects")
import load_keeshare
REPO_API = "https://git.nevo.engineer/api/v1/repos/nevo/hebrew_flash_cards"
FORGEJO_TOKEN: str = load_keeshare.get_entry("git.nevo.engineer")["password"]
OUTPUT_DIR = Path(__file__).parent / "output"
# All deck variants to include in release
DECK_PREFIX = "hebrew_"
DECK_VARIANTS = [
"hebrew_vocabulary.apkg",
"hebrew_vocabulary_audio.apkg",
"hebrew_vocabulary_images.apkg",
"hebrew_vocabulary_audio_images.apkg",
"hebrew_conjugations.apkg",
"hebrew_conjugations_audio.apkg",
"hebrew_confusables.apkg",
"hebrew_confusables_audio.apkg",
"hebrew_plurals.apkg",
"hebrew_plurals_audio.apkg",
"hebrew_complete.apkg",
"hebrew_complete_audio.apkg",
]
def get_release_tag() -> str:
"""Import RELEASE_TAG from apkg_builder."""
sys.path.insert(0, str(Path(__file__).parent))
from apkg_builder import RELEASE_TAG
return RELEASE_TAG
def api(method: str, endpoint: str, **kwargs) -> requests.Response:
url = f"{REPO_API}{endpoint}"
headers = {"Authorization": f"token {FORGEJO_TOKEN}"}
resp = requests.request(method, url, headers=headers, timeout=30, **kwargs)
resp.raise_for_status()
return resp
def tag_exists(tag: str) -> bool:
try:
api("GET", f"/tags/{tag}")
return True
except requests.HTTPError:
return False
def release_exists(tag: str) -> dict | None:
try:
resp = api("GET", f"/releases/tags/{tag}")
return resp.json()
except requests.HTTPError:
return None
def create_git_tag(tag: str) -> None:
subprocess.run(["git", "tag", tag], check=True)
subprocess.run(["git", "push", "origin", tag], check=True)
print(f" Created and pushed tag: {tag}")
def create_release(tag: str, assets: list[Path]) -> int:
"""Create release, return release ID."""
# Build release body from deck file sizes
lines = ["## Deck Variants\n", "| File | Size |", "|------|------|"]
for p in sorted(assets):
size_mb = p.stat().st_size / 1_048_576
lines.append(f"| {p.name} | {size_mb:.1f} MB |")
body = "\n".join(lines)
data = {
"tag_name": tag,
"name": f"{tag} — Hebrew Flash Cards",
"body": body,
"draft": False,
"prerelease": False,
}
resp = api("POST", "/releases", json=data)
release_id = resp.json()["id"]
print(f" Created release: {tag} (ID {release_id})")
return release_id
def delete_release_assets(release_id: int) -> int:
"""Delete all existing assets on a release. Returns count deleted."""
resp = api("GET", f"/releases/{release_id}/assets")
assets = resp.json()
for asset in assets:
api("DELETE", f"/releases/{release_id}/assets/{asset['id']}")
return len(assets)
def upload_assets(release_id: int, assets: list[Path]) -> None:
for p in sorted(assets):
size_mb = p.stat().st_size / 1_048_576
print(f" Uploading {p.name} ({size_mb:.1f} MB) ... ", end="", flush=True)
with open(p, "rb") as f:
api(
"POST",
f"/releases/{release_id}/assets?name={p.name}",
files={"attachment": (p.name, f, "application/octet-stream")},
)
print("ok")
def validate_decks() -> bool:
"""Run validate_apkg.py, return True if all checks pass."""
result = subprocess.run(
[sys.executable, "validate_apkg.py"],
capture_output=True,
text=True,
)
print(result.stdout)
if result.returncode != 0:
print(result.stderr)
return result.returncode == 0
def main():
parser = argparse.ArgumentParser(description="Create Forgejo release with deck assets")
parser.add_argument("tag", nargs="?", help="Release tag (default: from apkg_builder.RELEASE_TAG)")
parser.add_argument("--dry-run", action="store_true", help="Show what would be done without doing it")
parser.add_argument("--validate", action="store_true", help="Run validate_apkg.py before releasing")
parser.add_argument("--force", action="store_true", help="Re-upload assets if release already exists")
args = parser.parse_args()
tag = args.tag or get_release_tag()
print(f"Release tag: {tag}")
# Collect assets
assets = [OUTPUT_DIR / name for name in DECK_VARIANTS]
missing = [p for p in assets if not p.exists()]
if missing:
print("\nERROR: Missing deck files:")
for p in missing:
print(f" {p}")
print("\nRun the build pipeline first: python3 run.py --skip-scrape")
sys.exit(1)
print(f"Assets: {len(assets)} deck files")
total_mb = sum(p.stat().st_size for p in assets) / 1_048_576
print(f"Total size: {total_mb:.0f} MB")
if args.validate:
print("\nValidating decks ...")
if not validate_decks():
print("ERROR: Validation failed. Aborting release.")
sys.exit(1)
print("Validation passed.\n")
if args.dry_run:
print("\n[DRY RUN] Would upload:")
for p in sorted(assets):
size_mb = p.stat().st_size / 1_048_576
print(f" {p.name} ({size_mb:.1f} MB)")
print(f"\n[DRY RUN] Tag: {tag}")
return
# Check if release already exists
existing = release_exists(tag)
if existing and not args.force:
print(f"\nRelease {tag} already exists (ID {existing['id']}).")
print("Use --force to delete existing assets and re-upload.")
sys.exit(1)
if existing and args.force:
release_id = existing["id"]
deleted = delete_release_assets(release_id)
print(f" Deleted {deleted} existing assets from release {tag}")
else:
# Create tag if needed
if not tag_exists(tag):
create_git_tag(tag)
release_id = create_release(tag, assets)
# Upload
print(f"\nUploading {len(assets)} files ...")
upload_assets(release_id, assets)
print(f"\nDone. Release: https://git.nevo.engineer/nevo/hebrew_flash_cards/releases/tag/{tag}")
if __name__ == "__main__":
main()

531
run.py
View file

@ -6,14 +6,24 @@ Usage:
python run.py [options] python run.py [options]
Options: Options:
--only {vocab,conjugations} Run only one deck (skips all unrelated steps) --only {vocab,conjugations,confusables,plurals,complete} Run only one deck
--skip-scrape Use existing data/pealim_dict.csv (no pealim.com dict scraping) Pipeline steps:
1. List scrape scrape pealim.com list pages words.json (captures slugs)
2. Detail scrape scrape noun/verb detail pages using slugs words.json
3. Frequency load/download word frequency data
4. Examples extract example sentences from Hebrew EPUBs
5. Audio download download audio mp3 files
6. Fonts download Heebo font files
7. Images fetch noun images from Wikipedia
8. Build build all .apkg deck variants
Options:
--skip-scrape Skip list page scraping (use existing words.json)
--skip-detail Skip detail page scraping
--skip-audio Skip audio .mp3 downloads --skip-audio Skip audio .mp3 downloads
--skip-examples Skip Ben Yehuda example fetching --skip-examples Skip EPUB example extraction
--skip-conjugations Skip verb conjugation extraction
--skip-images Skip image fetching for concrete nouns --skip-images Skip image fetching for concrete nouns
--refresh-examples Force rebuild of Ben Yehuda index (delete old, download nikkud corpus) --test N Limit to first N words/pages
--test N Process only the first N dictionary words (for quick testing)
""" """
import argparse import argparse
@ -21,8 +31,6 @@ import json
import logging import logging
import re import re
import sys import sys
import time
import unicodedata
from pathlib import Path from pathlib import Path
sys.path.insert(0, str(Path(__file__).parent)) sys.path.insert(0, str(Path(__file__).parent))
@ -38,286 +46,127 @@ OUTPUT_DIR = Path(__file__).parent / "output"
AUDIO_DIR = DATA_DIR / "audio" AUDIO_DIR = DATA_DIR / "audio"
AUDIO_CONJ_DIR = DATA_DIR / "audio_conj" AUDIO_CONJ_DIR = DATA_DIR / "audio_conj"
FONTS_DIR = DATA_DIR / "fonts" FONTS_DIR = DATA_DIR / "fonts"
WORDS_JSON = DATA_DIR / "words.json"
def parse_args(): def parse_args():
p = argparse.ArgumentParser(description="Pealim Anki deck builder") p = argparse.ArgumentParser(description="Pealim Anki deck builder")
p.add_argument("--only", choices=["vocab", "conjugations"], help="Run only one deck (skips all unrelated steps)") p.add_argument(
p.add_argument("--skip-scrape", action="store_true", help="Skip dict scraping; use cached CSV") "--only",
choices=["vocab", "conjugations", "confusables", "plurals", "complete"],
help="Run only one deck (skips all unrelated steps)",
)
p.add_argument("--skip-scrape", action="store_true", help="Skip list page scraping")
p.add_argument("--skip-detail", action="store_true", help="Skip detail page scraping")
p.add_argument("--skip-audio", action="store_true", help="Skip audio downloads") p.add_argument("--skip-audio", action="store_true", help="Skip audio downloads")
p.add_argument("--skip-examples", action="store_true", help="Skip Ben Yehuda example lookup") p.add_argument("--skip-examples", action="store_true", help="Skip EPUB example extraction")
p.add_argument("--skip-conjugations", action="store_true", help="Skip verb conjugation extraction (deprecated: use --only vocab)")
p.add_argument("--skip-images", action="store_true", help="Skip image fetching") p.add_argument("--skip-images", action="store_true", help="Skip image fetching")
p.add_argument("--refresh-examples", action="store_true", help="Force rebuild of Ben Yehuda index")
p.add_argument("--test", type=int, metavar="N", help="Limit to first N words") p.add_argument("--test", type=int, metavar="N", help="Limit to first N words")
return p.parse_args() return p.parse_args()
def step_scrape(args): def step_list_scrape(args):
"""Step 1 — scrape or load dictionary.""" """Step 1 — scrape pealim.com list pages → words.json."""
dict_csv = DATA_DIR / "hebrew_dict.csv"
anki_csv = DATA_DIR / "hebrew_dict_for_anki.csv"
# Legacy fallback names
legacy_dict = DATA_DIR / "pealim_dict.csv"
legacy_anki = DATA_DIR / "pealim_dict_for_anki.csv"
if args.skip_scrape: if args.skip_scrape:
if dict_csv.exists(): if WORDS_JSON.exists():
logger.info(f"[1] Using existing {dict_csv}") logger.info("[1] Using existing words.json (--skip-scrape)")
elif legacy_dict.exists():
logger.info(f"[1] Using legacy {legacy_dict} (consider renaming)")
else: else:
logger.error(f"[1] --skip-scrape set but {dict_csv} not found. Aborting.") logger.error(f"[1] --skip-scrape set but {WORDS_JSON} not found. Aborting.")
sys.exit(1) sys.exit(1)
return return
logger.info("[1] Scraping dictionary from pealim.com …") logger.info("[1] Scraping dictionary list pages from pealim.com …")
import hebrew_extract import pealim_list_scrape
import pandas as pd
df = hebrew_extract.extract_from_website() total_pages = args.test if args.test else None
df.to_csv(dict_csv, index=True) pealim_list_scrape.run_scrape(total_pages=total_pages, force_refresh=False)
logger.info(f" Saved {len(df)} words → {dict_csv}")
df = hebrew_extract.modify_for_anki(df)
df.to_csv(anki_csv, sep=";", index=True)
logger.info(f" Saved Anki CSV → {anki_csv}")
def step_frequency() -> dict[str, int]: def step_frequency() -> dict[str, int]:
"""Step 2 — load/download word frequency data.""" """Step 3 — load/download word frequency data."""
logger.info("[2] Loading word frequency data …") logger.info("[3] Loading word frequency data …")
import frequency_lookup import frequency_lookup
frequency_lookup.load() frequency_lookup.load()
return frequency_lookup._freq return frequency_lookup._freq
def step_examples(args, freq_cache: dict): def step_examples(args) -> dict:
"""Step 3 — load/build Ben Yehuda example index.""" """Step 4 — extract example sentences from Hebrew EPUBs."""
if args.skip_examples: if args.skip_examples:
logger.info("[3] Skipping examples (--skip-examples)") logger.info("[4] Skipping examples (--skip-examples)")
examples_path = DATA_DIR / "examples_cache.json"
if examples_path.exists():
with open(examples_path) as f:
return json.load(f)
return {} return {}
logger.info("[3] Loading Ben Yehuda example index …") logger.info("[4] Extracting EPUB example sentences …")
import benyehuda import epub_examples
benyehuda.load(force_rebuild=args.refresh_examples)
dict_csv = DATA_DIR / "hebrew_dict_for_anki.csv" if not WORDS_JSON.exists():
if not dict_csv.exists(): logger.warning("[4] words.json not found, skipping examples")
dict_csv = DATA_DIR / "hebrew_dict.csv" return {}
if not dict_csv.exists():
dict_csv = DATA_DIR / "pealim_dict_for_anki.csv"
if not dict_csv.exists():
dict_csv = DATA_DIR / "pealim_dict.csv"
try: with open(WORDS_JSON, encoding="utf-8") as f:
import pandas as pd words = json.load(f)
try:
df = pd.read_csv(dict_csv, sep=";", index_col=0)
if df.shape[1] < 3:
raise ValueError("too few columns")
except (ValueError, pd.errors.ParserError):
df = pd.read_csv(dict_csv, index_col=0)
if args.test: stats = epub_examples.run(words)
df = df.head(args.test)
logger.info(f" Pre-fetching examples for {len(df)} words …") # Save updated words.json
for _, row in df.iterrows(): with open(WORDS_JSON, "w", encoding="utf-8") as f:
# Use nikkud word form as primary key (nikkud corpus) json.dump(words, f, ensure_ascii=False, indent=2)
word_nikkud = str(row.get("Word", "")).strip()
if word_nikkud:
benyehuda.get_examples(word_nikkud)
except Exception as e: logger.info(f" Coverage: {stats['matched']}/{stats['total_vocab']}")
logger.warning(f" Could not pre-fetch all examples: {e}") return stats
benyehuda.save_examples_cache()
return benyehuda._examples_cache
def step_audio(args): def step_detail_scrape(args):
"""Step 4 — download vocabulary audio .mp3 files from audio_url column in CSV.""" """Step 2 — scrape detail pages for nouns and verbs → update words.json."""
if args.skip_detail:
logger.info("[2] Skipping detail scrape (--skip-detail)")
return
logger.info("[2] Scraping detail pages from pealim.com …")
import pealim_detail_scrape
test_limit = args.test if args.test else None
pealim_detail_scrape.run(test=test_limit, force_refresh=False)
def step_audio_download(args):
"""Step 5 — download audio .mp3 files from URLs in words.json."""
if args.skip_audio: if args.skip_audio:
logger.info("[4] Skipping audio (--skip-audio)") logger.info("[5] Skipping audio (--skip-audio)")
return return
logger.info("[4] Downloading vocabulary audio files …") logger.info("[5] Downloading audio files …")
dict_csv = DATA_DIR / "hebrew_dict_for_anki.csv" import pealim_audio_download
if not dict_csv.exists():
dict_csv = DATA_DIR / "hebrew_dict.csv"
if not dict_csv.exists():
dict_csv = DATA_DIR / "pealim_dict_for_anki.csv"
if not dict_csv.exists():
dict_csv = DATA_DIR / "pealim_dict.csv"
import pandas as pd test_limit = args.test if args.test else None
import requests pealim_audio_download.run(test=test_limit)
try:
try:
df = pd.read_csv(dict_csv, sep=";", index_col=0)
if df.shape[1] < 3:
raise ValueError("too few columns")
except (ValueError, pd.errors.ParserError):
df = pd.read_csv(dict_csv, index_col=0)
if 'audio_url' not in df.columns:
logger.warning(" No audio_url column in CSV — re-scrape with hebrew_extract.py to capture audio URLs")
return
if args.test:
df = df.head(args.test)
AUDIO_DIR.mkdir(parents=True, exist_ok=True)
downloaded = 0
skipped = 0
no_url = 0
def strip_nik(t: str) -> str:
return "".join(c for c in unicodedata.normalize("NFD", t)
if unicodedata.category(c) != "Mn")
for _, row in df.iterrows():
word = str(row.get("Word", "")).strip()
word_plain = str(row.get("Word Without Nikkud", "")).strip()
audio_url = str(row.get("audio_url", "")).strip()
if not word:
continue
safe_name = re.sub(r"[^\u05d0-\u05ea]", "", strip_nik(word_plain or word))
if not safe_name:
continue
mp3_path = AUDIO_DIR / f"{safe_name}.mp3"
if mp3_path.exists():
skipped += 1
continue
if not audio_url or audio_url in ("nan", "None", ""):
no_url += 1
continue
try:
resp = requests.get(audio_url, timeout=10)
resp.raise_for_status()
mp3_path.write_bytes(resp.content)
downloaded += 1
time.sleep(0.3)
except Exception as e:
logger.debug(f" Audio download failed for {word}: {e}")
logger.info(f" Audio: {downloaded} downloaded, {skipped} already cached, {no_url} without URL")
except Exception as e:
logger.warning(f" Audio step failed: {e}")
def step_conj_audio(args, conjugations: dict): def step_fonts(_args: argparse.Namespace):
"""Step 4b — download conjugation audio .mp3 files.""" """Step 6 — download Heebo font files (one-time, cached)."""
if args.skip_audio:
logger.info("[4b] Skipping conjugation audio (--skip-audio)")
return
logger.info("[4b] Downloading conjugation audio files …")
AUDIO_CONJ_DIR.mkdir(parents=True, exist_ok=True)
import requests
downloaded = 0
skipped = 0
failed = 0
for infinitive, data in conjugations.items():
if not data or not data.get("forms"):
continue
slug = data.get("slug", "")
if not slug:
continue
# Active forms
for form_key, form_data in data["forms"].items():
audio_url = form_data.get("audio_url", "")
if not audio_url:
continue
filename = f"{slug}_{form_key}.mp3"
mp3_path = AUDIO_CONJ_DIR / filename
if mp3_path.exists():
skipped += 1
continue
try:
resp = requests.get(audio_url, timeout=10)
resp.raise_for_status()
mp3_path.write_bytes(resp.content)
downloaded += 1
time.sleep(0.2)
except Exception as e:
logger.debug(f" Conj audio failed {filename}: {e}")
failed += 1
# Passive partner forms
passive = data.get("passive_partner")
if passive and passive.get("forms"):
for form_key, form_data in passive["forms"].items():
audio_url = form_data.get("audio_url", "")
if not audio_url:
continue
filename = f"{slug}_passive_{form_key}.mp3"
mp3_path = AUDIO_CONJ_DIR / filename
if mp3_path.exists():
skipped += 1
continue
try:
resp = requests.get(audio_url, timeout=10)
resp.raise_for_status()
mp3_path.write_bytes(resp.content)
downloaded += 1
time.sleep(0.2)
except Exception as e:
logger.debug(f" Conj audio failed {filename}: {e}")
failed += 1
logger.info(
f" Conjugation audio: {downloaded} downloaded, "
f"{skipped} cached, {failed} failed"
)
def step_fonts(args):
"""Step 4c — download Heebo font files (one-time, cached)."""
FONTS_DIR.mkdir(parents=True, exist_ok=True) FONTS_DIR.mkdir(parents=True, exist_ok=True)
regular = FONTS_DIR / "_Heebo-Regular.ttf" regular = FONTS_DIR / "_Heebo-Regular.ttf"
bold = FONTS_DIR / "_Heebo-Bold.ttf" bold = FONTS_DIR / "_Heebo-Bold.ttf"
if regular.exists() and bold.exists(): if regular.exists() and bold.exists():
logger.info("[4c] Heebo fonts already cached") logger.info("[6] Heebo fonts already cached")
return return
logger.info("[4c] Downloading Heebo fonts from Google Fonts …") logger.info("[6] Downloading Heebo fonts from Google Fonts …")
# Fetch CSS to get actual TTF source URLs (static subset for Hebrew + Latin)
import requests as _req import requests as _req
headers = {
# Request TTF (not woff2) so Anki can embed them headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) Gecko/20100101 Firefox/120.0"}
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) Gecko/20100101 Firefox/120.0"
}
css_url = "https://fonts.googleapis.com/css2?family=Heebo:wght@400;700" css_url = "https://fonts.googleapis.com/css2?family=Heebo:wght@400;700"
try: try:
css_resp = _req.get(css_url, headers=headers, timeout=15) css_resp = _req.get(css_url, headers=headers, timeout=15)
css_resp.raise_for_status() css_resp.raise_for_status()
css_text = css_resp.text css_text = css_resp.text
# Find all src: url(...) references (may be woff2 for modern UA)
font_urls = re.findall(r"src:\s*url\(([^)]+)\)", css_text) font_urls = re.findall(r"src:\s*url\(([^)]+)\)", css_text)
logger.debug(f" Found {len(font_urls)} font URL(s) in CSS")
# Prefer TTF; if only woff2 available, download first two and note
downloaded = []
for i, fu in enumerate(font_urls[:2]): for i, fu in enumerate(font_urls[:2]):
fu = fu.strip("'\"") fu = fu.strip("'\"")
dest = regular if i == 0 else bold dest = regular if i == 0 else bold
@ -326,125 +175,78 @@ def step_fonts(args):
fr = _req.get(fu, timeout=15) fr = _req.get(fu, timeout=15)
fr.raise_for_status() fr.raise_for_status()
dest.write_bytes(fr.content) dest.write_bytes(fr.content)
downloaded.append(dest.name)
logger.info(f" Downloaded → {dest.name}") logger.info(f" Downloaded → {dest.name}")
if not downloaded:
logger.info(" All font files already present")
except Exception as e: except Exception as e:
logger.warning(f" Heebo download failed: {e}") logger.warning(f" Heebo download failed: {e}")
logger.warning(" Cards will fall back to Arial Hebrew / David.") logger.warning(" Cards will fall back to Arial Hebrew / David.")
logger.warning(
" To install manually: download Heebo-Regular.ttf and Heebo-Bold.ttf "
"from https://fonts.google.com/specimen/Heebo and rename with _ prefix "
f"into {FONTS_DIR}"
)
def step_images(args) -> dict: def step_images(args) -> dict:
"""Step 4d — fetch images for concrete nouns (resume-safe).""" """Step 7 — fetch images for concrete nouns (resume-safe)."""
if args.skip_images: if args.skip_images:
logger.info("[4d] Skipping images (--skip-images)") logger.info("[7] Skipping images (--skip-images)")
cache_path = DATA_DIR / "image_cache.json" cache_path = DATA_DIR / "image_cache.json"
if cache_path.exists(): if cache_path.exists():
with open(cache_path) as f: with open(cache_path) as f:
return json.load(f) return json.load(f)
return {} return {}
limit = args.test # When in test mode, limit images too limit = args.test
logger.info("[4d] Fetching images for concrete nouns …") logger.info("[7] Fetching images for concrete nouns …")
import image_fetch import image_fetch
return image_fetch.run(limit=limit) return image_fetch.run(limit=limit)
def step_build_vocab(args, examples_cache: dict, freq_cache: dict, image_cache: dict | None = None): def step_build_all(args):
"""Step 5 — build vocabulary .apkg.""" """Step 8 — build all 12 release variants from the unified words.json."""
logger.info("[5] Building vocabulary deck") logger.info("[8] Building all deck variants")
import apkg_builder import apkg_builder
dict_csv = DATA_DIR / "hebrew_dict_for_anki.csv" if not WORDS_JSON.exists():
if not dict_csv.exists(): logger.error(f"[8] {WORDS_JSON} not found. Run the data pipeline first.")
dict_csv = DATA_DIR / "hebrew_dict.csv" sys.exit(1)
if not dict_csv.exists():
dict_csv = DATA_DIR / "pealim_dict_for_anki.csv"
if not dict_csv.exists():
dict_csv = DATA_DIR / "pealim_dict.csv"
deck, media = apkg_builder.build_vocab_deck( with open(WORDS_JSON, encoding="utf-8") as f:
dict_csv, words = json.load(f)
examples_cache=examples_cache,
freq_cache=freq_cache, apkg_builder.build_all_variants(words, limit=args.test)
image_cache=image_cache or {},
limit=args.test,
)
apkg_builder.write_vocab_apkg(deck, media)
logger.info(f" Vocabulary .apkg → {apkg_builder.VOCAB_APKG}")
return deck
def step_conjugations(args): def print_summary(_args: argparse.Namespace, example_stats: dict, freq_cache: dict):
"""Step 6 — extract conjugations and build conjugation deck."""
if args.skip_conjugations:
logger.info("[6] Skipping conjugations (--skip-conjugations)")
return None
verbs_file = Path(__file__).parent / "verbs_input.txt"
if not verbs_file.exists():
logger.info("[6] verbs_input.txt not found — skipping conjugation deck")
return None
logger.info("[6] Extracting verb conjugations …")
import conjugation_extract
conjugations = conjugation_extract.main(verbs_file)
# Download conjugation audio
step_conj_audio(args, conjugations)
import apkg_builder
conj_deck, conj_media = apkg_builder.build_conj_deck(conjugations)
apkg_builder.write_conj_apkg(conj_deck, conj_media)
logger.info(f" Conjugation .apkg → {apkg_builder.CONJ_APKG}")
return conjugations
def print_summary(args, examples_cache, freq_cache, conjugations):
logger.info("") logger.info("")
logger.info("=" * 60) logger.info("=" * 60)
logger.info("SUMMARY") logger.info("SUMMARY")
logger.info("=" * 60) logger.info("=" * 60)
dict_csv = DATA_DIR / "hebrew_dict_for_anki.csv" if WORDS_JSON.exists():
if not dict_csv.exists(): with open(WORDS_JSON, encoding="utf-8") as f:
dict_csv = DATA_DIR / "hebrew_dict.csv" words = json.load(f)
if not dict_csv.exists(): logger.info(f" Dictionary words: {len(words)}")
dict_csv = DATA_DIR / "pealim_dict_for_anki.csv"
if not dict_csv.exists(): nouns = sum(1 for e in words.values() if e.get("pos", "").startswith("Noun"))
dict_csv = DATA_DIR / "pealim_dict.csv" verbs = sum(1 for e in words.values() if e.get("pos", "").startswith("Verb"))
if dict_csv.exists(): detail_scraped = sum(1 for e in words.values() if e.get("detail_scraped"))
import pandas as pd logger.info(f" Nouns: {nouns}, Verbs: {verbs}, Detail-scraped: {detail_scraped}")
try:
df = pd.read_csv(dict_csv, sep=";", index_col=0)
if df.shape[1] < 3:
raise ValueError("too few columns")
except (ValueError, pd.errors.ParserError):
df = pd.read_csv(dict_csv, index_col=0)
logger.info(f" Dictionary words: {len(df)}")
logger.info(f" Frequency entries: {len(freq_cache)}") logger.info(f" Frequency entries: {len(freq_cache)}")
logger.info(f" Example cache entries: {len(examples_cache)}") matched = example_stats.get("matched", 0)
covered = sum(1 for v in examples_cache.values() if v) total = example_stats.get("total_vocab", 0)
if examples_cache: if total:
logger.info(f" Example coverage: {covered}/{len(examples_cache)} ({100*covered//len(examples_cache)}%)") logger.info(f" Example coverage: {matched}/{total} ({100 * matched // total}%)")
for book, count in example_stats.get("books", {}).items():
logger.info(f" {book}: {count} sentences")
if AUDIO_DIR.exists(): if AUDIO_DIR.exists():
mp3s = list(AUDIO_DIR.glob("*.mp3")) mp3s = list(AUDIO_DIR.glob("*.mp3"))
logger.info(f" Vocabulary audio files: {len(mp3s)}") logger.info(f" Vocabulary audio files: {len(mp3s)}")
if AUDIO_CONJ_DIR.exists(): if AUDIO_CONJ_DIR.exists():
mp3s = list(AUDIO_CONJ_DIR.glob("*.mp3")) mp3s = [
logger.info(f" Conjugation audio files: {len(mp3s)}") p for p in AUDIO_CONJ_DIR.glob("*.mp3") if not p.stem.endswith("_infinitive") and "_passive_" not in p.stem
]
logger.info(f" Conjugation audio files (bundled): {len(mp3s)}")
image_cache_path = DATA_DIR / "image_cache.json" image_cache_path = DATA_DIR / "image_cache.json"
if image_cache_path.exists(): if image_cache_path.exists():
@ -453,17 +255,24 @@ def print_summary(args, examples_cache, freq_cache, conjugations):
found_imgs = sum(1 for v in ic.values() if v) found_imgs = sum(1 for v in ic.values() if v)
logger.info(f" Images: {found_imgs}/{len(ic)} nouns with images") logger.info(f" Images: {found_imgs}/{len(ic)} nouns with images")
vocab_apkg = OUTPUT_DIR / "hebrew_vocabulary.apkg" import apkg_builder as _ab
conj_apkg = OUTPUT_DIR / "hebrew_conjugations.apkg"
if vocab_apkg.exists(): all_apkgs = [
size_mb = vocab_apkg.stat().st_size / 1e6 _ab.VOCAB_APKG,
logger.info(f" Vocabulary .apkg: {size_mb:.1f} MB → {vocab_apkg}") _ab.VOCAB_APKG_AUDIO,
if conj_apkg.exists(): _ab.VOCAB_APKG_IMAGES,
size_mb = conj_apkg.stat().st_size / 1e6 _ab.VOCAB_APKG_AUDIO_IMAGES,
logger.info(f" Conjugation .apkg: {size_mb:.1f} MB → {conj_apkg}") _ab.CONJ_APKG,
if conjugations: _ab.CONJ_APKG_AUDIO,
verb_count = sum(1 for v in conjugations.values() if v) _ab.CONF_APKG,
logger.info(f" Verbs in conjugation deck: {verb_count}") _ab.CONF_APKG_AUDIO,
_ab.COMPLETE_APKG,
_ab.COMPLETE_APKG_AUDIO,
]
for apkg in all_apkgs:
if apkg.exists():
size_mb = apkg.stat().st_size / 1e6
logger.info(f" {apkg.name}: {size_mb:.1f} MB")
logger.info("=" * 60) logger.info("=" * 60)
logger.info("DONE") logger.info("DONE")
@ -478,29 +287,75 @@ def main():
logger.info(f" MODE: --only {args.only}") logger.info(f" MODE: --only {args.only}")
if args.test: if args.test:
logger.info(f" TEST MODE: {args.test} words") logger.info(f" TEST MODE: {args.test} words")
if args.refresh_examples:
logger.info(" REFRESH EXAMPLES: Ben Yehuda index will be rebuilt")
logger.info("=" * 60) logger.info("=" * 60)
def _load_words_for_only() -> dict:
if not WORDS_JSON.exists():
logger.error(f"words.json not found at {WORDS_JSON}. Run the data pipeline first.")
sys.exit(1)
with open(WORDS_JSON, encoding="utf-8") as f:
return json.load(f)
if args.only == "conjugations": if args.only == "conjugations":
step_fonts(args) step_fonts(args)
conjugations = step_conjugations(args) import apkg_builder
print_summary(args, {}, {}, conjugations or {})
words = _load_words_for_only()
for audio, path in [(False, apkg_builder.CONJ_APKG), (True, apkg_builder.CONJ_APKG_AUDIO)]:
deck, media = apkg_builder.build_conj_deck(words, include_audio=audio)
apkg_builder.write_conj_apkg(deck, media, out_path=path)
print_summary(args, {}, {})
return return
if args.only == "vocab": if args.only == "confusables":
args.skip_conjugations = True
step_scrape(args)
freq_cache = step_frequency()
examples_cache = step_examples(args, freq_cache)
step_audio(args)
step_fonts(args) step_fonts(args)
image_cache = step_images(args) import apkg_builder
step_build_vocab(args, examples_cache, freq_cache, image_cache)
conjugations = step_conjugations(args)
print_summary(args, examples_cache, freq_cache, conjugations or {}) words = _load_words_for_only()
for audio, path in [(False, apkg_builder.CONF_APKG), (True, apkg_builder.CONF_APKG_AUDIO)]:
deck, media = apkg_builder.build_confusables_deck(words, include_audio=audio)
apkg_builder.write_conf_apkg(deck, media, out_path=path)
print_summary(args, {}, {})
return
if args.only == "plurals":
step_fonts(args)
import apkg_builder
words = _load_words_for_only()
for audio, path in [(False, apkg_builder.PLURAL_APKG), (True, apkg_builder.PLURAL_APKG_AUDIO)]:
deck, media = apkg_builder.build_plural_deck(words, include_audio=audio)
apkg_builder.write_plural_apkg(deck, media, out_path=path)
print_summary(args, {}, {})
return
if args.only == "complete":
step_fonts(args)
import apkg_builder
words = _load_words_for_only()
emoji_lookup = apkg_builder._load_emoji_lookup()
for audio, path in [(False, apkg_builder.COMPLETE_APKG), (True, apkg_builder.COMPLETE_APKG_AUDIO)]:
decks, media = apkg_builder.build_complete_deck(
words,
include_audio=audio,
emoji_lookup=emoji_lookup,
)
apkg_builder.write_complete_apkg(decks, media, out_path=path)
print_summary(args, {}, {})
return
# Full pipeline
step_list_scrape(args) # 1 — scrape list pages → words.json (captures slugs)
step_detail_scrape(args) # 2 — scrape detail pages using slugs → words.json
freq_cache = step_frequency() # 3 — word frequency data
example_stats = step_examples(args) # 4 — EPUB example sentences
step_audio_download(args) # 5 — download audio mp3s
step_fonts(args) # 6 — download Heebo fonts
step_images(args) # 7 — fetch noun images
step_build_all(args) # 8 — build all .apkg variants
print_summary(args, example_stats, freq_cache)
if __name__ == "__main__": if __name__ == "__main__":

392
scripts/assign_frequency.py Normal file
View file

@ -0,0 +1,392 @@
#!/usr/bin/env python3
"""Assign frequency ranks from the cleaned corpus to words.json entries.
Two-tier assignment with PoS priority:
Tier 1: Match headword ktiv_male directly against corpus
Tier 2: Match conjugated/inflected forms (only if no other entry already
claimed that corpus word via tier 1)
PoS priority (based on standalone-word likelihood in Hebrew text):
כינוייוף (Pronoun) > מילות_חיבור (Conjunction) > שם_תואר (Adjective) >
מילית (Particle) > שם_עצם (Noun) > תוארי_הפועל (Adverb) >
מילות_יחס (Preposition) > פעלים (Verb)
Usage:
python3 scripts/assign_frequency.py # assign and save
python3 scripts/assign_frequency.py --dry-run # preview only
python3 scripts/assign_frequency.py --stats # show statistics only
"""
from __future__ import annotations
import argparse
import json
import logging
from collections import defaultdict
from pathlib import Path
logger = logging.getLogger(__name__)
PROJECT_ROOT = Path(__file__).parent.parent
WORDS_JSON = PROJECT_ROOT / "data" / "words.json"
CLEAN_CACHE = PROJECT_ROOT / "data" / "frequency_clean.json"
RAW_CACHE = PROJECT_ROOT / "data" / "frequency_cache.json"
# Function word PoS — these dominate content words in homograph groups
FUNCTION_POS = frozenset({"כינוייוף", "מילות_חיבור", "מילית", "מילות_יחס", "תוארי_הפועל"})
# Content PoS that loses frequency when a function word dominates
# Adjectives also lose (e.g. כן "honest" vs כן "yes") — they're rare collisions
CONTENT_POS = frozenset({"שם_עצם", "שם_תואר", "פעלים"})
# Manual overrides: at these corpus ranks, ALL homographs share frequency.
# These are cases where the content word is genuinely common enough to deserve it.
# e.g. rank 15: עם "people" (NN) alongside עם "with" (PREP)
# Manual overrides: at these ktiv_male forms, ALL homographs share frequency.
# These are cases where the content word is genuinely common enough to deserve it.
SHARE_ALL_WORDS = frozenset(
{
"עם", # "people" (NN) + "with" (PREP)
"שם", # "name" (NN) + "there" (ADV)
"אל", # "god" (NN) + "to" (PREP) + "don't" (PART)
"עד", # "witness"/"eternity" (NN) + "until" (PREP)
"פה", # "mouth" (NN) + "here" (ADV)
"לאחר", # "to be late" (VB) + "after" (PREP)
"יופי", # "beauty" (NN) + "great!" (ADV)
"המון", # "crowd" (NN) + "lots of" (ADV)
"חבל", # "rope" (NN) + "it's a pity" (ADV)
"ראשית", # "beginning" (NN) + "firstly" (ADV)
"עקב", # "heel"/"footprint" (NN) + "due to" (CONJ)
"אולם", # "hall" (NN) + "however" (ADV)
}
)
def _get_pos_tag(entry: dict) -> str:
"""Extract primary PoS tag from entry's tags field."""
tags = (entry.get("tags") or "").split()
for t in tags:
if not t.startswith("שורש"):
return t
return "unknown"
def _build_form_index(words: dict) -> dict[str, list[tuple[str, str]]]:
"""Build reverse index: ktiv_male_form -> [(unique_key, match_type), ...]"""
index: dict[str, list[tuple[str, str]]] = defaultdict(list)
for key, entry in words.items():
w = entry.get("word") or {}
if km := w.get("ktiv_male"):
index[km].append((key, "headword"))
# Verb conjugations: indexed for new-assignment-only matching (no upgrades).
# Conjugated forms collide with unrelated headwords, so tier 2 only uses
# these for entries that have NO existing frequency.
conj = entry.get("conjugation") or {}
for form in conj.get("active_forms") or []:
if isinstance(form, dict):
form_data = form.get("form") or {}
if km2 := form_data.get("ktiv_male"):
km2 = km2.rstrip("!\u200f ")
index[km2].append((key, "conjugation"))
for hp in conj.get("hufal_pual_forms") or []:
if isinstance(hp, dict):
hp_data = hp.get("form") or {}
if km3 := hp_data.get("ktiv_male"):
km3 = km3.rstrip("!\u200f ")
index[km3].append((key, "conjugation"))
for field in ("noun_inflection", "preposition_inflection", "adjective_inflection"):
for inf_data in (entry.get(field) or {}).values():
if isinstance(inf_data, dict) and (km4 := inf_data.get("ktiv_male")):
index[km4].append((key, "inflection"))
return dict(index)
def _should_get_frequency(
entry: dict,
all_headword_entries: list[tuple[str, str]],
corpus_word: str,
words: dict,
) -> bool:
"""Decide if an entry should get frequency in a homograph group.
Rules:
- If only one entry matches, it always gets frequency.
- If SHARE_ALL_WORDS includes this corpus word, all entries share.
- If the group has function words AND content words, content words lose.
- Otherwise all entries share.
"""
if len(all_headword_entries) <= 1:
return True
if corpus_word in SHARE_ALL_WORDS:
return True
pos = _get_pos_tag(entry)
has_function = any(_get_pos_tag(words[k]) in FUNCTION_POS for k, _ in all_headword_entries)
return not (has_function and pos in CONTENT_POS)
def assign_frequencies(
words: dict,
freq_corpus: dict[str, int],
raw_corpus: dict[str, int] | None = None,
upgrade: bool = False,
) -> dict[str, dict]:
"""Assign frequency ranks to words.json entries. Returns assignment details.
freq_corpus controls which words are valid (cleaned corpus).
raw_corpus provides original rank numbers (with gaps). If not provided,
uses freq_corpus ranks (re-ranked, no gaps).
upgrade: if True, tier 2 can upgrade an entry's rank when a conjugated/inflected
form has a better (lower) rank than the headword match.
"""
rank_source = raw_corpus if raw_corpus is not None else freq_corpus
form_index = _build_form_index(words)
# Track which corpus words have been claimed by tier 1
tier1_claimed: set[str] = set()
# Results tracking
assignments: dict[str, dict] = {} # unique_key -> {rank, source, corpus_word}
# --- Tier 1: headword matches ---
# For each corpus word, find all headword matches and assign to eligible entries.
# Homograph groups: function words get frequency, content words don't (unless overridden).
corpus_by_rank = sorted(freq_corpus.items(), key=lambda x: x[1])
for corpus_word, _clean_rank in corpus_by_rank:
matches = form_index.get(corpus_word, [])
headword_matches = [(k, t) for k, t in matches if t == "headword"]
if not headword_matches:
continue
original_rank = rank_source.get(corpus_word, _clean_rank)
assigned_any = False
for entry_key, _ in headword_matches:
if entry_key in assignments:
continue
if _should_get_frequency(words[entry_key], headword_matches, corpus_word, words):
assignments[entry_key] = {
"rank": original_rank,
"source": "headword",
"corpus_word": corpus_word,
}
assigned_any = True
if assigned_any:
tier1_claimed.add(corpus_word)
tier1_count = len(assignments)
logger.info("Tier 1 (headword): %d entries assigned", tier1_count)
# --- Tier 2: conjugation/inflection matches ---
# Only use corpus words NOT claimed in tier 1.
# A corpus word that matches an inflection is "owned" by that headword —
# it cannot also upgrade an unrelated verb via conjugation.
# Upgrades (when enabled) only apply within the same match type priority.
for corpus_word, _clean_rank in corpus_by_rank:
if corpus_word in tier1_claimed:
continue
matches = form_index.get(corpus_word, [])
secondary_matches = [(k, t) for k, t in matches if t in ("conjugation", "inflection")]
if not secondary_matches:
continue
original_rank = rank_source.get(corpus_word, _clean_rank)
# Split by type: inflections take priority over conjugations
inflection_matches = [(k, t) for k, t in secondary_matches if t == "inflection"]
conjugation_matches = [(k, t) for k, t in secondary_matches if t == "conjugation"]
# If any inflection matches exist, this corpus word belongs to inflection.
# Don't let conjugations claim it.
active_matches = inflection_matches if inflection_matches else conjugation_matches
for entry_key, match_type in active_matches:
existing = assignments.get(entry_key)
if existing is None:
# New assignment — conjugations only allowed for rank > 5000
# (too many false positives in the important tiers)
if match_type == "conjugation" and original_rank <= 5000:
continue
assignments[entry_key] = {
"rank": original_rank,
"source": match_type,
"corpus_word": corpus_word,
}
break
if upgrade and match_type == "inflection" and original_rank < existing["rank"]:
# Upgrade — only allowed for inflections (conjugations collide too much)
assignments[entry_key] = {
"rank": original_rank,
"source": f"upgrade:{match_type}",
"corpus_word": corpus_word,
}
break
tier2_count = len(assignments) - tier1_count
logger.info("Tier 2 (conjugation/inflection): %d entries assigned", tier2_count)
return assignments
def print_stats(words: dict, assignments: dict, freq_corpus: dict) -> None:
"""Print detailed statistics about frequency assignment."""
total = len(words)
assigned = len(assignments)
previously_had = sum(1 for e in words.values() if e.get("frequency") is not None)
print(f"\n{'=' * 60}")
print("Frequency Assignment Statistics")
print(f"{'=' * 60}")
print(f"Words.json entries: {total}")
print(f"Clean corpus size: {len(freq_corpus)}")
print(f"Previously had freq: {previously_had}")
print(f"Now assigned: {assigned}")
print(f"Newly gained: {assigned - previously_had}")
print(f"Still unlisted: {total - assigned}")
# By tier
tier1 = sum(1 for a in assignments.values() if a["source"] == "headword")
tier2_conj = sum(1 for a in assignments.values() if a["source"] == "conjugation")
tier2_inf = sum(1 for a in assignments.values() if a["source"] == "inflection")
print("\nBy assignment tier:")
print(f" Tier 1 (headword): {tier1}")
print(f" Tier 2 (conjugation): {tier2_conj}")
print(f" Tier 2 (inflection): {tier2_inf}")
# By PoS
print("\nBy PoS:")
from collections import Counter
pos_assigned = Counter()
pos_total = Counter()
for k, v in words.items():
pos = _get_pos_tag(v)
pos_total[pos] += 1
if k in assignments:
pos_assigned[pos] += 1
pos_order = [
"כינוייוף",
"מילות_חיבור",
"שם_תואר",
"מילית",
"שם_עצם",
"תוארי_הפועל",
"מילות_יחס",
"פעלים",
"unknown",
]
for pos in sorted(pos_total, key=lambda p: pos_order.index(p) if p in pos_order else 99):
a = pos_assigned[pos]
t = pos_total[pos]
pct = a / t * 100 if t else 0
print(f" {pos:20s}: {a:5d}/{t:5d} ({pct:.0f}%)")
# By frequency tier (using apkg_builder tiers)
print("\nBy frequency tier:")
tiers = {
"Core (1-500)": (1, 500),
"Essential (501-1500)": (501, 1500),
"Intermediate (1501-3000)": (1501, 3000),
"Upper-intermediate (3001-5000)": (3001, 5000),
"Advanced (5001-10000)": (5001, 10000),
"Rare (10001+)": (10001, 999999),
}
for label, (lo, hi) in tiers.items():
count = sum(1 for a in assignments.values() if lo <= a["rank"] <= hi)
print(f" {label:35s}: {count}")
# Top 20 newly assigned (entries that didn't have frequency before)
newly = []
for k, a in assignments.items():
if words[k].get("frequency") is None:
w = words[k].get("word", {})
newly.append((a["rank"], k, w.get("ktiv_male", ""), a["source"], a["corpus_word"]))
newly.sort()
if newly:
print("\nTop 20 newly assigned entries:")
for rank, _key, ktiv, source, corpus_word in newly[:20]:
print(f" rank {rank:5d}: {ktiv:15s} via {source:12s} (corpus: {corpus_word})")
# Entries that LOST frequency (had it before, not assigned now)
lost = []
for k, v in words.items():
old_freq = v.get("frequency")
if old_freq is not None and k not in assignments:
w = v.get("word", {})
lost.append((old_freq, k, w.get("ktiv_male", "")))
lost.sort()
if lost:
print(f"\nEntries that would LOSE frequency ({len(lost)} total):")
for rank, _key, ktiv in lost[:20]:
print(f" was rank {rank:5d}: {ktiv}")
def main() -> None:
parser = argparse.ArgumentParser(description="Assign frequency to words.json")
parser.add_argument("--dry-run", action="store_true", help="Preview without saving")
parser.add_argument("--stats", action="store_true", help="Show statistics only")
parser.add_argument(
"--upgrade", action="store_true", help="Allow tier 2 to upgrade headword rank from conjugated forms"
)
args = parser.parse_args()
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
# Load data
freq_path = CLEAN_CACHE if CLEAN_CACHE.exists() else RAW_CACHE
logger.info("Loading frequency corpus: %s", freq_path)
with open(freq_path, encoding="utf-8") as f:
freq_corpus: dict[str, int] = json.load(f)
# Load raw corpus for original rank numbers (with gaps)
raw_corpus: dict[str, int] | None = None
if RAW_CACHE.exists() and freq_path != RAW_CACHE:
with open(RAW_CACHE, encoding="utf-8") as f:
raw_corpus = json.load(f)
logger.info("Using original ranks from %s", RAW_CACHE)
with open(WORDS_JSON, encoding="utf-8") as f:
words: dict = json.load(f)
logger.info("Corpus: %d entries, Words.json: %d entries", len(freq_corpus), len(words))
# Run assignment
assignments = assign_frequencies(words, freq_corpus, raw_corpus, upgrade=args.upgrade)
# Stats
print_stats(words, assignments, freq_corpus)
if args.stats or args.dry_run:
if args.dry_run:
logger.info("Dry run — no changes saved")
return
# Apply to words.json
changed = 0
for key, entry in words.items():
if key in assignments:
new_rank = assignments[key]["rank"]
if entry.get("frequency") != new_rank:
entry["frequency"] = new_rank
changed += 1
else:
if entry.get("frequency") is not None:
entry["frequency"] = None
changed += 1
with open(WORDS_JSON, "w", encoding="utf-8") as f:
json.dump(words, f, ensure_ascii=False, indent=2)
logger.info("Updated %d entries in words.json", changed)
if __name__ == "__main__":
main()

View file

@ -0,0 +1,269 @@
#!/usr/bin/env python3
"""Assign pseudo-frequency to confusable groups using English word frequency.
Problem: Confusable entries share the same ktiv_male and thus the same Hebrew
frequency rank. This script uses English frequency to differentiate them so
Anki sorts more-common meanings first.
Algorithm:
1. For each confusable group where all entries share the same Hebrew frequency,
extract the first meaningful English keyword from each entry's meaning field.
2. Look up English frequency rank for each keyword.
3. Assign pseudo_frequency: the most frequent English meaning keeps the original
Hebrew rank; less frequent meanings get progressively higher (worse) ranks
by adding an offset (100 * position in group).
Usage:
python3 scripts/assign_pseudo_frequency.py # assign and save
python3 scripts/assign_pseudo_frequency.py --dry-run # preview only
"""
from __future__ import annotations
import argparse
import json
import logging
import re
from collections import defaultdict
from pathlib import Path
logger = logging.getLogger(__name__)
PROJECT_ROOT = Path(__file__).parent.parent
WORDS_JSON = PROJECT_ROOT / "data" / "words.json"
EN_FREQ_PATH = PROJECT_ROOT / "data" / "en_50k.txt"
# Words too common/vague to use as frequency signal
_EN_STOP = frozenset(
{
"to",
"be",
"a",
"an",
"the",
"of",
"in",
"on",
"at",
"for",
"and",
"with",
"by",
"or",
"but",
"not",
"as",
"its",
"it",
"is",
"was",
"are",
"from",
"that",
"this",
"have",
"has",
"had",
"do",
"does",
"did",
"will",
"would",
"can",
"could",
"may",
"might",
"shall",
"should",
"must",
"no",
"yes",
"very",
"too",
"also",
"just",
"only",
"so",
"up",
"out",
"into",
"over",
"after",
"before",
"about",
"more",
"than",
"other",
"some",
"any",
"all",
"each",
"every",
"both",
"few",
"many",
"much",
"most",
"such",
"own",
"same",
"well",
"still",
"even",
"how",
"what",
"when",
"where",
"which",
"who",
"whom",
"whose",
"why",
"because",
"if",
"then",
"else",
"while",
"until",
"though",
"whether",
}
)
def _load_en_freq() -> dict[str, int]:
"""Load English frequency data: word -> rank (1 = most common)."""
freq: dict[str, int] = {}
rank = 1
with open(EN_FREQ_PATH, encoding="utf-8") as f:
for line in f:
parts = line.strip().split()
if parts:
word = parts[0].lower()
if word not in freq:
freq[word] = rank
rank += 1
return freq
def _extract_keywords(meaning: str) -> list[str]:
"""Extract meaningful English keywords from a meaning string.
Returns list of lowercase words, filtered for stop words and short words.
"""
# Strip parenthesized content, punctuation
cleaned = re.sub(r"\([^)]*\)", " ", meaning)
cleaned = re.sub(r"[^\w\s]", " ", cleaned)
return [w.lower() for w in cleaned.split() if len(w) > 2 and w.lower() not in _EN_STOP]
def assign_pseudo_frequencies(
words: dict,
en_freq: dict[str, int],
dry_run: bool = False,
) -> int:
"""Assign pseudo_frequency to confusable groups. Returns count of changes."""
# Group by confusables_guid
groups: dict[str, list[str]] = defaultdict(list)
for key, entry in words.items():
cg = entry.get("confusables_guid")
if cg:
groups[cg].append(key)
changes = 0
assigned_groups = 0
skipped_diff = 0
skipped_no_en = 0
for _guid, keys in groups.items():
entries = [words[k] for k in keys]
freqs = [e.get("frequency") for e in entries]
# Skip groups that are already differentiated
unique_freqs = set(freqs)
if len(unique_freqs) > 1:
skipped_diff += 1
continue
base_freq = freqs[0] # All same (or all None)
# Look up English frequency for each entry
en_ranks: list[tuple[int, str]] = [] # (en_rank, key)
for key, entry in zip(keys, entries, strict=True):
keywords = _extract_keywords(entry.get("meaning", ""))
en_rank = 999_999
for kw in keywords[:5]:
r = en_freq.get(kw)
if r is not None:
en_rank = r
break
en_ranks.append((en_rank, key))
# Sort by English frequency (lower rank = more common)
en_ranks.sort()
# Check if all entries have the same English rank (no signal)
if len({r for r, _ in en_ranks}) <= 1:
skipped_no_en += 1
continue
assigned_groups += 1
# Assign pseudo_frequency: most common gets base, others get offset
for position, (en_rank, key) in enumerate(en_ranks):
pseudo = base_freq + position * 100 if base_freq is not None else 50000 + en_rank
if not dry_run:
words[key]["pseudo_frequency"] = pseudo
changes += 1
if dry_run:
meaning = words[key].get("meaning", "")[:40]
logger.info(
" [en:%5d] pseudo=%6d %s",
en_rank,
pseudo,
meaning,
)
logger.info(
"Pseudo-frequency: %d groups assigned, %d already differentiated, %d no English signal",
assigned_groups,
skipped_diff,
skipped_no_en,
)
return changes
def main() -> None:
parser = argparse.ArgumentParser(description="Assign pseudo-frequency to confusables")
parser.add_argument("--dry-run", action="store_true", help="Preview without saving")
args = parser.parse_args()
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(message)s",
)
logger.info("Loading English frequency data: %s", EN_FREQ_PATH)
en_freq = _load_en_freq()
logger.info("English frequency: %d entries", len(en_freq))
with open(WORDS_JSON, encoding="utf-8") as f:
words: dict = json.load(f)
changes = assign_pseudo_frequencies(words, en_freq, dry_run=args.dry_run)
if args.dry_run:
logger.info("Dry run — %d changes would be made", changes)
return
with open(WORDS_JSON, "w", encoding="utf-8") as f:
json.dump(words, f, ensure_ascii=False, indent=2)
logger.info("Saved %d pseudo-frequency assignments to words.json", changes)
if __name__ == "__main__":
main()

View file

@ -0,0 +1,212 @@
"""Check that every GUID in the last-release complete .apkg exists in words.json.
Extracts GUIDs from the Anki SQLite database inside the .apkg (zip) file,
then compares against all GUID fields stored in data/words.json.
Usage:
python3 scripts/check_guid_coverage.py
python3 scripts/check_guid_coverage.py --apkg output/hebrew_complete.apkg
python3 scripts/check_guid_coverage.py --verbose
"""
from __future__ import annotations
import argparse
import json
import os
import sqlite3
import sys
import tempfile
import zipfile
from pathlib import Path
from typing import Any
PROJECT_ROOT = Path(__file__).parent.parent
DEFAULT_APKG = PROJECT_ROOT / "output" / "hebrew_complete.apkg"
WORDS_JSON = PROJECT_ROOT / "data" / "words.json"
# Known model IDs (from apkg_builder.py)
MODEL_IDS = {
1701222017968: "vocab",
1234567893: "conjugation",
1234567897: "plurals",
1234567895: "confusables",
}
def extract_apkg_guids(apkg_path: Path) -> dict[int, set[str]]:
"""Extract GUIDs from .apkg grouped by model ID."""
by_model: dict[int, set[str]] = {}
with zipfile.ZipFile(apkg_path) as z, tempfile.TemporaryDirectory() as td:
z.extractall(td)
db_path = os.path.join(td, "collection.anki2")
conn = sqlite3.connect(db_path)
cur = conn.cursor()
cur.execute("SELECT guid, mid FROM notes")
for guid, mid in cur.fetchall():
by_model.setdefault(mid, set()).add(guid)
conn.close()
return by_model
def collect_words_json_guids(data: dict[str, Any]) -> dict[str, set[str]]:
"""Collect all GUIDs from words.json grouped by deck type."""
vocab_guids: set[str] = set()
cloze_guids: set[str] = set()
conj_guids: set[str] = set()
plurals_guids: set[str] = set()
confusables_guids: set[str] = set()
for entry in data.values():
# Vocab legacy GUID
g = entry.get("vocab_legacy_guid")
if g:
vocab_guids.add(g)
# Cloze GUID (stored in examples.cloze.cloze_guid)
examples = entry.get("examples")
if examples:
cloze = examples.get("cloze")
if cloze:
g = cloze.get("cloze_guid")
if g:
cloze_guids.add(g)
# Plurals GUID (stored inside noun_inflection)
ni = entry.get("noun_inflection")
if ni:
g = ni.get("plurals_guid")
if g:
plurals_guids.add(g)
# Confusables GUID (top-level)
g = entry.get("confusables_guid")
if g:
confusables_guids.add(g)
# Conjugation form GUIDs
conj = entry.get("conjugation")
if conj:
for form_list_key in ("active_forms", "hufal_pual_forms"):
forms = conj.get(form_list_key)
if not forms:
continue
for form in forms:
g = form.get("guid")
if g:
conj_guids.add(g)
gc = form.get("guid_candidates")
if gc:
for g2 in gc:
conj_guids.add(g2)
return {
"vocab": vocab_guids,
"cloze": cloze_guids,
"conjugation": conj_guids,
"plurals": plurals_guids,
"confusables": confusables_guids,
}
def main() -> None:
parser = argparse.ArgumentParser(description="Check GUID coverage between .apkg and words.json")
parser.add_argument(
"--apkg",
type=Path,
default=DEFAULT_APKG,
help=f"Path to .apkg file (default: {DEFAULT_APKG})",
)
parser.add_argument("--verbose", "-v", action="store_true")
args = parser.parse_args()
if not args.apkg.exists():
print(f"ERROR: apkg not found: {args.apkg}")
sys.exit(2)
if not WORDS_JSON.exists():
print(f"ERROR: words.json not found: {WORDS_JSON}")
sys.exit(2)
print(f"Checking: {args.apkg}")
print(f"Against: {WORDS_JSON}")
print()
apkg_by_model = extract_apkg_guids(args.apkg)
data = json.load(WORDS_JSON.open(encoding="utf-8"))
wj = collect_words_json_guids(data)
total_apkg = sum(len(s) for s in apkg_by_model.values())
total_wj = sum(len(s) for s in wj.values())
print(f"Total GUIDs in apkg: {total_apkg}")
print(f"Total GUIDs in words.json: {total_wj}")
print()
all_missing = 0
all_extra = 0
for mid, deck_name in MODEL_IDS.items():
apkg_set = apkg_by_model.get(mid, set())
# Map apkg model to words.json GUID sets
if deck_name == "vocab":
# Vocab notes cover both vocab cards (ord 0,1) and cloze (ord 2)
# They share the note GUID — vocab_legacy_guid IS the note guid
wj_set = wj["vocab"] | wj["cloze"]
elif deck_name == "conjugation":
wj_set = wj["conjugation"]
elif deck_name == "plurals":
wj_set = wj["plurals"]
elif deck_name == "confusables":
wj_set = wj["confusables"]
else:
wj_set = set()
missing = apkg_set - wj_set
extra = wj_set - apkg_set
matched = apkg_set & wj_set
all_missing += len(missing)
all_extra += len(extra)
status = "PASS" if not missing else "FAIL"
print(f" {status} {deck_name} (mid={mid})")
print(
f" apkg={len(apkg_set)}, words.json={len(wj_set)}, "
f"matched={len(matched)}, missing={len(missing)}, extra={len(extra)}"
)
if missing and args.verbose:
# Try to find what word each missing GUID belongs to in the apkg
print(" Missing GUIDs (in apkg, not in words.json):")
for g in sorted(missing)[:20]:
print(f" {g!r}")
if len(missing) > 20:
print(f" ... ({len(missing) - 20} more)")
if extra and args.verbose:
print(" Extra GUIDs (in words.json, not in apkg):")
for g in sorted(extra)[:10]:
print(f" {g!r}")
if len(extra) > 10:
print(f" ... ({len(extra) - 10} more)")
print()
# Check for unknown model IDs in apkg
unknown_mids = set(apkg_by_model.keys()) - set(MODEL_IDS.keys())
if unknown_mids:
print(f" WARNING: Unknown model IDs in apkg: {unknown_mids}")
for mid in unknown_mids:
print(f" mid={mid}: {len(apkg_by_model[mid])} notes")
print("" * 60)
if all_missing:
print(f" FAILED: {all_missing} apkg GUIDs not found in words.json")
print(" (These notes would lose study progress on reimport)")
sys.exit(1)
else:
print(f" All {total_apkg} apkg GUIDs accounted for in words.json.")
sys.exit(0)
if __name__ == "__main__":
main()

View file

@ -0,0 +1,400 @@
#!/usr/bin/env python3
"""Clean the Hebrew frequency corpus by removing prefix+word combinations.
Two modes:
--mode yap (default) Use YAP morphological analyzer for accurate prefix detection.
Requires YAP API running at localhost:8000.
--mode heuristic Use rule-based prefix stripping (no external dependencies).
Both modes preserve words that exist as known dictionary forms in words.json.
Usage:
python3 scripts/clean_frequency_corpus.py # YAP mode
python3 scripts/clean_frequency_corpus.py --mode heuristic # heuristic fallback
python3 scripts/clean_frequency_corpus.py --dry-run # preview only
python3 scripts/clean_frequency_corpus.py --resume # resume YAP from checkpoint
python3 scripts/clean_frequency_corpus.py --limit 1000 # process first N entries
Input: data/frequency_cache.json (raw he_50k.txt, 49999 entries)
Output: data/frequency_clean.json (filtered, prefix combos removed)
data/frequency_discarded.json (discarded entries with reason)
"""
from __future__ import annotations
import argparse
import json
import logging
import os
import sys
import time
from pathlib import Path
import requests
logger = logging.getLogger(__name__)
PROJECT_ROOT = Path(__file__).parent.parent
RAW_CACHE = PROJECT_ROOT / "data" / "frequency_cache.json"
CLEAN_CACHE = PROJECT_ROOT / "data" / "frequency_clean.json"
DISCARDED = PROJECT_ROOT / "data" / "frequency_discarded.json"
WORDS_JSON = PROJECT_ROOT / "data" / "words.json"
CHECKPOINT = PROJECT_ROOT / "data" / "_yap_checkpoint.json"
YAP_URL = os.environ.get("YAP_URL", "http://localhost:8000/yap/heb/joint")
YAP_TIMEOUT = 10
BATCH_SAVE_INTERVAL = 500
# --- YAP mode constants ---
# POS tags that indicate a prefix
PREFIX_POS = frozenset({"PREPOSITION", "CONJ", "DEF", "REL"})
# POS tags for the host word that make the combo a false positive
HOST_POS = frozenset({"NN", "NNP", "NNT", "PRP", "CD", "DT", "EX"})
# --- Heuristic mode constants ---
# Hebrew prefix combinations, longest first for greedy matching.
PREFIXES = [
# 4-char
"וכשמ",
"וכשב",
"וכשל",
"וכשה",
# 3-char
"וכש",
"ומה",
"ובה",
"וכה",
"ולה",
"ומש",
"ובש",
"וכב",
"ולב",
"ומב",
"וכל",
"ולכ",
"שבה",
"שמה",
# 2-char
"כש",
"מה",
"בה",
"כה",
"לה",
"מש",
"בש",
"וב",
"וה",
"וכ",
"ול",
"ומ",
"וש",
"כב",
"לב",
"מב",
"כל",
"לכ",
"שב",
"שה",
"שכ",
"של",
"שמ",
# 1-char
"ב",
"ה",
"ו",
"כ",
"ל",
"מ",
"ש",
]
MIN_REMAINDER_LEN = 2
def _load_known_forms(words_path: Path) -> set[str]:
"""Load all known ktiv_male forms from words.json."""
if not words_path.exists():
logger.warning("words.json not found at %s — no dictionary filter", words_path)
return set()
with open(words_path, encoding="utf-8") as f:
words = json.load(f)
known: set[str] = set()
for entry in words.values():
w = entry.get("word") or {}
if km := w.get("ktiv_male"):
known.add(km)
for form in entry.get("active_forms") or []:
if isinstance(form, dict) and (km2 := form.get("ktiv_male")):
known.add(km2)
for hp in entry.get("hufal_pual_forms") or []:
if isinstance(hp, dict) and (km3 := hp.get("ktiv_male")):
known.add(km3)
for field in ("noun_inflection", "preposition_inflection", "adjective_inflection"):
for inf_data in (entry.get(field) or {}).values():
if isinstance(inf_data, dict) and (km4 := inf_data.get("ktiv_male")):
known.add(km4)
logger.info("Loaded %d known dictionary forms from words.json", len(known))
return known
# ── YAP mode ──────────────────────────────────────────────────────────────
def query_yap(word: str) -> dict | None:
"""Send a single word to YAP and return the JSON response."""
payload = {"text": f"{word} "}
try:
resp = requests.post(YAP_URL, json=payload, timeout=YAP_TIMEOUT)
resp.raise_for_status()
return resp.json()
except requests.RequestException as e:
logger.warning("YAP request failed for '%s': %s", word, e)
return None
def is_prefix_combo_yap(yap_response: dict) -> tuple[bool, str]:
"""Check if any morphological analysis segments the word as prefix+host.
Conservative: if ANY analysis in the lattice shows prefix+host discard.
"""
lattice = yap_response.get("ma_lattice", "")
if not lattice:
return False, ""
arcs = []
for line in lattice.strip().split("\n"):
if not line.strip():
continue
parts = line.split("\t")
if len(parts) < 6:
continue
arcs.append(
{
"from": parts[0],
"to": parts[1],
"form": parts[2],
"lemma": parts[3],
"cpos": parts[4],
"pos": parts[5],
}
)
if len(arcs) < 2:
return False, ""
for a in arcs:
if a["cpos"] not in PREFIX_POS and a["pos"] not in PREFIX_POS:
continue
for b in arcs:
if b["from"] != a["to"]:
continue
if b["cpos"] in HOST_POS or b["pos"] in HOST_POS:
reason = f"{a['form']}({a['cpos']})+{b['form']}({b['cpos']})"
return True, reason
return False, ""
# ── Heuristic mode ────────────────────────────────────────────────────────
def find_prefix_decomposition(word: str, freq: dict[str, int]) -> tuple[str, str] | None:
"""Check if word is a prefix+higher-ranked-word combo (heuristic)."""
if len(word) <= MIN_REMAINDER_LEN:
return None
word_rank = freq.get(word, 999999)
for prefix in PREFIXES:
if not word.startswith(prefix):
continue
remainder = word[len(prefix) :]
if len(remainder) < MIN_REMAINDER_LEN:
continue
if remainder in freq and freq[remainder] < word_rank:
return prefix, remainder
return None
# ── Main ──────────────────────────────────────────────────────────────────
def main() -> None:
parser = argparse.ArgumentParser(description="Clean frequency corpus")
parser.add_argument("--mode", choices=["yap", "heuristic"], default="yap", help="Detection mode")
parser.add_argument("--dry-run", action="store_true", help="Show removals without saving")
parser.add_argument("--resume", action="store_true", help="Resume YAP mode from checkpoint")
parser.add_argument("--limit", type=int, default=0, help="Process only first N words (0=all)")
args = parser.parse_args()
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
if not RAW_CACHE.exists():
logger.error("Raw frequency cache not found: %s", RAW_CACHE)
sys.exit(1)
with open(RAW_CACHE, encoding="utf-8") as f:
raw_freq: dict[str, int] = json.load(f)
logger.info("Raw frequency corpus: %d entries", len(raw_freq))
# Sort by rank
words_by_rank = sorted(raw_freq.items(), key=lambda x: x[1])
if args.limit:
words_by_rank = words_by_rank[: args.limit]
if args.mode == "yap":
discarded_list = _run_yap_mode(words_by_rank, args)
else:
known_forms = _load_known_forms(WORDS_JSON)
discarded_list = _run_heuristic_mode(words_by_rank, raw_freq, known_forms)
kept_count = len(words_by_rank) - len(discarded_list)
logger.info("Done. Kept: %d, Discarded: %d", kept_count, len(discarded_list))
if args.dry_run:
logger.info("Dry run — no files written")
return
# Build clean frequency dict (re-ranked without gaps)
discarded_words = {d["word"] for d in discarded_list}
clean_freq: dict[str, int] = {}
new_rank = 1
for word, _rank in words_by_rank:
if word not in discarded_words:
clean_freq[word] = new_rank
new_rank += 1
with open(CLEAN_CACHE, "w", encoding="utf-8") as f:
json.dump(clean_freq, f, ensure_ascii=False)
logger.info("Clean frequency saved: %d entries → %s", len(clean_freq), CLEAN_CACHE)
with open(DISCARDED, "w", encoding="utf-8") as f:
json.dump(discarded_list, f, ensure_ascii=False, indent=2)
logger.info("Discarded entries saved: %d%s", len(discarded_list), DISCARDED)
def _run_yap_mode(
words_by_rank: list[tuple[str, int]],
args: argparse.Namespace,
) -> list[dict]:
"""Run YAP-based prefix detection."""
# Check YAP connectivity
test = query_yap("בדיקה")
if test is None:
logger.error("Cannot connect to YAP API at %s", YAP_URL)
sys.exit(1)
logger.info("YAP API connected")
# Load checkpoint if resuming
analyzed: dict[str, dict] = {}
if args.resume and CHECKPOINT.exists():
with open(CHECKPOINT, encoding="utf-8") as f:
analyzed = json.load(f)
logger.info("Resumed from checkpoint: %d words already analyzed", len(analyzed))
discarded_list: list[dict] = []
discarded_count = 0
kept_count = 0
error_count = 0
for i, (word, rank) in enumerate(words_by_rank):
# Already analyzed (from checkpoint)
if word in analyzed:
if analyzed[word]["discard"]:
discarded_count += 1
discarded_list.append({"word": word, "original_rank": rank, "reason": analyzed[word]["reason"]})
else:
kept_count += 1
continue
# Trivial: single char, ASCII, or too short
if len(word) <= 1 or word.isascii():
analyzed[word] = {"discard": False, "reason": ""}
kept_count += 1
continue
result = query_yap(word)
if result is None:
analyzed[word] = {"discard": False, "reason": "yap_error"}
error_count += 1
kept_count += 1
time.sleep(0.5)
continue
is_combo, reason = is_prefix_combo_yap(result)
analyzed[word] = {"discard": is_combo, "reason": reason}
if is_combo:
discarded_count += 1
discarded_list.append({"word": word, "original_rank": rank, "reason": reason})
if rank <= 500 or discarded_count <= 50:
logger.info(" DISCARD rank %5d: %s (%s)", rank, word, reason)
else:
kept_count += 1
# Rate limit
if i % 10 == 0:
time.sleep(0.01)
# Checkpoint
if (i + 1) % BATCH_SAVE_INTERVAL == 0:
if not args.dry_run:
with open(CHECKPOINT, "w", encoding="utf-8") as f:
json.dump(analyzed, f, ensure_ascii=False)
logger.info(
" [%d/%d] kept=%d discarded=%d errors=%d",
i + 1,
len(words_by_rank),
kept_count,
discarded_count,
error_count,
)
# Final checkpoint save
if not args.dry_run and CHECKPOINT.exists():
CHECKPOINT.unlink()
if error_count:
logger.warning("%d YAP errors encountered", error_count)
return discarded_list
def _run_heuristic_mode(
words_by_rank: list[tuple[str, int]],
raw_freq: dict[str, int],
known_forms: set[str],
) -> list[dict]:
"""Run heuristic prefix detection (no external dependencies)."""
discarded_list: list[dict] = []
discarded_count = 0
for word, rank in words_by_rank:
if len(word) <= 1 or word.isascii():
continue
# Known dictionary form → keep
if word in known_forms:
continue
result = find_prefix_decomposition(word, raw_freq)
if result is not None:
prefix, remainder = result
discarded_count += 1
reason = f"{prefix}+{remainder} (rank {raw_freq[remainder]})"
discarded_list.append({"word": word, "original_rank": rank, "reason": reason})
if rank <= 500 or discarded_count <= 50:
logger.info(" DISCARD rank %5d: %s = %s", rank, word, reason)
return discarded_list
if __name__ == "__main__":
main()

View file

@ -21,9 +21,10 @@ from pathlib import Path
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
PDF_URL = "https://books.nevo.engineer/opds/download/117/pdf/" PROJECT_ROOT = Path(__file__).resolve().parent.parent
PDF_URL = "" # Set to URL or local path of Coffin & Bolozky PDF
PDF_PATH = Path("/tmp/coffin_bolozky.pdf") PDF_PATH = Path("/tmp/coffin_bolozky.pdf")
OUTPUT_PATH = Path(__file__).parent / "verbs_input.txt" OUTPUT_PATH = PROJECT_ROOT / "verbs_input.txt"
# Pages to scan (Appendix 1) # Pages to scan (Appendix 1)
PAGE_START = 390 PAGE_START = 390
@ -31,24 +32,38 @@ PAGE_END = 411
# Binyan headings in Hebrew (vowelled and unvowelled variants) # Binyan headings in Hebrew (vowelled and unvowelled variants)
BINYAN_HEADINGS_HEB = [ BINYAN_HEADINGS_HEB = [
"פָּעַל", "פעל", "פָּעַל",
"נִפְעַל", "נפעל", "פעל",
"פִּעֵל", "פיעל", "נִפְעַל",
"פֻּעַל", "פועל", "נפעל",
"הִתְפַּעֵל", "התפעל", "פִּעֵל",
"הִפְעִיל", "הפעיל", "פיעל",
"הֻפְעַל", "הופעל", "פֻּעַל",
"פועל",
"הִתְפַּעֵל",
"התפעל",
"הִפְעִיל",
"הפעיל",
"הֻפְעַל",
"הופעל",
] ]
# Binyan heading → canonical name # Binyan heading → canonical name
BINYAN_CANONICAL = { BINYAN_CANONICAL = {
"פָּעַל": "Pa'al", "פעל": "Pa'al", "פָּעַל": "Pa'al",
"נִפְעַל": "Nif'al", "נפעל": "Nif'al", "פעל": "Pa'al",
"פִּעֵל": "Pi'el", "פיעל": "Pi'el", "נִפְעַל": "Nif'al",
"פֻּעַל": "Pu'al", "פועל": "Pu'al", "נפעל": "Nif'al",
"הִתְפַּעֵל": "Hitpa'el", "התפעל": "Hitpa'el", "פִּעֵל": "Pi'el",
"הִפְעִיל": "Hif'il", "הפעיל": "Hif'il", "פיעל": "Pi'el",
"הֻפְעַל": "Huf'al", "הופעל": "Huf'al", "פֻּעַל": "Pu'al",
"פועל": "Pu'al",
"הִתְפַּעֵל": "Hitpa'el",
"התפעל": "Hitpa'el",
"הִפְעִיל": "Hif'il",
"הפעיל": "Hif'il",
"הֻפְעַל": "Huf'al",
"הופעל": "Huf'al",
} }
# Passive binyan names — no infinitive, use 3ms past # Passive binyan names — no infinitive, use 3ms past
@ -156,15 +171,16 @@ FALLBACK_VERBS = """# Verb list from Coffin & Bolozky, A Reference Grammar of Mo
def _install_deps(): def _install_deps():
"""Install pymupdf and python-bidi if not available.""" """Install pymupdf and python-bidi if not available."""
try: try:
import fitz # noqa: F401
import bidi # noqa: F401 import bidi # noqa: F401
import fitz # noqa: F401
return True return True
except ImportError: except ImportError:
logger.info("Installing pymupdf and python-bidi …") logger.info("Installing pymupdf and python-bidi …")
import subprocess import subprocess
result = subprocess.run( result = subprocess.run(
[sys.executable, "-m", "pip", "install", [sys.executable, "-m", "pip", "install", "pymupdf", "python-bidi", "--break-system-packages", "-q"],
"pymupdf", "python-bidi", "--break-system-packages", "-q"],
capture_output=True, capture_output=True,
) )
if result.returncode != 0: if result.returncode != 0:
@ -182,6 +198,7 @@ def _download_pdf() -> bool:
logger.info(f"Downloading PDF from {PDF_URL}") logger.info(f"Downloading PDF from {PDF_URL}")
try: try:
import requests import requests
resp = requests.get(PDF_URL, timeout=120, stream=True) resp = requests.get(PDF_URL, timeout=120, stream=True)
resp.raise_for_status() resp.raise_for_status()
PDF_PATH.write_bytes(resp.content) PDF_PATH.write_bytes(resp.content)
@ -211,10 +228,7 @@ def _needs_bidi_fix(text: str) -> bool:
def _strip_nikkud(text: str) -> str: def _strip_nikkud(text: str) -> str:
return "".join( return "".join(ch for ch in unicodedata.normalize("NFD", text) if unicodedata.category(ch) != "Mn")
ch for ch in unicodedata.normalize("NFD", text)
if unicodedata.category(ch) != "Mn"
)
def _extract_from_pdf() -> list[tuple[str, str, str]]: def _extract_from_pdf() -> list[tuple[str, str, str]]:
@ -244,10 +258,9 @@ def _extract_from_pdf() -> list[tuple[str, str, str]]:
# Check if we need bidi correction # Check if we need bidi correction
test_text = "" test_text = ""
try: try:
for page_num in range(min(PAGE_START, doc.page_count - 1), for page_num in range(min(PAGE_START, doc.page_count - 1), min(PAGE_START + 3, doc.page_count)):
min(PAGE_START + 3, doc.page_count)):
test_text += doc[page_num].get_text("text") test_text += doc[page_num].get_text("text")
except Exception: except Exception: # noqa: S110
pass pass
use_bidi = _needs_bidi_fix(test_text) use_bidi = _needs_bidi_fix(test_text)
@ -259,6 +272,7 @@ def _extract_from_pdf() -> list[tuple[str, str, str]]:
return t return t
try: try:
from bidi.algorithm import get_display from bidi.algorithm import get_display
lines = t.split("\n") lines = t.split("\n")
fixed = [] fixed = []
for line in lines: for line in lines:
@ -274,7 +288,7 @@ def _extract_from_pdf() -> list[tuple[str, str, str]]:
for page_num in range(PAGE_START - 1, page_end): # fitz is 0-indexed for page_num in range(PAGE_START - 1, page_end): # fitz is 0-indexed
try: try:
raw = doc[page_num].get_text("text") raw = doc[page_num].get_text("text")
except Exception: except Exception: # noqa: S112
continue continue
text = fix_text(raw) text = fix_text(raw)
@ -316,9 +330,12 @@ def _extract_from_pdf() -> list[tuple[str, str, str]]:
heb_words = re.findall(r"[\u05d0-\u05ea\u05b0-\u05c7]{3,}", line) heb_words = re.findall(r"[\u05d0-\u05ea\u05b0-\u05c7]{3,}", line)
for w in heb_words: for w in heb_words:
stripped_w = _strip_nikkud(w) stripped_w = _strip_nikkud(w)
if current_binyan == "Pu'al" and stripped_w.startswith("פ"): if (
entries.append((current_binyan, "3ms", w)) current_binyan == "Pu'al"
elif current_binyan == "Huf'al" and stripped_w.startswith("ה"): and stripped_w.startswith("פ")
or current_binyan == "Huf'al"
and stripped_w.startswith("ה")
):
entries.append((current_binyan, "3ms", w)) entries.append((current_binyan, "3ms", w))
doc.close() doc.close()
@ -357,16 +374,20 @@ def _write_output(entries: list[tuple[str, str, str]]) -> None:
lines.append(form) lines.append(form)
OUTPUT_PATH.write_text("\n".join(lines) + "\n", encoding="utf-8") OUTPUT_PATH.write_text("\n".join(lines) + "\n", encoding="utf-8")
verb_count = sum(1 for l in lines if l and not l.startswith("#")) verb_count = sum(1 for ln in lines if ln and not ln.startswith("#"))
passive_count = sum(1 for l in lines if l.startswith("# 3ms:")) passive_count = sum(1 for ln in lines if ln.startswith("# 3ms:"))
logger.info(f"Written {verb_count} active verbs + {passive_count} passive (3ms) → {OUTPUT_PATH}") logger.info(f"Written {verb_count} active verbs + {passive_count} passive (3ms) → {OUTPUT_PATH}")
def _binyan_heb(name: str) -> str: def _binyan_heb(name: str) -> str:
mapping = { mapping = {
"Pa'al": "פָּעַל", "Nif'al": "נִפְעַל", "Pi'el": "פִּעֵל", "Pa'al": "פָּעַל",
"Pu'al": "פֻּעַל", "Hitpa'el": "הִתְפַּעֵל", "Nif'al": "נִפְעַל",
"Hif'il": "הִפְעִיל", "Huf'al": "הֻפְעַל", "Pi'el": "פִּעֵל",
"Pu'al": "פֻּעַל",
"Hitpa'el": "הִתְפַּעֵל",
"Hif'il": "הִפְעִיל",
"Huf'al": "הֻפְעַל",
} }
return mapping.get(name, name) return mapping.get(name, name)

919
scripts/validate_data.py Normal file
View file

@ -0,0 +1,919 @@
"""Standalone integrity validator for data/words.json.
Validates the unified Hebrew Flash Cards data against the schema defined in
SCHEMA.yaml. Each test prints PASS/FAIL with details on failures.
Usage:
python3 scripts/validate_data.py
python3 scripts/validate_data.py --verbose
python3 scripts/validate_data.py --test confusable_symmetric
"""
from __future__ import annotations
import argparse
import json
import re
import sys
import unicodedata
from pathlib import Path
from typing import Any
# ---------------------------------------------------------------------------
# Bootstrap: make project root importable so helpers.py is accessible
# ---------------------------------------------------------------------------
sys.path.insert(0, str(Path(__file__).parent.parent))
# ---------------------------------------------------------------------------
# Constants
# ---------------------------------------------------------------------------
DATA_FILE = Path(__file__).parent.parent / "data" / "words.json"
HEBREW_CONSONANT_RANGE = (0x05D0, 0x05EA) # aleftav
VALID_PERSON_CODES: frozenset[str] = frozenset(
["inf", "1s", "1p", "2ms", "2fs", "2mp", "2fp", "3ms", "3fs", "3mp", "3fp", "3p", "ms", "fs", "mp", "fp"]
)
EMOJI_RE = re.compile(
r"[\U0001f600-\U0001f64f"
r"\U0001f300-\U0001f5ff"
r"\U0001f680-\U0001f6ff"
r"\U0001f1e0-\U0001f1ff"
r"\U00002702-\U000027b0"
r"\U0001f900-\U0001f9ff"
r"\U0001fa00-\U0001fa6f"
r"\U0001fa70-\U0001faff]"
)
# ---------------------------------------------------------------------------
# Result tracking
# ---------------------------------------------------------------------------
_failures: list[str] = []
_warnings: list[str] = []
_verbose: bool = False
def _pass(name: str) -> None:
print(f" PASS {name}")
def _fail(name: str, details: list[str]) -> None:
global _failures
_failures.append(name)
print(f" FAIL {name}")
for d in details:
print(f" {d}")
def _warn(name: str, details: list[str]) -> None:
global _warnings
_warnings.extend(details)
print(f" WARN {name}")
for d in details:
print(f" {d}")
def _verbose_print(msg: str) -> None:
if _verbose:
print(f" {msg}")
# ---------------------------------------------------------------------------
# Helper: load data
# ---------------------------------------------------------------------------
def load_data() -> dict[str, Any]:
"""Load words.json and return the parsed dict."""
if not DATA_FILE.exists():
print(f"ERROR: data file not found: {DATA_FILE}")
sys.exit(2)
with DATA_FILE.open(encoding="utf-8") as fh:
return json.load(fh)
def _is_hebrew_consonant(ch: str) -> bool:
"""Return True if ch is a Hebrew consonant (U+05D0..U+05EA).
Accepts multi-codepoint strings like 'שׁ' (shin + shin dot) by checking
only the first base character after NFD decomposition.
"""
normalized = unicodedata.normalize("NFD", ch)
# The first codepoint is the base consonant; the rest are combining marks.
base = normalized[0]
cp = ord(base)
return HEBREW_CONSONANT_RANGE[0] <= cp <= HEBREW_CONSONANT_RANGE[1]
# ---------------------------------------------------------------------------
# Individual tests
# ---------------------------------------------------------------------------
def test_required_fields(data: dict[str, Any]) -> None:
"""Every entry has word.nikkud, word.ktiv_male, slug, pos, meaning."""
name = "required_fields"
errors: list[str] = []
warn_details: list[str] = []
for key, entry in data.items():
word = entry.get("word")
if not isinstance(word, dict):
errors.append(f"[{key}] 'word' is missing or not a dict")
else:
if not word.get("nikkud"):
errors.append(f"[{key}] word.nikkud is missing or empty")
if not word.get("ktiv_male"):
errors.append(f"[{key}] word.ktiv_male is missing or empty")
if not entry.get("slug"):
errors.append(f"[{key}] 'slug' is missing or empty")
if not entry.get("pos"):
errors.append(f"[{key}] 'pos' is missing or empty")
if not entry.get("meaning"):
errors.append(f"[{key}] 'meaning' is missing or empty")
if entry.get("frequency") is None:
warn_details.append(f"[{key}] 'frequency' is null/missing")
if warn_details:
_warn("frequency_missing", warn_details[:20] if not _verbose else warn_details)
if len(warn_details) > 20 and not _verbose:
print(f" ... ({len(warn_details) - 20} more; use --verbose)")
if errors:
_fail(name, errors[:20] if not _verbose else errors)
if len(errors) > 20 and not _verbose:
print(f" ... ({len(errors) - 20} more; use --verbose)")
else:
_pass(name)
def test_root_format(data: dict[str, Any]) -> None:
"""root is a list of 2-5 Hebrew consonant chars, or an empty list."""
name = "root_format"
errors: list[str] = []
for key, entry in data.items():
root = entry.get("root")
if root is None:
errors.append(f"[{key}] 'root' key is absent (should be [] for rootless words)")
continue
if not isinstance(root, list):
errors.append(f"[{key}] 'root' is not a list: {root!r}")
continue
if len(root) == 0:
continue # rootless word — valid
if not (2 <= len(root) <= 5):
errors.append(f"[{key}] root has {len(root)} elements (expected 2-5): {root!r}")
continue
for ch in root:
# A root element may be multi-codepoint (e.g. 'שׁ' = shin + shin dot).
# Validate by checking the base consonant after NFD decomposition.
if not isinstance(ch, str) or not ch or not _is_hebrew_consonant(ch):
errors.append(f"[{key}] root char {ch!r} is not a Hebrew consonant (U+05D0..U+05EA)")
break
if errors:
_fail(name, errors[:20] if not _verbose else errors)
if len(errors) > 20 and not _verbose:
print(f" ... ({len(errors) - 20} more; use --verbose)")
else:
_pass(name)
def test_unique_slugs(data: dict[str, Any]) -> None:
"""All non-empty slugs are unique across entries — each pealim page is a distinct word."""
name = "unique_slugs"
seen: dict[str, list[str]] = {}
for key, entry in data.items():
slug = entry.get("slug")
if slug:
seen.setdefault(slug, []).append(key)
dups = {slug: keys for slug, keys in seen.items() if len(keys) > 1}
if dups:
errors = [f"slug={slug!r} shared by: {keys}" for slug, keys in dups.items()]
_fail(name, errors[:20] if not _verbose else errors)
if len(errors) > 20 and not _verbose:
print(f" ... ({len(errors) - 20} more; use --verbose)")
else:
_pass(name)
def test_no_duplicate_keys(_data: dict[str, Any]) -> None: # noqa: ARG001
"""JSON loaded without top-level key collisions.
Python's json.load silently keeps the last value on duplicate keys;
we re-parse with a custom object_pairs_hook to detect them.
The pre-parsed ``_data`` dict is not used here because we need to
re-read the raw file to catch duplicate keys that json.load would
silently merge.
"""
name = "no_duplicate_keys"
duplicates: list[str] = []
def _detect_dups(pairs: list[tuple[str, Any]]) -> dict[str, Any]:
d: dict[str, Any] = {}
for k, v in pairs:
if k in d:
duplicates.append(k)
d[k] = v
return d
with DATA_FILE.open(encoding="utf-8") as fh:
json.load(fh, object_pairs_hook=_detect_dups)
if duplicates:
_fail(name, [f"duplicate key: {k!r}" for k in duplicates])
else:
_pass(name)
def test_confusable_symmetric(data: dict[str, Any]) -> None:
"""If A lists B in confusable_group, B must list A."""
name = "confusable_symmetric"
errors: list[str] = []
for key, entry in data.items():
group = entry.get("confusable_group")
if not group:
continue
for other_key in group:
other = data.get(other_key)
if other is None:
errors.append(f"[{key}] confusable_group references non-existent key {other_key!r}")
continue
other_group = other.get("confusable_group") or []
if key not in other_group:
errors.append(f"[{key}] lists {other_key!r} as confusable, but {other_key!r} does not list {key!r}")
if errors:
_fail(name, errors[:20] if not _verbose else errors)
if len(errors) > 20 and not _verbose:
print(f" ... ({len(errors) - 20} more; use --verbose)")
else:
_pass(name)
def test_shared_roots_valid_keys(data: dict[str, Any]) -> None:
"""Every key in shared_roots must exist as a top-level key."""
name = "shared_roots_valid_keys"
errors: list[str] = []
for key, entry in data.items():
shared = entry.get("shared_roots")
if not shared:
continue
for ref_key in shared:
if ref_key not in data:
errors.append(f"[{key}] shared_roots references non-existent key {ref_key!r}")
if errors:
_fail(name, errors[:20] if not _verbose else errors)
if len(errors) > 20 and not _verbose:
print(f" ... ({len(errors) - 20} more; use --verbose)")
else:
_pass(name)
def test_unique_legacy_guids(data: dict[str, Any]) -> None:
"""No two entries share the same vocab_legacy_guid (excluding null).
Exception: entries that share the same word.nikkud value inherited the
same legacy Anki card (PoS homographs like חַד Particle vs Adjective).
These are tolerated the duplicate GUID is a known artefact of how
legacy GUIDs were generated from the nikkud word alone.
"""
name = "unique_legacy_guids"
seen: dict[str, list[str]] = {}
for key, entry in data.items():
guid = entry.get("vocab_legacy_guid")
if guid:
seen.setdefault(guid, []).append(key)
errors: list[str] = []
for guid, keys in seen.items():
if len(keys) <= 1:
continue
# Tolerate sharing if ALL entries with this GUID share the same word.nikkud
nikkud_values = {(data[k].get("word") or {}).get("nikkud") for k in keys}
if len(nikkud_values) == 1:
# Same nikkud -> inherited from same legacy card; tolerable
_verbose_print(
f"GUID {guid!r} shared by {len(keys)} entries with same nikkud ({next(iter(nikkud_values))!r}): {keys}"
)
continue
errors.append(f"guid={guid!r} shared by entries with DIFFERENT nikkud: {keys}")
if errors:
_fail(name, errors[:20] if not _verbose else errors)
if len(errors) > 20 and not _verbose:
print(f" ... ({len(errors) - 20} more; use --verbose)")
else:
_pass(name)
def test_no_noun_inflection_on_non_nouns(data: dict[str, Any]) -> None:
"""noun_inflection must be null if pos doesn't start with 'Noun'.
Explicit test case: 'גָּבוֹהַּ' (adjective) must NOT have noun_inflection.
"""
name = "no_noun_inflection_on_non_nouns"
errors: list[str] = []
for key, entry in data.items():
pos = entry.get("pos") or ""
noun_inf = entry.get("noun_inflection")
if not pos.startswith("Noun") and noun_inf is not None:
errors.append(f"[{key}] pos={pos!r} but noun_inflection is set")
_verbose_print(f"offending entry: {key!r}")
if errors:
_fail(name, errors[:20] if not _verbose else errors)
if len(errors) > 20 and not _verbose:
print(f" ... ({len(errors) - 20} more; use --verbose)")
else:
_pass(name)
def test_no_emoji_in_meaning(data: dict[str, Any]) -> None:
"""meaning field must not contain inline emoji characters."""
name = "no_emoji_in_meaning"
errors: list[str] = []
for key, entry in data.items():
meaning = entry.get("meaning") or ""
if EMOJI_RE.search(meaning):
errors.append(f"[{key}] meaning contains emoji: {meaning!r}")
if errors:
_fail(name, errors[:20] if not _verbose else errors)
if len(errors) > 20 and not _verbose:
print(f" ... ({len(errors) - 20} more; use --verbose)")
else:
_pass(name)
def test_example_sentences_contain_word(data: dict[str, Any]) -> None:
"""For entries with examples.vetted, the word.nikkud must appear in at least one sentence.
Uses nikkud (exact) matching, not stripped matching.
"""
name = "example_sentences_contain_word"
errors: list[str] = []
for key, entry in data.items():
examples = entry.get("examples")
if not examples:
continue
vetted = examples.get("vetted")
if not vetted:
continue
word_obj = entry.get("word") or {}
nikkud_word = word_obj.get("nikkud") or ""
if not nikkud_word:
continue
found = any(nikkud_word in (s.get("text") or "") for s in vetted)
if not found:
sentences_preview = [s.get("text", "") for s in vetted[:2]]
errors.append(
f"[{key}] word {nikkud_word!r} not found in any vetted sentence. Sentences: {sentences_preview!r}"
)
if errors:
_warn(name, errors[:20] if not _verbose else errors)
if len(errors) > 20 and not _verbose:
print(f" ... ({len(errors) - 20} more; use --verbose)")
_pass(name)
def test_cloze_offsets_valid(data: dict[str, Any]) -> None:
"""cloze_word_start/end must be within text bounds when present.
Null offsets are tolerated (and warned separately) because some sentences
contain only inflected/construct/plural forms that cannot be matched back
to the base nikkud or ktiv_male this is a data quality issue in
vetted_sentences.json, not a schema violation.
"""
name = "cloze_offsets_valid"
errors: list[str] = []
null_warn: list[str] = []
for key, entry in data.items():
examples = entry.get("examples")
if not examples:
continue
cloze = examples.get("cloze")
if not cloze:
continue
text = cloze.get("text") or ""
start = cloze.get("cloze_word_start")
end = cloze.get("cloze_word_end")
if start is None or end is None:
null_warn.append(f"[{key}] cloze present but cloze_word_start/end are null")
continue
text_len = len(text)
if not isinstance(start, int) or not isinstance(end, int):
errors.append(f"[{key}] cloze_word_start/end are not integers: {start!r}, {end!r}")
continue
if start < 0 or end < 0:
errors.append(f"[{key}] cloze offsets are negative: start={start}, end={end}")
continue
if start >= end:
errors.append(f"[{key}] cloze start >= end: start={start}, end={end}")
continue
if end > text_len:
errors.append(f"[{key}] cloze end={end} exceeds text length={text_len}: {text!r}")
if null_warn:
_warn(f"{name}_null_offsets", null_warn[:20] if not _verbose else null_warn)
if len(null_warn) > 20 and not _verbose:
print(f" ... ({len(null_warn) - 20} more; use --verbose)")
if errors:
_fail(name, errors[:20] if not _verbose else errors)
if len(errors) > 20 and not _verbose:
print(f" ... ({len(errors) - 20} more; use --verbose)")
else:
_pass(name)
def test_hufal_pual_only_on_hifil_piel(data: dict[str, Any]) -> None:
"""hufal_pual_forms must only be set for Hif'il or Pi'el verbs."""
name = "hufal_pual_only_on_hifil_piel"
errors: list[str] = []
for key, entry in data.items():
conj = entry.get("conjugation")
if not conj:
continue
hufal_pual = conj.get("hufal_pual_forms")
if hufal_pual is None:
continue
binyan = conj.get("binyan") or ""
binyan_lower = binyan.lower()
if "hif" not in binyan_lower and "pi" not in binyan_lower:
errors.append(f"[{key}] hufal_pual_forms is set but binyan={binyan!r} (expected Hif'il or Pi'el)")
if errors:
_fail(name, errors[:20] if not _verbose else errors)
if len(errors) > 20 and not _verbose:
print(f" ... ({len(errors) - 20} more; use --verbose)")
else:
_pass(name)
def test_confusable_group_shares_ktiv_male(data: dict[str, Any]) -> None:
"""All entries in a confusable_group must share the same word.ktiv_male."""
name = "confusable_group_shares_ktiv_male"
errors: list[str] = []
for key, entry in data.items():
group = entry.get("confusable_group")
if not group:
continue
my_word = entry.get("word") or {}
my_ktiv = my_word.get("ktiv_male")
if not my_ktiv:
continue
for other_key in group:
other = data.get(other_key)
if not other:
continue # already caught by confusable_symmetric
other_word = other.get("word") or {}
other_ktiv = other_word.get("ktiv_male")
if other_ktiv and other_ktiv != my_ktiv:
errors.append(
f"[{key}] ktiv_male={my_ktiv!r} but confusable member {other_key!r} has ktiv_male={other_ktiv!r}"
)
if errors:
_fail(name, errors[:20] if not _verbose else errors)
if len(errors) > 20 and not _verbose:
print(f" ... ({len(errors) - 20} more; use --verbose)")
else:
_pass(name)
def test_confusables_guid(data: dict[str, Any]) -> None:
"""confusables_guid must be consistent within each confusable_group.
Rules:
- If confusable_group is non-null, confusables_guid must be non-null.
- If confusable_group is null, confusables_guid must be null.
- All entries that share a confusable_group must share the same
confusables_guid value.
"""
name = "confusables_guid"
errors: list[str] = []
for key, entry in data.items():
group = entry.get("confusable_group")
guid = entry.get("confusables_guid")
if group and not guid:
errors.append(f"[{key}] has confusable_group but confusables_guid is null/missing")
elif not group and guid is not None:
errors.append(f"[{key}] has confusables_guid={guid!r} but confusable_group is null")
if not group or not guid:
continue
for other_key in group:
other = data.get(other_key)
if not other:
continue # already caught by confusable_symmetric
other_guid = other.get("confusables_guid")
if other_guid != guid:
errors.append(
f"[{key}] confusables_guid={guid!r} but confusable member "
f"{other_key!r} has confusables_guid={other_guid!r}"
)
if errors:
_fail(name, errors[:20] if not _verbose else errors)
if len(errors) > 20 and not _verbose:
print(f" ... ({len(errors) - 20} more; use --verbose)")
else:
_pass(name)
def test_conjugation_form_guids(data: dict[str, Any]) -> None:
"""Every conjugation form must have a guid or guid_candidates, and GUIDs must be unique within a verb.
Rules:
- Each form in active_forms and hufal_pual_forms must have a non-null ``guid``
OR a non-empty ``guid_candidates`` list (used for present tense, past 3p, and
1st person forms where multiple GUIDs are possible).
- No two forms within the same verb (across both form lists) may share a GUID.
"""
name = "conjugation_form_guids"
errors: list[str] = []
warnings: list[str] = []
for key, entry in data.items():
conj = entry.get("conjugation")
if not conj:
continue
seen_guids: dict[str, str] = {} # guid -> "form_list_key[person]" label
for form_list_key in ("active_forms", "hufal_pual_forms"):
forms = conj.get(form_list_key)
if not forms:
continue
for form in forms:
person = form.get("person", "?")
label = f"{form_list_key}[{person}]"
guid = form.get("guid")
guid_candidates = form.get("guid_candidates")
if not guid and not guid_candidates:
# New forms from rescrape use deterministic fallback — warn, don't fail
warnings.append(f"[{key}] {label}: missing or null 'guid' and no 'guid_candidates'")
continue
if guid:
if guid in seen_guids:
errors.append(f"[{key}] {label}: guid={guid!r} duplicates {seen_guids[guid]}")
else:
seen_guids[guid] = label
elif guid_candidates:
for candidate in guid_candidates:
if candidate in seen_guids:
errors.append(
f"[{key}] {label}: guid_candidate={candidate!r} duplicates {seen_guids[candidate]}"
)
else:
seen_guids[candidate] = label
if warnings:
_warn(name + "_missing", [f"{len(warnings)} forms missing guid (deterministic fallback used)"])
if errors:
_fail(name, errors[:20] if not _verbose else errors)
if len(errors) > 20 and not _verbose:
print(f" ... ({len(errors) - 20} more; use --verbose)")
else:
_pass(name)
def test_conjugation_person_codes(data: dict[str, Any]) -> None:
"""active_forms person codes must be from the defined valid set."""
name = "conjugation_person_codes"
errors: list[str] = []
for key, entry in data.items():
conj = entry.get("conjugation")
if not conj:
continue
for form_list_key in ("active_forms", "hufal_pual_forms"):
forms = conj.get(form_list_key)
if not forms:
continue
for form in forms:
person = form.get("person")
if person not in VALID_PERSON_CODES:
errors.append(f"[{key}] {form_list_key}: invalid person code {person!r}")
if errors:
_fail(name, errors[:20] if not _verbose else errors)
if len(errors) > 20 and not _verbose:
print(f" ... ({len(errors) - 20} more; use --verbose)")
else:
_pass(name)
def test_no_stripped_form_sentence_collisions(data: dict[str, Any]) -> None:
"""For confusable words, their example sentences must not contain the wrong
homograph's nikkud word.
Specifically: if A and B are confusable (same ktiv_male), A's vetted
sentences must not contain B's nikkud form, and vice versa.
"""
name = "no_stripped_form_sentence_collisions"
errors: list[str] = []
for key, entry in data.items():
group = entry.get("confusable_group")
if not group:
continue
examples = entry.get("examples")
if not examples:
continue
vetted = examples.get("vetted")
if not vetted:
continue
my_word = entry.get("word") or {}
my_nikkud = my_word.get("nikkud") or ""
my_texts = [s.get("text") or "" for s in vetted]
for other_key in group:
other = data.get(other_key)
if not other:
continue
other_word = other.get("word") or {}
other_nikkud = other_word.get("nikkud") or ""
if not other_nikkud or other_nikkud == my_nikkud:
continue # same nikkud homographs are ok (we can't distinguish by nikkud)
for text in my_texts:
if other_nikkud in text:
errors.append(f"[{key}] sentence contains wrong homograph {other_nikkud!r}: {text!r}")
_verbose_print(f" my word: {my_nikkud!r}, wrong form: {other_nikkud!r}")
break # one error per (key, other_key) pair is enough
if errors:
_warn(name, errors[:20] if not _verbose else errors)
if len(errors) > 20 and not _verbose:
print(f" ... ({len(errors) - 20} more; use --verbose)")
_pass(name)
def test_no_shared_confusable_examples(data: dict[str, Any]) -> None:
"""Within each confusable group, no two entries should share the same set of vetted sentence texts.
Shared examples indicate the deduplication step in epub_examples.py
failed to assign examples to only the highest-frequency member.
"""
name = "no_shared_confusable_examples"
errors: list[str] = []
from collections import defaultdict
# Build confusable group map
group_map: dict[tuple[str, ...], list[str]] = defaultdict(list)
for key, entry in data.items():
cg = entry.get("confusable_group")
if cg:
group_id = tuple(sorted(cg))
group_map[group_id].append(key)
for _group_id, members in group_map.items():
if len(members) < 2:
continue
# Collect sentence text sets per member
text_sets: dict[str, frozenset[str]] = {}
for key in members:
vetted = (data[key].get("examples") or {}).get("vetted") or []
texts = frozenset(e.get("text", "") for e in vetted)
if texts:
text_sets[key] = texts
# Check for identical sets
seen: dict[frozenset[str], str] = {}
for key, texts in text_sets.items():
if texts in seen:
meaning_a = (data[seen[texts]].get("meaning") or "")[:30]
meaning_b = (data[key].get("meaning") or "")[:30]
errors.append(
f"{seen[texts]} ({meaning_a}) and {key} ({meaning_b}) share {len(texts)} identical example(s)"
)
else:
seen[texts] = key
if errors:
_fail(name, errors[:20] if not _verbose else errors)
if len(errors) > 20 and not _verbose:
print(f" ... ({len(errors) - 20} more; use --verbose)")
else:
_pass(name)
def test_no_hebrew_in_meaning(data: dict[str, Any]) -> None:
"""English meanings must not contain bare Hebrew text (spoils the card)."""
name = "no_hebrew_in_meaning"
errors: list[str] = []
hebrew_re = re.compile(r"[\u05D0-\u05EA]")
for key, entry in data.items():
meaning = entry.get("meaning") or ""
# Apply same cleaning pipeline as apkg_builder
cleaned = re.sub(r"[\u0590-\u05FF][\u0590-\u05FF\u0591-\u05C7\s\-]*", "", meaning)
cleaned = re.sub(r"\s{2,}", " ", cleaned).strip(", ;:")
if hebrew_re.search(cleaned):
errors.append(f"[{key}] meaning still contains Hebrew after cleaning: {cleaned!r}")
if errors:
_fail(name, errors[:20] if not _verbose else errors)
if len(errors) > 20 and not _verbose:
print(f" ... ({len(errors) - 20} more; use --verbose)")
else:
_pass(name)
def test_mishkal_consistency(data: dict[str, Any]) -> None:
"""mishkal_hebrew must match mishkal via _mishkal_to_hebrew conversion."""
name = "mishkal_consistency"
errors: list[str] = []
try:
from pealim_detail_scrape import _mishkal_to_hebrew
except ImportError:
_warn(name, ["Could not import _mishkal_to_hebrew — skipping"])
return
for key, entry in data.items():
for infl_key in ("noun_inflection", "adjective_inflection"):
infl = entry.get(infl_key)
if not infl:
continue
mishkal_eng = infl.get("mishkal") or ""
mishkal_heb = infl.get("mishkal_hebrew") or ""
if mishkal_eng and mishkal_heb:
expected = _mishkal_to_hebrew(mishkal_eng) or ""
if expected and expected != mishkal_heb:
errors.append(f"[{key}] {infl_key}: {mishkal_eng}{mishkal_heb} (expected {expected})")
if mishkal_heb and not mishkal_eng:
errors.append(f"[{key}] {infl_key}: has mishkal_hebrew but no mishkal")
if errors:
_fail(name, errors[:20] if not _verbose else errors)
if len(errors) > 20 and not _verbose:
print(f" ... ({len(errors) - 20} more; use --verbose)")
else:
_pass(name)
# ---------------------------------------------------------------------------
# Stats summary
# ---------------------------------------------------------------------------
def print_stats(data: dict[str, Any]) -> None:
"""Print a summary of dataset coverage metrics."""
total = len(data)
with_conj = sum(1 for e in data.values() if e.get("conjugation"))
with_noun_inf = sum(1 for e in data.values() if e.get("noun_inflection"))
with_vetted = sum(1 for e in data.values() if (e.get("examples") or {}).get("vetted"))
with_cloze = sum(1 for e in data.values() if (e.get("examples") or {}).get("cloze"))
with_image = sum(1 for e in data.values() if e.get("image"))
with_emoji = sum(1 for e in data.values() if e.get("emoji"))
with_guid = sum(1 for e in data.values() if e.get("vocab_legacy_guid"))
in_confusable = sum(1 for e in data.values() if e.get("confusable_group"))
with_shared_roots = sum(1 for e in data.values() if e.get("shared_roots"))
with_mishkal = sum(
1
for e in data.values()
if (e.get("noun_inflection") or {}).get("mishkal") or (e.get("adjective_inflection") or {}).get("mishkal")
)
print()
print("Stats Summary")
print("" * 42)
print(f" Total entries: {total:>6}")
print(f" With conjugation data: {with_conj:>6}")
print(f" With noun_inflection: {with_noun_inf:>6}")
print(f" With mishkal: {with_mishkal:>6}")
print(f" With vetted examples: {with_vetted:>6}")
print(f" With cloze examples: {with_cloze:>6}")
print(f" With images: {with_image:>6}")
print(f" With emoji: {with_emoji:>6}")
print(f" With legacy GUIDs: {with_guid:>6}")
print(f" In confusable groups: {in_confusable:>6}")
print(f" With shared roots: {with_shared_roots:>6}")
# ---------------------------------------------------------------------------
# Test registry
# ---------------------------------------------------------------------------
ALL_TESTS: dict[str, Any] = {
"required_fields": test_required_fields,
"root_format": test_root_format,
"unique_slugs": test_unique_slugs,
"no_duplicate_keys": test_no_duplicate_keys,
"confusable_symmetric": test_confusable_symmetric,
"shared_roots_valid_keys": test_shared_roots_valid_keys,
"unique_legacy_guids": test_unique_legacy_guids,
"no_noun_inflection_on_non_nouns": test_no_noun_inflection_on_non_nouns,
"no_emoji_in_meaning": test_no_emoji_in_meaning,
"example_sentences_contain_word": test_example_sentences_contain_word,
"cloze_offsets_valid": test_cloze_offsets_valid,
"hufal_pual_only_on_hifil_piel": test_hufal_pual_only_on_hifil_piel,
"confusable_group_shares_ktiv_male": test_confusable_group_shares_ktiv_male,
"confusables_guid": test_confusables_guid,
"conjugation_form_guids": test_conjugation_form_guids,
"conjugation_person_codes": test_conjugation_person_codes,
"no_stripped_form_sentence_collisions": test_no_stripped_form_sentence_collisions,
"no_shared_confusable_examples": test_no_shared_confusable_examples,
"no_hebrew_in_meaning": test_no_hebrew_in_meaning,
"mishkal_consistency": test_mishkal_consistency,
}
# ---------------------------------------------------------------------------
# Entry point
# ---------------------------------------------------------------------------
def main() -> None:
global _verbose
parser = argparse.ArgumentParser(description="Validate data/words.json against the Hebrew Flash Cards schema.")
parser.add_argument(
"--verbose",
"-v",
action="store_true",
help="Print full details for all failures (not just first 20).",
)
parser.add_argument(
"--test",
metavar="NAME",
help=f"Run a single test by name. Available: {', '.join(ALL_TESTS)}",
)
args = parser.parse_args()
_verbose = args.verbose
data = load_data()
# Select tests to run
if args.test:
if args.test not in ALL_TESTS:
print(f"ERROR: unknown test {args.test!r}. Available: {', '.join(ALL_TESTS)}")
sys.exit(2)
tests_to_run = {args.test: ALL_TESTS[args.test]}
else:
tests_to_run = ALL_TESTS
print(f"Validating {DATA_FILE} ({len(data)} entries)")
print("" * 60)
# no_duplicate_keys needs the file, not the pre-parsed dict
for test_fn in tests_to_run.values():
test_fn(data)
# Summary
if not args.test:
print_stats(data)
print()
print("" * 60)
if _warnings:
print(f" Warnings : {len(_warnings)}")
if _failures:
print(f" FAILED: {len(_failures)} test(s): {', '.join(_failures)}")
sys.exit(1)
else:
print(f" All {len(tests_to_run)} test(s) passed.")
sys.exit(0)
if __name__ == "__main__":
main()

198
sentence_difficulty.py Normal file
View file

@ -0,0 +1,198 @@
"""Sentence difficulty scoring by context-word frequency.
Scores sentences by the median frequency rank of context words
(excluding the cloze target). Lower score = easier sentence.
Used by epub_examples.py to select the best cloze sentence.
"""
from __future__ import annotations
from statistics import median
import helpers
import nikkud_to_ktiv_male
DEFAULT_RANK = 50_000
# Hebrew prefix consonants for ktiv_male prefix stripping (tier 5)
_KM_PREFIX_CHARS = set("בהוכלמשע")
# Punctuation to strip from tokens
_PUNCT = set('.,!?;:"\'"״׳–—()[]{}')
# Maqaf (Hebrew hyphen) — splits tokens
_MAQAF = "־"
def build_nikkud_map(words: dict) -> dict[str, str]:
"""Build nikkud→ktiv_male lookup from words.json.
Indexes: headwords, conjugation forms (active, passive, infinitive,
reference_form), noun inflections (singular, plural, construct,
pronominal suffixes), and adjective inflections (ms/fs/mp/fp).
Args:
words: The full words.json dict keyed by unique_key.
Returns:
Dict mapping nikkud form to ktiv_male string.
When collisions occur, last-write wins (acceptable for frequency lookup).
"""
nmap: dict[str, str] = {}
def _add(nikkud: str | None, ktiv_male: str | None) -> None:
if nikkud and ktiv_male:
nmap[nikkud] = ktiv_male
for entry in words.values():
word = entry.get("word") or {}
_add(word.get("nikkud"), word.get("ktiv_male"))
# Conjugation forms
conj = entry.get("conjugation") or {}
for form_entry in conj.get("active_forms") or []:
form = form_entry.get("form") or {}
_add(form.get("nikkud"), form.get("ktiv_male"))
for form_entry in conj.get("hufal_pual_forms") or []:
form = form_entry.get("form") or {}
_add(form.get("nikkud"), form.get("ktiv_male"))
inf = conj.get("infinitive") or {}
_add(inf.get("nikkud"), inf.get("ktiv_male"))
ref = conj.get("reference_form") or {}
_add(ref.get("nikkud"), ref.get("ktiv_male"))
# Noun inflection forms
noun = entry.get("noun_inflection") or {}
for field in ("singular", "plural", "construct_singular", "construct_plural"):
sub = noun.get(field) or {}
nikkud_form = sub.get("nikkud")
ktiv = sub.get("ktiv_male")
_add(nikkud_form, ktiv)
# Index construct forms without maqaf
if nikkud_form and nikkud_form.endswith("־") and ktiv:
_add(nikkud_form[:-1], ktiv)
pronominal = noun.get("pronominal_suffixes") or {}
for sub in pronominal.values():
if isinstance(sub, dict):
_add(sub.get("nikkud"), sub.get("ktiv_male"))
# Adjective inflection forms
adj = entry.get("adjective_inflection") or {}
for field in ("ms", "fs", "mp", "fp"):
sub = adj.get(field) or {}
_add(sub.get("nikkud"), sub.get("ktiv_male"))
return nmap
def _resolve_token_frequency(
token: str,
nikkud_map: dict[str, str],
nikkud_index: dict,
freq_data: dict[str, int],
) -> int:
"""Resolve a nikkud sentence token to its frequency rank.
Uses a 5-tier pipeline:
1. Known mapping (nikkud_map from words.json)
2. Nikkud prefix stripping (epub_examples.try_strip_prefix)
3. Academy rules converter (nikkud_to_ktiv_male.convert)
4. strip_nikkud fallback (helpers.strip_nikkud)
5. Ktiv_male prefix stripping on the converted form
Returns:
Frequency rank (1 = most common). DEFAULT_RANK (50000) if not found.
"""
# Tier 1: Direct lookup in nikkud→ktiv_male map
ktiv = nikkud_map.get(token)
if ktiv and ktiv in freq_data:
return freq_data[ktiv]
# Tier 2: Nikkud prefix stripping → resolve remainder via nikkud_map
from epub_examples import try_strip_prefix
prefix_hits = try_strip_prefix(token, nikkud_index)
for _unique_key, _match_type, matched_remainder in prefix_hits:
remainder_ktiv = nikkud_map.get(matched_remainder)
if remainder_ktiv and remainder_ktiv in freq_data:
return freq_data[remainder_ktiv]
# Tier 3: Academy rules converter
converted = nikkud_to_ktiv_male.convert(token)
if converted in freq_data:
return freq_data[converted]
# Tier 4: strip_nikkud fallback
stripped = helpers.strip_nikkud(token)
if stripped != converted and stripped in freq_data:
return freq_data[stripped]
# Tier 5: Ktiv_male prefix stripping on converted/stripped form
for form in (converted, stripped):
for prefix_len in (1, 2):
if len(form) > prefix_len + 1:
prefix = form[:prefix_len]
if all(c in _KM_PREFIX_CHARS for c in prefix):
stem = form[prefix_len:]
if stem in freq_data:
return freq_data[stem]
return DEFAULT_RANK
def score_sentence(
text: str,
target_start: int,
target_end: int,
nikkud_map: dict[str, str],
nikkud_index: dict,
freq_data: dict[str, int],
) -> int:
"""Score a sentence by median frequency rank of context words.
Args:
text: The full sentence text (with nikkud).
target_start: Character offset where the cloze target word starts.
target_end: Character offset where the cloze target word ends.
nikkud_map: nikkudktiv_male mapping from build_nikkud_map().
nikkud_index: nikkud index from epub_examples._build_nikkud_index().
freq_data: Frequency dict from frequency_lookup.get_freq_data().
Returns:
Median frequency rank of context tokens (int). Lower = easier.
Returns DEFAULT_RANK if no scoreable context tokens.
"""
# Tokenize: split on whitespace, then split on maqaf
raw_tokens = text.split()
tokens_with_pos: list[tuple[str, int, int]] = []
pos = 0
for raw in raw_tokens:
start = text.index(raw, pos)
# Split on maqaf
parts = raw.split(_MAQAF)
sub_pos = start
for part in parts:
if part:
tokens_with_pos.append((part, sub_pos, sub_pos + len(part)))
sub_pos += len(part) + 1 # +1 for maqaf
pos = start + len(raw)
# Filter: exclude target word, strip punctuation, skip short tokens
context_ranks: list[int] = []
for token, tok_start, tok_end in tokens_with_pos:
# Exclude target word by overlap with char offsets
if tok_start < target_end and tok_end > target_start:
continue
# Strip punctuation from edges
cleaned = token.strip("".join(_PUNCT))
if len(cleaned) < 2:
continue
rank = _resolve_token_frequency(cleaned, nikkud_map, nikkud_index, freq_data)
context_ranks.append(rank)
if not context_ranks:
return DEFAULT_RANK
return int(median(context_ranks))

View file

@ -1,31 +0,0 @@
#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup
word = 'אבל'
url = f'https://www.pealim.com/search/?q={word}'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
try:
response = requests.get(url, headers=headers, timeout=10)
print(f'Status: {response.status_code}')
soup = BeautifulSoup(response.content, 'html.parser')
# Debug: check what we find
word_elem = soup.find('h1', class_='word-title')
pos_elem = soup.find('span', class_='pos')
definition_elem = soup.find('div', class_='definition')
print(f'word_elem found: {word_elem is not None}')
print(f'pos_elem found: {pos_elem is not None}')
print(f'definition_elem found: {definition_elem is not None}')
print('\n--- HTML snippet (first 3000 chars) ---')
print(soup.prettify()[:3000])
except Exception as e:
print(f'Error: {e}')
import traceback
traceback.print_exc()

0
tests/__init__.py Normal file
View file

246
tests/test_apkg_builder.py Normal file
View file

@ -0,0 +1,246 @@
"""Unit tests for apkg_builder — Sprint 15 learnings.
Tests cover: cloze prefix preservation, Hebrew spoiler stripping from English
meanings, PoS exact matching, gender field population, and mishkal data integrity.
"""
import json
import re
import sys
from pathlib import Path
import pytest
# Ensure project root is on path
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from apkg_builder import _categorize_pos, _cloze_prefix_len
# ---------------------------------------------------------------------------
# Cloze prefix preservation
# ---------------------------------------------------------------------------
class TestClozePrefix:
"""_cloze_prefix_len must detect Hebrew prefix letters before the word."""
def test_single_prefix_bet(self):
# בַּתּוֹר = bet + patach + tor
assert _cloze_prefix_len("בַּתּוֹר", "תּוֹר") > 0
def test_single_prefix_lamed(self):
# לַמֶּלֶךְ = lamed + patach + melech
assert _cloze_prefix_len("לַמֶּלֶךְ", "מֶּלֶךְ") > 0
def test_two_consonant_prefix(self):
# שֶׁבַּתּוֹר = shin + bet + tor (two prefix letters)
token = "שֶׁבַּתּוֹר"
word = "תּוֹר"
prefix_len = _cloze_prefix_len(token, word)
assert prefix_len > 0
assert token[prefix_len:].startswith(word)
def test_no_prefix_direct_match(self):
# Word appears at start — no prefix
assert _cloze_prefix_len("תּוֹר", "תּוֹר") == 0
def test_empty_inputs(self):
assert _cloze_prefix_len("", "תּוֹר") == 0
assert _cloze_prefix_len("בַּתּוֹר", "") == 0
assert _cloze_prefix_len("", "") == 0
def test_non_prefix_letter_returns_zero(self):
# If the "prefix" chars aren't valid prefix letters, return 0
# 'ת' is not in _PREFIX_LETTERS (בהוכלמש)
assert _cloze_prefix_len("תַּתּוֹר", "תּוֹר") == 0
def test_prefix_preserves_nikkud(self):
# Verify that prefix_len includes nikkud marks
token = "בַּתּוֹר"
word = "תּוֹר"
prefix_len = _cloze_prefix_len(token, word)
prefix = token[:prefix_len]
# Prefix should contain at least bet + nikkud mark(s)
base_letters = [c for c in prefix if "\u05d0" <= c <= "\u05ea"]
assert base_letters == ["ב"]
# ---------------------------------------------------------------------------
# PoS exact matching (no substring collisions)
# ---------------------------------------------------------------------------
class TestCategorizePos:
"""_categorize_pos must not let 'Pronoun' match 'Noun'."""
def test_noun_exact(self):
assert _categorize_pos("Noun") == "Noun"
def test_pronoun_is_other(self):
assert _categorize_pos("Pronoun") == "Other"
def test_verb_exact(self):
assert _categorize_pos("Verb") == "Verb"
def test_noun_with_dash(self):
assert _categorize_pos("Noun masculine") == "Noun"
def test_adjective(self):
assert _categorize_pos("Adjective") == "Adjective"
def test_conjunction_is_other(self):
assert _categorize_pos("Conjunction") == "Other"
# ---------------------------------------------------------------------------
# Hebrew spoiler stripping from English meanings
# ---------------------------------------------------------------------------
class TestHebrewSpoilerStripping:
"""English meanings must not contain Hebrew text (spoils the card)."""
# Use the same regex from apkg_builder.py
HEBREW_STRIP_RE = re.compile(r"[\u0590-\u05FF][\u0590-\u05FF\u0591-\u05C7\s\-]*")
@staticmethod
def _strip_hebrew(meaning: str) -> str:
"""Replicate the meaning cleaning pipeline from build_vocab_deck."""
meaning = re.sub(r"[\u0590-\u05FF][\u0590-\u05FF\u0591-\u05C7\s\-]*", "", meaning)
meaning = re.sub(r"[;:]\s*—", "", meaning)
meaning = re.sub(r";\s*:", ";", meaning)
return re.sub(r"\s{2,}", " ", meaning).strip(", ;:")
def test_pure_english_unchanged(self):
assert self._strip_hebrew("to eat, to consume") == "to eat, to consume"
def test_hebrew_word_removed(self):
result = self._strip_hebrew("to eat; אכל")
assert "אכל" not in result
def test_hebrew_with_nikkud_removed(self):
result = self._strip_hebrew("tall; גָּבוֹהַּ")
assert "גָּבוֹהַּ" not in result
assert "tall" in result
def test_no_residual_hebrew_in_real_data(self):
"""Scan actual words.json — no meaning should contain Hebrew after stripping."""
words_path = Path(__file__).resolve().parent.parent / "data" / "words.json"
if not words_path.exists():
pytest.skip("words.json not available")
with open(words_path, encoding="utf-8") as f:
words = json.load(f)
# The regex used in apkg_builder
hebrew_re = re.compile(r"[\u05D0-\u05EA]")
spoilers = []
for key, entry in words.items():
meaning = entry.get("meaning") or ""
cleaned = self._strip_hebrew(meaning)
if hebrew_re.search(cleaned):
spoilers.append(f"{key}: {cleaned!r}")
assert not spoilers, f"Hebrew found in {len(spoilers)} meanings after stripping: {spoilers[:5]}"
# ---------------------------------------------------------------------------
# Gender field for nouns (words.json data integrity)
# ---------------------------------------------------------------------------
class TestGenderDataIntegrity:
"""Nouns with noun_inflection should have gender populated."""
@pytest.fixture()
def words(self):
words_path = Path(__file__).resolve().parent.parent / "data" / "words.json"
if not words_path.exists():
pytest.skip("words.json not available")
with open(words_path, encoding="utf-8") as f:
return json.load(f)
def test_nouns_have_gender(self, words):
"""Nouns with noun_inflection should have a valid gender."""
missing = []
for key, entry in words.items():
pos = entry.get("pos") or ""
ni = entry.get("noun_inflection")
if pos.startswith("Noun") and ni:
gender = ni.get("gender") or ""
if gender not in ("masculine", "feminine", "masculine and feminine"):
missing.append(f"{key}: gender={gender!r}")
# Allow up to 7% missing (loan words, compound words, etc.)
noun_count = sum(
1 for e in words.values() if (e.get("pos") or "").startswith("Noun") and e.get("noun_inflection")
)
if noun_count > 0:
pct_missing = len(missing) / noun_count
assert pct_missing < 0.07, f"{len(missing)}/{noun_count} nouns missing gender: {missing[:10]}"
# ---------------------------------------------------------------------------
# Mishkal data integrity
# ---------------------------------------------------------------------------
class TestMishkalIntegrity:
"""Validate mishkal data consistency in words.json."""
@pytest.fixture()
def words(self):
words_path = Path(__file__).resolve().parent.parent / "data" / "words.json"
if not words_path.exists():
pytest.skip("words.json not available")
with open(words_path, encoding="utf-8") as f:
return json.load(f)
def test_mishkal_hebrew_matches_english(self, words):
"""If mishkal and mishkal_hebrew are both set, they should correspond via _mishkal_to_hebrew."""
from pealim_detail_scrape import _mishkal_to_hebrew
mismatches = []
for key, entry in words.items():
for infl_key in ("noun_inflection", "adjective_inflection"):
infl = entry.get(infl_key)
if not infl:
continue
mishkal_eng = infl.get("mishkal") or ""
mishkal_heb = infl.get("mishkal_hebrew") or ""
if mishkal_eng and mishkal_heb:
expected = _mishkal_to_hebrew(mishkal_eng) or ""
if expected and expected != mishkal_heb:
mismatches.append(f"{key}: {mishkal_eng}{mishkal_heb} (expected {expected})")
assert not mismatches, f"{len(mismatches)} mishkal mismatches: {mismatches[:10]}"
def test_mishkal_hebrew_is_hebrew(self, words):
"""mishkal_hebrew must contain Hebrew characters."""
hebrew_re = re.compile(r"[\u05D0-\u05EA]")
bad = []
for key, entry in words.items():
for infl_key in ("noun_inflection", "adjective_inflection"):
infl = entry.get(infl_key)
if not infl:
continue
mishkal_heb = infl.get("mishkal_hebrew") or ""
if mishkal_heb and not hebrew_re.search(mishkal_heb):
bad.append(f"{key}: mishkal_hebrew={mishkal_heb!r}")
assert not bad, f"{len(bad)} non-Hebrew mishkal_hebrew values: {bad[:10]}"
def test_no_orphaned_mishkal(self, words):
"""If mishkal_hebrew is set, mishkal (English) must also be set."""
orphans = []
for key, entry in words.items():
for infl_key in ("noun_inflection", "adjective_inflection"):
infl = entry.get(infl_key)
if not infl:
continue
mishkal_heb = infl.get("mishkal_hebrew") or ""
mishkal_eng = infl.get("mishkal") or ""
if mishkal_heb and not mishkal_eng:
orphans.append(f"{key}: has mishkal_hebrew but no mishkal")
assert not orphans, f"{len(orphans)} orphaned mishkal_hebrew: {orphans[:10]}"

524
tests/test_detail_scrape.py Normal file
View file

@ -0,0 +1,524 @@
"""Tests for adjective and preposition detail page parsing in pealim_detail_scrape.py."""
import sys
from pathlib import Path
import pytest
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from pealim_detail_scrape import (
_parse_adjective_table,
_parse_adjective_table_vl,
_parse_preposition_table,
_parse_preposition_table_vl,
_scrape_adjective_detail,
_scrape_preposition_detail,
)
# ---------------------------------------------------------------------------
# Fixtures — real HTML snippets from pealim.com
# ---------------------------------------------------------------------------
ADJECTIVE_MO_TABLE = """
<table class="table table-condensed conjugation-table">
<thead>
<tr>
<th class="column-header" colspan="2">Singular</th>
<th class="column-header" colspan="2">Plural</th>
</tr>
<tr>
<th class="column-header">Masculine</th>
<th class="column-header">Feminine</th>
<th class="column-header">Masculine</th>
<th class="column-header">Feminine</th>
</tr>
</thead>
<tbody>
<tr>
<td class="conj-td">
<div id="ms-a">
<div><div>
<span class="audio-play" data-audio="https://audio.pealim.com/v0/dn/dngfpnovmytc.mp3">&#128266;</span>
<span class="menukad">אֲבִיבִי</span>
</div></div>
<div class="meaning">spring-like, vernal</div>
</div>
</td>
<td class="conj-td">
<div id="fs-a">
<div><div>
<span class="audio-play" data-audio="https://audio.pealim.com/v0/1j/1j6srg3do7n5k.mp3">&#128266;</span>
<span class="menukad">אֲבִיבִית</span>
</div></div>
<div class="meaning">spring-like, vernal</div>
</div>
</td>
<td class="conj-td">
<div id="mp-a">
<div><div>
<span class="audio-play" data-audio="https://audio.pealim.com/v0/tj/tjrhw0b5dkhc.mp3">&#128266;</span>
<span class="menukad">אֲבִיבִיִּים</span>
</div></div>
<div class="meaning">spring-like, vernal</div>
</div>
</td>
<td class="conj-td">
<div id="fp-a">
<div><div>
<span class="audio-play" data-audio="https://audio.pealim.com/v0/h3/h3u1ml5a4xcf.mp3">&#128266;</span>
<span class="menukad">אֲבִיבִיּוֹת</span>
</div></div>
<div class="meaning">spring-like, vernal</div>
</div>
</td>
</tr>
</tbody>
</table>
"""
# VL version: menukad spans contain unvowelled text (hebstyle=vl)
ADJECTIVE_VL_TABLE = """
<table class="table table-condensed conjugation-table">
<tbody>
<tr>
<td class="conj-td">
<div id="ms-a"><div><div>
<span class="menukad">אביבי</span>
</div></div></div>
</td>
<td class="conj-td">
<div id="fs-a"><div><div>
<span class="menukad">אביבית</span>
</div></div></div>
</td>
<td class="conj-td">
<div id="mp-a"><div><div>
<span class="menukad">אביביים</span>
</div></div></div>
</td>
<td class="conj-td">
<div id="fp-a"><div><div>
<span class="menukad">אביביות</span>
</div></div></div>
</td>
</tr>
</tbody>
</table>
"""
PREPOSITION_MO_TABLE = """
<table class="table table-condensed conjugation-table">
<thead>
<tr>
<th rowspan="2">Person</th>
<th class="column-header" colspan="2">Singular</th>
<th class="column-header" colspan="2">Plural</th>
</tr>
<tr>
<th class="column-header">Masculine</th>
<th class="column-header">Feminine</th>
<th class="column-header">Masculine</th>
<th class="column-header">Feminine</th>
</tr>
</thead>
<tbody>
<tr>
<th>1st</th>
<td class="conj-td" colspan="2">
<div id="P-1s"><div><div>
<span class="audio-play" data-audio="https://audio.pealim.com/v0/fk/fkp5faeteecr.mp3">&#128266;</span>
<span class="menukad">שֶׁלִּי</span>
</div></div><div class="meaning"><strong>of mine</strong></div></div>
</td>
<td class="conj-td" colspan="2">
<div id="P-1p"><div><div>
<span class="audio-play" data-audio="https://audio.pealim.com/v0/13/13uvi0dz6tgcc.mp3">&#128266;</span>
<span class="menukad">שֶׁלָּנוּ</span>
</div></div><div class="meaning"><strong>of ours</strong></div></div>
</td>
</tr>
<tr>
<th>2nd</th>
<td class="conj-td">
<div id="P-2ms"><div><div>
<span class="audio-play" data-audio="https://audio.pealim.com/v0/sh/shbxafq8ietx.mp3">&#128266;</span>
<span class="menukad">שֶׁלְּךָ</span>
</div></div><div class="meaning"><strong>of yours</strong> <em>m. sg.</em></div></div>
</td>
<td class="conj-td">
<div id="P-2fs"><div><div>
<span class="audio-play" data-audio="https://audio.pealim.com/v0/sh/sh9ue3a8buo3.mp3">&#128266;</span>
<span class="menukad">שֶׁלָּךְ</span>
</div></div><div class="meaning"><strong>of yours</strong> <em>f. sg.</em></div></div>
</td>
<td class="conj-td">
<div id="P-2mp"><div><div>
<span class="audio-play" data-audio="https://audio.pealim.com/v0/ol/olx8vzsctlzn.mp3">&#128266;</span>
<span class="menukad">שֶׁלָּכֶם</span>
</div></div><div class="meaning"><strong>of yours</strong> <em>m. pl.</em></div></div>
</td>
<td class="conj-td">
<div id="P-2fp"><div><div>
<span class="audio-play" data-audio="https://audio.pealim.com/v0/ol/olxrms6dl8eq.mp3">&#128266;</span>
<span class="menukad">שֶׁלָּכֶן</span>
</div></div><div class="meaning"><strong>of yours</strong> <em>f. pl.</em></div></div>
</td>
</tr>
<tr>
<th>3rd</th>
<td class="conj-td">
<div id="P-3ms"><div><div>
<span class="audio-play" data-audio="https://audio.pealim.com/v0/fk/fkp5qigelthg.mp3">&#128266;</span>
<span class="menukad">שֶׁלּוֹ</span>
</div></div><div class="meaning"><strong>of his</strong></div></div>
</td>
<td class="conj-td">
<div id="P-3fs"><div><div>
<span class="audio-play" data-audio="https://audio.pealim.com/v0/sh/sh9w36hojm5w.mp3">&#128266;</span>
<span class="menukad">שֶׁלָּהּ</span>
</div></div><div class="meaning"><strong>of hers</strong></div></div>
</td>
<td class="conj-td">
<div id="P-3mp"><div><div>
<span class="audio-play" data-audio="https://audio.pealim.com/v0/n9/n99z0jr8pint.mp3">&#128266;</span>
<span class="menukad">שֶׁלָּהֶם</span>
</div></div><div class="meaning"><strong>of theirs</strong> <em>m.</em></div></div>
</td>
<td class="conj-td">
<div id="P-3fp"><div><div>
<span class="audio-play" data-audio="https://audio.pealim.com/v0/n9/n9ahrc59h52w.mp3">&#128266;</span>
<span class="menukad">שֶׁלָּהֶן</span>
</div></div><div class="meaning"><strong>of theirs</strong> <em>f.</em></div></div>
</td>
</tr>
</tbody>
</table>
"""
PREPOSITION_VL_TABLE = """
<table class="table table-condensed conjugation-table">
<tbody>
<tr>
<th>1st</th>
<td colspan="2"><div id="P-1s"><div><div>
<span class="menukad">שלי</span>
</div></div></div></td>
<td colspan="2"><div id="P-1p"><div><div>
<span class="menukad">שלנו</span>
</div></div></div></td>
</tr>
<tr>
<th>2nd</th>
<td><div id="P-2ms"><div><div>
<span class="menukad">שלך</span>
</div></div></div></td>
<td><div id="P-2fs"><div><div>
<span class="menukad">שלך</span>
</div></div></div></td>
<td><div id="P-2mp"><div><div>
<span class="menukad">שלכם</span>
</div></div></div></td>
<td><div id="P-2fp"><div><div>
<span class="menukad">שלכן</span>
</div></div></div></td>
</tr>
<tr>
<th>3rd</th>
<td><div id="P-3ms"><div><div>
<span class="menukad">שלו</span>
</div></div></div></td>
<td><div id="P-3fs"><div><div>
<span class="menukad">שלה</span>
</div></div></div></td>
<td><div id="P-3mp"><div><div>
<span class="menukad">שלהם</span>
</div></div></div></td>
<td><div id="P-3fp"><div><div>
<span class="menukad">שלהן</span>
</div></div></div></td>
</tr>
</tbody>
</table>
"""
# Minimal full-page wrappers so _scrape_*_detail() can parse them
_ADJECTIVE_MO_PAGE = f"<html><body>{ADJECTIVE_MO_TABLE}</body></html>"
_ADJECTIVE_VL_PAGE = f"<html><body>{ADJECTIVE_VL_TABLE}</body></html>"
_PREPOSITION_MO_PAGE = f"<html><body>{PREPOSITION_MO_TABLE}</body></html>"
_PREPOSITION_VL_PAGE = f"<html><body>{PREPOSITION_VL_TABLE}</body></html>"
# ---------------------------------------------------------------------------
# Adjective table tests
# ---------------------------------------------------------------------------
class TestParseAdjectiveTable:
"""Tests for _parse_adjective_table (mo/nikkud page)."""
def test_returns_four_form_keys(self) -> None:
result = _parse_adjective_table(__import__("bs4").BeautifulSoup(ADJECTIVE_MO_TABLE, "lxml"))
assert set(result.keys()) == {"ms", "fs", "mp", "fp"}
def test_ms_nikkud(self) -> None:
result = _parse_adjective_table(__import__("bs4").BeautifulSoup(ADJECTIVE_MO_TABLE, "lxml"))
assert result["ms"]["nikkud"] == "אֲבִיבִי"
def test_fs_nikkud(self) -> None:
result = _parse_adjective_table(__import__("bs4").BeautifulSoup(ADJECTIVE_MO_TABLE, "lxml"))
assert result["fs"]["nikkud"] == "אֲבִיבִית"
def test_mp_nikkud(self) -> None:
result = _parse_adjective_table(__import__("bs4").BeautifulSoup(ADJECTIVE_MO_TABLE, "lxml"))
assert result["mp"]["nikkud"] == "אֲבִיבִיִּים"
def test_fp_nikkud(self) -> None:
result = _parse_adjective_table(__import__("bs4").BeautifulSoup(ADJECTIVE_MO_TABLE, "lxml"))
assert result["fp"]["nikkud"] == "אֲבִיבִיּוֹת"
def test_audio_url_present(self) -> None:
result = _parse_adjective_table(__import__("bs4").BeautifulSoup(ADJECTIVE_MO_TABLE, "lxml"))
assert result["ms"]["audio_url"].startswith("https://audio.pealim.com/")
def test_empty_on_missing_table(self) -> None:
result = _parse_adjective_table(__import__("bs4").BeautifulSoup("<html><body></body></html>", "lxml"))
assert result == {}
class TestParseAdjectiveTableVl:
"""Tests for _parse_adjective_table_vl (ktiv male page)."""
def test_returns_four_form_keys(self) -> None:
result = _parse_adjective_table_vl(__import__("bs4").BeautifulSoup(ADJECTIVE_VL_TABLE, "lxml"))
assert set(result.keys()) == {"ms", "fs", "mp", "fp"}
def test_ms_ktiv(self) -> None:
result = _parse_adjective_table_vl(__import__("bs4").BeautifulSoup(ADJECTIVE_VL_TABLE, "lxml"))
assert result["ms"] == "אביבי"
def test_fs_ktiv(self) -> None:
result = _parse_adjective_table_vl(__import__("bs4").BeautifulSoup(ADJECTIVE_VL_TABLE, "lxml"))
assert result["fs"] == "אביבית"
def test_mp_ktiv(self) -> None:
result = _parse_adjective_table_vl(__import__("bs4").BeautifulSoup(ADJECTIVE_VL_TABLE, "lxml"))
assert result["mp"] == "אביביים"
def test_fp_ktiv(self) -> None:
result = _parse_adjective_table_vl(__import__("bs4").BeautifulSoup(ADJECTIVE_VL_TABLE, "lxml"))
assert result["fp"] == "אביביות"
# ---------------------------------------------------------------------------
# _scrape_adjective_detail tests
# ---------------------------------------------------------------------------
class TestScrapeAdjectiveDetail:
"""Tests for _scrape_adjective_detail — schema compliance."""
@pytest.fixture()
def result(self) -> dict:
return _scrape_adjective_detail("9098-avivi", _ADJECTIVE_MO_PAGE, _ADJECTIVE_VL_PAGE)
def test_returns_non_empty_dict(self, result: dict) -> None:
assert result
def test_ms_has_nikkud_and_ktiv(self, result: dict) -> None:
assert result["ms"]["nikkud"] == "אֲבִיבִי"
assert result["ms"]["ktiv_male"] == "אביבי"
def test_fs_has_nikkud_and_ktiv(self, result: dict) -> None:
assert result["fs"]["nikkud"] == "אֲבִיבִית"
assert result["fs"]["ktiv_male"] == "אביבית"
def test_mp_has_nikkud_and_ktiv(self, result: dict) -> None:
assert result["mp"]["nikkud"] == "אֲבִיבִיִּים"
assert result["mp"]["ktiv_male"] == "אביביים"
def test_fp_has_nikkud_and_ktiv(self, result: dict) -> None:
assert result["fp"]["nikkud"] == "אֲבִיבִיּוֹת"
assert result["fp"]["ktiv_male"] == "אביביות"
def test_mishkal_key_present(self, result: dict) -> None:
# mishkal may be None since no PoS section is in our minimal fixture
assert "mishkal" in result
def test_mishkal_hebrew_key_present(self, result: dict) -> None:
assert "mishkal_hebrew" in result
def test_all_schema_keys_present(self, result: dict) -> None:
expected = {"ms", "fs", "mp", "fp", "mishkal", "mishkal_hebrew"}
assert expected.issubset(result.keys())
def test_empty_on_no_table(self) -> None:
result = _scrape_adjective_detail("missing", "<html><body></body></html>", "<html><body></body></html>")
assert result == {}
# ---------------------------------------------------------------------------
# Preposition table tests
# ---------------------------------------------------------------------------
class TestParsePrepositionTable:
"""Tests for _parse_preposition_table (mo/nikkud page)."""
@pytest.fixture()
def result(self) -> dict:
return _parse_preposition_table(__import__("bs4").BeautifulSoup(PREPOSITION_MO_TABLE, "lxml"))
def test_returns_ten_form_keys(self, result: dict) -> None:
expected = {"1s", "1p", "2ms", "2fs", "2mp", "2fp", "3ms", "3fs", "3mp", "3fp"}
assert set(result.keys()) == expected
def test_1s_nikkud(self, result: dict) -> None:
assert result["1s"]["nikkud"] == "שֶׁלִּי"
def test_1p_nikkud(self, result: dict) -> None:
assert result["1p"]["nikkud"] == "שֶׁלָּנוּ"
def test_2ms_nikkud(self, result: dict) -> None:
assert result["2ms"]["nikkud"] == "שֶׁלְּךָ"
def test_2fs_nikkud(self, result: dict) -> None:
assert result["2fs"]["nikkud"] == "שֶׁלָּךְ"
def test_2mp_nikkud(self, result: dict) -> None:
assert result["2mp"]["nikkud"] == "שֶׁלָּכֶם"
def test_2fp_nikkud(self, result: dict) -> None:
assert result["2fp"]["nikkud"] == "שֶׁלָּכֶן"
def test_3ms_nikkud(self, result: dict) -> None:
assert result["3ms"]["nikkud"] == "שֶׁלּוֹ"
def test_3fs_nikkud(self, result: dict) -> None:
assert result["3fs"]["nikkud"] == "שֶׁלָּהּ"
def test_3mp_nikkud(self, result: dict) -> None:
assert result["3mp"]["nikkud"] == "שֶׁלָּהֶם"
def test_3fp_nikkud(self, result: dict) -> None:
assert result["3fp"]["nikkud"] == "שֶׁלָּהֶן"
def test_audio_url_present(self, result: dict) -> None:
assert result["1s"]["audio_url"].startswith("https://audio.pealim.com/")
def test_empty_on_missing_table(self) -> None:
result = _parse_preposition_table(__import__("bs4").BeautifulSoup("<html><body></body></html>", "lxml"))
assert result == {}
class TestParsePrepositionTableVl:
"""Tests for _parse_preposition_table_vl (ktiv male page)."""
@pytest.fixture()
def result(self) -> dict:
return _parse_preposition_table_vl(__import__("bs4").BeautifulSoup(PREPOSITION_VL_TABLE, "lxml"))
def test_returns_ten_form_keys(self, result: dict) -> None:
expected = {"1s", "1p", "2ms", "2fs", "2mp", "2fp", "3ms", "3fs", "3mp", "3fp"}
assert set(result.keys()) == expected
def test_1s_ktiv(self, result: dict) -> None:
assert result["1s"] == "שלי"
def test_1p_ktiv(self, result: dict) -> None:
assert result["1p"] == "שלנו"
def test_2ms_ktiv(self, result: dict) -> None:
assert result["2ms"] == "שלך"
def test_3ms_ktiv(self, result: dict) -> None:
assert result["3ms"] == "שלו"
def test_3fp_ktiv(self, result: dict) -> None:
assert result["3fp"] == "שלהן"
# ---------------------------------------------------------------------------
# _scrape_preposition_detail tests
# ---------------------------------------------------------------------------
class TestScrapePrepositionDetail:
"""Tests for _scrape_preposition_detail — schema compliance."""
@pytest.fixture()
def result(self) -> dict:
return _scrape_preposition_detail("2643-shel", _PREPOSITION_MO_PAGE, _PREPOSITION_VL_PAGE)
def test_returns_non_empty_dict(self, result: dict) -> None:
assert result
def test_all_ten_person_keys_present(self, result: dict) -> None:
expected = {"1s", "1p", "2ms", "2fs", "2mp", "2fp", "3ms", "3fs", "3mp", "3fp"}
assert expected.issubset(result.keys())
def test_1s_has_nikkud_and_ktiv(self, result: dict) -> None:
assert result["1s"]["nikkud"] == "שֶׁלִּי"
assert result["1s"]["ktiv_male"] == "שלי"
def test_1p_has_nikkud_and_ktiv(self, result: dict) -> None:
assert result["1p"]["nikkud"] == "שֶׁלָּנוּ"
assert result["1p"]["ktiv_male"] == "שלנו"
def test_2ms_has_nikkud_and_ktiv(self, result: dict) -> None:
assert result["2ms"]["nikkud"] == "שֶׁלְּךָ"
assert result["2ms"]["ktiv_male"] == "שלך"
def test_3ms_has_nikkud_and_ktiv(self, result: dict) -> None:
assert result["3ms"]["nikkud"] == "שֶׁלּוֹ"
assert result["3ms"]["ktiv_male"] == "שלו"
def test_3fs_has_nikkud_and_ktiv(self, result: dict) -> None:
assert result["3fs"]["nikkud"] == "שֶׁלָּהּ"
assert result["3fs"]["ktiv_male"] == "שלה"
def test_3fp_has_nikkud_and_ktiv(self, result: dict) -> None:
assert result["3fp"]["nikkud"] == "שֶׁלָּהֶן"
assert result["3fp"]["ktiv_male"] == "שלהן"
def test_empty_on_no_table(self) -> None:
result = _scrape_preposition_detail("missing", "<html><body></body></html>", "<html><body></body></html>")
assert result == {}
# ---------------------------------------------------------------------------
# Tests for _parse_noun_gender_mishkal mishkal extraction
# ---------------------------------------------------------------------------
from bs4 import BeautifulSoup # noqa: E402
from pealim_detail_scrape import _parse_noun_gender_mishkal # noqa: E402
class TestNounGenderMishkal:
def test_noun_with_mishkal(self):
html = '<p>Noun <a href="/dict/?pos=noun&amp;nm=qetel"><i>ketel</i> pattern</a>, masculine</p>'
soup = BeautifulSoup(html, "html.parser")
gender, mishkal = _parse_noun_gender_mishkal(soup)
assert gender == "masculine"
assert mishkal == "ketel"
def test_noun_without_mishkal(self):
html = "<p>Noun masculine</p>"
soup = BeautifulSoup(html, "html.parser")
gender, mishkal = _parse_noun_gender_mishkal(soup)
assert gender == "masculine"
assert mishkal == ""
def test_adjective_mishkal(self):
html = '<p>Adjective <a href="/dict/?pos=adjective&amp;am=qatul"><i>katul</i> pattern</a></p>'
soup = BeautifulSoup(html, "html.parser")
_, mishkal = _parse_noun_gender_mishkal(soup)
assert mishkal == "katul"
def test_feminine_noun(self):
html = '<p>Noun <a href="/dict/?pos=noun&amp;nm=qetel"><i>ketel</i> pattern</a>, feminine</p>'
soup = BeautifulSoup(html, "html.parser")
gender, mishkal = _parse_noun_gender_mishkal(soup)
assert gender == "feminine"
assert mishkal == "ketel"

127
tests/test_epub_examples.py Normal file
View file

@ -0,0 +1,127 @@
"""Tests for epub_examples deduplication of confusable group examples."""
from epub_examples import _deduplicate_confusable_examples
def _make_entry(meaning, confusable_group, vetted_texts=None, frequency_rank=None):
"""Build a minimal words.json entry for testing."""
entry = {
"meaning": meaning,
"confusable_group": confusable_group,
}
if vetted_texts is not None:
entry["examples"] = {
"vetted": [{"text": t, "source": "test", "match_method": "direct"} for t in vetted_texts],
}
if frequency_rank is not None:
entry["frequency_rank"] = frequency_rank
return entry
class TestDeduplicateConfusableExamples:
"""Tests for _deduplicate_confusable_examples()."""
def test_shared_examples_kept_on_higher_frequency(self):
"""When two confusables share identical examples, the one with
lower frequency_rank (more common) keeps them."""
group = ["key_a", "key_b"]
words = {
"key_a": _make_entry("brother", group, ["sent1", "sent2"], frequency_rank=500),
"key_b": _make_entry("fireplace", group, ["sent1", "sent2"], frequency_rank=8000),
}
cleared = _deduplicate_confusable_examples(words)
assert cleared == 1
assert len(words["key_a"]["examples"]["vetted"]) == 2
assert words["key_b"]["examples"]["vetted"] == []
def test_no_action_when_examples_differ(self):
"""Groups with different example sets are left untouched."""
group = ["key_a", "key_b"]
words = {
"key_a": _make_entry("meaning1", group, ["sent1"], frequency_rank=100),
"key_b": _make_entry("meaning2", group, ["sent2"], frequency_rank=200),
}
cleared = _deduplicate_confusable_examples(words)
assert cleared == 0
assert len(words["key_a"]["examples"]["vetted"]) == 1
assert len(words["key_b"]["examples"]["vetted"]) == 1
def test_no_action_when_one_has_no_examples(self):
"""If only one member has examples, nothing to deduplicate."""
group = ["key_a", "key_b"]
words = {
"key_a": _make_entry("meaning1", group, ["sent1"], frequency_rank=100),
"key_b": _make_entry("meaning2", group, frequency_rank=200),
}
cleared = _deduplicate_confusable_examples(words)
assert cleared == 0
def test_no_frequency_uses_alphabetical_tiebreak(self):
"""When no member has frequency data, first alphabetically wins."""
group = ["alpha_key", "beta_key"]
words = {
"alpha_key": _make_entry("meaning1", group, ["sent1"]),
"beta_key": _make_entry("meaning2", group, ["sent1"]),
}
cleared = _deduplicate_confusable_examples(words)
assert cleared == 1
assert len(words["alpha_key"]["examples"]["vetted"]) == 1
assert words["beta_key"]["examples"]["vetted"] == []
def test_three_way_group(self):
"""Three-member group: highest frequency wins, other two cleared."""
group = ["key_a", "key_b", "key_c"]
words = {
"key_a": _make_entry("yes", group, ["sent1", "sent2"], frequency_rank=50),
"key_b": _make_entry("honest", group, ["sent1", "sent2"], frequency_rank=3000),
"key_c": _make_entry("pedestal", group, ["sent1", "sent2"], frequency_rank=15000),
}
cleared = _deduplicate_confusable_examples(words)
assert cleared == 2
assert len(words["key_a"]["examples"]["vetted"]) == 2
assert words["key_b"]["examples"]["vetted"] == []
assert words["key_c"]["examples"]["vetted"] == []
def test_cloze_removed_from_losers(self):
"""Losing entries should have their cloze data removed too."""
group = ["key_a", "key_b"]
words = {
"key_a": _make_entry("common", group, ["sent1"], frequency_rank=100),
"key_b": _make_entry("rare", group, ["sent1"], frequency_rank=9000),
}
# Add cloze to both
words["key_b"]["examples"]["cloze"] = {"text": "sent1", "cloze_guid": "abc"}
cleared = _deduplicate_confusable_examples(words)
assert cleared == 1
assert "cloze" not in words["key_b"]["examples"]
def test_no_confusable_groups_returns_zero(self):
"""Words without confusable_group are ignored."""
words = {
"key_a": {"meaning": "word1", "examples": {"vetted": [{"text": "s1"}]}},
"key_b": {"meaning": "word2", "examples": {"vetted": [{"text": "s1"}]}},
}
cleared = _deduplicate_confusable_examples(words)
assert cleared == 0
def test_mixed_frequency_and_none(self):
"""Member with frequency beats member without."""
group = ["key_a", "key_b"]
words = {
"key_a": _make_entry("has_freq", group, ["sent1"], frequency_rank=5000),
"key_b": _make_entry("no_freq", group, ["sent1"]),
}
cleared = _deduplicate_confusable_examples(words)
assert cleared == 1
assert len(words["key_a"]["examples"]["vetted"]) == 1
assert words["key_b"]["examples"]["vetted"] == []
def test_partial_overlap_not_deduplicated(self):
"""Groups with overlapping but not identical sentence sets are not touched."""
group = ["key_a", "key_b"]
words = {
"key_a": _make_entry("m1", group, ["sent1", "sent2"], frequency_rank=100),
"key_b": _make_entry("m2", group, ["sent1", "sent3"], frequency_rank=200),
}
cleared = _deduplicate_confusable_examples(words)
assert cleared == 0

View file

@ -0,0 +1,83 @@
"""Integration tests for frequency-based sentence scoring in update_words_json."""
def _make_sentence(text, source="test", match_method="direct", word_count=None, char_offset=0, char_end=3):
"""Build a minimal sentence dict as match_sentences would produce."""
if word_count is None:
word_count = len(text.split())
return {
"text": text,
"source": source,
"match_method": match_method,
"word_count": word_count,
"char_offset": char_offset,
"char_end": char_end,
}
class TestScoringIntegration:
"""Tests that update_words_json uses frequency scoring."""
def test_cloze_has_difficulty_score(self):
"""Cloze dict includes difficulty_score field."""
from epub_examples import update_words_json
words = {
"טוֹב": {
"word": {"nikkud": "טוֹב", "ktiv_male": "טוב"},
"examples": {},
}
}
matches = {
"טוֹב": [
_make_sentence("הוּא אָדָם טוֹב מְאוֹד", char_offset=10, char_end=13),
]
}
update_words_json(words, matches, confusable_keys=set())
cloze = words["טוֹב"]["examples"].get("cloze")
assert cloze is not None
assert "difficulty_score" in cloze
assert isinstance(cloze["difficulty_score"], int)
def test_vetted_sorted_by_difficulty(self):
"""Vetted sentences are sorted easiest first."""
from epub_examples import update_words_json
words = {
"טוֹב": {
"word": {"nikkud": "טוֹב", "ktiv_male": "טוב"},
"examples": {},
}
}
matches = {
"טוֹב": [
_make_sentence("הוּא טוֹב", char_offset=4, char_end=7),
_make_sentence("הַתַּפְנִיט טוֹב בְּיוֹתֵר", char_offset=10, char_end=13),
_make_sentence("אֲנִי טוֹב הַיּוֹם", char_offset=5, char_end=8),
]
}
update_words_json(words, matches, confusable_keys=set())
vetted = words["טוֹב"]["examples"]["vetted"]
assert len(vetted) == 3
def test_easiest_sentence_becomes_cloze(self):
"""The sentence with the lowest difficulty score becomes the cloze."""
from epub_examples import update_words_json
words = {
"טוֹב": {
"word": {"nikkud": "טוֹב", "ktiv_male": "טוב"},
"examples": {},
}
}
easy_text = "הוּא טוֹב מְאוֹד"
hard_text = "הַפַּרְנָסִימוֹן טוֹב לְהַפְלִיא"
matches = {
"טוֹב": [
_make_sentence(hard_text, char_offset=14, char_end=17),
_make_sentence(easy_text, char_offset=4, char_end=7),
]
}
update_words_json(words, matches, confusable_keys=set())
cloze = words["טוֹב"]["examples"]["cloze"]
assert cloze["text"] == easy_text

View file

@ -0,0 +1,441 @@
#!/usr/bin/env python3
"""Integration tests: scrape real pealim.com pages and validate data.
These tests hit pealim.com directly. They are skipped when the environment
variable SKIP_INTEGRATION is set to any non-empty string.
Run with:
pytest tests/test_scraper_integration.py -v -m integration
"""
import json
import os
import re
import sys
import time
from pathlib import Path
import pytest
# Add project root to path so all sibling modules are importable
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
import pealim_detail_scrape
import pealim_list_scrape
# ---------------------------------------------------------------------------
# Skip marker
# ---------------------------------------------------------------------------
skip_integration = pytest.mark.skipif(
bool(os.environ.get("SKIP_INTEGRATION", "")),
reason="SKIP_INTEGRATION is set",
)
# A known Hif'il verb slug that is not page-1 dependent.
# לְהַגִּיד (to tell/say) — Hif'il, slug 1135-lehagid
HIFIL_VERB_SLUG = "1135-lehagid"
HIFIL_VERB_NIKKUD = "לְהַגִּיד"
HIFIL_VERB_MEANING = "to say, to tell"
# Minimum expected entries from a single list page
MIN_LIST_ENTRIES = 10
# Hebrew character regex (Unicode block U+05D0U+05EA)
HEBREW_CHAR_RE = re.compile(r"[\u05d0-\u05ea]")
# Slug pattern: one or more digits, hyphen, one or more word chars
SLUG_RE = re.compile(r"^\d+-\w+$")
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _has_hebrew(text: str) -> bool:
"""Return True if *text* contains at least one Hebrew consonant."""
return bool(HEBREW_CHAR_RE.search(text))
def _words_from_file(path: Path) -> dict:
with path.open(encoding="utf-8") as fh:
return json.load(fh)
# ---------------------------------------------------------------------------
# Test class: list page scrape
# ---------------------------------------------------------------------------
@pytest.mark.integration
@skip_integration
class TestListScrape:
"""Validate pealim_list_scrape against a real /dict/?page=1 fetch."""
def test_list_page_1_produces_entries(self, tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
"""Page 1 must yield at least MIN_LIST_ENTRIES entries in words.json."""
words_path = tmp_path / "words.json"
progress_path = tmp_path / "list_scrape_progress.json"
monkeypatch.setattr(pealim_list_scrape, "WORDS_JSON", words_path)
monkeypatch.setattr(pealim_list_scrape, "PROGRESS_JSON", progress_path)
# Scrape exactly one page
pealim_list_scrape.run_scrape(total_pages=1, force_refresh=True)
assert words_path.exists(), "words.json was not created after scrape"
words = _words_from_file(words_path)
assert len(words) >= MIN_LIST_ENTRIES, (
f"Expected at least {MIN_LIST_ENTRIES} entries from page 1, got {len(words)}"
)
def test_list_entries_have_required_fields(self, tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
"""Every entry must have non-empty nikkud, ktiv_male, slug, pos, meaning."""
words_path = tmp_path / "words.json"
progress_path = tmp_path / "list_scrape_progress.json"
monkeypatch.setattr(pealim_list_scrape, "WORDS_JSON", words_path)
monkeypatch.setattr(pealim_list_scrape, "PROGRESS_JSON", progress_path)
pealim_list_scrape.run_scrape(total_pages=1, force_refresh=True)
words = _words_from_file(words_path)
for key, entry in words.items():
word_block = entry.get("word", {})
nikkud = word_block.get("nikkud", "")
ktiv_male = word_block.get("ktiv_male", "")
slug = entry.get("slug", "")
pos = entry.get("pos", "")
meaning = entry.get("meaning", "")
assert nikkud, f"Entry '{key}': word.nikkud is empty"
assert _has_hebrew(nikkud), f"Entry '{key}': word.nikkud has no Hebrew chars: {nikkud!r}"
assert ktiv_male, f"Entry '{key}': word.ktiv_male is empty"
assert slug, f"Entry '{key}': slug is empty"
assert SLUG_RE.match(slug), f"Entry '{key}': slug does not match \\d+-\\w+ pattern: {slug!r}"
assert pos, f"Entry '{key}': pos is empty"
assert meaning, f"Entry '{key}': meaning is empty"
def test_list_at_least_one_entry_has_root(self, tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
"""At least one entry on page 1 must have a non-empty root list."""
words_path = tmp_path / "words.json"
progress_path = tmp_path / "list_scrape_progress.json"
monkeypatch.setattr(pealim_list_scrape, "WORDS_JSON", words_path)
monkeypatch.setattr(pealim_list_scrape, "PROGRESS_JSON", progress_path)
pealim_list_scrape.run_scrape(total_pages=1, force_refresh=True)
words = _words_from_file(words_path)
entries_with_root = [e for e in words.values() if e.get("root")]
assert entries_with_root, "No entries on page 1 have a non-empty root list"
def test_list_at_least_one_entry_has_audio(self, tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
"""At least one entry on page 1 must have a non-empty audio_url."""
words_path = tmp_path / "words.json"
progress_path = tmp_path / "list_scrape_progress.json"
monkeypatch.setattr(pealim_list_scrape, "WORDS_JSON", words_path)
monkeypatch.setattr(pealim_list_scrape, "PROGRESS_JSON", progress_path)
pealim_list_scrape.run_scrape(total_pages=1, force_refresh=True)
words = _words_from_file(words_path)
entries_with_audio = [e for e in words.values() if e.get("audio_url")]
assert entries_with_audio, "No entries on page 1 have a non-empty audio_url"
def test_list_post_process_fields_exist(self, tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
"""After scrape, every entry must have 'confusable_group' and 'shared_roots' keys (post-processed)."""
words_path = tmp_path / "words.json"
progress_path = tmp_path / "list_scrape_progress.json"
monkeypatch.setattr(pealim_list_scrape, "WORDS_JSON", words_path)
monkeypatch.setattr(pealim_list_scrape, "PROGRESS_JSON", progress_path)
pealim_list_scrape.run_scrape(total_pages=1, force_refresh=True)
words = _words_from_file(words_path)
for key, entry in words.items():
assert "confusable_group" in entry, f"Entry '{key}' missing 'confusable_group' key"
assert "shared_roots" in entry, f"Entry '{key}' missing 'shared_roots' key"
assert isinstance(entry["shared_roots"], list), f"Entry '{key}': shared_roots is not a list"
# ---------------------------------------------------------------------------
# Test class: noun detail scrape
# ---------------------------------------------------------------------------
@pytest.mark.integration
@skip_integration
class TestDetailScrapeNoun:
"""Validate pealim_detail_scrape for a real noun detail page."""
def _find_noun_with_root(self, words: dict) -> tuple[str, dict] | None:
"""Return the first (key, entry) pair that is a Noun with a non-empty root."""
for key, entry in words.items():
if entry.get("pos", "").startswith("Noun") and entry.get("root") and entry.get("slug"):
return key, entry
return None
def _prepare_words_json(self, tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> tuple[Path, dict]:
"""
Scrape page 1 into a fresh words.json and return (path, words).
Uses list scraper monkeypatched to tmp_path.
"""
words_path = tmp_path / "words.json"
progress_path = tmp_path / "list_scrape_progress.json"
monkeypatch.setattr(pealim_list_scrape, "WORDS_JSON", words_path)
monkeypatch.setattr(pealim_list_scrape, "PROGRESS_JSON", progress_path)
pealim_list_scrape.run_scrape(total_pages=1, force_refresh=True)
words = _words_from_file(words_path)
return words_path, words
def test_noun_detail_inflection_not_null(self, tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
"""After detail scrape, noun_inflection must not be null."""
words_path, words = self._prepare_words_json(tmp_path, monkeypatch)
pair = self._find_noun_with_root(words)
assert pair is not None, "No noun with a root found on page 1"
noun_key, noun_entry = pair
# Now monkeypatch detail scraper and run it on just this noun
monkeypatch.setattr(pealim_detail_scrape, "WORDS_JSON", words_path)
# Small rate-limit delay between list scrape and detail scrape
time.sleep(1.0)
pealim_detail_scrape.run(force_refresh=True, nouns_only=True)
updated_words = _words_from_file(words_path)
entry = updated_words.get(noun_key, {})
assert entry.get("noun_inflection") is not None, (
f"noun_inflection is None after detail scrape for '{noun_key}' (slug={noun_entry.get('slug')})"
)
def test_noun_detail_singular_and_plural_forms(self, tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
"""Noun singular and plural forms must have non-empty nikkud and ktiv_male."""
words_path, words = self._prepare_words_json(tmp_path, monkeypatch)
pair = self._find_noun_with_root(words)
assert pair is not None, "No noun with a root found on page 1"
noun_key, _noun_entry = pair
monkeypatch.setattr(pealim_detail_scrape, "WORDS_JSON", words_path)
time.sleep(1.0)
pealim_detail_scrape.run(force_refresh=True, nouns_only=True)
updated_words = _words_from_file(words_path)
ni = updated_words[noun_key].get("noun_inflection", {}) or {}
singular = ni.get("singular") or {}
plural = ni.get("plural") or {}
assert singular.get("nikkud"), f"noun_inflection.singular.nikkud is empty for '{noun_key}'"
assert singular.get("ktiv_male"), f"noun_inflection.singular.ktiv_male is empty for '{noun_key}'"
assert plural.get("nikkud"), f"noun_inflection.plural.nikkud is empty for '{noun_key}'"
assert plural.get("ktiv_male"), f"noun_inflection.plural.ktiv_male is empty for '{noun_key}'"
def test_noun_detail_gender(self, tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
"""Noun gender must be 'masculine' or 'feminine'."""
words_path, words = self._prepare_words_json(tmp_path, monkeypatch)
pair = self._find_noun_with_root(words)
assert pair is not None, "No noun with a root found on page 1"
noun_key, _noun_entry = pair
monkeypatch.setattr(pealim_detail_scrape, "WORDS_JSON", words_path)
time.sleep(1.0)
pealim_detail_scrape.run(force_refresh=True, nouns_only=True)
updated_words = _words_from_file(words_path)
ni = updated_words[noun_key].get("noun_inflection", {}) or {}
gender = ni.get("gender", "")
assert gender in ("masculine", "feminine"), (
f"noun_inflection.gender is {gender!r} for '{noun_key}' (expected 'masculine' or 'feminine')"
)
def test_noun_detail_scraped_flag(self, tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
"""detail_scraped must be True after a successful noun detail scrape."""
words_path, words = self._prepare_words_json(tmp_path, monkeypatch)
pair = self._find_noun_with_root(words)
assert pair is not None, "No noun with a root found on page 1"
noun_key, _ = pair
monkeypatch.setattr(pealim_detail_scrape, "WORDS_JSON", words_path)
time.sleep(1.0)
pealim_detail_scrape.run(force_refresh=True, nouns_only=True)
updated_words = _words_from_file(words_path)
assert updated_words[noun_key].get("detail_scraped") is True, (
f"detail_scraped is not True after scrape for '{noun_key}'"
)
# ---------------------------------------------------------------------------
# Test class: verb detail scrape (Hif'il)
# ---------------------------------------------------------------------------
@pytest.mark.integration
@skip_integration
class TestDetailScrapeVerb:
"""Validate pealim_detail_scrape for a known Hif'il verb (lehagid, slug 4183-lehagid)."""
def _build_test_words_json(self, tmp_path: Path) -> Path:
"""
Write a minimal words.json containing only the known Hif'il verb entry.
The detail scraper's run() will pick it up because pos starts with 'Verb'
and detail_scraped is absent/False.
"""
words_path = tmp_path / "words.json"
entry = {
"word": {"nikkud": HIFIL_VERB_NIKKUD, "ktiv_male": "להגיד"},
"slug": HIFIL_VERB_SLUG,
"root": ["נ", "ג", "ד"],
"pos": "Verb",
"pos_hebrew": "פֹּעַל — הִפְעִיל",
"meaning": HIFIL_VERB_MEANING,
"meaning_raw": HIFIL_VERB_MEANING,
"audio_url": "",
"audio_file": "להגיד.mp3",
"tags": "שורש::נגד פעלים",
"last_scrape_date": "2026-03-08",
"vocab_legacy_guid": None,
"frequency": None,
"pseudo_frequency": None,
"emoji": None,
"emoji_source": None,
"emoji_visible": False,
"image": None,
"image_source": None,
"hint": "",
"shared_roots": [],
"confusable_group": None,
"confusables_guid": None,
"examples": None,
"noun_inflection": None,
"conjugation": None,
"adjective_inflection": None,
"preposition_inflection": None,
# Intentionally no detail_scraped key so the scraper processes it
}
words = {HIFIL_VERB_NIKKUD: entry}
with words_path.open("w", encoding="utf-8") as fh:
json.dump(words, fh, ensure_ascii=False, indent=2)
return words_path
def test_verb_detail_conjugation_not_null(self, tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
"""After detail scrape, conjugation must not be null for the Hif'il verb."""
words_path = self._build_test_words_json(tmp_path)
monkeypatch.setattr(pealim_detail_scrape, "WORDS_JSON", words_path)
pealim_detail_scrape.run(test=1, force_refresh=True, verbs_only=True)
words = _words_from_file(words_path)
entry = words.get(HIFIL_VERB_NIKKUD, {})
assert entry.get("conjugation") is not None, f"conjugation is None after detail scrape for {HIFIL_VERB_SLUG}"
def test_verb_detail_binyan(self, tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
"""conjugation.binyan must be \"Hif'il\" and binyan_hebrew must be the correct nikkud."""
words_path = self._build_test_words_json(tmp_path)
monkeypatch.setattr(pealim_detail_scrape, "WORDS_JSON", words_path)
pealim_detail_scrape.run(test=1, force_refresh=True, verbs_only=True)
words = _words_from_file(words_path)
conj = words[HIFIL_VERB_NIKKUD].get("conjugation") or {}
assert conj.get("binyan") == "Hif'il", f"Expected binyan='Hif\\'il', got {conj.get('binyan')!r}"
assert conj.get("binyan_hebrew") == "הִפְעִיל", (
f"Expected binyan_hebrew='הִפְעִיל', got {conj.get('binyan_hebrew')!r}"
)
def test_verb_detail_infinitive_and_reference_form(self, tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
"""infinitive.nikkud and reference_form.nikkud must be non-empty Hebrew strings."""
words_path = self._build_test_words_json(tmp_path)
monkeypatch.setattr(pealim_detail_scrape, "WORDS_JSON", words_path)
pealim_detail_scrape.run(test=1, force_refresh=True, verbs_only=True)
words = _words_from_file(words_path)
conj = words[HIFIL_VERB_NIKKUD].get("conjugation") or {}
infinitive = conj.get("infinitive") or {}
reference_form = conj.get("reference_form") or {}
inf_nikkud = infinitive.get("nikkud", "")
ref_nikkud = reference_form.get("nikkud", "")
assert inf_nikkud and _has_hebrew(inf_nikkud), (
f"infinitive.nikkud is empty or has no Hebrew chars: {inf_nikkud!r}"
)
assert ref_nikkud and _has_hebrew(ref_nikkud), (
f"reference_form.nikkud (3ms past) is empty or has no Hebrew chars: {ref_nikkud!r}"
)
def test_verb_detail_active_forms_count_and_structure(
self, tmp_path: Path, monkeypatch: pytest.MonkeyPatch
) -> None:
"""active_forms must be a list of at least 20 entries, each with required sub-fields."""
words_path = self._build_test_words_json(tmp_path)
monkeypatch.setattr(pealim_detail_scrape, "WORDS_JSON", words_path)
pealim_detail_scrape.run(test=1, force_refresh=True, verbs_only=True)
words = _words_from_file(words_path)
conj = words[HIFIL_VERB_NIKKUD].get("conjugation") or {}
active_forms = conj.get("active_forms")
assert isinstance(active_forms, list), f"active_forms is not a list: {type(active_forms)}"
assert len(active_forms) >= 20, f"Expected at least 20 active forms, got {len(active_forms)}"
for i, form in enumerate(active_forms):
assert form.get("person"), f"active_forms[{i}].person is empty"
assert form.get("tense"), f"active_forms[{i}].tense is empty"
form_block = form.get("form") or {}
assert form_block.get("nikkud") and _has_hebrew(form_block["nikkud"]), (
f"active_forms[{i}].form.nikkud is empty or has no Hebrew: {form_block.get('nikkud')!r}"
)
assert form_block.get("ktiv_male") and _has_hebrew(form_block["ktiv_male"]), (
f"active_forms[{i}].form.ktiv_male is empty or has no Hebrew: {form_block.get('ktiv_male')!r}"
)
def test_verb_detail_hufal_passive_section(self, tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
"""Hif'il verb must have a non-null hufal_pual_forms list and reference_form_passive."""
words_path = self._build_test_words_json(tmp_path)
monkeypatch.setattr(pealim_detail_scrape, "WORDS_JSON", words_path)
pealim_detail_scrape.run(test=1, force_refresh=True, verbs_only=True)
words = _words_from_file(words_path)
conj = words[HIFIL_VERB_NIKKUD].get("conjugation") or {}
hufal_forms = conj.get("hufal_pual_forms")
assert hufal_forms is not None, "hufal_pual_forms is None — expected Huf'al passive section for a Hif'il verb"
assert isinstance(hufal_forms, list), f"hufal_pual_forms is not a list: {type(hufal_forms)}"
assert len(hufal_forms) > 0, "hufal_pual_forms list is empty"
ref_passive = conj.get("reference_form_passive")
assert ref_passive is not None, "reference_form_passive is None — expected a Huf'al 3ms past form"
passive_nikkud = (ref_passive or {}).get("nikkud", "")
assert passive_nikkud and _has_hebrew(passive_nikkud), (
f"reference_form_passive.nikkud is empty or has no Hebrew: {passive_nikkud!r}"
)
def test_verb_detail_scraped_flag(self, tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
"""detail_scraped must be True after a successful verb detail scrape."""
words_path = self._build_test_words_json(tmp_path)
monkeypatch.setattr(pealim_detail_scrape, "WORDS_JSON", words_path)
pealim_detail_scrape.run(test=1, force_refresh=True, verbs_only=True)
words = _words_from_file(words_path)
entry = words.get(HIFIL_VERB_NIKKUD, {})
assert entry.get("detail_scraped") is True, f"detail_scraped is not True after scrape for {HIFIL_VERB_SLUG}"

View file

@ -0,0 +1,207 @@
"""Tests for sentence difficulty scoring."""
import json
from pathlib import Path
import pytest
import frequency_lookup
from sentence_difficulty import DEFAULT_RANK, _resolve_token_frequency, build_nikkud_map, score_sentence
class TestBuildNikkudMap:
def test_maps_direct_headwords(self):
words = {"אָב": {"word": {"nikkud": "אָב", "ktiv_male": "אב"}}}
nmap = build_nikkud_map(words)
assert nmap["אָב"] == "אב"
def test_maps_conjugation_forms(self):
words = {
"שָׁמַר": {
"word": {"nikkud": "שָׁמַר", "ktiv_male": "שמר"},
"conjugation": {
"active_forms": [
{
"person": "1s",
"tense": "עָבָר",
"form": {"nikkud": "שָׁמַרְתִּי", "ktiv_male": "שמרתי"},
},
],
"infinitive": {"nikkud": "לִשְׁמֹר", "ktiv_male": "לשמור"},
"reference_form": {"nikkud": "שָׁמַר", "ktiv_male": "שמר"},
},
}
}
nmap = build_nikkud_map(words)
assert nmap["שָׁמַרְתִּי"] == "שמרתי"
assert nmap["לִשְׁמֹר"] == "לשמור"
def test_maps_noun_inflections(self):
words = {
"אָב": {
"word": {"nikkud": "אָב", "ktiv_male": "אב"},
"noun_inflection": {
"singular": {"nikkud": "אָב", "ktiv_male": "אב"},
"plural": {"nikkud": "אָבוֹת", "ktiv_male": "אבות"},
"pronominal_suffixes": {"1s": {"nikkud": "אָבִי", "ktiv_male": "אבי"}},
},
}
}
nmap = build_nikkud_map(words)
assert nmap["אָבוֹת"] == "אבות"
assert nmap["אָבִי"] == "אבי"
def test_maps_adjective_inflections(self):
words = {
"גָּדוֹל": {
"word": {"nikkud": "גָּדוֹל", "ktiv_male": "גדול"},
"adjective_inflection": {
"ms": {"nikkud": "גָּדוֹל", "ktiv_male": "גדול"},
"fs": {"nikkud": "גְּדוֹלָה", "ktiv_male": "גדולה"},
"mp": {"nikkud": "גְּדוֹלִים", "ktiv_male": "גדולים"},
"fp": {"nikkud": "גְּדוֹלוֹת", "ktiv_male": "גדולות"},
},
}
}
nmap = build_nikkud_map(words)
assert nmap["גְּדוֹלָה"] == "גדולה"
assert nmap["גְּדוֹלִים"] == "גדולים"
def test_construct_forms_strip_maqaf(self):
words = {
"בֵּית": {
"word": {"nikkud": "בֵּית", "ktiv_male": "בית"},
"noun_inflection": {
"construct_singular": {"nikkud": "בֵּית־", "ktiv_male": "בית"},
},
}
}
nmap = build_nikkud_map(words)
assert "בֵּית־" in nmap
assert "בֵּית" in nmap
def test_handles_missing_fields(self):
words = {
"test": {
"word": {"nikkud": "טֶסְט", "ktiv_male": "טסט"},
"conjugation": None,
"noun_inflection": None,
"adjective_inflection": None,
}
}
nmap = build_nikkud_map(words)
assert nmap["טֶסְט"] == "טסט"
def test_real_words_json_coverage(self):
words_path = Path(__file__).parent.parent / "data" / "words.json"
if not words_path.exists():
pytest.skip("words.json not available")
with open(words_path, encoding="utf-8") as f:
words = json.load(f)
nmap = build_nikkud_map(words)
assert len(nmap) > 90_000
class TestResolveTokenFrequency:
@pytest.fixture()
def freq_setup(self):
frequency_lookup.load()
freq_data = frequency_lookup.get_freq_data()
words_path = Path(__file__).parent.parent / "data" / "words.json"
if not words_path.exists():
pytest.skip("words.json not available")
with open(words_path, encoding="utf-8") as f:
words = json.load(f)
from epub_examples import _build_nikkud_index
nikkud_map = build_nikkud_map(words)
nikkud_index = _build_nikkud_index(words)
return nikkud_map, nikkud_index, freq_data
def test_tier1_known_mapping(self, freq_setup):
nikkud_map, nikkud_index, freq_data = freq_setup
rank = _resolve_token_frequency("אָב", nikkud_map, nikkud_index, freq_data)
assert rank is not None
assert rank < 50_000
def test_tier3_academy_converter(self, freq_setup):
nikkud_map, nikkud_index, freq_data = freq_setup
rank = _resolve_token_frequency("שָׁלוֹם", nikkud_map, nikkud_index, freq_data)
assert rank is not None
assert rank < 1000
def test_unknown_token_returns_default(self, freq_setup):
nikkud_map, nikkud_index, freq_data = freq_setup
rank = _resolve_token_frequency("קְסַנְתּוֹפּוּלוֹס", nikkud_map, nikkud_index, freq_data)
assert rank == 50_000
def test_tier5_ktiv_male_prefix_strip(self, freq_setup):
nikkud_map, nikkud_index, freq_data = freq_setup
assert freq_data.get("שלום") is not None
class TestScoreSentence:
@pytest.fixture()
def scoring_setup(self):
frequency_lookup.load()
freq_data = frequency_lookup.get_freq_data()
words_path = Path(__file__).parent.parent / "data" / "words.json"
if not words_path.exists():
pytest.skip("words.json not available")
with open(words_path, encoding="utf-8") as f:
words = json.load(f)
from epub_examples import _build_nikkud_index
nikkud_map = build_nikkud_map(words)
nikkud_index = _build_nikkud_index(words)
return nikkud_map, nikkud_index, freq_data
def test_returns_integer(self, scoring_setup):
nmap, nidx, freq = scoring_setup
text = "הוּא הָלַךְ הַבַּיְתָה"
start = text.index("הָלַךְ")
end = start + len("הָלַךְ")
score = score_sentence(text, start, end, nmap, nidx, freq)
assert isinstance(score, int)
def test_easy_sentence_scores_lower(self, scoring_setup):
nmap, nidx, freq = scoring_setup
easy = "הוּא אָמַר שָׁלוֹם"
easy_start = easy.index("אָמַר")
easy_end = easy_start + len("אָמַר")
hard = "הַפַּרְדֵּס נִשְׁתַּטֵּחַ בַּדַּהֲרָה"
hard_start = hard.index("נִשְׁתַּטֵּחַ")
hard_end = hard_start + len("נִשְׁתַּטֵּחַ")
easy_score = score_sentence(easy, easy_start, easy_end, nmap, nidx, freq)
hard_score = score_sentence(hard, hard_start, hard_end, nmap, nidx, freq)
assert easy_score < hard_score
def test_single_context_token(self, scoring_setup):
nmap, nidx, freq = scoring_setup
text = "הוּא טוֹב"
start = 0
end = len("הוּא")
score = score_sentence(text, start, end, nmap, nidx, freq)
assert isinstance(score, int)
def test_handles_punctuation(self, scoring_setup):
nmap, nidx, freq = scoring_setup
text = '"הוּא טוֹב!"'
start = text.index("טוֹב")
end = start + len("טוֹב")
score = score_sentence(text, start, end, nmap, nidx, freq)
assert isinstance(score, int)
def test_splits_on_maqaf(self, scoring_setup):
nmap, nidx, freq = scoring_setup
text = "בֵּית־סֵפֶר גָּדוֹל"
start = text.index("גָּדוֹל")
end = start + len("גָּדוֹל")
score = score_sentence(text, start, end, nmap, nidx, freq)
assert isinstance(score, int)
def test_no_context_tokens_returns_default(self, scoring_setup):
nmap, nidx, freq = scoring_setup
text = "א ב"
score = score_sentence(text, 0, 1, nmap, nidx, freq)
assert score == DEFAULT_RANK

58
tests/test_smoke.py Normal file
View file

@ -0,0 +1,58 @@
"""Smoke tests for the Hebrew Flash Cards project."""
import sys
from pathlib import Path
# Ensure project root is on path
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
def test_helpers_strip_nikkud():
from helpers import strip_nikkud
assert strip_nikkud("שָׁלוֹם") == "שלום"
assert strip_nikkud("hello") == "hello"
assert strip_nikkud("") == ""
def test_apkg_builder_imports():
import apkg_builder
assert hasattr(apkg_builder, "build_vocab_deck")
assert hasattr(apkg_builder, "build_conj_deck")
assert apkg_builder.VOCAB_MODEL_ID == 1_701_222_017_968
def test_data_files_exist():
data_dir = Path(__file__).resolve().parent.parent / "data"
assert (data_dir / "words.json").exists(), "words.json missing"
def test_strip_nikkud_idempotent():
from helpers import strip_nikkud
plain = "שלום"
assert strip_nikkud(plain) == plain
def test_strip_nikkud_all_marks():
from helpers import strip_nikkud
# Comprehensive: patach, kamatz, segol, tsere, hiriq, holam, kubutz, shva, dagesh
nikkud = "הַמַּלְכָּה"
plain = strip_nikkud(nikkud)
assert all(ch < "\u0591" or ch > "\u05c7" for ch in plain), f"Residual nikkud in: {plain}"
def test_categorize_pos_no_substring_match():
"""Regression: 'Pronoun' must NOT match 'Noun' category."""
from apkg_builder import _categorize_pos
assert _categorize_pos("Noun") == "Noun"
assert _categorize_pos("Verb") == "Verb"
assert _categorize_pos("Adjective") == "Adjective"
assert _categorize_pos("Adverb") == "Adverb"
assert _categorize_pos("Pronoun") == "Other", "Pronoun must not match Noun"
assert _categorize_pos("Preposition") == "Other"
assert _categorize_pos("Conjunction") == "Other"
assert _categorize_pos("Cardinal numeral") == "Other"

View file

@ -14,7 +14,6 @@ import json
import os import os
import re import re
import sqlite3 import sqlite3
import struct
import sys import sys
import tempfile import tempfile
import zipfile import zipfile
@ -22,6 +21,9 @@ from pathlib import Path
VOCAB_APKG = Path("output/hebrew_vocabulary.apkg") VOCAB_APKG = Path("output/hebrew_vocabulary.apkg")
CONJ_APKG = Path("output/hebrew_conjugations.apkg") CONJ_APKG = Path("output/hebrew_conjugations.apkg")
CONF_APKG = Path("output/hebrew_confusables.apkg")
PLURAL_APKG = Path("output/hebrew_plurals.apkg")
COMPLETE_APKG = Path("output/hebrew_complete.apkg")
PASS = "\033[32m✓\033[0m" PASS = "\033[32m✓\033[0m"
FAIL = "\033[31m✗\033[0m" FAIL = "\033[31m✗\033[0m"
@ -60,10 +62,9 @@ def _detect_format(data: bytes) -> str:
def validate_apkg(apkg_path: Path) -> int: def validate_apkg(apkg_path: Path) -> int:
"""Run all checks. Returns number of failures.""" """Run all checks. Returns number of failures."""
name = apkg_path.name print(f"\n{'=' * 60}")
print(f"\n{'='*60}")
print(f" Validating: {apkg_path}") print(f" Validating: {apkg_path}")
print(f"{'='*60}") print(f"{'=' * 60}")
failures = 0 failures = 0
@ -78,16 +79,17 @@ def validate_apkg(apkg_path: Path) -> int:
print("\n[ZIP structure]") print("\n[ZIP structure]")
try: try:
zf = zipfile.ZipFile(apkg_path) zf = zipfile.ZipFile(apkg_path)
except zipfile.BadZipFile as e:
print(f" {FAIL} Invalid ZIP: {e}")
return 1
with zf, tempfile.TemporaryDirectory() as tmpdir:
namelist = zf.namelist() namelist = zf.namelist()
has_db = "collection.anki2" in namelist has_db = "collection.anki2" in namelist
has_media = "media" in namelist has_media = "media" in namelist
failures += 0 if check("collection.anki2 present", has_db) else 1 failures += 0 if check("collection.anki2 present", has_db) else 1
failures += 0 if check("media manifest present", has_media) else 1 failures += 0 if check("media manifest present", has_media) else 1
except zipfile.BadZipFile as e:
print(f" {FAIL} Invalid ZIP: {e}")
return 1
with tempfile.TemporaryDirectory() as tmpdir:
zf.extractall(tmpdir) zf.extractall(tmpdir)
# --- Media manifest --- # --- Media manifest ---
@ -116,8 +118,11 @@ def validate_apkg(apkg_path: Path) -> int:
size = zf.getinfo(num).file_size if num in zf.NameToInfo else -1 size = zf.getinfo(num).file_size if num in zf.NameToInfo else -1
if size == 0: if size == 0:
zero_byte.append(orig) zero_byte.append(orig)
failures += 0 if check("No zero-byte media files", len(zero_byte) == 0, failures += (
f"{len(zero_byte)} empty" if zero_byte else "") else 1 0
if check("No zero-byte media files", len(zero_byte) == 0, f"{len(zero_byte)} empty" if zero_byte else "")
else 1
)
# Check audio format sample (first 20 mp3s) # Check audio format sample (first 20 mp3s)
mp3_names = [num for num, orig in media_map.items() if orig.endswith(".mp3")] mp3_names = [num for num, orig in media_map.items() if orig.endswith(".mp3")]
@ -127,16 +132,19 @@ def validate_apkg(apkg_path: Path) -> int:
fmt = _detect_format(data) fmt = _detect_format(data)
if "MP3" not in fmt: if "MP3" not in fmt:
bad_format.append(f"{media_map[num]}: {fmt}") bad_format.append(f"{media_map[num]}: {fmt}")
failures += 0 if check( failures += (
0
if check(
f"Audio format (sampled {min(20, len(mp3_names))} files)", f"Audio format (sampled {min(20, len(mp3_names))} files)",
len(bad_format) == 0, len(bad_format) == 0,
"; ".join(bad_format) if bad_format else f"all MP3", "; ".join(bad_format) if bad_format else "all MP3",
) else 1 )
else 1
)
# Fonts present # Fonts present
font_files = [v for v in original_names if v.endswith(".ttf")] font_files = [v for v in original_names if v.endswith(".ttf")]
check("Heebo font files bundled", len(font_files) >= 1, check("Heebo font files bundled", len(font_files) >= 1, ", ".join(font_files) if font_files else "none found")
", ".join(font_files) if font_files else "none found")
# --- Database --- # --- Database ---
print("\n[Database]") print("\n[Database]")
@ -144,8 +152,7 @@ def validate_apkg(apkg_path: Path) -> int:
conn = sqlite3.connect(db_path) conn = sqlite3.connect(db_path)
schema_ver = conn.execute("SELECT ver FROM col").fetchone()[0] schema_ver = conn.execute("SELECT ver FROM col").fetchone()[0]
failures += 0 if check("Schema version 11 (Anki 2.1)", schema_ver == 11, failures += 0 if check("Schema version 11 (Anki 2.1)", schema_ver == 11, f"got {schema_ver}") else 1
f"got {schema_ver}") else 1
note_count = conn.execute("SELECT COUNT(*) FROM notes").fetchone()[0] note_count = conn.execute("SELECT COUNT(*) FROM notes").fetchone()[0]
card_count = conn.execute("SELECT COUNT(*) FROM cards").fetchone()[0] card_count = conn.execute("SELECT COUNT(*) FROM cards").fetchone()[0]
@ -153,33 +160,37 @@ def validate_apkg(apkg_path: Path) -> int:
failures += 0 if check("Cards present", card_count > 0, f"{card_count:,} cards") else 1 failures += 0 if check("Cards present", card_count > 0, f"{card_count:,} cards") else 1
# Determine expected cards per note from model templates # Determine expected cards per note from model templates
# Some templates are optional (e.g. cloze only generates when field is non-empty),
# so we check that cards fall between min and max expected range.
models_json_raw = conn.execute("SELECT models FROM col").fetchone()[0] models_json_raw = conn.execute("SELECT models FROM col").fetchone()[0]
models_raw = json.loads(models_json_raw) models_raw = json.loads(models_json_raw)
tmpl_counts = [len(m["tmpls"]) for m in models_raw.values()] tmpl_counts = [len(m["tmpls"]) for m in models_raw.values()]
expected_ratio = tmpl_counts[0] if len(set(tmpl_counts)) == 1 else None if len(set(tmpl_counts)) == 1 and len(tmpl_counts) == 1:
if expected_ratio: expected_ratio = tmpl_counts[0]
failures += 0 if check( # Allow fewer cards when optional templates exist (e.g. cloze)
f"{expected_ratio} card(s) per note", min_cards = note_count # at least 1 card per note
card_count == note_count * expected_ratio, max_cards = note_count * expected_ratio
f"{note_count} notes × {expected_ratio} = {note_count * expected_ratio}, got {card_count}", failures += (
) else 1 0
if check(
f"Cards per note (1{expected_ratio} templates)",
min_cards <= card_count <= max_cards,
f"{card_count:,} cards from {note_count:,} notes",
)
else 1
)
# Duplicate GUIDs # Duplicate GUIDs
dup_guids = conn.execute( dup_guids = conn.execute("SELECT guid, COUNT(*) c FROM notes GROUP BY guid HAVING c > 1").fetchall()
"SELECT guid, COUNT(*) c FROM notes GROUP BY guid HAVING c > 1" failures += 0 if check("No duplicate GUIDs", len(dup_guids) == 0, f"{len(dup_guids)} duplicates") else 1
).fetchall()
failures += 0 if check("No duplicate GUIDs", len(dup_guids) == 0,
f"{len(dup_guids)} duplicates") else 1
# Card queue states # Card queue states
queues = conn.execute( queues = conn.execute("SELECT type, queue, COUNT(*) FROM cards GROUP BY type, queue").fetchall()
"SELECT type, queue, COUNT(*) FROM cards GROUP BY type, queue"
).fetchall()
queue_map = {(t, q): cnt for t, q, cnt in queues} queue_map = {(t, q): cnt for t, q, cnt in queues}
new_cards = queue_map.get((0, 0), 0) new_cards = queue_map.get((0, 0), 0)
suspended = queue_map.get((0, -1), 0) + queue_map.get((1, -1), 0) + queue_map.get((2, -1), 0) suspended = queue_map.get((0, -1), 0) + queue_map.get((1, -1), 0) + queue_map.get((2, -1), 0)
if new_cards > 0: if new_cards > 0:
check(f"Cards in new queue (type=0, queue=0)", True, f"{new_cards:,}") check("Cards in new queue (type=0, queue=0)", True, f"{new_cards:,}")
if suspended > 0: if suspended > 0:
warn("Suspended cards", f"{suspended:,}") warn("Suspended cards", f"{suspended:,}")
@ -190,23 +201,18 @@ def validate_apkg(apkg_path: Path) -> int:
per_days = {dc.get("new", {}).get("perDay") for dc in dconf.values() if isinstance(dc, dict)} per_days = {dc.get("new", {}).get("perDay") for dc in dconf.values() if isinstance(dc, dict)}
check("new.order configured", bool(orders), f"{orders}") check("new.order configured", bool(orders), f"{orders}")
if per_days: if per_days:
check("new.perDay > 0", all(p and p > 0 for p in per_days if p is not None), check("new.perDay > 0", all(p and p > 0 for p in per_days if p is not None), f"perDay={per_days}")
f"perDay={per_days}")
# Deck assignment # Deck assignment
decks_json = conn.execute("SELECT decks FROM col").fetchone()[0] decks_json = conn.execute("SELECT decks FROM col").fetchone()[0]
decks = json.loads(decks_json) decks = json.loads(decks_json)
real_decks = {did: d for did, d in decks.items() if did != "1"} real_decks = {did: d for did, d in decks.items() if did != "1"}
if real_decks: if real_decks:
check("Custom deck exists (not Default only)", True, check("Custom deck exists (not Default only)", True, ", ".join(d["name"] for d in real_decks.values()))
", ".join(d["name"] for d in real_decks.values()))
# All cards in the custom deck? # All cards in the custom deck?
for did_str in real_decks: for did_str in real_decks:
assigned = conn.execute( assigned = conn.execute("SELECT COUNT(*) FROM cards WHERE did=?", [int(did_str)]).fetchone()[0]
"SELECT COUNT(*) FROM cards WHERE did=?", [int(did_str)] check(f"Cards in deck '{real_decks[did_str]['name']}'", assigned > 0, f"{assigned:,}/{card_count:,}")
).fetchone()[0]
check(f"Cards in deck '{real_decks[did_str]['name']}'", assigned > 0,
f"{assigned:,}/{card_count:,}")
# --- Sound references vs media manifest --- # --- Sound references vs media manifest ---
print("\n[Sound references]") print("\n[Sound references]")
@ -218,16 +224,25 @@ def validate_apkg(apkg_path: Path) -> int:
missing_audio = sound_refs - original_names missing_audio = sound_refs - original_names
orphaned_audio = original_names - sound_refs - set(font_files) orphaned_audio = original_names - sound_refs - set(font_files)
failures += 0 if check("All sound refs in media manifest", len(missing_audio) == 0, failures += (
f"{len(missing_audio)} missing" if missing_audio else "") else 1 0
if check(
"All sound refs in media manifest",
len(missing_audio) == 0,
f"{len(missing_audio)} missing" if missing_audio else "",
)
else 1
)
if orphaned_audio: if orphaned_audio:
warn("Media files not referenced by any card", f"{len(orphaned_audio)} orphaned") warn("Media files not referenced by any card", f"{len(orphaned_audio)} orphaned")
notes_with_audio = sum( notes_with_audio = sum(1 for (flds,) in notes_flds if "[sound:" in flds)
1 for (flds,) in notes_flds if "[sound:" in flds
)
pct = notes_with_audio / note_count * 100 if note_count else 0 pct = notes_with_audio / note_count * 100 if note_count else 0
check(f"Notes with audio", notes_with_audio > 0, f"{notes_with_audio:,}/{note_count:,} ({pct:.0f}%)") if notes_with_audio > 0:
check("Notes with audio", True, f"{notes_with_audio:,}/{note_count:,} ({pct:.0f}%)")
else:
# Non-audio variants intentionally have no audio — not a failure
warn("No audio in this deck variant", f"0/{note_count:,}")
# --- Empty fields check --- # --- Empty fields check ---
print("\n[Field content]") print("\n[Field content]")
@ -236,22 +251,12 @@ def validate_apkg(apkg_path: Path) -> int:
field_names = [f["name"] for f in model["flds"]] field_names = [f["name"] for f in model["flds"]]
# Check required fields (first 3) are not empty # Check required fields (first 3) are not empty
required_idx = list(range(min(3, len(field_names)))) required_idx = list(range(min(3, len(field_names))))
all_notes_for_model = conn.execute("SELECT flds FROM notes WHERE mid=?", [int(mid_str)]).fetchall()
for idx in required_idx: for idx in required_idx:
fname = field_names[idx] fname = field_names[idx]
empty_count = conn.execute(
"""SELECT COUNT(*) FROM notes
WHERE mid=? AND (
flds LIKE ? OR
instr(flds, char(31)) = 0
)""",
[int(mid_str), "\x1f" * idx + "\x1f%"],
).fetchone()[0]
# Simpler: count notes where field idx is empty
all_notes_for_model = conn.execute(
"SELECT flds FROM notes WHERE mid=?", [int(mid_str)]
).fetchall()
empty = sum( empty = sum(
1 for (flds,) in all_notes_for_model 1
for (flds,) in all_notes_for_model
if len(flds.split("\x1f")) <= idx or not flds.split("\x1f")[idx].strip() if len(flds.split("\x1f")) <= idx or not flds.split("\x1f")[idx].strip()
) )
if empty > 0: if empty > 0:
@ -271,6 +276,9 @@ def main() -> None:
group = parser.add_mutually_exclusive_group() group = parser.add_mutually_exclusive_group()
group.add_argument("--vocab", action="store_true", help="Validate vocabulary deck only") group.add_argument("--vocab", action="store_true", help="Validate vocabulary deck only")
group.add_argument("--conjugations", action="store_true", help="Validate conjugation deck only") group.add_argument("--conjugations", action="store_true", help="Validate conjugation deck only")
group.add_argument("--confusables", action="store_true", help="Validate confusables deck only")
group.add_argument("--plurals", action="store_true", help="Validate plurals deck only")
group.add_argument("--complete", action="store_true", help="Validate complete combined deck only")
args = parser.parse_args() args = parser.parse_args()
targets: list[Path] = [] targets: list[Path] = []
@ -280,19 +288,25 @@ def main() -> None:
targets = [VOCAB_APKG] targets = [VOCAB_APKG]
elif args.conjugations: elif args.conjugations:
targets = [CONJ_APKG] targets = [CONJ_APKG]
elif args.confusables:
targets = [CONF_APKG]
elif args.plurals:
targets = [PLURAL_APKG]
elif args.complete:
targets = [COMPLETE_APKG]
else: else:
targets = [VOCAB_APKG, CONJ_APKG] targets = [VOCAB_APKG, CONJ_APKG, CONF_APKG, PLURAL_APKG, COMPLETE_APKG]
total_failures = 0 total_failures = 0
for path in targets: for path in targets:
total_failures += validate_apkg(path) total_failures += validate_apkg(path)
print(f"\n{'='*60}") print(f"\n{'=' * 60}")
if total_failures == 0: if total_failures == 0:
print(f" {PASS} All checks passed") print(f" {PASS} All checks passed")
else: else:
print(f" {FAIL} {total_failures} check(s) failed") print(f" {FAIL} {total_failures} check(s) failed")
print(f"{'='*60}\n") print(f"{'=' * 60}\n")
sys.exit(0 if total_failures == 0 else 1) sys.exit(0 if total_failures == 0 else 1)

View file

@ -1,251 +0,0 @@
#!/usr/bin/env python3
"""
Validate nevo_typed_verbs_from_modern_hebrew against pealim.com.
For each verb:
1. Classifies it by position in the file (Pa'al/Nif'al/Pi'el/Pu'al/Hitpa'el/Hif'il/Huf'al)
2. Searches pealim.com to find URL slug
3. Fetches the page to confirm the binyan
4. Flags known-problem entries and detects: not-found, binyan mismatch, suspected typos
Output:
verbs_input.txt cleaned verb list for conjugation_extract.py
Printed validation report table
Usage:
python3 validate_verb_list.py
After running, review verbs_input.txt (especially REVIEW-flagged entries) before
running conjugation extraction.
"""
import re
import sys
import time
import urllib.parse
from pathlib import Path
import requests
from bs4 import BeautifulSoup
PEALIM_BASE = "https://www.pealim.com"
REQUEST_DELAY = 1.5
REQUEST_TIMEOUT = 15
SOURCE_FILE = Path(__file__).parent / "nevo_typed_verbs_from_modern_hebrew"
OUTPUT_FILE = Path(__file__).parent / "verbs_input.txt"
# Known problem entries: word → (action, note)
# action: "REVIEW" = comment out and flag, "3ms" = treat as 3ms past form
KNOWN_ISSUES: dict[str, tuple[str, str]] = {
"לגבוה": ("REVIEW", "not a standard infinitive form; likely defective spelling or wrong word"),
"לההרג": ("REVIEW", "extra ה; should probably be להיהרג (Nif'al of הרג)"),
"להתלקלח": ("REVIEW", "not a real word; likely typo for להתקלקל"),
"להקלל": ("REVIEW", "ambiguous: could be Hif'il לְהָקֵל (to ease) or Nif'al of קלל"),
"המציא": ("3ms", "Hif'il 3ms past form, not an infinitive"),
"קומם": ("3ms", "ambiguous: Pu'al 3ms past; Pi'el infinitive is לְקוֹמֵם"),
}
# Expected binyan by line range (1-indexed) per plan analysis
LINE_RANGES: list[tuple[range, str]] = [
(range(1, 18), "Pa'al"),
(range(18, 29), "Nif'al"),
(range(29, 37), "Pi'el"),
(range(37, 43), "Pu'al"),
(range(43, 53), "Hitpa'el"),
(range(53, 63), "Hif'il"),
(range(63, 71), "Huf'al"),
]
SECTION_HEADERS: dict[str, str] = {
"Pa'al": "# Pa'al (פָּעַל)",
"Nif'al": "# Nif'al (נִפְעַל)",
"Pi'el": "# Pi'el (פִּעֵל)",
"Pu'al": "# Pu'al (פֻּעַל) — 3ms past, no infinitive",
"Hitpa'el": "# Hitpa'el (הִתְפַּעֵל)",
"Hif'il": "# Hif'il (הִפְעִיל)",
"Huf'al": "# Huf'al (הֻפְעַל) — 3ms past, no infinitive",
}
session = requests.Session()
session.headers.update({"User-Agent": "Mozilla/5.0 (compatible; pealim-anki/3.0)"})
def classify_by_line(line_num: int) -> str:
"""Return expected binyan for a 1-indexed line number."""
for r, binyan in LINE_RANGES:
if line_num in r:
return binyan
return "Unknown"
def find_slug(query: str) -> str | None:
"""Search pealim.com and return first URL slug found."""
url = f"{PEALIM_BASE}/search/?q={urllib.parse.quote(query)}"
try:
resp = session.get(url, timeout=REQUEST_TIMEOUT)
resp.raise_for_status()
slugs = re.findall(r"/dict/(\d+[-][^/?\"<>\s]+)/", resp.text)
return slugs[0] if slugs else None
except Exception as e:
print(f" ERROR searching {query!r}: {e}", file=sys.stderr)
return None
def get_page_binyan(slug: str) -> str:
"""Fetch /dict/<slug>/ and extract binyan from page header."""
url = f"{PEALIM_BASE}/dict/{slug}/"
try:
resp = session.get(url, cookies={"hebstyle": "mo"}, timeout=REQUEST_TIMEOUT)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")
binyan_names = ["Pa'al", "Nif'al", "Pi'el", "Pu'al", "Hitpa'el", "Hif'il", "Huf'al"]
for h3 in soup.find_all("h3", class_="page-header"):
text = h3.get_text(" ", strip=True)
for bname in binyan_names:
if bname in text:
return bname
meta = soup.find("meta", {"property": "og:description"})
if meta:
desc = meta.get("content", "")
for bname in binyan_names:
if bname in desc:
return bname
except Exception as e:
print(f" ERROR fetching {slug}: {e}", file=sys.stderr)
return ""
def main() -> None:
if not SOURCE_FILE.exists():
print(f"ERROR: {SOURCE_FILE} not found", file=sys.stderr)
sys.exit(1)
lines = [l.strip() for l in SOURCE_FILE.read_text(encoding="utf-8").splitlines() if l.strip()]
print(f"Loaded {len(lines)} entries from {SOURCE_FILE.name}")
print(f"Querying pealim.com (delay {REQUEST_DELAY}s per request)…\n")
results = []
for line_num, word in enumerate(lines, start=1):
expected_binyan = classify_by_line(line_num)
issue_type, issue_note = KNOWN_ISSUES.get(word, (None, ""))
# Positions 37-42 (Pu'al) and 63-70 (Huf'al) are 3ms past forms
is_3ms_by_position = expected_binyan in ("Pu'al", "Huf'al")
print(f"[{line_num:2d}/{len(lines)}] {word:<20}", end=" ", flush=True)
if issue_type == "REVIEW":
# Don't query pealim for known-bad entries
print(f"REVIEW (skipping query)")
results.append({
"line": line_num, "word": word,
"expected_binyan": expected_binyan,
"slug": "", "page_binyan": "",
"status": "REVIEW", "notes": issue_note,
"is_3ms": is_3ms_by_position,
})
continue
time.sleep(REQUEST_DELAY)
slug = find_slug(word)
if slug:
time.sleep(REQUEST_DELAY)
page_binyan = get_page_binyan(slug)
else:
page_binyan = ""
# Determine status
if issue_type == "3ms" or is_3ms_by_position:
status = "3ms"
notes = issue_note or "Pu'al/Huf'al 3ms past form"
elif not slug:
status = "NOT_FOUND"
notes = "no search result on pealim.com"
elif page_binyan and expected_binyan and page_binyan != expected_binyan:
status = "MISMATCH"
notes = f"expected {expected_binyan}, page says {page_binyan}"
else:
status = "OK"
notes = ""
print(f"{status:<12} slug={slug or '-':<35} binyan={page_binyan or '-'}")
results.append({
"line": line_num, "word": word,
"expected_binyan": expected_binyan,
"slug": slug or "", "page_binyan": page_binyan,
"status": status, "notes": notes,
"is_3ms": is_3ms_by_position or issue_type == "3ms",
})
# ── Write cleaned verbs_input.txt ────────────────────────────────────────────
sections: dict[str, list[str]] = {b: [] for b in SECTION_HEADERS}
review_lines: list[str] = []
for r in results:
b = r["expected_binyan"]
if b not in sections:
b = list(sections.keys())[0]
if r["status"] == "REVIEW":
review_lines.append(f"# REVIEW: {r['word']}{r['notes']}")
elif r["status"] == "3ms":
sections[b].append(f"# 3ms: {r['word']}")
elif r["status"] in ("OK", "MISMATCH"):
sections[b].append(r["word"])
else: # NOT_FOUND
sections[b].append(f"# NOT_FOUND: {r['word']}{r['notes']}")
output_lines = [
"# Verb list — validated against pealim.com from nevo_typed_verbs_from_modern_hebrew",
"# Lines prefixed '# 3ms:' are searched by 3ms past form (Pu'al/Huf'al).",
"# Lines prefixed '# REVIEW:' need manual correction before conjugation extraction.",
"# Lines prefixed '# NOT_FOUND:' had no pealim.com result — check spelling.",
"",
]
for binyan, header in SECTION_HEADERS.items():
if sections.get(binyan):
output_lines.append(header)
output_lines.extend(sections[binyan])
output_lines.append("")
if review_lines:
output_lines.append("# ── Entries flagged for manual review ──────────────────────────────────────────")
output_lines.extend(review_lines)
output_lines.append("")
OUTPUT_FILE.write_text("\n".join(output_lines), encoding="utf-8")
print(f"\nWrote → {OUTPUT_FILE}")
# ── Print summary table ──────────────────────────────────────────────────────
col_w = [4, 22, 14, 38, 12]
print("\n" + "=" * 95)
print("VALIDATION REPORT")
print("=" * 95)
print(f"{'#':>4} {'Verb':<22} {'Status':<14} {'Slug':<38} {'Binyan':<12} Notes")
print("-" * 95)
for r in results:
print(
f"{r['line']:>4} {r['word']:<22} {r['status']:<14} "
f"{r['slug'][:36]:<38} {r['page_binyan'] or '-':<12} {r['notes']}"
)
print("=" * 95)
counts = {s: sum(1 for r in results if r["status"] == s)
for s in ("OK", "3ms", "MISMATCH", "REVIEW", "NOT_FOUND")}
print(
f"\nSummary: {counts['OK']} OK | {counts['3ms']} 3ms-past | "
f"{counts['MISMATCH']} MISMATCH | {counts['REVIEW']} REVIEW | {counts['NOT_FOUND']} NOT_FOUND"
)
print(f"Total entries: {len(results)}")
if counts["REVIEW"] > 0 or counts["NOT_FOUND"] > 0 or counts["MISMATCH"] > 0:
print(
"\n⚠ Review flagged entries in verbs_input.txt before running:\n"
" python3 conjugation_extract.py"
)
if __name__ == "__main__":
main()

View file

@ -2,6 +2,8 @@
# Lines prefixed '# 3ms:' are searched by 3ms past form (Pu'al/Huf'al). # Lines prefixed '# 3ms:' are searched by 3ms past form (Pu'al/Huf'al).
# Pa'al (פָּעַל) # Pa'al (פָּעַל)
# slug: להיות 454-lihyot
להיות
לשמור לשמור
ללמוד ללמוד
לאסוף לאסוף