Anki Flash Cards for Learning Hebrew Vocabulary and Conjugations!
Implements four major improvements to the Pealim Anki deck pipeline:
1. Automated .apkg generation (genanki) — no more manual Anki Desktop step.
Both vocabulary and conjugation decks are built programmatically.
2. Word frequency ranking from hermitdave/FrequencyWords he_50k corpus.
Notes sorted by rank so Anki presents most common words first.
3. Example sentences from Ben Yehuda public domain corpus (not pealim.com).
Downloads txt_stripped.zip, indexes 25k texts, ~89% coverage on test set.
4. Conjugation drill deck — one card per form × verb.
Input: verbs_input.txt (Hebrew infinitives). Initial set: 7 verbs (one
per binyan). Extracts 28 forms each via pealim.com/search/ + table parse.
New files:
apkg_builder.py — genanki deck builder for both decks
benyehuda.py — Ben Yehuda corpus downloader + sentence indexer
frequency_lookup.py — FrequencyWords downloader + rank lookup
verbs_input.txt — verb input list (7 test verbs, one per binyan)
data/ — baseline CSVs + generated caches
Updated:
conjugation_extract.py — rewritten: reads verbs_input.txt, searches
/search/?q= for slug, parses table by row labels
requirements.txt — add genanki, beautifulsoup4, lxml
run.py — full orchestration pipeline with CLI flags
.gitignore — exclude venv/, benyehuda_index.json, audio/, output/
CLI:
python run.py --skip-scrape --skip-audio --test 20 (quick test)
python run.py --skip-scrape (full build)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
||
|---|---|---|
| data | ||
| .gitignore | ||
| apkg_builder.py | ||
| benyehuda.py | ||
| conjugation_extract.py | ||
| flashcard.png | ||
| frequency_lookup.py | ||
| pealim.apkg | ||
| pealim_dict.csv | ||
| pealim_dict_for_anki.csv | ||
| pealim_extract.py | ||
| README.md | ||
| requirements.txt | ||
| run.py | ||
| test_scrape.py | ||
| verbs_input.txt | ||
Pealim — Hebrew Vocabulary Scraper & Anki Deck Generator
Extract Hebrew vocabulary from pealim.com and automatically generate Anki flashcards with roots, parts of speech, and related words.
Features
- Dictionary Scraping — Extracts ~14,400 Hebrew words with roots and parts of speech
- Anki-Ready — Generates flashcards with Hebrew tags and shared-root grouping
- Conjugation Tables — Extracts verb conjugation forms for reference
- Respectful — Built-in delays and connection pooling
- Robust — Retry logic, error handling, and detailed logging
Installation
pip install -r requirements.txt
Usage
Extract Everything
python3 run.py
Dictionary Only
python3 pealim_extract.py
Conjugations Only
python3 conjugation_extract.py
Output Files
- pealim_dict.csv — Raw dictionary (Word, Root, Part of Speech, Word Without Nikkud)
- pealim_dict_for_anki.csv — Anki-formatted (adds
shared rootsand Hebrewtags) - conjugations.csv — Verb conjugation forms
- pealim.apkg — Ready-to-import Anki deck
Configuration
Edit constants at the top of each script:
REQUEST_DELAY— Seconds between requests (default: 1.5)REQUEST_TIMEOUT— Network timeout (default: 10s)max_pages— Limit extraction for testing
Performance
- Full dictionary: ~10-15 minutes (608 pages × 2 requests/page + delays)
- ~14,400 words extracted
- ~960KB CSV output
Data Structure
pealim_dict_for_anki.csv
| Column | Example |
|---|---|
| Word | שמור |
| Root | שמר |
| Part of Speech | Verb |
| Word Without Nikkud | שמור |
| shared roots | שומר שמירה |
| tags | שורש::שמר פעלים |
conjugations.csv
Columns: present_ms, present_fs, past_1s, future_1s, infinitive, etc.
Notes
- Respects pealim.com's server with configurable delays
- Uses session pooling for efficiency
- Handles network errors gracefully with retries
- All logging output goes to stdout + log file
License
Personal use. Hebrew learning tool.