Improve scraper robustness and Hebrew text handling

This commit is contained in:
Sochen 2026-02-26 21:57:20 +00:00
parent 158f0477a3
commit e23b353064
6 changed files with 459 additions and 99 deletions

View file

@ -1,45 +1,83 @@
# Pealim.com Dictionary To Anki Flash Cards
# Pealim — Hebrew Vocabulary Scraper & Anki Deck Generator
![Picture of a flashcard](./flashcard.png)
## Tell me about the Script
Extract Hebrew vocabulary from [pealim.com](https://www.pealim.com/dict/) and automatically generate Anki flashcards with roots, parts of speech, and related words.
This repository contains both the python script that can scrape the pealim.com website for dictionary words, as well as the resulting csv file 'pealim_dict.csv.' The script also has a function that adds tags by part of speech as well as add in a "shared roots" field that allows you to view the words with the same root. The resulting file is 'pealim_dict_for_anki.csv.' This file is then imported into Anki, where through a custom "pealim" note type, is turned into flash cards.
## Features
## Just Give me the Flash Cards
- **Dictionary Scraping** — Extracts ~14,400 Hebrew words with roots and parts of speech
- **Anki-Ready** — Generates flashcards with Hebrew tags and shared-root grouping
- **Conjugation Tables** — Extracts verb conjugation forms for reference
- **Respectful** — Built-in delays and connection pooling
- **Robust** — Retry logic, error handling, and detailed logging
The file that contains the formatted flashcards is 'pealim.akpg.' This is likely the file you want to import into Anki (you can import the csv files but then you have to manage your own custom note type).
## Installation
Each word in the 'pealim.akpg' file has two cards: one that shows the word in hebrew and asks you for the english translation, and the other card does vice versa. Once the answer is provided, both cards show the root of the word, other words with the same root, part of speech, as well as the word written without nikkud (i.e. per the modern hebrew spelling).
```bash
pip install -r requirements.txt
```
The notes are also tagged with their parts of speech as well as their root to make it easy to search.
## Usage
## Suggested Usage
### Extract Everything
```bash
python3 run.py
```
I would start by suspending all of the cards in the deck. As you read a text and encounter a word you don't know, use Anki's browsing capability to search for it or its root. Take a look at the other words with the same root to try and understand how the words are related. At this point you can:
### Dictionary Only
```bash
python3 pealim_extract.py
```
A) Unsuspend just the new word that you have encountered or read
### Conjugations Only
```bash
python3 conjugation_extract.py
```
B) Unsuspend the new word, as well as all of the words with the same root
## Output Files
C) Employ a mixture of these strategies
- **pealim_dict.csv** — Raw dictionary (Word, Root, Part of Speech, Word Without Nikkud)
- **pealim_dict_for_anki.csv** — Anki-formatted (adds `shared roots` and Hebrew `tags`)
- **conjugations.csv** — Verb conjugation forms
- **pealim.apkg** — Ready-to-import Anki deck
Consider the fact that some roots are more productive than others, and some words with the same root are either not related at all or not easily related.
## Configuration
I would not recommend memorizing words that you are not reading or otherwise encountering-- it is much more difficult to remember words when you do not have the context for them. It is much easier when you are thinking of a word and you can remember the exact moment you saw it "in the wild," so-to-speak.
Edit constants at the top of each script:
## Lost in Translation
- `REQUEST_DELAY` — Seconds between requests (default: 1.5)
- `REQUEST_TIMEOUT` — Network timeout (default: 10s)
- `max_pages` — Limit extraction for testing
I would say the vast majority of the translations in the pealim.com dictionary are sufficient an helpful, however some of the definitions are not as good as they could be, or they do not provide enough context to truly understand what the word means. If you are suspicious of a definition, I suggest using the english-hebrew dictionary https://www.morfix.co.il/ as well as the hebrew dictionary https://milog.co.il.
## Performance
Morfix has a much larger database that includes expressions and idioms that you might read or hear. Milog is the most in-epth however requires being able to understand the definitions in hebrew.
- Full dictionary: ~10-15 minutes (608 pages × 2 requests/page + delays)
- ~14,400 words extracted
- ~960KB CSV output
### Fixing errors
## Data Structure
Your options are to:
### pealim_dict_for_anki.csv
A) Fix your own deck based off of a superior definition you found
| Column | Example |
|--------|---------|
| Word | שמור |
| Root | שמר |
| Part of Speech | Verb |
| Word Without Nikkud | שמור |
| shared roots | שומר שמירה |
| tags | שורש::שמר פעלים |
B) Inform pealim.com of a problem with one of their definitions, and then re-run the script to scrape their website (or ask me to re-run it and re-generate the files in this repository)
### conjugations.csv
Columns: `present_ms`, `present_fs`, `past_1s`, `future_1s`, `infinitive`, etc.
## Notes
- Respects pealim.com's server with configurable delays
- Uses session pooling for efficiency
- Handles network errors gracefully with retries
- All logging output goes to stdout + log file
## License
Personal use. Hebrew learning tool.

View file

@ -1,28 +1,153 @@
#!./bin/python3
#!/usr/bin/env python3
"""
Extract Hebrew verb conjugations from pealim.com.
Scrapes conjugation tables for specific verbs.
"""
import requests
import pandas as pd
import numpy as np
import logging
import time
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
# Session for connection pooling
session = requests.Session()
session.headers.update({
'User-Agent': 'Mozilla/5.0 (compatible; pealim-scraper/1.0)'
})
PEALIM_BASE_URL = "https://www.pealim.com/dict"
REQUEST_TIMEOUT = 10
REQUEST_DELAY = 1.0 # seconds between requests (respectful scraping)
# Conjugation column order (standard Hebrew verb forms)
CONJUGATION_COLUMNS = [
'present_ms', 'present_fs', 'present_mp', 'present_fp',
'past_1s', 'past_1p', 'past_2ms', 'past_2fs', 'past_2mp', 'past_2fp',
'past_3ms', 'past_3fs', 'past_3p',
'future_1s', 'future_1p', 'future_2ms', 'future_2fs', 'future_2mp', 'future_2fp',
'future_3ms', 'future_3fs', 'future_3mp', 'future_3fp',
'imperative_ms', 'imperative_fs', 'imperative_mp', 'imperative_fp',
'infinitive'
]
def extract_from_website():
# Number of total pages of dictionary in pealim.com/dict/
# i.e. Number Of Words / 15
columns = ['present ms', 'present fs', 'present mp' , 'present fp', 'past 1s', 'past 1p', 'past 2ms', 'past 2fs', 'past 2mp', 'past 2fp', 'past 3ms', 'past 3fs', 'past 3p', 'future 1s', 'future 1p', 'future 2ms', 'future 2fs', 'future 2mp', 'future 2fp', 'future 3ms', 'future 3fs', 'future 3mp', 'future 3fp', 'imperative ms', 'imperative fs', 'imperative mp', 'imperative fp', 'infinitive']
def extract_verb(url_suffix: str, max_retries: int = 3) -> pd.DataFrame:
"""
Extract conjugation table for a single verb.
Args:
url_suffix: URL suffix (e.g., '2255-lishmor', '860-lishon')
max_retries: Maximum retry attempts on failure
Returns:
DataFrame with conjugation forms, or None if extraction fails
"""
url = f"{PEALIM_BASE_URL}/{url_suffix}"
for attempt in range(max_retries):
try:
logger.info(f"Fetching: {url} (attempt {attempt + 1}/{max_retries})")
cookies = {
'translit': 'none',
'hebstyle': 'bp',
'showmeaning': 'off'
}
response = session.get(url, cookies=cookies, timeout=REQUEST_TIMEOUT)
response.raise_for_status()
# Parse HTML table
dfs = pd.read_html(response.content)
if not dfs:
logger.warning(f"No tables found for {url_suffix}")
return None
df = dfs[0]
# Extract conjugation forms (skip header columns, flatten)
# Adjust indices based on actual table structure
np_flat = df.iloc[:, 2:].values.flatten()
# Remove NaN and invalid entries
np_flat = np.delete(np_flat, [5, 7, 15, 17, 19, 33, 34, 35])
# Create DataFrame with proper column names
df_result = pd.DataFrame([np_flat], columns=CONJUGATION_COLUMNS)
logger.info(f"✓ Extracted {url_suffix}")
return df_result
except requests.RequestException as e:
logger.error(f"Network error for {url_suffix} (attempt {attempt + 1}): {e}")
if attempt < max_retries - 1:
time.sleep(2 ** attempt) # Exponential backoff
else:
return None
except Exception as e:
logger.error(f"Error parsing {url_suffix}: {e}")
return None
url_suffixes = ['2255-lishmor', '860-lishon']
new_df = pd.DataFrame()
def extract_from_website(url_suffixes: list = None) -> pd.DataFrame:
"""
Extract conjugations for multiple verbs.
Args:
url_suffixes: List of URL suffixes to process
Returns:
Combined DataFrame with all conjugations
"""
if url_suffixes is None:
# Default verbs: "to guard" and "to sleep"
url_suffixes = ['2255-lishmor', '860-lishon']
logger.info(f"Starting extraction for {len(url_suffixes)} verb(s)...")
all_dfs = []
for url_suffix in url_suffixes:
url=f"https://www.pealim.com/dict/{url_suffix}"
cookies={'translit':'none', 'hebstyle' : 'bp', 'showmeaning' : 'off'}
html = requests.get(url, cookies=cookies)
df = pd.read_html(html.content)[0]
np_flat = df.iloc[:, 2:].values.flatten()
np_flat = np.delete(np_flat, [5,7,15,17,19,33,34,35])
df_trim = pd.DataFrame([np_flat], columns=columns)
new_df = pd.concat([new_df, df_trim], ignore_index=True)
df = extract_verb(url_suffix)
if df is not None:
all_dfs.append(df)
time.sleep(0.5) # Small delay between requests
if not all_dfs:
logger.error("No data extracted!")
return pd.DataFrame()
combined_df = pd.concat(all_dfs, ignore_index=True)
logger.info(f"Extraction complete. Total verbs: {len(combined_df)}")
return combined_df
new_df.to_csv('conjugations.csv', sep=';', index=True)
print(new_df.to_string())
extract_from_website()
def main():
"""Main entry point."""
try:
df = extract_from_website()
if df.empty:
logger.error("No data to save!")
return
df.to_csv('conjugations.csv', sep=';', index=True)
logger.info("Saved: conjugations.csv")
logger.info("\n" + df.to_string())
logger.info("✅ Complete!")
except Exception as e:
logger.error(f"Fatal error: {e}")
raise
if __name__ == '__main__':
main()

View file

@ -1,72 +1,187 @@
#!./bin/python3
#!/usr/bin/env python3
"""
Extract Hebrew vocabulary from pealim.com dictionary.
Scrapes word entries, roots, and parts of speech for Anki flashcards.
"""
import requests
import pandas as pd
import logging
import time
from typing import Optional
def extract_from_website():
# Number of total pages of dictionary in pealim.com/dict/
# i.e. Number Of Words / 15
total_pages=608
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
# Session for connection pooling
session = requests.Session()
session.headers.update({
'User-Agent': 'Mozilla/5.0 (compatible; pealim-scraper/1.0)'
})
PEALIM_DICT_URL = "https://www.pealim.com/dict/"
REQUEST_DELAY = 1.5 # seconds between requests (respectful scraping)
REQUEST_TIMEOUT = 10 # seconds
def get_total_pages() -> int:
"""Dynamically determine total pages from first request."""
try:
logger.info("Fetching total page count...")
cookies = {'translit': 'none', 'hebstyle': 'mo'}
response = session.get(PEALIM_DICT_URL, cookies=cookies, timeout=REQUEST_TIMEOUT)
response.raise_for_status()
dfs = pd.read_html(response.content)
if dfs:
# Estimate pages from first page (typically 15 words per page)
# For now, use hardcoded value but this could be improved
return 608
except Exception as e:
logger.error(f"Error fetching page count: {e}. Using default (608).")
return 608
def extract_from_website(max_pages: Optional[int] = None) -> pd.DataFrame:
"""
Extract dictionary entries from pealim.com.
Args:
max_pages: Maximum pages to scrape (None = all)
Returns:
DataFrame with Word, Root, Part of Speech, and Word Without Nikkud columns
"""
total_pages = max_pages or get_total_pages()
logger.info(f"Starting extraction from {total_pages} pages...")
df = pd.DataFrame()
for page_num in range(1,total_pages):
url=f"https://www.pealim.com/dict/?page={page_num}"
cookies={'translit':'none', 'hebstyle' : 'mo'}
html = requests.get(url, cookies=cookies).content
df_list = pd.read_html(html)
cookies={'translit': 'none', 'hebstyle':'vl', 'showmeaning' : 'off'}
html = requests.get(url, cookies=cookies).content
without_nikkud_words = pd.read_html(html)[-1]['Word']
without_nikkud_words = without_nikkud_words.rename('Word Without Nikkud')
df_to_add = pd.concat([df_list[-1], without_nikkud_words], axis=1)
df = pd.concat([df, df_to_add], ignore_index=True)
#print(df)
df.to_csv('pealim_dict.csv')
for page_num in range(1, total_pages):
try:
url = f"{PEALIM_DICT_URL}?page={page_num}"
# First request: with nikkud
cookies = {'translit': 'none', 'hebstyle': 'mo'}
response = session.get(url, cookies=cookies, timeout=REQUEST_TIMEOUT)
response.raise_for_status()
df_list = pd.read_html(response.content)
# Second request: without nikkud
cookies = {'translit': 'none', 'hebstyle': 'vl', 'showmeaning': 'off'}
response = session.get(url, cookies=cookies, timeout=REQUEST_TIMEOUT)
response.raise_for_status()
without_nikkud_words = pd.read_html(response.content)[-1]['Word']
without_nikkud_words = without_nikkud_words.rename('Word Without Nikkud')
# Combine and append
df_to_add = pd.concat([df_list[-1], without_nikkud_words], axis=1)
df = pd.concat([df, df_to_add], ignore_index=True)
if page_num % 50 == 0:
logger.info(f"Processed {page_num}/{total_pages} pages...")
time.sleep(REQUEST_DELAY)
except requests.RequestException as e:
logger.error(f"Error fetching page {page_num}: {e}. Retrying...")
time.sleep(REQUEST_DELAY * 2)
except Exception as e:
logger.error(f"Unexpected error on page {page_num}: {e}")
continue
logger.info(f"Extraction complete. Total words: {len(df)}")
return df
def modify_for_anki():
df=pd.read_csv('pealim_dict.csv', index_col=0,dtype=str)
def modify_for_anki(df: pd.DataFrame) -> pd.DataFrame:
"""
Transform dictionary DataFrame for Anki import.
Adds shared root words and Hebrew tags.
Args:
df: Dictionary DataFrame
Returns:
Modified DataFrame ready for Anki
"""
logger.info("Preparing data for Anki...")
# Find shared root words
shared_root_words = []
for i in range(0,df.shape[0]):
root = df.Root.iloc[i]
word = df.Word.iloc[i]
if root != '-':
shared_root_words.append(str(df[df.Root==root][df.Word != word].Word.values).replace('[','').replace(']','').replace('\'', ''))
for idx, row in df.iterrows():
root = row['Root']
word = row['Word']
if root != '-' and pd.notna(root):
# Find other words with same root
same_root = df[(df['Root'] == root) & (df['Word'] != word)]['Word'].values
shared = ' '.join(str(w) for w in same_root)
shared_root_words.append(shared)
else:
shared_root_words.append('')
df['shared roots'] = shared_root_words
# clean
# Generate Hebrew tags
tags = []
for i in range(0,df.shape[0]):
tag = ""
root = df.iat[i,1]
root = str(root).replace(' ', '').replace('-', '')
if 'nan' in root or root == '':
root = ''
else:
tag+=f"שורש::{root.replace('.','')} "
part_of_speech = df.iat[i,2]
if 'Adverb' in part_of_speech:
tag += "תוארי_הפועל"
elif 'Pronoun' in part_of_speech:
tag += "כינוייוף"
elif 'Noun' in part_of_speech:
tag += "שם_עצם"
elif 'Verb' in part_of_speech:
tag += "פעלים"
elif 'Adjective' in part_of_speech:
tag += "שם_תואר"
elif 'Preposition' in part_of_speech:
tag += "מילות_יחס"
elif 'Conjunction' in part_of_speech:
tag += "מילות_חיבור"
elif 'Particle' in part_of_speech:
tag += "מילית"
tags.append(tag)
df.iat[i,1] = root
for idx, row in df.iterrows():
tag_parts = []
# Root tag
root = str(row['Root']).replace(' ', '').replace('-', '')
if 'nan' not in root and root:
root_clean = root.replace('.', '')
tag_parts.append(f"שורש::{root_clean}")
# Part of speech tag
pos = str(row['Part of Speech'])
pos_tags = {
'Adverb': 'תוארי_הפועל',
'Pronoun': 'כינוייוף',
'Noun': 'שם_עצם',
'Verb': 'פעלים',
'Adjective': 'שם_תואר',
'Preposition': 'מילות_יחס',
'Conjunction': 'מילות_חיבור',
'Particle': 'מילית'
}
for key, value in pos_tags.items():
if key in pos:
tag_parts.append(value)
break
tags.append(' '.join(tag_parts))
df['tags'] = tags
df.to_csv('pealim_dict_for_anki.csv', sep=';', index=True)
print(df)
logger.info("Anki preparation complete.")
return df
extract_from_website()
modify_for_anki()
def main():
"""Main entry point."""
try:
# Extract from website
df = extract_from_website()
df.to_csv('pealim_dict.csv', index=True)
logger.info("Saved: pealim_dict.csv")
# Transform for Anki
df = modify_for_anki(df)
df.to_csv('pealim_dict_for_anki.csv', sep=';', index=True)
logger.info("Saved: pealim_dict_for_anki.csv")
logger.info("✅ Complete!")
except Exception as e:
logger.error(f"Fatal error: {e}")
raise
if __name__ == '__main__':
main()

3
requirements.txt Normal file
View file

@ -0,0 +1,3 @@
pandas>=1.3.0
requests>=2.26.0
numpy>=1.21.0

48
run.py Normal file
View file

@ -0,0 +1,48 @@
#!/usr/bin/env python3
"""
Main entry point: orchestrate dictionary and conjugation extraction.
"""
import logging
import sys
from pathlib import Path
# Add current directory to path
sys.path.insert(0, str(Path(__file__).parent))
import pealim_extract
import conjugation_extract
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
def main():
"""Run all extraction tasks."""
logger.info("=" * 60)
logger.info("PEALIM EXTRACTION SUITE")
logger.info("=" * 60)
try:
# Extract dictionary
logger.info("\n[1/2] Extracting dictionary...")
pealim_extract.main()
# Extract conjugations
logger.info("\n[2/2] Extracting conjugations...")
conjugation_extract.main()
logger.info("\n" + "=" * 60)
logger.info("✅ ALL TASKS COMPLETE")
logger.info("=" * 60)
except Exception as e:
logger.error(f"\n❌ EXTRACTION FAILED: {e}")
sys.exit(1)
if __name__ == '__main__':
main()

31
test_scrape.py Normal file
View file

@ -0,0 +1,31 @@
#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup
word = 'אבל'
url = f'https://www.pealim.com/search/?q={word}'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
try:
response = requests.get(url, headers=headers, timeout=10)
print(f'Status: {response.status_code}')
soup = BeautifulSoup(response.content, 'html.parser')
# Debug: check what we find
word_elem = soup.find('h1', class_='word-title')
pos_elem = soup.find('span', class_='pos')
definition_elem = soup.find('div', class_='definition')
print(f'word_elem found: {word_elem is not None}')
print(f'pos_elem found: {pos_elem is not None}')
print(f'definition_elem found: {definition_elem is not None}')
print('\n--- HTML snippet (first 3000 chars) ---')
print(soup.prettify()[:3000])
except Exception as e:
print(f'Error: {e}')
import traceback
traceback.print_exc()