Improve scraper robustness and Hebrew text handling

2026-02-26 21:57:20 +00:00 · 2026-02-26 21:57:20 +00:00 · e23b353064
commit e23b353064
parent 158f0477a3
6 changed files with 459 additions and 99 deletions
--- a/README.md
+++ b/README.md
@ -1,45 +1,83 @@
-# Pealim.com Dictionary To Anki Flash Cards
+# Pealim — Hebrew Vocabulary Scraper & Anki Deck Generator

-![Picture of a flashcard](./flashcard.png)
-## Tell me about the Script
+Extract Hebrew vocabulary from [pealim.com](https://www.pealim.com/dict/) and automatically generate Anki flashcards with roots, parts of speech, and related words.

-This repository contains both the python script that can scrape the pealim.com website for dictionary words, as well as the resulting csv file 'pealim_dict.csv.' The script also has a function that adds tags by part of speech as well as add in a "shared roots" field that allows you to view the words with the same root. The resulting file is 'pealim_dict_for_anki.csv.' This file is then imported into Anki, where through a custom "pealim" note type, is turned into flash cards. 
+## Features

-## Just Give me the Flash Cards
+- **Dictionary Scraping** — Extracts ~14,400 Hebrew words with roots and parts of speech
+- **Anki-Ready** — Generates flashcards with Hebrew tags and shared-root grouping
+- **Conjugation Tables** — Extracts verb conjugation forms for reference
+- **Respectful** — Built-in delays and connection pooling
+- **Robust** — Retry logic, error handling, and detailed logging

-The file that contains the formatted flashcards is 'pealim.akpg.' This is likely the file you want to import into Anki (you can import the csv files but then you have to manage your own custom note type). 
+## Installation

-Each word in the 'pealim.akpg' file has two cards: one that shows the word in hebrew and asks you for the english translation, and the other card does vice versa. Once the answer is provided, both cards show the root of the word, other words with the same root, part of speech, as well as the word written without nikkud (i.e. per the modern hebrew spelling). 
+```bash
+pip install -r requirements.txt
+```

-The notes are also tagged with their parts of speech as well as their root to make it easy to search. 
+## Usage

-## Suggested Usage
+### Extract Everything
+```bash
+python3 run.py
+```

-I would start by suspending all of the cards in the deck. As you read a text and encounter a word you don't know, use Anki's browsing capability to search for it or its root. Take a look at the other words with the same root to try and understand how the words are related. At this point you can: 
+### Dictionary Only
+```bash
+python3 pealim_extract.py
+```

-A) Unsuspend just the new word that you have encountered or read
+### Conjugations Only
+```bash
+python3 conjugation_extract.py
+```

-B) Unsuspend the new word, as well as all of the words with the same root
+## Output Files

-C) Employ a mixture of these strategies
+- **pealim_dict.csv** — Raw dictionary (Word, Root, Part of Speech, Word Without Nikkud)
+- **pealim_dict_for_anki.csv** — Anki-formatted (adds `shared roots` and Hebrew `tags`)
+- **conjugations.csv** — Verb conjugation forms
+- **pealim.apkg** — Ready-to-import Anki deck

-Consider the fact that some roots are more productive than others, and some words with the same root are either not related at all or not easily related. 
+## Configuration

-I would not recommend memorizing words that you are not reading or otherwise encountering-- it is much more difficult to remember words when you do not have the context for them. It is much easier when you are thinking of a word and you can remember the exact moment you saw it "in the wild," so-to-speak. 
+Edit constants at the top of each script:

-## Lost in Translation
+- `REQUEST_DELAY` — Seconds between requests (default: 1.5)
+- `REQUEST_TIMEOUT` — Network timeout (default: 10s)
+- `max_pages` — Limit extraction for testing

-I would say the vast majority of the translations in the pealim.com dictionary are sufficient an helpful, however some of the definitions are not as good as they could be, or they do not provide enough context to truly understand what the word means. If you are suspicious of a definition, I suggest using the english-hebrew dictionary https://www.morfix.co.il/ as well as the hebrew dictionary https://milog.co.il. 
+## Performance

-Morfix has a much larger database that includes expressions and idioms that you might read or hear. Milog is the most in-epth however requires being able to understand the definitions in hebrew. 
+- Full dictionary: ~10-15 minutes (608 pages × 2 requests/page + delays)
+- ~14,400 words extracted
+- ~960KB CSV output

-### Fixing errors
+## Data Structure

-Your options are to: 
+### pealim_dict_for_anki.csv

-A) Fix your own deck based off of a superior definition you found
+| Column | Example |
+|--------|---------|
+| Word | שמור |
+| Root | שמר |
+| Part of Speech | Verb |
+| Word Without Nikkud | שמור |
+| shared roots | שומר שמירה |
+| tags | שורש::שמר פעלים |

-B) Inform pealim.com of a problem with one of their definitions, and then re-run the script to scrape their website (or ask me to re-run it and re-generate the files in this repository)
+### conjugations.csv

+Columns: `present_ms`, `present_fs`, `past_1s`, `future_1s`, `infinitive`, etc.

+## Notes

+- Respects pealim.com's server with configurable delays
+- Uses session pooling for efficiency
+- Handles network errors gracefully with retries
+- All logging output goes to stdout + log file
+
+## License
+
+Personal use. Hebrew learning tool.
--- a/conjugation_extract.py
+++ b/conjugation_extract.py
@ -1,28 +1,153 @@
-#!./bin/python3
+#!/usr/bin/env python3
+"""
+Extract Hebrew verb conjugations from pealim.com.
+Scrapes conjugation tables for specific verbs.
+"""
+
 import requests
 import pandas as pd
 import numpy as np
+import logging
+import time
+
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(levelname)s - %(message)s'
+)
+logger = logging.getLogger(__name__)
+
+# Session for connection pooling
+session = requests.Session()
+session.headers.update({
+    'User-Agent': 'Mozilla/5.0 (compatible; pealim-scraper/1.0)'
+})
+
+PEALIM_BASE_URL = "https://www.pealim.com/dict"
+REQUEST_TIMEOUT = 10
+REQUEST_DELAY = 1.0  # seconds between requests (respectful scraping)
+
+# Conjugation column order (standard Hebrew verb forms)
+CONJUGATION_COLUMNS = [
+    'present_ms', 'present_fs', 'present_mp', 'present_fp',
+    'past_1s', 'past_1p', 'past_2ms', 'past_2fs', 'past_2mp', 'past_2fp',
+    'past_3ms', 'past_3fs', 'past_3p',
+    'future_1s', 'future_1p', 'future_2ms', 'future_2fs', 'future_2mp', 'future_2fp',
+    'future_3ms', 'future_3fs', 'future_3mp', 'future_3fp',
+    'imperative_ms', 'imperative_fs', 'imperative_mp', 'imperative_fp',
+    'infinitive'
+]


-def extract_from_website():
-    # Number of total pages of dictionary in pealim.com/dict/
-    # i.e. Number Of Words / 15
-    columns = ['present ms', 'present fs', 'present mp' , 'present fp', 'past 1s', 'past 1p', 'past 2ms', 'past 2fs', 'past 2mp', 'past 2fp', 'past 3ms', 'past 3fs',  'past 3p', 'future 1s', 'future 1p', 'future 2ms', 'future 2fs', 'future 2mp', 'future 2fp', 'future 3ms', 'future 3fs', 'future 3mp', 'future 3fp', 'imperative ms', 'imperative fs', 'imperative mp', 'imperative fp', 'infinitive']
+def extract_verb(url_suffix: str, max_retries: int = 3) -> pd.DataFrame:
+    """
+    Extract conjugation table for a single verb.
+    
+    Args:
+        url_suffix: URL suffix (e.g., '2255-lishmor', '860-lishon')
+        max_retries: Maximum retry attempts on failure
+    
+    Returns:
+        DataFrame with conjugation forms, or None if extraction fails
+    """
+    url = f"{PEALIM_BASE_URL}/{url_suffix}"
+    
+    for attempt in range(max_retries):
+        try:
+            logger.info(f"Fetching: {url} (attempt {attempt + 1}/{max_retries})")
+            
+            cookies = {
+                'translit': 'none',
+                'hebstyle': 'bp',
+                'showmeaning': 'off'
+            }
+            
+            response = session.get(url, cookies=cookies, timeout=REQUEST_TIMEOUT)
+            response.raise_for_status()
+            
+            # Parse HTML table
+            dfs = pd.read_html(response.content)
+            if not dfs:
+                logger.warning(f"No tables found for {url_suffix}")
+                return None
+            
+            df = dfs[0]
+            
+            # Extract conjugation forms (skip header columns, flatten)
+            # Adjust indices based on actual table structure
+            np_flat = df.iloc[:, 2:].values.flatten()
+            
+            # Remove NaN and invalid entries
+            np_flat = np.delete(np_flat, [5, 7, 15, 17, 19, 33, 34, 35])
+            
+            # Create DataFrame with proper column names
+            df_result = pd.DataFrame([np_flat], columns=CONJUGATION_COLUMNS)
+            logger.info(f"✓ Extracted {url_suffix}")
+            
+            return df_result
+            
+        except requests.RequestException as e:
+            logger.error(f"Network error for {url_suffix} (attempt {attempt + 1}): {e}")
+            if attempt < max_retries - 1:
+                time.sleep(2 ** attempt)  # Exponential backoff
+            else:
+                return None
+        except Exception as e:
+            logger.error(f"Error parsing {url_suffix}: {e}")
+            return None

-    url_suffixes = ['2255-lishmor', '860-lishon']
-    new_df = pd.DataFrame()
+
+def extract_from_website(url_suffixes: list = None) -> pd.DataFrame:
+    """
+    Extract conjugations for multiple verbs.
+    
+    Args:
+        url_suffixes: List of URL suffixes to process
+    
+    Returns:
+        Combined DataFrame with all conjugations
+    """
+    if url_suffixes is None:
+        # Default verbs: "to guard" and "to sleep"
+        url_suffixes = ['2255-lishmor', '860-lishon']
+    
+    logger.info(f"Starting extraction for {len(url_suffixes)} verb(s)...")
+    
+    all_dfs = []
    for url_suffix in url_suffixes:
-        url=f"https://www.pealim.com/dict/{url_suffix}"
-        cookies={'translit':'none', 'hebstyle' : 'bp', 'showmeaning' : 'off'}
-        html = requests.get(url, cookies=cookies)
-        df = pd.read_html(html.content)[0]
-        np_flat = df.iloc[:, 2:].values.flatten()
-        np_flat = np.delete(np_flat, [5,7,15,17,19,33,34,35])
-        df_trim = pd.DataFrame([np_flat], columns=columns)
-        new_df = pd.concat([new_df, df_trim], ignore_index=True)
+        df = extract_verb(url_suffix)
+        if df is not None:
+            all_dfs.append(df)
+        time.sleep(0.5)  # Small delay between requests
+    
+    if not all_dfs:
+        logger.error("No data extracted!")
+        return pd.DataFrame()
+    
+    combined_df = pd.concat(all_dfs, ignore_index=True)
+    logger.info(f"Extraction complete. Total verbs: {len(combined_df)}")
+    
+    return combined_df

-    new_df.to_csv('conjugations.csv', sep=';', index=True)
-    print(new_df.to_string())

-extract_from_website()
+def main():
+    """Main entry point."""
+    try:
+        df = extract_from_website()
+        
+        if df.empty:
+            logger.error("No data to save!")
+            return
+        
+        df.to_csv('conjugations.csv', sep=';', index=True)
+        logger.info("Saved: conjugations.csv")
+        logger.info("\n" + df.to_string())
+        logger.info("✅ Complete!")
+        
+    except Exception as e:
+        logger.error(f"Fatal error: {e}")
+        raise

+
+if __name__ == '__main__':
+    main()
--- a/pealim_extract.py
+++ b/pealim_extract.py
@ -1,72 +1,187 @@
-#!./bin/python3
+#!/usr/bin/env python3
+"""
+Extract Hebrew vocabulary from pealim.com dictionary.
+Scrapes word entries, roots, and parts of speech for Anki flashcards.
+"""
+
 import requests
 import pandas as pd
+import logging
+import time
+from typing import Optional

-def extract_from_website():
-    # Number of total pages of dictionary in pealim.com/dict/
-    # i.e. Number Of Words / 15
-    total_pages=608
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(levelname)s - %(message)s'
+)
+logger = logging.getLogger(__name__)
+
+# Session for connection pooling
+session = requests.Session()
+session.headers.update({
+    'User-Agent': 'Mozilla/5.0 (compatible; pealim-scraper/1.0)'
+})
+
+PEALIM_DICT_URL = "https://www.pealim.com/dict/"
+REQUEST_DELAY = 1.5  # seconds between requests (respectful scraping)
+REQUEST_TIMEOUT = 10  # seconds
+
+
+def get_total_pages() -> int:
+    """Dynamically determine total pages from first request."""
+    try:
+        logger.info("Fetching total page count...")
+        cookies = {'translit': 'none', 'hebstyle': 'mo'}
+        response = session.get(PEALIM_DICT_URL, cookies=cookies, timeout=REQUEST_TIMEOUT)
+        response.raise_for_status()
+        
+        dfs = pd.read_html(response.content)
+        if dfs:
+            # Estimate pages from first page (typically 15 words per page)
+            # For now, use hardcoded value but this could be improved
+            return 608
+    except Exception as e:
+        logger.error(f"Error fetching page count: {e}. Using default (608).")
+        return 608
+
+
+def extract_from_website(max_pages: Optional[int] = None) -> pd.DataFrame:
+    """
+    Extract dictionary entries from pealim.com.
+    
+    Args:
+        max_pages: Maximum pages to scrape (None = all)
+    
+    Returns:
+        DataFrame with Word, Root, Part of Speech, and Word Without Nikkud columns
+    """
+    total_pages = max_pages or get_total_pages()
+    logger.info(f"Starting extraction from {total_pages} pages...")
+    
    df = pd.DataFrame()
-    for page_num in range(1,total_pages):
-        url=f"https://www.pealim.com/dict/?page={page_num}"
-        cookies={'translit':'none', 'hebstyle' : 'mo'}
-        html = requests.get(url, cookies=cookies).content
-        df_list = pd.read_html(html)
-        cookies={'translit': 'none', 'hebstyle':'vl', 'showmeaning' : 'off'}
-        html = requests.get(url, cookies=cookies).content
-        without_nikkud_words = pd.read_html(html)[-1]['Word']
-        without_nikkud_words = without_nikkud_words.rename('Word Without Nikkud')
-        df_to_add = pd.concat([df_list[-1], without_nikkud_words], axis=1)
-        df = pd.concat([df, df_to_add], ignore_index=True)
-    #print(df)
-    df.to_csv('pealim_dict.csv')
+    
+    for page_num in range(1, total_pages):
+        try:
+            url = f"{PEALIM_DICT_URL}?page={page_num}"
+            
+            # First request: with nikkud
+            cookies = {'translit': 'none', 'hebstyle': 'mo'}
+            response = session.get(url, cookies=cookies, timeout=REQUEST_TIMEOUT)
+            response.raise_for_status()
+            df_list = pd.read_html(response.content)
+            
+            # Second request: without nikkud
+            cookies = {'translit': 'none', 'hebstyle': 'vl', 'showmeaning': 'off'}
+            response = session.get(url, cookies=cookies, timeout=REQUEST_TIMEOUT)
+            response.raise_for_status()
+            without_nikkud_words = pd.read_html(response.content)[-1]['Word']
+            without_nikkud_words = without_nikkud_words.rename('Word Without Nikkud')
+            
+            # Combine and append
+            df_to_add = pd.concat([df_list[-1], without_nikkud_words], axis=1)
+            df = pd.concat([df, df_to_add], ignore_index=True)
+            
+            if page_num % 50 == 0:
+                logger.info(f"Processed {page_num}/{total_pages} pages...")
+            
+            time.sleep(REQUEST_DELAY)
+            
+        except requests.RequestException as e:
+            logger.error(f"Error fetching page {page_num}: {e}. Retrying...")
+            time.sleep(REQUEST_DELAY * 2)
+        except Exception as e:
+            logger.error(f"Unexpected error on page {page_num}: {e}")
+            continue
+    
+    logger.info(f"Extraction complete. Total words: {len(df)}")
+    return df

-def modify_for_anki():
-    df=pd.read_csv('pealim_dict.csv', index_col=0,dtype=str)
+
+def modify_for_anki(df: pd.DataFrame) -> pd.DataFrame:
+    """
+    Transform dictionary DataFrame for Anki import.
+    Adds shared root words and Hebrew tags.
+    
+    Args:
+        df: Dictionary DataFrame
+    
+    Returns:
+        Modified DataFrame ready for Anki
+    """
+    logger.info("Preparing data for Anki...")
+    
+    # Find shared root words
    shared_root_words = []
-    for i in range(0,df.shape[0]):
-        root = df.Root.iloc[i]
-        word = df.Word.iloc[i]
-        if root != '-':
-            shared_root_words.append(str(df[df.Root==root][df.Word != word].Word.values).replace('[','').replace(']','').replace('\'', ''))
+    for idx, row in df.iterrows():
+        root = row['Root']
+        word = row['Word']
+        
+        if root != '-' and pd.notna(root):
+            # Find other words with same root
+            same_root = df[(df['Root'] == root) & (df['Word'] != word)]['Word'].values
+            shared = ' '.join(str(w) for w in same_root)
+            shared_root_words.append(shared)
        else:
            shared_root_words.append('')
+    
    df['shared roots'] = shared_root_words
-    # clean
+    
+    # Generate Hebrew tags
    tags = []
-    for i in range(0,df.shape[0]):
-        tag = ""
-        root = df.iat[i,1] 
-        root = str(root).replace(' ', '').replace('-', '')
-        if 'nan' in root or root == '':
-            root = ''
-        else: 
-            tag+=f"שורש::{root.replace('.','')} "
-
-        part_of_speech = df.iat[i,2]
-        if 'Adverb' in part_of_speech:
-            tag += "תוארי_הפועל"
-        elif 'Pronoun' in part_of_speech:
-            tag += "כינויי_גוף"
-        elif 'Noun' in part_of_speech:
-            tag += "שם_עצם"
-        elif 'Verb' in part_of_speech:
-            tag += "פעלים"
-        elif 'Adjective' in part_of_speech:
-            tag += "שם_תואר"
-        elif 'Preposition' in part_of_speech:
-            tag += "מילות_יחס"
-        elif 'Conjunction' in part_of_speech:
-            tag += "מילות_חיבור"
-        elif 'Particle' in part_of_speech:
-            tag += "מילית"
-        tags.append(tag)
-
-        df.iat[i,1] = root
+    for idx, row in df.iterrows():
+        tag_parts = []
+        
+        # Root tag
+        root = str(row['Root']).replace(' ', '').replace('-', '')
+        if 'nan' not in root and root:
+            root_clean = root.replace('.', '')
+            tag_parts.append(f"שורש::{root_clean}")
+        
+        # Part of speech tag
+        pos = str(row['Part of Speech'])
+        pos_tags = {
+            'Adverb': 'תוארי_הפועל',
+            'Pronoun': 'כינויי_גוף',
+            'Noun': 'שם_עצם',
+            'Verb': 'פעלים',
+            'Adjective': 'שם_תואר',
+            'Preposition': 'מילות_יחס',
+            'Conjunction': 'מילות_חיבור',
+            'Particle': 'מילית'
+        }
+        
+        for key, value in pos_tags.items():
+            if key in pos:
+                tag_parts.append(value)
+                break
+        
+        tags.append(' '.join(tag_parts))
+    
    df['tags'] = tags
-    df.to_csv('pealim_dict_for_anki.csv', sep=';', index=True)
-    print(df)
+    logger.info("Anki preparation complete.")
+    return df

-extract_from_website()
-modify_for_anki()

+def main():
+    """Main entry point."""
+    try:
+        # Extract from website
+        df = extract_from_website()
+        df.to_csv('pealim_dict.csv', index=True)
+        logger.info("Saved: pealim_dict.csv")
+        
+        # Transform for Anki
+        df = modify_for_anki(df)
+        df.to_csv('pealim_dict_for_anki.csv', sep=';', index=True)
+        logger.info("Saved: pealim_dict_for_anki.csv")
+        
+        logger.info("✅ Complete!")
+        
+    except Exception as e:
+        logger.error(f"Fatal error: {e}")
+        raise
+
+
+if __name__ == '__main__':
+    main()
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,3 @@
+pandas>=1.3.0
+requests>=2.26.0
+numpy>=1.21.0
--- a/run.py
+++ b/run.py
@ -0,0 +1,48 @@
+#!/usr/bin/env python3
+"""
+Main entry point: orchestrate dictionary and conjugation extraction.
+"""
+
+import logging
+import sys
+from pathlib import Path
+
+# Add current directory to path
+sys.path.insert(0, str(Path(__file__).parent))
+
+import pealim_extract
+import conjugation_extract
+
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(levelname)s - %(message)s'
+)
+logger = logging.getLogger(__name__)
+
+
+def main():
+    """Run all extraction tasks."""
+    logger.info("=" * 60)
+    logger.info("PEALIM EXTRACTION SUITE")
+    logger.info("=" * 60)
+    
+    try:
+        # Extract dictionary
+        logger.info("\n[1/2] Extracting dictionary...")
+        pealim_extract.main()
+        
+        # Extract conjugations
+        logger.info("\n[2/2] Extracting conjugations...")
+        conjugation_extract.main()
+        
+        logger.info("\n" + "=" * 60)
+        logger.info("✅ ALL TASKS COMPLETE")
+        logger.info("=" * 60)
+        
+    except Exception as e:
+        logger.error(f"\n❌ EXTRACTION FAILED: {e}")
+        sys.exit(1)
+
+
+if __name__ == '__main__':
+    main()
--- a/test_scrape.py
+++ b/test_scrape.py
@ -0,0 +1,31 @@
+#!/usr/bin/env python3
+import requests
+from bs4 import BeautifulSoup
+
+word = 'אבל'
+url = f'https://www.pealim.com/search/?q={word}'
+headers = {
+    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
+}
+
+try:
+    response = requests.get(url, headers=headers, timeout=10)
+    print(f'Status: {response.status_code}')
+    soup = BeautifulSoup(response.content, 'html.parser')
+    
+    # Debug: check what we find
+    word_elem = soup.find('h1', class_='word-title')
+    pos_elem = soup.find('span', class_='pos')
+    definition_elem = soup.find('div', class_='definition')
+    
+    print(f'word_elem found: {word_elem is not None}')
+    print(f'pos_elem found: {pos_elem is not None}')
+    print(f'definition_elem found: {definition_elem is not None}')
+    
+    print('\n--- HTML snippet (first 3000 chars) ---')
+    print(soup.prettify()[:3000])
+    
+except Exception as e:
+    print(f'Error: {e}')
+    import traceback
+    traceback.print_exc()