hebrew_flash_cards

Author	SHA1	Message	Date
Sochen	b2fef5aa8a	Sprint 11.1: strip_nikkud cleanup, dead code removal, test fixes Remove strip_nikkud from all pipeline files — use ktiv_male directly. Fix case-insensitive binyan matching in detail scraper (og:description uses UPPERCASE). Fix integration test slugs and test limits. Delete legacy CSVs, stale .apkg, and dead scripts from git. Add vulture to pre-commit hook. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 04:03:47 +00:00
Sochen	08fb7009d8	Sprint 11: unified JSON architecture + consolidated scraping pipeline Migrate from fragmented CSV + 10 JSON files to a single data/words.json (9,104 entries) as the unified data store. All GUIDs preserved for Anki study progress continuity. New files: - SCHEMA.yaml: authoritative schema for words.json - pealim_list_scrape.py: consolidated list page scraper → words.json - pealim_detail_scrape.py: noun/verb detail scraper → words.json - pealim_audio_download.py: audio downloader reading from words.json - scripts/migrate_to_json.py: one-time CSV→JSON migration - scripts/validate_data.py: 17 data integrity tests - scripts/check_guid_coverage.py: GUID preservation checker - scripts/repair_slugs.py: slug deduplication repair tool - tests/test_scraper_integration.py: live scraper integration tests Updated: - apkg_builder.py: reads from words.json (no more pandas) - run.py: 8-step pipeline (list scrape → frequency → examples → detail scrape → audio download → fonts → images → build) - benyehuda.py, frequency_lookup.py, image_fetch.py: TODO markers for future words.json integration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 10:54:58 +00:00
Sochen	17f7458d19	Sprint 9: cloze cards, plurals deck, project reorg, lint tooling - Cloze card pipeline: 924 cards from 2,296 AI-vetted Hebrew book sentences - Plurals deck: 375 notes (144 irregular + 231 regular from 86 mishkal patterns) - Ktiv male forms expanded to 20,711 entries for sentence matching - Project reorg: helpers.py (deduped strip_nikkud from 10 files), scripts/ for one-off tools, tests/ with smoke tests, deleted 3 dead files - Lint tooling: pyproject.toml with ruff/vulture/bandit/pytest config, .editorconfig, fixed all 129 ruff errors (B023 closure fix, SIM103, unused vars) - validate_apkg.py: card count range check for optional cloze template - Data caches committed: vetted_sentences, ktiv_male_forms, noun_plurals, noun_slug_map, vocab_sentence_matches, epub_sentence_index Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 08:09:39 +00:00
Sochen	64a1b18951	Sprint 7: emoji/prep extraction, conjugation reduction, project rename - Item 1/2: Extract emoji and Hebrew parentheticals (prepositions) from Meaning field; display emoji with 3.5em font, prep inline after Hebrew word. Add Emoji and Prep fields to Hebrew Flash Cards model. - Item 3: Seeded RNG per verb reduces conjugation cards by ~630 (4 present forms → 1 pronoun each; past_3p → 1 gender). 1st-person forms gain gender label (זכר/נקבה). Total: 1,834 conj cards (was ~2,464). - Item 4: hebrew_extract.py uses BeautifulSoup to capture data-audio URLs from pealim.com list pages during scraping. step_audio() reads audio_url column from CSV (no longer needs audio_extract.py). - Item 5: Rename to 'Hebrew Flash Cards'. New filenames: hebrew_dict.csv, hebrew_extract.py, hebrew_vocabulary.apkg, hebrew_conjugations.apkg. Deck/model names updated throughout. Forgejo repo rename pending (sochen lacks admin rights — Nevo must do via UI). - Fix: Deduplicate entries with same Hebrew word before adding notes (eliminates GUID collisions from duplicate source CSV rows). - Bump RELEASE_TAG to v0.11. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-05 05:49:51 +00:00
Sochen	bb79725a7f	Python code cleanup (python-pro review) - Type annotations: dict\|None defaults, return types, nested func annotations - Dead code: removed unused row_forms_with_audio(), duplicate _strip_nikkud defs, redundant guards, duplicate 'ism' in ABSTRACT_SUFFIXES - Exceptions: narrowed bare except to (ValueError, pd.errors.ParserError) and (json.JSONDecodeError, OSError) throughout; all raise ValueError given messages - Deduplication: extracted deduplicate() helper in _parse_table; setdefault() for dict building in benyehuda and apkg_builder; list comprehension in benyehuda - Correctness: limit=0 guard fixed (is not None); audio tag parsing uses removeprefix/removesuffix instead of magic offsets; vectorized pandas sum - Constants: BINYAN_NAMES extracted; unicodedata imports moved to top level - benyehuda load(): removed wasted cache read on force_rebuild; word-boundary regex simplified from double-negative to \w Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-04 07:33:14 +00:00
Sochen	ca7ca74a39	feat: Sprint 3 — Heebo font files, image fetch, verb validator scripts - data/fonts/: Heebo variable font TTF (Regular + Bold) for bundling in .apkg - image_fetch.py: Wikipedia/Commons image fetch for concrete nouns - validate_verb_list.py: pealim.com validator for verb input list Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 08:37:08 +00:00

6 commits