Commit graph

60 commits

Author SHA1 Message Date
6d2d446ed5 feat: pseudo-frequency for confusables using English word frequency
264 confusable groups where all entries shared the same Hebrew frequency
now have differentiated pseudo_frequency values based on English word
commonality (hermitdave en_50k.txt). Most common meaning keeps base
rank; less common meanings get +100 offset per position.

Examples:
- אב: "father" (en:194) → 2491, "bud" (en:2963) → 2591
- אח: "brother" (en:300) → 911, "fireplace" (en:9389) → 1011

Builder uses pseudo_frequency for sort order when available.
Confusable card definitions now sorted most-common-first.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-03 05:28:30 +00:00
f978e5f39a fix: vet fallback emoji — verb gate + expanded stop list removes 852 bad matches
The fallback emoji system (keyword→Unicode char matching at build time)
was producing 1,733 matches, many with wrong-sense emoji:
- "high, tall" →  (from "high voltage")
- "to cut" → 🥩 (cut of meat)
- "city" → 🇻🇦 (Vatican flag)

Two fixes:
1. Skip fallback for verbs (meanings starting "to ") — 476 removed
2. Expand _EMOJI_STOP with 100+ polysemous/abstract keywords — 376 more

Result: 1733 → 881 fallback matches (49% reduction). The 114 from_pealim
emojis (concrete nouns like 🍎 apple, 🐪 camel) are unaffected.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-03 05:17:31 +00:00
5f617af4ba feat: vet emoji assignments — 114 visible, 3 removed, 1 fixed
Reviewed all 117 pealim-inherited emoji assignments:
- Made 114 correct assignments visible (emoji_visible: true)
- Removed: goblet (🏆 is trophy), fitness (🏋 too abstract), red (💄 is lipstick)
- Fixed: onion 🌰🧅 (was chestnut emoji)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-26 23:19:43 +00:00
f3496998f5 feat: confusables show ktiv male, emoji/prep stripping fully upstream
- Confusables deck front now shows shared ktiv male form instead of
  nikkud variants joined by "/". Back still shows nikkud with definitions.
- Fixed list scraper EMOJI_RE to catch variation selectors (U+FE0F) and
  ZWJ (U+200D) — cleaned 17 entries with leftover selectors in meaning.
- Removed build-time prep extraction fallback (0 entries relied on it).
- release.py: fix keeshare field name (API_TOKEN → password).

Closes: Pealim #11 (emoji/prep upstream), Pealim #16 (confusables ktiv male)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-21 02:19:03 +00:00
138acb06d8 bump RELEASE_TAG to v0.20
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 14:09:19 +00:00
0a85291975 feat(v0.20): adaptive sentence difficulty scoring — 3163 scored cloze sentences
Replaces length-based sentence selection with frequency-based difficulty
scoring. Easiest sentences (most common context words) become cloze candidates.

Score range: 1-50000, median=179. Coverage: 3163/3325 cloze entries scored.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 13:35:23 +00:00
14d567a261 schema: add difficulty_score field + update spec with MIN_WORDS=3
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 13:30:13 +00:00
8b24d0fd26 Task 7: Add integration tests for frequency-based sentence scoring
Tests verify that update_words_json produces a cloze with `difficulty_score`,
that vetted sentences are sorted by difficulty, and that the easiest sentence
becomes the cloze candidate.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-15 13:29:25 +00:00
272a2a080d Task 7: Replace length-based scoring with frequency-based scoring in update_words_json
- Import `sentence_difficulty.build_nikkud_map`, `score_sentence`, and `frequency_lookup`
- Build `nikkud_index`, `nikkud_map`, `freq_data` once before the per-word loop
- Replace `_score()` closure with call to `score_sentence()` (median context-word rank)
- Store `difficulty_score` in cloze dict for downstream use

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-15 13:29:22 +00:00
fb12f806a8 feat: add sentence_difficulty module with 5-tier frequency scoring
Implements build_nikkud_map(), _resolve_token_frequency(), and
score_sentence() for v0.20 adaptive cloze sentence selection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 13:23:21 +00:00
00fba934fb feat(epub_examples): export try_strip_prefix as public alias
Exposes _try_strip_prefix under a public name so the upcoming
sentence_difficulty module can reuse Hebrew prefix stripping logic
without duplicating it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-15 13:18:55 +00:00
d2a7c9d483 feat(epub_examples): lower MIN_WORDS from 4 to 3
Hebrew is more concise than English — 3-word sentences are valid
candidates for the example sentence pool. Expands the pool for the
upcoming adaptive sentence difficulty ranking.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-15 13:18:44 +00:00
d0f4aea58d feat(frequency_lookup): add get_freq_data() for batch frequency access
Exposes the full word→rank dict for use by the upcoming sentence_difficulty
module without requiring per-word lookups.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-15 13:18:11 +00:00
b3ea086e85 v0.20 design spec + nikkud-to-ktiv-male converter
Add Academy-rules-based nikkud→ktiv male converter (91.6% accuracy
vs 77.2% for strip_nikkud) and v0.20 adaptive sentence difficulty
cloze design spec. The converter enables frequency-based sentence
scoring by properly resolving nikkud tokens to their ktiv male forms
for frequency corpus lookup.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 12:57:14 +00:00
af186e2030 Sprint 17: homograph example dedup + plural audio + prep extraction
- Homograph collision fix: _deduplicate_confusable_examples() clears
  shared examples from less-common confusable group members (36 entries
  fixed). Keeps examples only on highest-frequency meaning.
- Plural deck audio: wired up PluralAudio field in apkg_builder.py,
  downloaded 613 plural audio files from pealim.com for all deck entries.
- Prep extraction upstream: moved Hebrew preposition parsing from build
  time into list/detail scrapers (SCHEMA.yaml prep field added).
- Validation: new no_shared_confusable_examples check in validate_data.py
- Tests: 9 new unit tests for confusable deduplication (98 total)
- Release: v0.19

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 21:51:35 +00:00
0d92451271 Sprint 16: collapsible card details + related words table
- All secondary fields (shoresh, PoS, ktiv male, plural, related words)
  behind a "מידע נוסף" toggle button using HTML <details>/<summary>
- Conjugation back: English meaning, binyan also behind toggle
- Related words: table format with word + meaning, sorted by frequency
- Hebrew words not bold, English meanings 24px gray (#555)
- "מִילִים קְשׁוּרוֹת" sub-header with nikkud inside toggle
- "אֵיךְ אוֹמְרִים" prompt centered using hint class
- New CSS: .more-toggle, .more-header, .related-header, .rw-word, .rw-meaning
- Dark mode support for all new classes
- Bump to v0.18

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 01:34:14 +00:00
c85063ee2f Sprint 15: example sentence pipeline overhaul + corpus expansion + card improvements
- Regenerated all example sentences from scratch (deleted legacy + stale entries)
- Added .txt file support to epub_examples.py for Ben Yehuda corpus
- 7 Ben Yehuda nikkud'd children's texts + 3 new Time Tunnel EPUBs
- Maqaf-stripped construct form indexing (+68% inflected matches)
- Total: 3,598 words with examples, 3,289 with cloze (was ~2,900)
- Cloze prefix preservation (_cloze_prefix_len)
- Hebrew spoiler stripping from English meanings
- Gender field (זָכָר/נְקֵבָה) on vocab cards
- sec-table CSS layout for aligned key:value pairs
- Mishkal uses mishkal_hebrew on plural cards
- Improved mishkal extraction from pealim detail pages
- 21 new pytest tests (cloze, PoS, Hebrew stripping, gender, mishkal)
- 2 new validate_data.py tests + mishkal stats
- Colliding forms tracking (local-only)
- Release tag v0.17

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 10:44:14 +00:00
efd0745ada Sprint 14: deck template/CSS overhaul + Sprint 12 detail scrape
Template & CSS fixes (15 items from Mar 9 feedback):
- Fix conjugation front showing 3ms form instead of infinitive
- Rename conjugation model to "Hebrew Conjugation"
- Strip Hebrew parenthesized text from English meanings
- Shoresh separator: spaces → dots (א.כ.ל)
- Remove duplicate English meaning from cloze back
- Remove example sentences from vocab front/back (cloze only)
- Center-align audio buttons on all decks
- Fix parenthesis spacing: "you(feminine,singular)" → "you (feminine, singular)"
- Unify sec-key/sec-label fonts, make keys bold
- Size overhaul: bigger Hebrew (42px), meaning (34px), secondary (28px)
- Center-align related words groups
- Sort confusables by average frequency
- Plurals: show Gender (Hebrew) before Mishkal, strip emoji from meaning
- Clean duplicate quotation marks in cloze sentences

Sprint 12 carry-forward (detail scrape + EPUB):
- Adjective/preposition detail scraping in pealim_detail_scrape.py
- EPUB example matching rewrite in epub_examples.py
- Delete benyehuda.py and rebuild_sentence_matches.py (merged)
- 49 parser tests for detail scraping
- SCHEMA.yaml updates for new fields

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 07:44:47 +00:00
3b0f9defa9 feat: YAP-cleaned frequency corpus + two-tier assignment pipeline
- Add clean_frequency_corpus.py: YAP morphological analyzer removes
  prefix+word combos (e.g. בבית=ב+בית) from he_50k frequency data.
  Headwords always protected. 30,430 clean entries from 49,999 raw.
- Add assign_frequency.py: two-tier assignment with PoS-aware homograph
  handling. Tier 1 matches headwords; Tier 2 matches inflections (any rank)
  and conjugations (rank>5000 only, to avoid false positives).
  Function words claim frequency over content words in homograph groups,
  with manual overrides for 12 common dual-use words.
- frequency_lookup.py auto-prefers frequency_clean.json when available
- 6,691 entries now have frequency (was 5,974), 717 newly assigned

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 06:22:55 +00:00
b8b65442cb restore epub_examples.py and rebuild_sentence_matches.py
Accidentally removed in 6c2a0f8 — these are the EPUB sentence
extraction and matching scripts used to build vetted_sentences.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 04:33:32 +00:00
04a4b52113 fix: deduplicate 66 plural GUIDs for homograph nouns
Homographs (same nikkud form, different meanings) had identical
plurals_guid values. Regenerated unique GUIDs by including meaning
in the hash. Also updated build-time fallback to use meaning.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 04:12:45 +00:00
f6af714e22 bump RELEASE_TAG to v0.15.1
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 04:08:35 +00:00
b2fef5aa8a Sprint 11.1: strip_nikkud cleanup, dead code removal, test fixes
Remove strip_nikkud from all pipeline files — use ktiv_male directly.
Fix case-insensitive binyan matching in detail scraper (og:description
uses UPPERCASE). Fix integration test slugs and test limits. Delete
legacy CSVs, stale .apkg, and dead scripts from git. Add vulture to
pre-commit hook.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 04:03:47 +00:00
a1d970a782 fix: reorder pipeline — detail scrape immediately after list scrape
List scrape captures slugs needed by detail scrape, so they should be
adjacent. Reordered: list→detail→frequency→examples→audio→fonts→images→build

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 19:16:57 +00:00
6c2a0f8eed chore: remove legacy scraping scripts replaced by unified pipeline
Removed 11 files that are no longer called by the active pipeline:
- hebrew_extract.py (replaced by pealim_list_scrape.py)
- conjugation_extract.py (replaced by pealim_detail_scrape.py)
- scripts/scrape_noun_plurals.py, scrape_verb_ktiv.py, scrape_ktiv_male.py
  (all replaced by pealim_detail_scrape.py)
- scripts/migrate_to_json.py, repair_slugs.py (one-time migration, complete)
- epub_examples.py, rebuild_sentence_matches.py (unused utilities)
- scripts/extract_pdf_sentences.py, add_slugs.py (unused one-off scripts)

Kept: check_guid_coverage.py, validate_data.py, extract_verb_list.py,
validate_apkg.py, validate_verb_list.py, release.py (standalone utilities)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 11:08:33 +00:00
08fb7009d8 Sprint 11: unified JSON architecture + consolidated scraping pipeline
Migrate from fragmented CSV + 10 JSON files to a single data/words.json
(9,104 entries) as the unified data store. All GUIDs preserved for Anki
study progress continuity.

New files:
- SCHEMA.yaml: authoritative schema for words.json
- pealim_list_scrape.py: consolidated list page scraper → words.json
- pealim_detail_scrape.py: noun/verb detail scraper → words.json
- pealim_audio_download.py: audio downloader reading from words.json
- scripts/migrate_to_json.py: one-time CSV→JSON migration
- scripts/validate_data.py: 17 data integrity tests
- scripts/check_guid_coverage.py: GUID preservation checker
- scripts/repair_slugs.py: slug deduplication repair tool
- tests/test_scraper_integration.py: live scraper integration tests

Updated:
- apkg_builder.py: reads from words.json (no more pandas)
- run.py: 8-step pipeline (list scrape → frequency → examples →
  detail scrape → audio download → fonts → images → build)
- benyehuda.py, frequency_lookup.py, image_fetch.py: TODO markers
  for future words.json integration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 10:54:58 +00:00
2e48109d7f v0.15: PoS fix, slug-based audio, CSS cleanup, template improvements
- Fix PoS substring bug: "Pronoun" no longer matches "Noun"
- CSS: reduce sec-label/sec-key font sizes, add .definitions/.conf-entry
- Slug-based audio filenames for confusable words (no more collisions)
- Scraper captures slug from pealim.com list page links
- Confusables: RTL alignment, re-enable audio (remove all-must-have gate)
- Plurals: blue given word, gray meaning, labeled mishkal badge
- Conjugation: add "אֵיךְ אוֹמְרִים" prompt, tense prefix (בְּ),
  Prep field from HBPAREN_RE, labeled RelatedVocab
- Ben Yehuda: skip stripped fallback for confusable words
- Bump RELEASE_TAG to v0.15

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 17:50:23 +00:00
802c369365 v0.14: rescrape vocab, formatting fixes for all decks
- Full pealim.com rescrape: 9,120 words (15 new), all with audio URLs
- Plurals deck: 2:1 regular:irregular ratio (649 notes), RTL arrows, 1.6x hint text
- Conjugation deck: blue infinitive on front, plain meaning on back, nikkud labels
- Confusables deck: larger prompt text (32px), audio only when all words have it
- Validator: non-audio variants no longer false-fail on audio check
- 14 new audio files downloaded

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 09:26:41 +00:00
def2fc1aca fix: card formatting, example sentence homograph protection, plural coverage
Formatting (#5):
- Labels now display with nikkud (שֹׁרֶשׁ, חֵלֶק דִּיבּוּר, רַבִּים, etc.)
- Secondary fields below audio 1.6x bigger (20px → 32px)
- Label keys styled separately (.sec-key class, smaller/dimmer than values)
- Example sentences centered on card (margin: auto, max-width: 90%)
- Emoji only on English side (removed duplicate from Eng→Heb back)
- Broken images hidden via onerror handler

Example sentences (#6):
- Confusable words (same consonants, different nikkud) now only match
  example sentences by exact nikkud form, preventing wrong-word sentences
- Same protection applied to cloze sentence and vetted sentence lookups

Plural coverage (#3):
- Added stripped-nikkud fallback for noun plural matching
- 3,918 nouns now show plurals (was ~3,604, +314 from fallback)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 08:45:53 +00:00
5685270dfa chore: add PROJECT_NOTES.md, update .gitignore
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 08:20:31 +00:00
34bec8f4ce feat: add release.py for automated Forgejo releases
Creates git tag, Forgejo release, uploads all 12 deck variants.
Supports --dry-run, --validate, --force for re-uploading.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 08:12:53 +00:00
17f7458d19 Sprint 9: cloze cards, plurals deck, project reorg, lint tooling
- Cloze card pipeline: 924 cards from 2,296 AI-vetted Hebrew book sentences
- Plurals deck: 375 notes (144 irregular + 231 regular from 86 mishkal patterns)
- Ktiv male forms expanded to 20,711 entries for sentence matching
- Project reorg: helpers.py (deduped strip_nikkud from 10 files), scripts/ for
  one-off tools, tests/ with smoke tests, deleted 3 dead files
- Lint tooling: pyproject.toml with ruff/vulture/bandit/pytest config, .editorconfig,
  fixed all 129 ruff errors (B023 closure fix, SIM103, unused vars)
- validate_apkg.py: card count range check for optional cloze template
- Data caches committed: vetted_sentences, ktiv_male_forms, noun_plurals,
  noun_slug_map, vocab_sentence_matches, epub_sentence_index

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 08:09:39 +00:00
419e952389 feat: curated emoji denylist, vocab audio URLs in CSV
- Expanded _EMOJI_STOP from ~20 to ~80 keywords after manual review
  of all 2,261 emoji-word pairs. Removes false positives from
  polysemous words (french→🍟, water→🤽, rock→🪨, etc.)
- Emoji count: 2,261 → 1,820 (removed ~440 bad matches)
- hebrew_dict.csv now populated with audio_url from pealim.com scrape
  (8,727 words with audio URLs)
- Cached emoji_lookup.json (1,749 keywords from Unicode emoji-test.txt)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 12:29:15 +00:00
607fd1a3bc feat: emoji Unicode lookup, conj nikkud, fix summary metric
- Emoji: _load_emoji_lookup() fetches unicode.org emoji-test.txt, builds
  {keyword: emoji_char} map cached in data/emoji_lookup.json. Falls back
  to empty dict on network failure. build_all_variants() loads once and
  passes to all build_vocab_deck() calls. For each word without pealim
  emoji, tries first 5 keywords from English meaning against lookup.
- Nikkud: זכר→זָכָר, נקבה→נְקֵבָה in PRESENT_EXPANSION constants and
  build_conj_deck() 1st-person gender labels.
- Summary: conj audio file count now excludes _infinitive and _passive_
  on-disk extras never bundled in .apkg (was 2235, now shows ~1765).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 21:24:10 +00:00
3fc3a21a33 fix: load conjugations from cache when --skip-conjugations passed
Previously --skip-conjugations returned None, causing build_all_variants()
to produce near-empty conjugation decks (0.3MB font-only files). Now loads
from conjugations.json cache so all 6 release variants build correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 21:01:47 +00:00
ccd7d61efb Add 6-variant release build (4 vocab + 2 conj), bump to v0.12
- build_vocab_deck(): include_audio/include_images flags
- build_conj_deck(): include_audio flag
- build_all_variants(): builds all 6 apkg files in one call
- Variants: hebrew_vocabulary{,_audio,_images,_audio_images}.apkg
            hebrew_conjugations{,_audio}.apkg
- run.py: step_build_all() replaces step_build_vocab(); conjugation
  extraction reuses cached conjugations.json unless refreshed
- RELEASE_TAG bumped to v0.12

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 20:58:06 +00:00
62c92ffae0 Emoji/image mutual exclusion, font size increases, layout
- Emoji and image are now mutually exclusive: emoji shown if present,
  image used as fallback ({{^Emoji}}{{#Image}}...{{/Image}}{{/Emoji}})
- Emoji shown on English card front (under meaning) — both card directions
- Emoji appears directly under meaning on backs, before secondary info
- sec-label: 16px → 20px; root-info/example: 16px → 18px; related-group: 15px → 18px
- hebrew-sm font-weight:normal (prep label no longer inherits bold)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 20:51:20 +00:00
0e4b041331 Fix emoji ordering and prep font weight
- Emoji now shown above image on both back templates (was below)
- Emoji also shown on English→Hebrew card front (visual cue with meaning)
- hebrew-sm: add font-weight:normal (was inheriting bold from .hebrew parent)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 06:09:55 +00:00
64a1b18951 Sprint 7: emoji/prep extraction, conjugation reduction, project rename
- Item 1/2: Extract emoji and Hebrew parentheticals (prepositions) from
  Meaning field; display emoji with 3.5em font, prep inline after Hebrew
  word. Add Emoji and Prep fields to Hebrew Flash Cards model.
- Item 3: Seeded RNG per verb reduces conjugation cards by ~630 (4 present
  forms → 1 pronoun each; past_3p → 1 gender). 1st-person forms gain gender
  label (זכר/נקבה). Total: 1,834 conj cards (was ~2,464).
- Item 4: hebrew_extract.py uses BeautifulSoup to capture data-audio URLs
  from pealim.com list pages during scraping. step_audio() reads audio_url
  column from CSV (no longer needs audio_extract.py).
- Item 5: Rename to 'Hebrew Flash Cards'. New filenames: hebrew_dict.csv,
  hebrew_extract.py, hebrew_vocabulary.apkg, hebrew_conjugations.apkg.
  Deck/model names updated throughout. Forgejo repo rename pending (sochen
  lacks admin rights — Nevo must do via UI).
- Fix: Deduplicate entries with same Hebrew word before adding notes
  (eliminates GUID collisions from duplicate source CSV rows).
- Bump RELEASE_TAG to v0.11.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 05:49:51 +00:00
f8e4873349 Remove releases/ binary from git; add to .gitignore
Release artifacts (.apkg files) are distributed via Forgejo releases,
not committed to the repository tree. Also gitignore CLAUDE.md (internal).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 05:26:18 +00:00
e66020628f Fix: use stable GUIDs for Anki note matching on reimport
genanki's default GUID is computed from ALL fields, so adding audio to a
previously-empty Audio field changes the GUID — Anki can't match the old
note and skips the update.

Fix: explicitly set GUID from identity-only fields:
- Conjugation notes: guid_for(infinitive, pronoun, tense)
- Vocabulary notes:  guid_for(word)  [Hebrew word with nikkud]

With stable GUIDs, reimporting a rebuilt deck correctly updates existing
notes (audio, tags, corrected fields) without breaking study progress.

NOTE: users who imported a previous release will see new notes on first
reimport (old GUID → new GUID mismatch). They can delete the old untagged
notes via Browse → tag:v0.10 missing → delete.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 05:21:45 +00:00
4fcc5cff60 Sprint 6: release tagging, conjugation front swap, validate_apkg.py
- Add RELEASE_TAG="v0.10" constant; tag all notes (vocab + conj) so users
  can identify which release their cards came from via Anki Browse
- Swap conjugation card front: Pronoun now above Infinitive for easier recall
- Add validate_apkg.py: comprehensive .apkg integrity checker covering ZIP
  structure, media manifest, audio format, DB schema, card counts, sound refs,
  and field content; runs on both decks
- Configure Forgejo v0.10 release with conjugation .apkg as downloadable asset
- Update releases/pealim_conjugations.apkg with tagged notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 05:09:45 +00:00
39fb388f6c first release 2026-03-04 00:27:46 -08:00
bb79725a7f Python code cleanup (python-pro review)
- Type annotations: dict|None defaults, return types, nested func annotations
- Dead code: removed unused row_forms_with_audio(), duplicate _strip_nikkud defs,
  redundant guards, duplicate 'ism' in ABSTRACT_SUFFIXES
- Exceptions: narrowed bare except to (ValueError, pd.errors.ParserError) and
  (json.JSONDecodeError, OSError) throughout; all raise ValueError given messages
- Deduplication: extracted deduplicate() helper in _parse_table; setdefault() for
  dict building in benyehuda and apkg_builder; list comprehension in benyehuda
- Correctness: limit=0 guard fixed (is not None); audio tag parsing uses
  removeprefix/removesuffix instead of magic offsets; vectorized pandas sum
- Constants: BINYAN_NAMES extracted; unicodedata imports moved to top level
- benyehuda load(): removed wasted cache read on force_rebuild; word-boundary
  regex simplified from double-negative to \w

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-04 07:33:14 +00:00
0686298610 Deprecate --skip-conjugations in favor of --only vocab
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-04 07:21:03 +00:00
6cd42b1e12 Sprint 5: dark mode CSS, alternate conjugation forms, README releases link fix
- add @media (prefers-color-scheme: dark) block to CARD_CSS covering all hardcoded colors
- _parse_table: add table_el param to parse a specific table directly
- _extract_conjugations: detect second active conjugation table; store alternate_forms
- build_conj_deck: show "primary / alternate" when alternate form exists for a key
- README: fix dead ../../releases link → git.nevo.engineer/nevo/pealim/releases

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-04 07:07:28 +00:00
372680be3c Add missing 70th verb להתקלח (to shower, Hitpa'el)
להתלקלח in the original source was a typo for להתקלח (1896-lehitkaleach),
not for להתקלקל as previously assumed — it's a completely different word.
Conjugation deck now has the correct 70 paradigm verbs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-04 06:46:07 +00:00
0db11b1aa1 Sprint 4: fix insertion order, skip infinitive cards, split past_3p, fix empty binyan
- vocab deck uses frequency insertion order (genanki.Package); conjugation deck random (_RandomOrderPackage)
- skip infinitive form_key in conjugation deck build (reference only, not a quiz target)
- PAST_3P_EXPANSION: split past_3p into separate הֵם and הֵן cards
- SECTION_BINYAN parsing: read section headers from verbs_input.txt as binyan hints
- add binyan_hint param to _extract_conjugations and _extract_passive_from_active_slug
- patch 20 cached entries with empty binyan (Pa'al, Nif'al) using section hints
- result: 2428 notes across 69 verbs, all with populated binyan

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-04 06:42:32 +00:00
58dc1b8d9b fix: correct word/verb counts in README, add missing .gitignore entries
- README: ~14,400 → ~9,100 words (actual scrape count)
- README: 71 → 69 verbs (current verb list; 2 short of Coffin & Bolozky — to investigate)
- .gitignore: add data/audio_conj/, data/image_cache.json, data/images/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-04 06:32:44 +00:00
d26e4c8ce5 feat: Sprint 3 — passive/active separation, random card order, card UX fixes
Conjugation extraction:
- Active entries now extract active forms only (no auto passive partner)
- Passive (# 3ms:) entries extract passive section only via new
  _extract_passive_from_active_slug(); search-based fallback also uses
  this path so no active forms leak into passive entries
- # slug: VERB SLUG override syntax for search-ambiguous active verbs
- # 3ms: FORM ACTIVE-SLUG syntax for passive entries with known active page
- Fixed verb spellings: בוטל (was בותל), slug overrides for תואם →
  2344-letaem, זוכה → 503-lezakot, לָשִׂים → 45-lasim, העבר → 1442-lehaavir

Card UX:
- Passive card front: shows active partner infinitive (e.g. לְבַטֵּל) with
  (סָבִיל) inline in smaller font instead of bare 3ms past form
- Removed פָּעִיל label from active cards; only passive cards carry voice label
- New cards introduced in random order (new.order=0 via _RandomOrderPackage)
- Frequency badge: words outside top 50k show "50k+" instead of blank

README: updated CLI options, output files table, pipeline list, card
descriptions to reflect Sprint 3 state

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 10:16:50 +00:00