feat: pseudo-frequency for confusables using English word frequency

264 confusable groups where all entries shared the same Hebrew frequency now have differentiated pseudo_frequency values based on English word commonality (hermitdave en_50k.txt). Most common meaning keeps base rank; less common meanings get +100 offset per position. Examples: - אב: "father" (en:194) → 2491, "bud" (en:2963) → 2591 - אח: "brother" (en:300) → 911, "fireplace" (en:9389) → 1011 Builder uses pseudo_frequency for sort order when available. Confusable card definitions now sorted most-common-first. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fix: vet fallback emoji — verb gate + expanded stop list removes 852 bad matches
2026-04-03 05:28:30 +00:00 · 2026-04-03 05:17:31 +00:00 · 2026-03-26 23:19:43 +00:00 · 2026-03-21 02:19:03 +00:00
6 changed files with 51132 additions and 717 deletions
--- a/apkg_builder.py
+++ b/apkg_builder.py
@ -265,6 +265,14 @@ details[open] > .more-header::before { content: "● "; }
  text-align: center;
  margin: 0.3em 0;
 }
+.plural-direction {
+  font-size: 32px;
+  color: #444;
+  text-align: center;
+  direction: rtl;
+  margin: 8px 0;
+  font-weight: bold;
+}
 .card [type="button"], .card button, .replay-button {
  display: block !important;
  margin: 4px auto !important;
@ -288,7 +296,26 @@ details[open] > .more-header::before { content: "● "; }
  .related-header { color: #999; }
  .rw-word     { color: #e0e0e0; }
  .rw-meaning  { color: #999; }
+  .plural-direction { color: #aaa; }
 }
+.nightMode .card        { color: #e8e8e8; background: #1c1c1e; }
+.nightMode .hebrew      { color: #f0f0f0; }
+.nightMode .hebrew-sm   { color: #e0e0e0; }
+.nightMode .meaning     { color: #82b0ff; }
+.nightMode .sec-label   { color: #e0e0e0; }
+.nightMode .sec-key     { color: #e0e0e0; }
+.nightMode .sec-val     { color: #e0e0e0; }
+.nightMode .conf-entry  { color: #ddd; }
+.nightMode .hint        { color: #777; }
+.nightMode .voice-label { color: #888; }
+.nightMode .example     { color: #e0e0e0; border-right-color: #555; }
+.nightMode .divider     { border-top-color: #333; }
+.nightMode .freq-badge  { color: #888; border-color: #444; }
+.nightMode .more-header { color: #bbb; background: #2a2a2e; border-color: #555; }
+.nightMode .related-header { color: #999; }
+.nightMode .rw-word     { color: #e0e0e0; }
+.nightMode .rw-meaning  { color: #999; }
+.nightMode .plural-direction { color: #aaa; }
 """

 # ──────────────────────────────────────────────────────────────────────────────
@ -422,7 +449,7 @@ CONJ_BACK = """
 <div class="hebrew">{{ConjugatedForm}}{{#Prep}} ({{Prep}}){{/Prep}}</div>
 {{#Audio}}<div>{{Audio}}</div>{{/Audio}}
 <details class="more-toggle"><summary class="more-header">מידע נוסף</summary>
-{{#Meaning}}<div class="sec-label" style="text-align:center;display:block;">{{Meaning}}</div>{{/Meaning}}
+{{#Meaning}}<div class="meaning" style="font-size:28px;">{{Meaning}}</div>{{/Meaning}}
 <div class="sec-table">
 <div class="sec-label"><span class="sec-key">שֹׁרֶשׁ:</span><span class="sec-val">{{Root}}</span></div>
 <div class="sec-label"><span class="sec-key">בִּנְיָן:</span><span class="sec-val">{{Binyan}}</span></div>
@ -693,6 +720,116 @@ _EMOJI_STOP = frozenset(
        "bar",
        "wheel",
        "horizontal",
+        # Polysemous keywords producing wrong-sense emoji (Sprint 17 audit)
+        "high",  # ⚡ high voltage, not "tall"
+        "down",  # 🫳 palm down, not "descend"
+        "off",  # 📴 phone off, not "remove"
+        "away",  # 💨 dashing away, not "depart"
+        "together",  # 🤲 palms together, not "unite"
+        "top",  # 🎩 top hat, not "upper"
+        "low",  # 🔈 low volume, not "short"
+        "flat",  # 🥿 ballet flat, not "apartment"
+        "soft",  # 🍦 soft serve, not "quiet"
+        "broken",  # 💔 broken heart, not "damaged"
+        "round",  # 📍 round pushpin, not "circular"
+        "cool",  # 🆒 COOL button, not "cold"
+        "free",  # 🆓 FREE button, not "liberated"
+        "long",  # 🪘 long drum, not "lengthy"
+        "straight",  # 📏 straight ruler, not "direct"
+        "empty",  # 🪹 empty nest, not "void"
+        "hot",  # 🥵 hot face, not "warm"
+        "cross",  # ✝️ latin cross, not "intersect"
+        "bright",  # 🔆 bright button, not "luminous"
+        "old",  # 👴 old man, not "aged"
+        "head",  # 🙂‍↔️ shaking head, not "leader"
+        # Category words that match generic emoji
+        "military",  # 🎖️ military medal for any military term
+        "sports",  # 🏅 sports medal for any sports term
+        "food",  # 😋 yummy face for any food term
+        "city",  # 🇻🇦 Vatican flag for any city
+        "china",  # 🇨🇳 China flag for "porcelain"
+        "polish",  # 💅 nail polish for "to polish/shine"
+        "aid",  # 🦻 hearing aid for "to help"
+        "office",  # 🧑‍💼 office worker for "bureau"
+        "construction",  # 🏛️ classical building, not construction
+        "cinema",  # 🎦 cinema emoji for any film term
+        "ceremony",  # 🎑 moon ceremony for any ceremony
+        "building",  # 🏛️ classical building for any structure
+        # Body parts / human features → wrong emoji
+        "arm",  # 🦾 mechanical arm for "to arm"
+        "hair",  # 👱 blond person for "hair"
+        "nose",  # 😤 steam from nose
+        "tongue",  # 😛 tongue-out face
+        "chest",  # 🪎 not a chest
+        "eyes",  # 😃 face with eyes
+        # Abstract/vague words
+        "fear",  # 😱 screaming face
+        "anger",  # 💢 anger symbol
+        "angry",  # 😠 angry face
+        "tired",  # 😫 tired face
+        "sad",  # 😥 sad face
+        "joy",  # 😂 tears of joy
+        "love",  # 💌 love letter
+        "cold",  # 🥶 cold face
+        "pile",  # 💩 pile of poo
+        "man",  # 👨 man
+        "woman",  # 👩 woman
+        "boy",  # 👦 boy
+        "girl",  # 👧 girl
+        "baby",  # 👶 baby
+        "children",  # 🚸 children crossing
+        "student",  # 🧑‍🎓 student
+        "adult",  # 🧑‍🧑‍🧒 family
+        "name",  # 📛 name badge
+        "check",  # ✅ check mark
+        "line",  # 🫥 dotted line face
+        "floor",  # 🤣 ROFL (rolling on floor)
+        "room",  # 🧖 person in steamy room
+        "bubble",  # 👁️‍🗨️ speech bubble
+        "car",  # 🚃 railway car, not automobile
+        "bullet",  # 🚅 bullet train
+        "steam",  # 😤 face with steam
+        "fly",  # 🪰 the insect, not the verb
+        "plant",  # 🪴 potted plant for all "X (plant)" entries
+        "tree",  # 🌲 evergreen for all "X (tree)" entries
+        "ball",  # ⛹️ person bouncing ball
+        "bag",  # 👝 clutch bag
+        "fight",  # 🫯 not a fight
+        "cloud",  # 🫯 not a cloud
+        "video",  # 🎮 video game, not video
+        "rescue",  # ⛑️ rescue worker helmet
+        "exchange",  # 💱 currency exchange
+        "cut",  # 🥩 cut of meat, not "to cut"
+        "key",  # 🔐 locked with key
+        "walking",  # 🚶 person walking
+        "running",  # 🏃 person running
+        "climbing",  # 🧗 person climbing
+        "speaking",  # 🗣️ speaking head
+        "playing",  # 🤽 person playing
+        "feeding",  # 👩‍🍼 person feeding
+        "shooting",  # 🌠 shooting star
+        "clapping",  # 👏 clapping hands
+        "cooking",  # 🍳 cooking emoji
+        "holding",  # 🥹 face holding back tears
+        # More wrong-sense matches from remaining audit
+        "paper",  # 🏮 red lantern for "paper"
+        "track",  # 🛤️ railroad for "track record"
+        "vertical",  # 🚦 traffic light for "vertical"
+        "speaker",  # 🔇 muted speaker for "speaker (person)"
+        "square",  # 🟥 red square for "plaza"
+        "wrapped",  # 🎁 gift for "wrapped, bound"
+        "volume",  # 🔈 speaker for "volume (book)"
+        "mobile",  # 📱 phone for "mobile, moveable"
+        "flash",  # 📸 camera flash for "to shine"
+        "identification",  # 🪪 ID card for "locating"
+        "service",  # 🐕‍🦺 service dog for "service, term"
+        "ground",  # ⛱️ umbrella on ground
+        "machine",  # 🎰 slot machine for "mechanism"
+        "liquid",  # 🫗 pouring for "liquid, drop"
+        "vehicle",  # 🚙 SUV for any vehicle mention
+        "window",  # 🪟 window pane for "window, gap"
+        "information",  # ℹ️ info symbol
+        "child",  # 🧒 child emoji
    }
 )

@ -832,9 +969,11 @@ def build_vocab_deck(
        if word_nikkud not in word_to_pos_cat:
            word_to_pos_cat[word_nikkud] = _categorize_pos(pos_raw) if pos_raw else "Other"

-    # Sort entries by frequency (null → 999999), applying limit after sort
+    # Sort entries by effective frequency (pseudo_frequency for confusables,
+    # else regular frequency; null → 999999), applying limit after sort
    def _freq_key(item: tuple[str, dict]) -> int:
-        return item[1].get("frequency") or 999_999
+        e = item[1]
+        return e.get("pseudo_frequency") or e.get("frequency") or 999_999

    sorted_entries = sorted(words.items(), key=_freq_key)
    if limit:
@ -860,7 +999,6 @@ def build_vocab_deck(
        meaning = re.sub(r"\s{2,}", " ", meaning).strip(", ;:")
        meaning = re.sub(r"(\w)\(", r"\1 (", meaning)  # space before opening paren
        meaning = re.sub(r",(\S)", r", \1", meaning)  # space after comma
-        meaning_raw = entry.get("meaning_raw", "") or ""
        slug = entry.get("slug", "") or ""
        frequency = entry.get("frequency") or 999_999
        audio_file = entry.get("audio_file", "") or ""
@ -895,25 +1033,22 @@ def build_vocab_deck(
        else:
            freq_display = "Unlisted"

-        # Emoji: use entry's emoji if emoji_visible, else fall back to emoji_lookup
+        # Emoji: use entry's emoji if emoji_visible, else fall back to emoji_lookup.
+        # Skip fallback for verbs — keyword matching on verb definitions produces
+        # wrong-sense emoji (e.g. "to cut" → 🥩, "to arm" → 🦾).
        emoji_str = ""
        if entry.get("emoji_visible") and entry.get("emoji"):
            emoji_str = entry["emoji"]
-        elif not emoji_str and emoji_lookup:
+        elif emoji_lookup and not meaning.startswith("to "):
            meaning_clean_for_emoji = EMOJI_RE.sub("", meaning).strip()
            for kw in re.sub(r"[^\w\s]", " ", meaning_clean_for_emoji.lower()).split()[:5]:
                if len(kw) > 2 and kw not in _EMOJI_STOP and kw in emoji_lookup:
                    emoji_str = emoji_lookup[kw]
                    break

-        # Extract Hebrew prepositions: prefer upstream-parsed prep field, fall back to meaning_raw scan
-        # (fallback covers entries scraped before prep was moved upstream)
+        # Hebrew prepositions — extracted upstream by list scraper
        entry_prep = entry.get("prep")
-        if entry_prep:
-            prep_str = " ".join(f"({p})" for p in entry_prep.split())
-        else:
-            preps = HBPAREN_RE.findall(meaning_raw)
-            prep_str = " ".join(f"({p})" for p in preps)
+        prep_str = " ".join(f"({p})" for p in entry_prep.split()) if entry_prep else ""

        # Audio — use audio_file from entry; for confusables it's already slug-based
        audio_tag = ""
@ -1123,25 +1258,12 @@ def build_conj_deck(
        root = ".".join(root_list)
        voice = VOICE_MAP.get(binyan, "")

-        meaning_raw = entry.get("meaning_raw", "") or ""
        meaning = entry.get("meaning", "") or ""
-        # Extract Hebrew preposition — strip from meaning, show on Hebrew side
+        # Hebrew preposition — extracted upstream by scraper
        prep_str = ""
-        conj_prep = conj.get("prep")
+        conj_prep = conj.get("prep") or entry.get("prep")
        if conj_prep:
-            # Strip any parentheses from stored prep value
            prep_str = conj_prep.strip("() ")
-        elif meaning_raw:
-            preps = HBPAREN_RE.findall(meaning_raw)
-            if preps:
-                prep_str = preps[0]
-        # Strip Hebrew prepositions from English meaning to avoid duplication
-        if prep_str:
-            meaning = HBPAREN_RE.sub("", meaning).strip()
-            # Also strip from meaning_raw patterns like "(על)"
-            meaning = re.sub(r"\(\s*" + re.escape(prep_str) + r"\s*-?\s*\)", "", meaning).strip()
-            # Clean up double spaces and trailing commas
-            meaning = re.sub(r"\s{2,}", " ", meaning).strip(", ")

        related = [(f, w, m) for f, w, m in root_words.get(root, []) if w != infinitive]
        if related:
@ -1438,9 +1560,12 @@ def build_confusables_deck(
            guid = genanki.guid_for("confusable", entry["word"].get("ktiv_male", unique_key))
        guid_to_entries.setdefault(guid, []).append(entry)

+    def _eff_freq(e: dict) -> int:
+        return e.get("pseudo_frequency") or e.get("frequency") or 999_999
+
    for guid, group_entries in sorted(
        guid_to_entries.items(),
-        key=lambda x: sum(e.get("frequency") or 999_999 for e in x[1]) / len(x[1]),
+        key=lambda x: sum(_eff_freq(e) for e in x[1]) / len(x[1]),
    ):
        if guid in seen_guids:
            continue
@ -1459,9 +1584,13 @@ def build_confusables_deck(
                unique_entries.append(e)
        if len(unique_entries) < 2:
            continue
+        # Sort by pseudo/frequency so most common meaning appears first
+        unique_entries.sort(key=_eff_freq)
+        if len(unique_entries) < 2:
+            continue

        word_no_nik = unique_entries[0]["word"].get("ktiv_male", "")
-        words_display = " / ".join(e["word"]["nikkud"] for e in unique_entries)
+        words_display = word_no_nik  # Show ktiv male (shared form) on front

        defs_parts: list[str] = []
        audio_parts: list[str] = []
@ -1530,8 +1659,8 @@ def write_conf_apkg(
 PLURAL_FRONT_SG = """
 <div class="hebrew" style="color:#1a1a8c;">{{Singular}}</div>
 {{#SingularAudio}}<div>{{SingularAudio}}</div>{{/SingularAudio}}
-<div class="sec-label">{{Meaning}}</div>
-<div class="hint" style="font-size:28px;">יָחִיד ← רַבִּים</div>
+<div class="meaning" style="font-size:28px;">{{Meaning}}</div>
+<div class="plural-direction">יָחִיד ← רַבִּים</div>
 """

 PLURAL_BACK_SG = """
@ -1547,14 +1676,14 @@ PLURAL_BACK_SG = """
 PLURAL_FRONT_PL = """
 <div class="hebrew" style="color:#1a1a8c;">{{Plural}}</div>
 {{#PluralAudio}}<div>{{PluralAudio}}</div>{{/PluralAudio}}
-<div class="hint" style="font-size:28px;">רַבִּים ← יָחִיד</div>
+<div class="plural-direction">רַבִּים ← יָחִיד</div>
 """

 PLURAL_BACK_PL = """
 {{FrontSide}}<hr>
 <div class="hebrew">{{Singular}}</div>
 {{#SingularAudio}}<div>{{SingularAudio}}</div>{{/SingularAudio}}
-<div class="sec-label" style="text-align:center;display:block;">{{Meaning}}</div>
+<div class="meaning" style="font-size:28px;">{{Meaning}}</div>
 <div class="sec-table">
 {{#Gender}}<div class="sec-label"><span class="sec-key">מִין:</span><span class="sec-val">{{Gender}}</span></div>{{/Gender}}
 {{#Mishkal}}<div class="sec-label"><span class="sec-key">מִשְׁקָל:</span><span class="sec-val">{{Mishkal}}</span></div>{{/Mishkal}}
@ -1651,9 +1780,9 @@ def build_plural_deck(
    irregular_count = len(irregulars)
    target_regular = irregular_count * 2
    mishkal_count = len(by_mishkal) or 1
-    per_mishkal = max(2, target_regular // mishkal_count)
+    # Over-sample per mishkal to compensate for small patterns, then trim
+    per_mishkal = max(3, (target_regular * 3) // (mishkal_count * 2))

-    selected: list[tuple[str, dict, dict]] = list(irregulars)
    regular_pool: list[tuple[str, dict, dict]] = []
    for _mishkal, entries in sorted(by_mishkal.items()):
        entries.sort(key=lambda e: e[1].get("frequency") or 999_999)
@ -1664,7 +1793,24 @@ def build_plural_deck(
        regular_pool.sort(key=lambda e: e[1].get("frequency") or 999_999)
        regular_pool = regular_pool[:target_regular]

-    selected.extend(regular_pool)
+    # Sort both pools by frequency, then interleave for homogeneous 2:1 regular:irregular
+    irregulars.sort(key=lambda e: e[1].get("frequency") or 999_999)
+    regular_pool.sort(key=lambda e: e[1].get("frequency") or 999_999)
+
+    # Interleave: for every 1 irregular, insert 2 regulars
+    selected: list[tuple[str, dict, dict]] = []
+    ri = 0  # regular index
+    for _ii, irr in enumerate(irregulars):
+        # Insert 2 regulars before each irregular (when available)
+        for _ in range(2):
+            if ri < len(regular_pool):
+                selected.append(regular_pool[ri])
+                ri += 1
+        selected.append(irr)
+    # Append remaining regulars
+    while ri < len(regular_pool):
+        selected.append(regular_pool[ri])
+        ri += 1

    note_count = 0
    for _unique_key, entry, noun_inflection in selected:
--- a/data/en_50k.txt
+++ b/data/en_50k.txt
--- a/data/words.json
+++ b/data/words.json
--- a/pealim_list_scrape.py
+++ b/pealim_list_scrape.py
@ -82,7 +82,7 @@ BINYAN_HEBREW: dict[str, str] = {

 # Regex for extracting emoji characters
 EMOJI_RE = re.compile(
-    r"[\U0001F300-\U0001FFFF\U00002600-\U000027BF\U0001F000-\U0001F9FF\u2600-\u26FF\u2700-\u27BF]+",
+    r"[\U0001F300-\U0001FFFF\U00002600-\U000027BF\U0001F000-\U0001F9FF\u2600-\u26FF\u2700-\u27BF\uFE0E\uFE0F\u200D]+",
    re.UNICODE,
 )

--- a/release.py
+++ b/release.py
@ -24,7 +24,7 @@ sys.path.insert(0, "/home/node/projects")
 import load_keeshare

 REPO_API = "https://git.nevo.engineer/api/v1/repos/nevo/hebrew_flash_cards"
-FORGEJO_TOKEN: str = load_keeshare.get_entry("git.nevo.engineer")["API_TOKEN"]
+FORGEJO_TOKEN: str = load_keeshare.get_entry("git.nevo.engineer")["password"]
 OUTPUT_DIR = Path(__file__).parent / "output"

 # All deck variants to include in release
--- a/scripts/assign_pseudo_frequency.py
+++ b/scripts/assign_pseudo_frequency.py
@ -0,0 +1,269 @@
+#!/usr/bin/env python3
+"""Assign pseudo-frequency to confusable groups using English word frequency.
+
+Problem: Confusable entries share the same ktiv_male and thus the same Hebrew
+frequency rank.  This script uses English frequency to differentiate them so
+Anki sorts more-common meanings first.
+
+Algorithm:
+  1. For each confusable group where all entries share the same Hebrew frequency,
+     extract the first meaningful English keyword from each entry's meaning field.
+  2. Look up English frequency rank for each keyword.
+  3. Assign pseudo_frequency: the most frequent English meaning keeps the original
+     Hebrew rank; less frequent meanings get progressively higher (worse) ranks
+     by adding an offset (100 * position in group).
+
+Usage:
+    python3 scripts/assign_pseudo_frequency.py              # assign and save
+    python3 scripts/assign_pseudo_frequency.py --dry-run    # preview only
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import logging
+import re
+from collections import defaultdict
+from pathlib import Path
+
+logger = logging.getLogger(__name__)
+
+PROJECT_ROOT = Path(__file__).parent.parent
+WORDS_JSON = PROJECT_ROOT / "data" / "words.json"
+EN_FREQ_PATH = PROJECT_ROOT / "data" / "en_50k.txt"
+
+# Words too common/vague to use as frequency signal
+_EN_STOP = frozenset(
+    {
+        "to",
+        "be",
+        "a",
+        "an",
+        "the",
+        "of",
+        "in",
+        "on",
+        "at",
+        "for",
+        "and",
+        "with",
+        "by",
+        "or",
+        "but",
+        "not",
+        "as",
+        "its",
+        "it",
+        "is",
+        "was",
+        "are",
+        "from",
+        "that",
+        "this",
+        "have",
+        "has",
+        "had",
+        "do",
+        "does",
+        "did",
+        "will",
+        "would",
+        "can",
+        "could",
+        "may",
+        "might",
+        "shall",
+        "should",
+        "must",
+        "no",
+        "yes",
+        "very",
+        "too",
+        "also",
+        "just",
+        "only",
+        "so",
+        "up",
+        "out",
+        "into",
+        "over",
+        "after",
+        "before",
+        "about",
+        "more",
+        "than",
+        "other",
+        "some",
+        "any",
+        "all",
+        "each",
+        "every",
+        "both",
+        "few",
+        "many",
+        "much",
+        "most",
+        "such",
+        "own",
+        "same",
+        "well",
+        "still",
+        "even",
+        "how",
+        "what",
+        "when",
+        "where",
+        "which",
+        "who",
+        "whom",
+        "whose",
+        "why",
+        "because",
+        "if",
+        "then",
+        "else",
+        "while",
+        "until",
+        "though",
+        "whether",
+    }
+)
+
+
+def _load_en_freq() -> dict[str, int]:
+    """Load English frequency data: word -> rank (1 = most common)."""
+    freq: dict[str, int] = {}
+    rank = 1
+    with open(EN_FREQ_PATH, encoding="utf-8") as f:
+        for line in f:
+            parts = line.strip().split()
+            if parts:
+                word = parts[0].lower()
+                if word not in freq:
+                    freq[word] = rank
+                    rank += 1
+    return freq
+
+
+def _extract_keywords(meaning: str) -> list[str]:
+    """Extract meaningful English keywords from a meaning string.
+
+    Returns list of lowercase words, filtered for stop words and short words.
+    """
+    # Strip parenthesized content, punctuation
+    cleaned = re.sub(r"\([^)]*\)", " ", meaning)
+    cleaned = re.sub(r"[^\w\s]", " ", cleaned)
+    return [w.lower() for w in cleaned.split() if len(w) > 2 and w.lower() not in _EN_STOP]
+
+
+def assign_pseudo_frequencies(
+    words: dict,
+    en_freq: dict[str, int],
+    dry_run: bool = False,
+) -> int:
+    """Assign pseudo_frequency to confusable groups. Returns count of changes."""
+
+    # Group by confusables_guid
+    groups: dict[str, list[str]] = defaultdict(list)
+    for key, entry in words.items():
+        cg = entry.get("confusables_guid")
+        if cg:
+            groups[cg].append(key)
+
+    changes = 0
+    assigned_groups = 0
+    skipped_diff = 0
+    skipped_no_en = 0
+
+    for _guid, keys in groups.items():
+        entries = [words[k] for k in keys]
+        freqs = [e.get("frequency") for e in entries]
+
+        # Skip groups that are already differentiated
+        unique_freqs = set(freqs)
+        if len(unique_freqs) > 1:
+            skipped_diff += 1
+            continue
+
+        base_freq = freqs[0]  # All same (or all None)
+
+        # Look up English frequency for each entry
+        en_ranks: list[tuple[int, str]] = []  # (en_rank, key)
+        for key, entry in zip(keys, entries, strict=True):
+            keywords = _extract_keywords(entry.get("meaning", ""))
+            en_rank = 999_999
+            for kw in keywords[:5]:
+                r = en_freq.get(kw)
+                if r is not None:
+                    en_rank = r
+                    break
+            en_ranks.append((en_rank, key))
+
+        # Sort by English frequency (lower rank = more common)
+        en_ranks.sort()
+
+        # Check if all entries have the same English rank (no signal)
+        if len({r for r, _ in en_ranks}) <= 1:
+            skipped_no_en += 1
+            continue
+
+        assigned_groups += 1
+
+        # Assign pseudo_frequency: most common gets base, others get offset
+        for position, (en_rank, key) in enumerate(en_ranks):
+            pseudo = base_freq + position * 100 if base_freq is not None else 50000 + en_rank
+
+            if not dry_run:
+                words[key]["pseudo_frequency"] = pseudo
+            changes += 1
+
+            if dry_run:
+                meaning = words[key].get("meaning", "")[:40]
+                logger.info(
+                    "  [en:%5d] pseudo=%6d  %s",
+                    en_rank,
+                    pseudo,
+                    meaning,
+                )
+
+    logger.info(
+        "Pseudo-frequency: %d groups assigned, %d already differentiated, %d no English signal",
+        assigned_groups,
+        skipped_diff,
+        skipped_no_en,
+    )
+    return changes
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Assign pseudo-frequency to confusables")
+    parser.add_argument("--dry-run", action="store_true", help="Preview without saving")
+    args = parser.parse_args()
+
+    logging.basicConfig(
+        level=logging.INFO,
+        format="%(asctime)s %(levelname)s %(message)s",
+    )
+
+    logger.info("Loading English frequency data: %s", EN_FREQ_PATH)
+    en_freq = _load_en_freq()
+    logger.info("English frequency: %d entries", len(en_freq))
+
+    with open(WORDS_JSON, encoding="utf-8") as f:
+        words: dict = json.load(f)
+
+    changes = assign_pseudo_frequencies(words, en_freq, dry_run=args.dry_run)
+
+    if args.dry_run:
+        logger.info("Dry run — %d changes would be made", changes)
+        return
+
+    with open(WORDS_JSON, "w", encoding="utf-8") as f:
+        json.dump(words, f, ensure_ascii=False, indent=2)
+
+    logger.info("Saved %d pseudo-frequency assignments to words.json", changes)
+
+
+if __name__ == "__main__":
+    main()
Author	SHA1	Message	Date
Sochen	6d2d446ed5	feat: pseudo-frequency for confusables using English word frequency 264 confusable groups where all entries shared the same Hebrew frequency now have differentiated pseudo_frequency values based on English word commonality (hermitdave en_50k.txt). Most common meaning keeps base rank; less common meanings get +100 offset per position. Examples: - אב: "father" (en:194) → 2491, "bud" (en:2963) → 2591 - אח: "brother" (en:300) → 911, "fireplace" (en:9389) → 1011 Builder uses pseudo_frequency for sort order when available. Confusable card definitions now sorted most-common-first. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-03 05:28:30 +00:00
Sochen	f978e5f39a	fix: vet fallback emoji — verb gate + expanded stop list removes 852 bad matches The fallback emoji system (keyword→Unicode char matching at build time) was producing 1,733 matches, many with wrong-sense emoji: - "high, tall" → ⚡ (from "high voltage") - "to cut" → 🥩 (cut of meat) - "city" → 🇻🇦 (Vatican flag) Two fixes: 1. Skip fallback for verbs (meanings starting "to ") — 476 removed 2. Expand _EMOJI_STOP with 100+ polysemous/abstract keywords — 376 more Result: 1733 → 881 fallback matches (49% reduction). The 114 from_pealim emojis (concrete nouns like 🍎 apple, 🐪 camel) are unaffected. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-03 05:17:31 +00:00
Sochen	5f617af4ba	feat: vet emoji assignments — 114 visible, 3 removed, 1 fixed Reviewed all 117 pealim-inherited emoji assignments: - Made 114 correct assignments visible (emoji_visible: true) - Removed: goblet (🏆 is trophy), fitness (🏋 too abstract), red (💄 is lipstick) - Fixed: onion 🌰→🧅 (was chestnut emoji) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-26 23:19:43 +00:00
Sochen	f3496998f5	feat: confusables show ktiv male, emoji/prep stripping fully upstream - Confusables deck front now shows shared ktiv male form instead of nikkud variants joined by "/". Back still shows nikkud with definitions. - Fixed list scraper EMOJI_RE to catch variation selectors (U+FE0F) and ZWJ (U+200D) — cleaned 17 entries with leftover selectors in meaning. - Removed build-time prep extraction fallback (0 entries relied on it). - release.py: fix keeshare field name (API_TOKEN → password). Closes: Pealim #11 (emoji/prep upstream), Pealim #16 (confusables ktiv male) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-21 02:19:03 +00:00