qo}NJq"
+ "cloze_guid": "D1F7qor]09"
},
- "rejected_count": 1
+ "rejected_count": 0
},
"noun_inflection": {
"singular": {
@@ -2198106,6 +2209655,11 @@
"text": "אֲנִי עֲדַיִן שָׁקוּעַ בְּחִפּוּשׂ הַהֶבְדֵּלִים בֵּין דּוֹדָה לְבֵין מִנְהָרָה",
"source": "time_tunnel_78",
"match_method": "direct"
+ },
+ {
+ "text": "כַּאֲשֶׁר שָׁטָה בֵּין אֲוָזֶיהָ גִּלְּתָה אֶת חַמְדִּי שָׁקוּעַ עַל גַּבּוֹ בְּמַיִם",
+ "source": "tree_dress",
+ "match_method": "direct"
}
],
"cloze": {
@@ -2198933,6 +2210487,16 @@
"confusables_guid": null,
"examples": {
"vetted": [
+ {
+ "text": "\" \"אִם כָּךְ, מָה עִם שְׁקִיעַת הַשֶּׁמֶשׁ שֶׁלִּי",
+ "source": "little_prince",
+ "match_method": "inflected"
+ },
+ {
+ "text": "הוּא הִצְטַעֵר עַל שְׁקִיעַת הַשֶּׁמֶשׁ שֶׁלֹּא קָרְתָה",
+ "source": "little_prince",
+ "match_method": "inflected"
+ },
{
"text": "כְּשֶׁעֲצוּבִים מְאוֹד, נָעִים לִרְאוֹת שְׁקִיעוֹת",
"source": "little_prince",
@@ -2198940,9 +2210504,9 @@
}
],
"cloze": {
- "text": "כְּשֶׁעֲצוּבִים מְאוֹד, נָעִים לִרְאוֹת שְׁקִיעוֹת",
- "cloze_word_start": 40,
- "cloze_word_end": 50,
+ "text": "\" \"אִם כָּךְ, מָה עִם שְׁקִיעַת הַשֶּׁמֶשׁ שֶׁלִּי",
+ "cloze_word_start": 22,
+ "cloze_word_end": 31,
"cloze_hint": null,
"cloze_guid": "p^aMF.j5o,D>"
+ },
+ "rejected_count": 0
+ },
"noun_inflection": {
"singular": {
"nikkud": "מִשְׂרָה",
@@ -2209156,7 +2220814,16 @@
"לִשְׂרוֹת"
],
"confusables_guid": "BTpm)jG1ua",
- "examples": null,
+ "examples": {
+ "vetted": [
+ {
+ "text": "בְּיָמִים טוֹבִים הָיָה הַדַיָג מַעֲלֶה דָגִים רַבִּים בְּרִשְׁתּוֹ, וְאָז הָיְתָה הַשִׂמְחָה שׁוֹרָה בְּבֵיתוֹ הַקָטָן",
+ "source": "shell_story",
+ "match_method": "conjugated"
+ }
+ ],
+ "rejected_count": 0
+ },
"noun_inflection": null,
"conjugation": {
"in_conjugation_deck": false,
@@ -2212161,7 +2223828,16 @@
"שָׂרוּךְ"
],
"confusables_guid": "kH<&uS{c1n",
- "examples": null,
+ "examples": {
+ "vetted": [
+ {
+ "text": "הוֹפָעָתָהּ שֶׁל צְעִירָה שֶׁלָּבְשָׁה חֻלְצָה כְּחֻלָּה שֶׁל אֵיזוֹ תְּנוּעַת נֹעַר, אֲבָל בְּלִי שְׂרוֹךְ, קָטְעָה אוֹתִי",
+ "source": "time_tunnel_76",
+ "match_method": "direct"
+ }
+ ],
+ "rejected_count": 0
+ },
"noun_inflection": {
"singular": {
"nikkud": "שְׂרוֹךְ",
@@ -2213145,20 +2224821,25 @@
"examples": {
"vetted": [
{
- "text": "אֲנִי מַבִּיט לְכִווּן יְרוּשָׁלַיִם שֶׁאֲנִי מַכִּיר, זֹאת שֶׁבַּהֹוֶה מִשְׂתָּרַעַת עַל הֶהָרִים, וְהִיא אֵינֶנָּה",
- "source": "time_tunnel_82",
+ "text": "הַגְּבָעוֹת הָאֵלֶּה מִשְׂתָּרְעוֹת בְּאֵזוֹר בֵּית שְׁעָרִים, קָרוֹב לְעֵמֶק יִזְרְעֶאל",
+ "source": "time_tunnel_81",
"match_method": "conjugated"
},
{
- "text": "כְּמוֹ שֶׁאַתָּה יָכוֹל לְהָבִין מִשְּׁמָהּ, הִיא הִשְׂתָּרְעָה בֵּין הַשְּׁאָר עַל הַשְּׁטָחִים שֶׁהַיּוֹם שַׁיָּכִים לְאוֹסְטְרִיָה וּלְהוּנְגַרְיָה",
- "source": "time_tunnel_63",
+ "text": "לְנֶגֶד עֵינָיו הִשְׂתָּרְעָה עִיר יָפָה לְהַפְלִיא",
+ "source": "gulliver",
+ "match_method": "conjugated"
+ },
+ {
+ "text": "אֲנִי מַבִּיט לְכִווּן יְרוּשָׁלַיִם שֶׁאֲנִי מַכִּיר, זֹאת שֶׁבַּהֹוֶה מִשְׂתָּרַעַת עַל הֶהָרִים, וְהִיא אֵינֶנָּה",
+ "source": "time_tunnel_82",
"match_method": "conjugated"
}
],
"cloze": {
- "text": "אֲנִי מַבִּיט לְכִווּן יְרוּשָׁלַיִם שֶׁאֲנִי מַכִּיר, זֹאת שֶׁבַּהֹוֶה מִשְׂתָּרַעַת עַל הֶהָרִים, וְהִיא אֵינֶנָּה",
- "cloze_word_start": 72,
- "cloze_word_end": 85,
+ "text": "הַגְּבָעוֹת הָאֵלֶּה מִשְׂתָּרְעוֹת בְּאֵזוֹר בֵּית שְׁעָרִים, קָרוֹב לְעֵמֶק יִזְרְעֶאל",
+ "cloze_word_start": 21,
+ "cloze_word_end": 35,
"cloze_hint": null,
"cloze_guid": "Ndo},7K-4G"
},
@@ -2213650,15 +2225331,25 @@
"examples": {
"vetted": [
{
- "text": "״הֵם בְּתוֹךְ הַמּוֹשָׁבָה, שׁוֹדְדִים אֶת הָעֲדָרִים שֶׁלָּנוּ וְשׂוֹרְפִים הַכֹּל",
- "source": "time_tunnel_70",
- "match_method": "conjugated_prefix"
+ "text": "הָרַגְלַיִם שֶׁלִּי הוֹלְכוֹת וְנִהְיוֹת כְּבֵדוֹת יוֹתֵר וְיוֹתֵר וְהָעֵינַיִם שֶׁלִּי שׂוֹרְפוֹת",
+ "source": "time_tunnel_81",
+ "match_method": "conjugated"
+ },
+ {
+ "text": "גַּם אִם הָיִיתִי רוֹצֶה לְהַגִּיד מַשֶּׁהוּ, לֹא הָיִיתִי מְסֻגָּל, כִּי הַגָּרוֹן שֶׁלִּי שׂוֹרֵף וְחָנוּק",
+ "source": "time_tunnel_76",
+ "match_method": "conjugated"
+ },
+ {
+ "text": "שׂוֹרֵף מֵהַצְּעָקוֹת וְחָנוּק מִדְּמָעוֹת",
+ "source": "time_tunnel_76",
+ "match_method": "conjugated"
}
],
"cloze": {
- "text": "״הֵם בְּתוֹךְ הַמּוֹשָׁבָה, שׁוֹדְדִים אֶת הָעֲדָרִים שֶׁלָּנוּ וְשׂוֹרְפִים הַכֹּל",
- "cloze_word_start": 64,
- "cloze_word_end": 76,
+ "text": "הָרַגְלַיִם שֶׁלִּי הוֹלְכוֹת וְנִהְיוֹת כְּבֵדוֹת יוֹתֵר וְיוֹתֵר וְהָעֵינַיִם שֶׁלִּי שׂוֹרְפוֹת",
+ "cloze_word_start": 88,
+ "cloze_word_end": 98,
"cloze_hint": null,
"cloze_guid": "BNewm3jYG-"
},
@@ -2214529,7 +2226220,23 @@
],
"confusable_group": null,
"confusables_guid": null,
- "examples": null,
+ "examples": {
+ "vetted": [
+ {
+ "text": "צִלְצְלוּ כְּמוֹ בִּשְׁעַת שְׂרֵפָה",
+ "source": "ilana",
+ "match_method": "direct"
+ }
+ ],
+ "cloze": {
+ "text": "צִלְצְלוּ כְּמוֹ בִּשְׁעַת שְׂרֵפָה",
+ "cloze_word_start": 27,
+ "cloze_word_end": 35,
+ "cloze_hint": null,
+ "cloze_guid": "yv5CgP?1CI"
+ },
+ "rejected_count": 0
+ },
"noun_inflection": {
"singular": {
"nikkud": "שְׂרֵפָה",
@@ -2216993,15 +2228700,15 @@
"examples": {
"vetted": [
{
- "text": "\"אַל תֵּלֵךְ, אֲנִי אֲמַנֶּה אוֹתְךָ לְשַׂר",
- "source": "הנסיך הקטן",
- "match_method": "inflected_prefix"
+ "text": "בְּנוֹסָף עַל כָּל זֶה, הוּא גַּם עוֹנֵב עֲנִיבָה וְנִרְאֶה כְּמוֹ שַׂר בַּמֶּמְשָׁלָה",
+ "source": "time_tunnel_81",
+ "match_method": "inflected"
}
],
"cloze": {
- "text": "\"אַל תֵּלֵךְ, אֲנִי אֲמַנֶּה אוֹתְךָ לְשַׂר",
- "cloze_word_start": 37,
- "cloze_word_end": 43,
+ "text": "בְּנוֹסָף עַל כָּל זֶה, הוּא גַּם עוֹנֵב עֲנִיבָה וְנִרְאֶה כְּמוֹ שַׂר בַּמֶּמְשָׁלָה",
+ "cloze_word_start": 67,
+ "cloze_word_end": 71,
"cloze_hint": null,
"cloze_guid": "l`Hfy.7lQh"
},
@@ -2220021,24 +2231728,14 @@
"examples": {
"vetted": [
{
- "text": "\"הוּא חָשַׁב שֶׁאֲנִי הַמְּשָׁרֶתֶת שֶׁלּוֹ,\" אָמְרָה לְעַצְמָהּ בְּעוֹדָהּ רָצָה",
- "source": "alice_wonderland",
- "match_method": "conjugated_prefix"
- },
- {
- "text": "\"אֵין שׁוּם סוּג שֶׁל תּוֹעֶלֶת בְּלִדְפּוֹק,\" אָמַר הַמְּשָׁרֵת, \"וְזֶה בִּגְלַל שְׁתֵּי סִבּוֹת",
- "source": "alice_wonderland",
- "match_method": "conjugated_prefix"
- },
- {
- "text": "\"כָּל מָה שֶׁמִּתְחַשֵּׁק לָךְ,\" אָמַר הַמְּשָׁרֵת, וְהִתְחִיל לִשְׁרוֹק",
- "source": "alice_wonderland",
- "match_method": "conjugated_prefix"
+ "text": "שְׁנַיִם מֵהֶם אֲפִלּוּ מְשָׁרְתִים אִתִּי בַּיְּחִידָה, אֲבָל הֵם קִבְּלוּ שִׁחְרוּר מֵהַפִּנּוּי",
+ "source": "time_tunnel_77",
+ "match_method": "conjugated"
}
],
"cloze": {
- "text": "\"הוּא חָשַׁב שֶׁאֲנִי הַמְּשָׁרֶתֶת שֶׁלּוֹ,\" אָמְרָה לְעַצְמָהּ בְּעוֹדָהּ רָצָה",
- "cloze_word_start": 22,
+ "text": "שְׁנַיִם מֵהֶם אֲפִלּוּ מְשָׁרְתִים אִתִּי בַּיְּחִידָה, אֲבָל הֵם קִבְּלוּ שִׁחְרוּר מֵהַפִּנּוּי",
+ "cloze_word_start": 24,
"cloze_word_end": 35,
"cloze_hint": null,
"cloze_guid": "gEu3Mv-/1P"
@@ -2221422,6 +2233119,11 @@
"text": "הָיוּ לִי מֵי שְׁתִיָּה בְּקשִׁי לִשְׁמוֹנָה יָמִים",
"source": "little_prince",
"match_method": "direct"
+ },
+ {
+ "text": "\" אֲנִי קוֹרֵא לְשָׁרוֹן וּלְאַדְוָה וּמַצְבִּיעַ עַל מִתְקַן שְׁתִיָּה, \"יֵשׁ כָּאן מַיִם",
+ "source": "time_tunnel_77",
+ "match_method": "direct"
}
],
"cloze": {
@@ -2221566,7 +2233268,23 @@
],
"confusable_group": null,
"confusables_guid": null,
- "examples": null,
+ "examples": {
+ "vetted": [
+ {
+ "text": "הֵם עָרְכוּ מִשְׁתֶּה גָדוֹל לִכְבוֹדוֹ וְשִׂמְּחוּ אֶת אוֹרְחָם",
+ "source": "gulliver",
+ "match_method": "direct"
+ }
+ ],
+ "cloze": {
+ "text": "הֵם עָרְכוּ מִשְׁתֶּה גָדוֹל לִכְבוֹדוֹ וְשִׂמְּחוּ אֶת אוֹרְחָם",
+ "cloze_word_start": 12,
+ "cloze_word_end": 21,
+ "cloze_hint": null,
+ "cloze_guid": "m4S_on7!;-"
+ },
+ "rejected_count": 0
+ },
"noun_inflection": {
"singular": {
"nikkud": "מִשְׁתֶּה",
@@ -2222119,7 +2233837,23 @@
],
"confusable_group": null,
"confusables_guid": null,
- "examples": null,
+ "examples": {
+ "vetted": [
+ {
+ "text": "מִי שָׁתַל אֶת הַגְּפָנִים הָאֵלֶּה, אִם אֵין פֹּה אַף אֶחָד",
+ "source": "time_tunnel_77",
+ "match_method": "conjugated"
+ }
+ ],
+ "cloze": {
+ "text": "מִי שָׁתַל אֶת הַגְּפָנִים הָאֵלֶּה, אִם אֵין פֹּה אַף אֶחָד",
+ "cloze_word_start": 4,
+ "cloze_word_end": 10,
+ "cloze_hint": null,
+ "cloze_guid": "z1,it.SPCa"
+ },
+ "rejected_count": 0
+ },
"noun_inflection": null,
"conjugation": {
"in_conjugation_deck": false,
@@ -2224764,7 +2236498,28 @@
],
"confusable_group": null,
"confusables_guid": null,
- "examples": null,
+ "examples": {
+ "vetted": [
+ {
+ "text": "וְאִם מַנְיָה שֻׁתָּפָה לַגְּנֵבָה, כַּנִּרְאֶה גַּם בַּעֲלָהּ יִשְׂרָאֵל שֻׁתָּף",
+ "source": "time_tunnel_81",
+ "match_method": "direct"
+ },
+ {
+ "text": "וְאִם יִשְׂרָאֵל שׁוֹחַט שֻׁתָּף לַגְּנֵבָה, כַּנִּרְאֶה גַּם לָאִרְגּוּן הַסּוֹדִי 'בַּר גִּיּוֹרָא' יֵשׁ חֵלֶק בַּגְּנֵבָה הַזּוֹ",
+ "source": "time_tunnel_81",
+ "match_method": "direct"
+ }
+ ],
+ "cloze": {
+ "text": "וְאִם מַנְיָה שֻׁתָּפָה לַגְּנֵבָה, כַּנִּרְאֶה גַּם בַּעֲלָהּ יִשְׂרָאֵל שֻׁתָּף",
+ "cloze_word_start": 74,
+ "cloze_word_end": 81,
+ "cloze_hint": null,
+ "cloze_guid": "HGTiX(%0,r"
+ },
+ "rejected_count": 0
+ },
"noun_inflection": {
"singular": {
"nikkud": "שֻׁתָּף",
@@ -2225345,7 +2237100,23 @@
],
"confusable_group": null,
"confusables_guid": null,
- "examples": null,
+ "examples": {
+ "vetted": [
+ {
+ "text": "אֶפְשָׁר לְהַגִּיעַ לְשִׁתּוּפֵי פְּעֻלָּה עִם אֲחֵרִים, אֲבָל לְהִשָּׁעֵן עֲלֵיהֶם",
+ "source": "time_tunnel_81",
+ "match_method": "inflected_prefix"
+ }
+ ],
+ "cloze": {
+ "text": "אֶפְשָׁר לְהַגִּיעַ לְשִׁתּוּפֵי פְּעֻלָּה עִם אֲחֵרִים, אֲבָל לְהִשָּׁעֵן עֲלֵיהֶם",
+ "cloze_word_start": 20,
+ "cloze_word_end": 32,
+ "cloze_hint": null,
+ "cloze_guid": "j{7SB?M21;"
+ },
+ "rejected_count": 0
+ },
"noun_inflection": {
"singular": {
"nikkud": "שִׁתּוּף",
@@ -2227391,27 +2239162,17 @@
"examples": {
"vetted": [
{
- "text": "כִּי הִיא זֹאת שֶׁהִקְשַׁבְתִּי לָהּ כְּשֶׁהִתְלוֹנְנָה אוֹ כְּשֶׁהִתְפָּאֲרָה אוֹ כְּשֶׁסְּתָם שָׁתְקָה",
- "source": "הנסיך הקטן",
- "vetted": true
- },
- {
- "text": "\" לֹא הֵבַנְתִּי אֶת תְּשׁוּבָתוֹ, אַךְ שָׁתַקְתִּי",
- "source": "הנסיך הקטן",
- "vetted": true
- },
- {
- "text": "\" אָז, אַחֲרֵי שְׁתִיקָה אֲרֻכָּה, הוֹסִיף: \"נָחַתִּי לֹא רָחוֹק מִכָּאן",
- "source": "הנסיך הקטן",
- "vetted": true
+ "text": "אֲבָל הַמַּבָּט שֶׁיּוּבַל נָעַץ בִּי שִׁתֵּק אוֹתִי",
+ "source": "time_tunnel_76",
+ "match_method": "conjugated"
}
],
"cloze": {
- "text": "כִּי הִיא זֹאת שֶׁהִקְשַׁבְתִּי לָהּ כְּשֶׁהִתְלוֹנְנָה אוֹ כְּשֶׁהִתְפָּאֲרָה אוֹ כְּשֶׁסְּתָם שָׁתְקָה",
- "cloze_word_start": null,
- "cloze_word_end": null,
+ "text": "אֲבָל הַמַּבָּט שֶׁיּוּבַל נָעַץ בִּי שִׁתֵּק אוֹתִי",
+ "cloze_word_start": 38,
+ "cloze_word_end": 45,
"cloze_hint": null,
- "cloze_guid": "gHh+/})UE$"
+ "cloze_guid": "j7XGH^.>[n"
},
"rejected_count": 0
},
@@ -2228156,8 +2239917,8 @@
"match_method": "direct"
},
{
- "text": "לָכֵן הִנַּחְתִּי עַל הַשֻּׁלְחָן הַצָּמוּד לַמִּטָּה אֶת מַכְשִׁיר הַטֶּלֶפוֹן, וְשָׁלוֹשׁ פְּעָמִים בָּדַקְתִּי שֶׁהוּא לֹא מֻשְׁתָּק",
- "source": "time_tunnel_70",
+ "text": "\" שָׁרוֹן הִשְׁתִּיקָה אוֹתִי, \"אַתָּה רוֹצֶה שֶׁיְּגַלּוּ אוֹתָנוּ",
+ "source": "time_tunnel_76",
"match_method": "conjugated"
}
],
@@ -2231288,6 +2243049,11 @@
"text": "אֵיכְשֶׁהוּ הַשְּׁנַיִם הָאֲחֵרִים תֵּאֲמוּ בֵּינֵיהֶם אֶת הַתְּנוּעוֹת",
"source": "time_tunnel_78",
"match_method": "conjugated"
+ },
+ {
+ "text": "\"בּוֹאוּ נְתָאֵם מַסְלוּלִים,\" שָׁרוֹן מִתְקָרֶבֶת אֵלַי, מִתְנַשֶּׁפֶת, וְאַדְוָה אַחֲרֶיהָ, \"נִגְרֹם לָהֶם לְהִתְפַּצֵּל",
+ "source": "time_tunnel_77",
+ "match_method": "conjugated"
}
],
"cloze": {
@@ -2240510,15 +2252276,15 @@
"examples": {
"vetted": [
{
- "text": "\"זֶה הִתְחִיל עִם הַתָּו,\" הֵשִׁיב הַכּוֹבְעָן",
- "source": "alice_wonderland",
- "match_method": "direct_prefix"
+ "text": "בַּחֹשֶךְ לֹא יָכֹלְתִּי לִרְאוֹת אֶת תָּוֵי פָּנָיו, אֲבָל הַצְּלָלִית שֶׁלּוֹ הָיְתָה צְלָלִית שֶׁל גֶּבֶר",
+ "source": "time_tunnel_75",
+ "match_method": "inflected"
}
],
"cloze": {
- "text": "\"זֶה הִתְחִיל עִם הַתָּו,\" הֵשִׁיב הַכּוֹבְעָן",
- "cloze_word_start": 18,
- "cloze_word_end": 24,
+ "text": "בַּחֹשֶךְ לֹא יָכֹלְתִּי לִרְאוֹת אֶת תָּוֵי פָּנָיו, אֲבָל הַצְּלָלִית שֶׁלּוֹ הָיְתָה צְלָלִית שֶׁל גֶּבֶר",
+ "cloze_word_start": 38,
+ "cloze_word_end": 44,
"cloze_hint": null,
"cloze_guid": "G?)E!|lYQ;"
},
@@ -2242570,25 +2254336,25 @@
"examples": {
"vetted": [
{
- "text": "\" אֲנִי רוֹצֶה לְהַגִּיד לָאִשָּׁה שֶׁכְּאֵבִים לֹא חוֹלְפִים תּוֹךְ שְׁנִיּוֹת",
- "source": "מנהרת הזמן 82",
+ "text": "\"אוּלַי נְשַׁנֶּה נוֹשֵׂא,\" קָטַע אוֹתָם אַרְנַב־הָאָבִיב, תּוֹךְ כְּדֵי פִּהוּק",
+ "source": "alice_wonderland",
"match_method": "inflected"
},
{
- "text": "תּוֹךְ שְׁנִיּוֹת כְּבָר קָשֶׁה לְהַבְחִין בָּהֶם, כִּי הַקָּהָל הָרַב מַסְתִּיר אוֹתָם",
- "source": "מנהרת הזמן 82",
+ "text": "\" צָעַק הַגְּרִיפוֹן, תּוֹךְ נִתּוּר בָּאֲוִיר",
+ "source": "alice_wonderland",
"match_method": "inflected"
},
{
- "text": "\" אֲנִי צוֹעֵק אֶל תּוֹךְ הַבּוֹר, \"אַתְּ שָׁם",
- "source": "מנהרת הזמן 82",
+ "text": "״ שָׁרוֹן לָחֲשָׁה לִי תּוֹךְ כְּדֵי הֲלִיכָה",
+ "source": "time_tunnel_63",
"match_method": "inflected"
}
],
"cloze": {
- "text": "\" אֲנִי רוֹצֶה לְהַגִּיד לָאִשָּׁה שֶׁכְּאֵבִים לֹא חוֹלְפִים תּוֹךְ שְׁנִיּוֹת",
- "cloze_word_start": 62,
- "cloze_word_end": 68,
+ "text": "\"אוּלַי נְשַׁנֶּה נוֹשֵׂא,\" קָטַע אוֹתָם אַרְנַב־הָאָבִיב, תּוֹךְ כְּדֵי פִּהוּק",
+ "cloze_word_start": 59,
+ "cloze_word_end": 65,
"cloze_hint": null,
"cloze_guid": "jZs.V(,["
},
@@ -2243208,20 +2254974,15 @@
"examples": {
"vetted": [
{
- "text": "רוֹאִי הוּא הַבָּא בַּתּוֹר, נִיב צוֹעֵד אַחֲרֵי רוֹאִי, אַחֲרָיו יְהוּדִית וַאֲנִי הַמְּאַסֵּף",
- "source": "time_tunnel_silver_train",
- "match_method": "direct_prefix"
- },
- {
- "text": "בָּרֶנְס נִצֵּל כָּל הִזְדַּמְּנוּת לַעֲצוֹר, וּלְקַצֵּר אֶת הַתּוֹר שֶׁלּוֹ וּלְהַאֲרִיךְ אֶת הַתּוֹר שֶׁל שְׁוַרְץ",
- "source": "time_tunnel_63",
- "match_method": "direct_prefix"
+ "text": "\"כָּל הַשַּׁיָּרָה הַזֹּאת הִיא תּוֹר לָאוֹנִיָּה",
+ "source": "time_tunnel_76",
+ "match_method": "direct"
}
],
"cloze": {
- "text": "רוֹאִי הוּא הַבָּא בַּתּוֹר, נִיב צוֹעֵד אַחֲרֵי רוֹאִי, אַחֲרָיו יְהוּדִית וַאֲנִי הַמְּאַסֵּף",
- "cloze_word_start": 19,
- "cloze_word_end": 27,
+ "text": "\"כָּל הַשַּׁיָּרָה הַזֹּאת הִיא תּוֹר לָאוֹנִיָּה",
+ "cloze_word_start": 32,
+ "cloze_word_end": 37,
"cloze_hint": null,
"cloze_guid": "s;r(reiIcY"
},
@@ -2243411,7 +2255172,28 @@
"shared_roots": [],
"confusable_group": null,
"confusables_guid": null,
- "examples": null,
+ "examples": {
+ "vetted": [
+ {
+ "text": "“הַאִם גַם הֵם אוֹכְלִים עָלִים שֶׁל תּוּת",
+ "source": "silkworms",
+ "match_method": "direct"
+ },
+ {
+ "text": "“הֲלֹא יֵשׁ לָכֶם עֵץ תּוּת בֶּחָצֵר, וְלֹא יִהְיוּ לְךָ דְאָגוֹת שֶׁל חִפּוּשׁ אַחַר מָזוֹן בִּשְׁבִילָן”",
+ "source": "silkworms",
+ "match_method": "direct"
+ }
+ ],
+ "cloze": {
+ "text": "“הַאִם גַם הֵם אוֹכְלִים עָלִים שֶׁל תּוּת",
+ "cloze_word_start": 37,
+ "cloze_word_end": 42,
+ "cloze_hint": null,
+ "cloze_guid": "CD61Qt^j1V"
+ },
+ "rejected_count": 0
+ },
"noun_inflection": {
"singular": {
"nikkud": "תּוּת",
@@ -2243477,6 +2255259,16 @@
"text": "״אַל תִּתְפַּלְּאוּ אִם בְּעוֹד כַּמָּה דַּקּוֹת תִּהְיֶה כָּאן הַפְגָּזָה שֶׁל תּוֹתָחִים",
"source": "time_tunnel_63",
"match_method": "inflected"
+ },
+ {
+ "text": "\"זֶה לֹא הָיָה תּוֹתָח,\" שָׁרוֹן מְדַיֶּקֶת, \"הַמִּנְהָרָה יָרְתָה אוֹתְךָ",
+ "source": "time_tunnel_77",
+ "match_method": "direct"
+ },
+ {
+ "text": "\" \"תּוֹתָח יָרָה אוֹתִי הַחוּצָה",
+ "source": "time_tunnel_77",
+ "match_method": "direct"
}
],
"cloze": {
@@ -2247821,7 +2259613,7 @@
"vetted": [
{
"text": "וְגַם הַמִּגְדָּל שֶׁנִּמְצָא בְּמִתְחַם כְּנֵסִיַּת אוֹגוּסְטָה וִיקְטוֹרְיָה, אֵינֶנּוּ",
- "source": "מנהרת הזמן 82",
+ "source": "time_tunnel_82",
"match_method": "inflected_prefix"
}
],
@@ -2251404,6 +2263196,11 @@
"text": "נִזְכַּרְתִּי שֶׁהוּא הֶחְזִיק תִּיק בַּיָּד",
"source": "time_tunnel_63",
"match_method": "direct"
+ },
+ {
+ "text": "הִיא צִיְּרָה צִיּוּרִים וְשָׁלְחָה וּכְרִיכָה יָפֶה לְסֵפֶר רָקְמָה וְשָׁלְחָה, וְגַם תִּיק קָטָן תָּפְרָה לוֹ בְּעַצְמָהּ",
+ "source": "ilana",
+ "match_method": "direct"
}
],
"cloze": {
@@ -2252386,7 +2264183,23 @@
],
"confusable_group": null,
"confusables_guid": null,
- "examples": null,
+ "examples": {
+ "vetted": [
+ {
+ "text": "הַמְּהוּמוֹת וְהַשְּׁבִיתוֹת אֵינָן מַפְרִיעוֹת לַתַּיָּרִים",
+ "source": "time_tunnel_77",
+ "match_method": "inflected_prefix"
+ }
+ ],
+ "cloze": {
+ "text": "הַמְּהוּמוֹת וְהַשְּׁבִיתוֹת אֵינָן מַפְרִיעוֹת לַתַּיָּרִים",
+ "cloze_word_start": 48,
+ "cloze_word_end": 60,
+ "cloze_hint": null,
+ "cloze_guid": "uZGFqQkrsO"
+ },
+ "rejected_count": 0
+ },
"noun_inflection": {
"singular": {
"nikkud": "תַּיָּר",
@@ -2253113,6 +2264926,11 @@
"text": "יֵשׁ לָהּ תָּכְנִיּוֹת מִשֶּׁלָּהּ, וְהִיא לֹא מִתְחַשֶּׁבֶת בַּלִּמּוּדִים",
"source": "time_tunnel_82",
"match_method": "inflected"
+ },
+ {
+ "text": "\" \"לְמִנְהֶרֶת־הַזְּמַן יֵשׁ כַּנִּרְאֶה תָּכְנִית בִּשְׁבִילֵנוּ,\" שָׁרוֹן הִסְבִּירָה לוֹ",
+ "source": "time_tunnel_76",
+ "match_method": "direct"
}
],
"cloze": {
@@ -2255853,7 +2267671,23 @@
],
"confusable_group": null,
"confusables_guid": null,
- "examples": null,
+ "examples": {
+ "vetted": [
+ {
+ "text": "הֵם הָיוּ עֲסוּקִים כָּל הַזְּמַן בְּתִּכְנוּנֵי בְּרִיחָה אוֹ בַּחֲפִירַת מִנְהֲרוֹת בְּרִיחָה",
+ "source": "time_tunnel_silver_train",
+ "match_method": "inflected_prefix"
+ }
+ ],
+ "cloze": {
+ "text": "הֵם הָיוּ עֲסוּקִים כָּל הַזְּמַן בְּתִּכְנוּנֵי בְּרִיחָה אוֹ בַּחֲפִירַת מִנְהֲרוֹת בְּרִיחָה",
+ "cloze_word_start": 34,
+ "cloze_word_end": 48,
+ "cloze_hint": null,
+ "cloze_guid": "nllq{anbbh"
+ },
+ "rejected_count": 0
+ },
"noun_inflection": {
"singular": {
"nikkud": "תִּכְנוּן",
@@ -2258500,6 +2270334,11 @@
"text": "וְשׁוּב רַגְלַי בָּאֲוִיר, וְגַם הַחֵלֶק הַקִּדְמִי שֶׁל גּוּפִי תָּלוּי בָּאֲוִיר",
"source": "time_tunnel_82",
"match_method": "direct"
+ },
+ {
+ "text": "רַק וִילוֹן אֶחָד, קָרוּעַ, תָּלוּי עַל בְּלִימָה וּמִתְנוֹפֵף",
+ "source": "time_tunnel_77",
+ "match_method": "direct"
}
],
"cloze": {
@@ -2259432,6 +2271271,11 @@
"text": "״ \"אֲנִי תּוֹלֵשׁ מִמֵּךְ אֶת הַפֵּאָה,״ הַשּׁוֹטֵר הִסְבִּיר",
"source": "time_tunnel_silver_train",
"match_method": "conjugated"
+ },
+ {
+ "text": "אַדְוָה לֹא מַמְתִּינָה עַד שֶׁאָשִׁיב, הִיא תּוֹלֶשֶׁת אֶשְׁכּוֹל וּמְבִיאָה לִי",
+ "source": "time_tunnel_77",
+ "match_method": "conjugated"
}
],
"cloze": {
@@ -2259876,6 +2271720,11 @@
"text": "אֲבָל לַמְרוֹת מַאֲמַצָּיו, וְלַמְרוֹת צַעֲקוֹתֶיהָ שֶׁל יְהוּדִית, לֹא נִתְלְשָׁה שׁוּם פֵּאָה",
"source": "time_tunnel_silver_train",
"match_method": "conjugated"
+ },
+ {
+ "text": "נִדְמֶה לִי שֶׁאִם לֹא אֶשְׁכַּב, הָרֹאשׁ שֶׁלִּי פָּשׁוּט יִתָּלֵשׁ וְיִפֹּל לְאָחוֹר",
+ "source": "time_tunnel_81",
+ "match_method": "conjugated"
}
],
"cloze": {
@@ -2261120,7 +2272969,33 @@
],
"confusable_group": null,
"confusables_guid": null,
- "examples": null,
+ "examples": {
+ "vetted": [
+ {
+ "text": "\" אֲנִי פּוֹנֶה אֶל הַבָּחוּר הַמְּתֻלְתָּל, מְנַסֶּה לְהַרְוִיחַ זְמַן וְלִמְצוֹא שֵׁם",
+ "source": "time_tunnel_76",
+ "match_method": "direct_prefix"
+ },
+ {
+ "text": "\" הַבָּחוּר הַמְּתֻלְתָּל מְדַפְדֵּף בַּעֲרֵמַת הַדַּפִּים עַד שֶׁהוּא מַגִּיעַ אֶל הָאוֹת פ'",
+ "source": "time_tunnel_76",
+ "match_method": "direct_prefix"
+ },
+ {
+ "text": "אֲבָל לֹא, הַמְּתֻלְתָּל לֹא זִהָה אוֹתִי",
+ "source": "time_tunnel_76",
+ "match_method": "direct_prefix"
+ }
+ ],
+ "cloze": {
+ "text": "\" אֲנִי פּוֹנֶה אֶל הַבָּחוּר הַמְּתֻלְתָּל, מְנַסֶּה לְהַרְוִיחַ זְמַן וְלִמְצוֹא שֵׁם",
+ "cloze_word_start": 30,
+ "cloze_word_end": 43,
+ "cloze_hint": null,
+ "cloze_guid": "z[ZHnouAZ="
+ },
+ "rejected_count": 0
+ },
"noun_inflection": null,
"conjugation": null,
"adjective_inflection": {
@@ -2265891,7 +2277766,28 @@
],
"confusable_group": null,
"confusables_guid": null,
- "examples": null,
+ "examples": {
+ "vetted": [
+ {
+ "text": "\" אֲנִי מִתַּמֵּם, \"מָה מוּזָר בְּזֶה שֶׁהַבַּיִת רֵיק",
+ "source": "time_tunnel_77",
+ "match_method": "conjugated"
+ },
+ {
+ "text": "״ יִשְׂרָאֵל שׁוֹחַט מִתַּמֵּם",
+ "source": "time_tunnel_81",
+ "match_method": "conjugated"
+ }
+ ],
+ "cloze": {
+ "text": "\" אֲנִי מִתַּמֵּם, \"מָה מוּזָר בְּזֶה שֶׁהַבַּיִת רֵיק",
+ "cloze_word_start": 8,
+ "cloze_word_end": 17,
+ "cloze_hint": null,
+ "cloze_guid": "o%uu#|xHRb"
+ },
+ "rejected_count": 0
+ },
"noun_inflection": null,
"conjugation": {
"in_conjugation_deck": false,
@@ -2266317,7 +2278213,23 @@
],
"confusable_group": null,
"confusables_guid": null,
- "examples": null,
+ "examples": {
+ "vetted": [
+ {
+ "text": "1 אִישׁ תָּם הָיָה רַבִּי יוֹסֵי וְלֺא יָדַע אֶת נֶפֶשׁ עִזָּיו, וְהוּא אָמַר בִּדְאָגָה: — בָּא הַקֵּץ",
+ "source": "kid_on_mountain",
+ "match_method": "direct"
+ }
+ ],
+ "cloze": {
+ "text": "1 אִישׁ תָּם הָיָה רַבִּי יוֹסֵי וְלֺא יָדַע אֶת נֶפֶשׁ עִזָּיו, וְהוּא אָמַר בִּדְאָגָה: — בָּא הַקֵּץ",
+ "cloze_word_start": 8,
+ "cloze_word_end": 12,
+ "cloze_hint": null,
+ "cloze_guid": "unF6#!fo%?"
+ },
+ "rejected_count": 0
+ },
"noun_inflection": null,
"conjugation": null,
"adjective_inflection": {
diff --git a/epub_examples.py b/epub_examples.py
index 043c144..4e14e88 100644
--- a/epub_examples.py
+++ b/epub_examples.py
@@ -29,7 +29,7 @@ WORDS_JSON = DATA_DIR / "words.json"
# Book metadata: filename -> display name
def _discover_epubs() -> dict[str, str]:
- """Auto-discover all .epub files in EPUB_DIR, returning {filepath: display_name}."""
+ """Auto-discover all .epub and .txt files in EPUB_DIR, returning {filepath: display_name}."""
if not EPUB_DIR.exists():
return {}
books: dict[str, str] = {}
@@ -50,6 +50,9 @@ def _discover_epubs() -> dict[str, str]:
else:
name = stem_stripped[:40]
books[str(path)] = name
+ # Also discover plain-text files (e.g. Ben Yehuda downloads)
+ for path in sorted(EPUB_DIR.glob("*.txt")):
+ books[str(path)] = path.stem
return books
@@ -196,6 +199,20 @@ def extract_sentences_from_epub(epub_path: Path, book_name: str) -> list[dict]:
return _split_into_sentences(full_text, book_name)
+def extract_sentences_from_text(text_path: Path, book_name: str) -> list[dict]:
+ """Extract sentences from a plain-text file (e.g. Ben Yehuda downloads).
+
+ Args:
+ text_path: Path to the .txt file.
+ book_name: Human-readable book name used as the ``source`` field.
+
+ Returns:
+ List of ``{"text": str, "source": str}`` dicts.
+ """
+ full_text = text_path.read_text(encoding="utf-8")
+ return _split_into_sentences(full_text, book_name)
+
+
# ── Sentence splitting ───────────────────────────────────────────
# Hebrew sentence terminators: period, exclamation, question mark, sof pasuk
@@ -480,7 +497,12 @@ def _build_nikkud_index(words: dict) -> dict[str, list[tuple[str, str]]]:
for field in ("singular", "plural", "construct_singular", "construct_plural"):
sub = noun.get(field) or {}
- _add(sub.get("nikkud"), unique_key, "inflected")
+ form = sub.get("nikkud")
+ _add(form, unique_key, "inflected")
+ # Index construct forms without maqaf too — modern text often
+ # writes smichut as two space-separated words without maqaf
+ if form and form.endswith("־"):
+ _add(form[:-1], unique_key, "inflected")
pronominal = noun.get("pronominal_suffixes") or {}
for _person, sub in pronominal.items():
@@ -720,7 +742,10 @@ def run(words: dict) -> dict:
for filepath, book_name in _discover_epubs().items():
path = Path(filepath)
- sentences = extract_sentences_from_epub(path, book_name)
+ if path.suffix == ".txt":
+ sentences = extract_sentences_from_text(path, book_name)
+ else:
+ sentences = extract_sentences_from_epub(path, book_name)
book_counts[book_name] = len(sentences)
all_sentences.extend(sentences)
logger.info(f" {book_name}: {len(sentences)} sentences")
diff --git a/pealim_detail_scrape.py b/pealim_detail_scrape.py
index 36730ba..e3c83ea 100644
--- a/pealim_detail_scrape.py
+++ b/pealim_detail_scrape.py
@@ -459,15 +459,29 @@ def _parse_noun_gender_mishkal(soup: BeautifulSoup) -> tuple[str, str]:
"""
Extract (gender, mishkal) from the PoS section of the detail page.
Returns ("masculine"|"feminine"|"", mishkal_english|"").
+
+ Pealim HTML structure:
+ Noun – ketel pattern, masculine
+ The mishkal is in the tag (k-notation, e.g. "ketel") or the nm= URL param (q-notation).
+ Some nouns have no mishkal link: Noun – masculine
"""
gender = ""
mishkal = ""
- # Try various selectors that pealim uses for PoS info
- pos_section = soup.find("div", class_="pos") or soup.find("p", class_="pos")
+ # Find the PoS tag — on pealim detail pages it's a bare
like
+ # "Noun – ketel pattern, masculine" or "Adjective – katul pattern"
+ pos_section = None
+ for p in soup.find_all("p"):
+ text = p.get_text(" ", strip=True)
+ if re.match(r"^(Noun|Adjective)\b", text):
+ pos_section = p
+ break
+
+ # Fall back to older selectors (div.pos, p.pos, div.page-header)
if not pos_section:
- # Look for it in the page header area
- pos_section = soup.find("div", class_="page-header")
+ pos_section = (
+ soup.find("div", class_="pos") or soup.find("p", class_="pos") or soup.find("div", class_="page-header")
+ )
if pos_section:
text = pos_section.get_text(" ", strip=True)
@@ -476,13 +490,21 @@ def _parse_noun_gender_mishkal(soup: BeautifulSoup) -> tuple[str, str]:
if raw in text.lower():
gender = canonical
break
- # Mishkal detection: look for CaCaC-style patterns
- mishkal_match = re.search(r"\b([A-Z][a-zA-Z\']+)\b", text)
- if mishkal_match:
- candidate = mishkal_match.group(1)
- # Validate: mishkal names contain uppercase letters in CaCaC pattern
- if re.match(r"^[A-Za-z\']+$", candidate) and any(c.isupper() for c in candidate):
- mishkal = candidate
+
+ # Mishkal detection: extract from YYYY pattern
+ # Nouns use nm= param, adjectives use am= param
+ mishkal_link = pos_section.find("a", href=re.compile(r"[na]m="))
+ if mishkal_link:
+ # Prefer tag text (k-notation, matches _MISHKAL_HEBREW_Q after k→q)
+ i_tag = mishkal_link.find("i")
+ if i_tag:
+ mishkal = i_tag.get_text(strip=True)
+ else:
+ # Fall back to nm= URL parameter (already q-notation)
+ href = mishkal_link.get("href", "")
+ nm_match = re.search(r"[na]m=([a-zA-Z']+)", href)
+ if nm_match:
+ mishkal = nm_match.group(1)
# Also check the og:description or breadcrumbs for gender
if not gender:
diff --git a/scripts/validate_data.py b/scripts/validate_data.py
index 5ce760d..9b348ae 100644
--- a/scripts/validate_data.py
+++ b/scripts/validate_data.py
@@ -685,6 +685,61 @@ def test_no_stripped_form_sentence_collisions(data: dict[str, Any]) -> None:
_pass(name)
+def test_no_hebrew_in_meaning(data: dict[str, Any]) -> None:
+ """English meanings must not contain bare Hebrew text (spoils the card)."""
+ name = "no_hebrew_in_meaning"
+ errors: list[str] = []
+ hebrew_re = re.compile(r"[\u05D0-\u05EA]")
+
+ for key, entry in data.items():
+ meaning = entry.get("meaning") or ""
+ # Apply same cleaning pipeline as apkg_builder
+ cleaned = re.sub(r"[\u0590-\u05FF][\u0590-\u05FF\u0591-\u05C7\s\-]*", "", meaning)
+ cleaned = re.sub(r"\s{2,}", " ", cleaned).strip(", ;:")
+ if hebrew_re.search(cleaned):
+ errors.append(f"[{key}] meaning still contains Hebrew after cleaning: {cleaned!r}")
+
+ if errors:
+ _fail(name, errors[:20] if not _verbose else errors)
+ if len(errors) > 20 and not _verbose:
+ print(f" ... ({len(errors) - 20} more; use --verbose)")
+ else:
+ _pass(name)
+
+
+def test_mishkal_consistency(data: dict[str, Any]) -> None:
+ """mishkal_hebrew must match mishkal via _mishkal_to_hebrew conversion."""
+ name = "mishkal_consistency"
+ errors: list[str] = []
+
+ try:
+ from pealim_detail_scrape import _mishkal_to_hebrew
+ except ImportError:
+ _warn(name, ["Could not import _mishkal_to_hebrew — skipping"])
+ return
+
+ for key, entry in data.items():
+ for infl_key in ("noun_inflection", "adjective_inflection"):
+ infl = entry.get(infl_key)
+ if not infl:
+ continue
+ mishkal_eng = infl.get("mishkal") or ""
+ mishkal_heb = infl.get("mishkal_hebrew") or ""
+ if mishkal_eng and mishkal_heb:
+ expected = _mishkal_to_hebrew(mishkal_eng) or ""
+ if expected and expected != mishkal_heb:
+ errors.append(f"[{key}] {infl_key}: {mishkal_eng}→{mishkal_heb} (expected {expected})")
+ if mishkal_heb and not mishkal_eng:
+ errors.append(f"[{key}] {infl_key}: has mishkal_hebrew but no mishkal")
+
+ if errors:
+ _fail(name, errors[:20] if not _verbose else errors)
+ if len(errors) > 20 and not _verbose:
+ print(f" ... ({len(errors) - 20} more; use --verbose)")
+ else:
+ _pass(name)
+
+
# ---------------------------------------------------------------------------
# Stats summary
# ---------------------------------------------------------------------------
@@ -702,6 +757,11 @@ def print_stats(data: dict[str, Any]) -> None:
with_guid = sum(1 for e in data.values() if e.get("vocab_legacy_guid"))
in_confusable = sum(1 for e in data.values() if e.get("confusable_group"))
with_shared_roots = sum(1 for e in data.values() if e.get("shared_roots"))
+ with_mishkal = sum(
+ 1
+ for e in data.values()
+ if (e.get("noun_inflection") or {}).get("mishkal") or (e.get("adjective_inflection") or {}).get("mishkal")
+ )
print()
print("Stats Summary")
@@ -709,6 +769,7 @@ def print_stats(data: dict[str, Any]) -> None:
print(f" Total entries: {total:>6}")
print(f" With conjugation data: {with_conj:>6}")
print(f" With noun_inflection: {with_noun_inf:>6}")
+ print(f" With mishkal: {with_mishkal:>6}")
print(f" With vetted examples: {with_vetted:>6}")
print(f" With cloze examples: {with_cloze:>6}")
print(f" With images: {with_image:>6}")
@@ -740,6 +801,8 @@ ALL_TESTS: dict[str, Any] = {
"conjugation_form_guids": test_conjugation_form_guids,
"conjugation_person_codes": test_conjugation_person_codes,
"no_stripped_form_sentence_collisions": test_no_stripped_form_sentence_collisions,
+ "no_hebrew_in_meaning": test_no_hebrew_in_meaning,
+ "mishkal_consistency": test_mishkal_consistency,
}
diff --git a/tests/test_apkg_builder.py b/tests/test_apkg_builder.py
new file mode 100644
index 0000000..9a18dbb
--- /dev/null
+++ b/tests/test_apkg_builder.py
@@ -0,0 +1,246 @@
+"""Unit tests for apkg_builder — Sprint 15 learnings.
+
+Tests cover: cloze prefix preservation, Hebrew spoiler stripping from English
+meanings, PoS exact matching, gender field population, and mishkal data integrity.
+"""
+
+import json
+import re
+import sys
+from pathlib import Path
+
+import pytest
+
+# Ensure project root is on path
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+
+from apkg_builder import _categorize_pos, _cloze_prefix_len
+
+# ---------------------------------------------------------------------------
+# Cloze prefix preservation
+# ---------------------------------------------------------------------------
+
+
+class TestClozePrefix:
+ """_cloze_prefix_len must detect Hebrew prefix letters before the word."""
+
+ def test_single_prefix_bet(self):
+ # בַּתּוֹר = bet + patach + tor
+ assert _cloze_prefix_len("בַּתּוֹר", "תּוֹר") > 0
+
+ def test_single_prefix_lamed(self):
+ # לַמֶּלֶךְ = lamed + patach + melech
+ assert _cloze_prefix_len("לַמֶּלֶךְ", "מֶּלֶךְ") > 0
+
+ def test_two_consonant_prefix(self):
+ # שֶׁבַּתּוֹר = shin + bet + tor (two prefix letters)
+ token = "שֶׁבַּתּוֹר"
+ word = "תּוֹר"
+ prefix_len = _cloze_prefix_len(token, word)
+ assert prefix_len > 0
+ assert token[prefix_len:].startswith(word)
+
+ def test_no_prefix_direct_match(self):
+ # Word appears at start — no prefix
+ assert _cloze_prefix_len("תּוֹר", "תּוֹר") == 0
+
+ def test_empty_inputs(self):
+ assert _cloze_prefix_len("", "תּוֹר") == 0
+ assert _cloze_prefix_len("בַּתּוֹר", "") == 0
+ assert _cloze_prefix_len("", "") == 0
+
+ def test_non_prefix_letter_returns_zero(self):
+ # If the "prefix" chars aren't valid prefix letters, return 0
+ # 'ת' is not in _PREFIX_LETTERS (בהוכלמש)
+ assert _cloze_prefix_len("תַּתּוֹר", "תּוֹר") == 0
+
+ def test_prefix_preserves_nikkud(self):
+ # Verify that prefix_len includes nikkud marks
+ token = "בַּתּוֹר"
+ word = "תּוֹר"
+ prefix_len = _cloze_prefix_len(token, word)
+ prefix = token[:prefix_len]
+ # Prefix should contain at least bet + nikkud mark(s)
+ base_letters = [c for c in prefix if "\u05d0" <= c <= "\u05ea"]
+ assert base_letters == ["ב"]
+
+
+# ---------------------------------------------------------------------------
+# PoS exact matching (no substring collisions)
+# ---------------------------------------------------------------------------
+
+
+class TestCategorizePos:
+ """_categorize_pos must not let 'Pronoun' match 'Noun'."""
+
+ def test_noun_exact(self):
+ assert _categorize_pos("Noun") == "Noun"
+
+ def test_pronoun_is_other(self):
+ assert _categorize_pos("Pronoun") == "Other"
+
+ def test_verb_exact(self):
+ assert _categorize_pos("Verb") == "Verb"
+
+ def test_noun_with_dash(self):
+ assert _categorize_pos("Noun – masculine") == "Noun"
+
+ def test_adjective(self):
+ assert _categorize_pos("Adjective") == "Adjective"
+
+ def test_conjunction_is_other(self):
+ assert _categorize_pos("Conjunction") == "Other"
+
+
+# ---------------------------------------------------------------------------
+# Hebrew spoiler stripping from English meanings
+# ---------------------------------------------------------------------------
+
+
+class TestHebrewSpoilerStripping:
+ """English meanings must not contain Hebrew text (spoils the card)."""
+
+ # Use the same regex from apkg_builder.py
+ HEBREW_STRIP_RE = re.compile(r"[\u0590-\u05FF][\u0590-\u05FF\u0591-\u05C7\s\-]*")
+
+ @staticmethod
+ def _strip_hebrew(meaning: str) -> str:
+ """Replicate the meaning cleaning pipeline from build_vocab_deck."""
+ meaning = re.sub(r"[\u0590-\u05FF][\u0590-\u05FF\u0591-\u05C7\s\-]*", "", meaning)
+ meaning = re.sub(r"[;:]\s*—", " —", meaning)
+ meaning = re.sub(r";\s*:", ";", meaning)
+ return re.sub(r"\s{2,}", " ", meaning).strip(", ;:")
+
+ def test_pure_english_unchanged(self):
+ assert self._strip_hebrew("to eat, to consume") == "to eat, to consume"
+
+ def test_hebrew_word_removed(self):
+ result = self._strip_hebrew("to eat; אכל")
+ assert "אכל" not in result
+
+ def test_hebrew_with_nikkud_removed(self):
+ result = self._strip_hebrew("tall; גָּבוֹהַּ")
+ assert "גָּבוֹהַּ" not in result
+ assert "tall" in result
+
+ def test_no_residual_hebrew_in_real_data(self):
+ """Scan actual words.json — no meaning should contain Hebrew after stripping."""
+ words_path = Path(__file__).resolve().parent.parent / "data" / "words.json"
+ if not words_path.exists():
+ pytest.skip("words.json not available")
+
+ with open(words_path, encoding="utf-8") as f:
+ words = json.load(f)
+
+ # The regex used in apkg_builder
+ hebrew_re = re.compile(r"[\u05D0-\u05EA]")
+ spoilers = []
+ for key, entry in words.items():
+ meaning = entry.get("meaning") or ""
+ cleaned = self._strip_hebrew(meaning)
+ if hebrew_re.search(cleaned):
+ spoilers.append(f"{key}: {cleaned!r}")
+
+ assert not spoilers, f"Hebrew found in {len(spoilers)} meanings after stripping: {spoilers[:5]}"
+
+
+# ---------------------------------------------------------------------------
+# Gender field for nouns (words.json data integrity)
+# ---------------------------------------------------------------------------
+
+
+class TestGenderDataIntegrity:
+ """Nouns with noun_inflection should have gender populated."""
+
+ @pytest.fixture()
+ def words(self):
+ words_path = Path(__file__).resolve().parent.parent / "data" / "words.json"
+ if not words_path.exists():
+ pytest.skip("words.json not available")
+ with open(words_path, encoding="utf-8") as f:
+ return json.load(f)
+
+ def test_nouns_have_gender(self, words):
+ """Nouns with noun_inflection should have a valid gender."""
+ missing = []
+ for key, entry in words.items():
+ pos = entry.get("pos") or ""
+ ni = entry.get("noun_inflection")
+ if pos.startswith("Noun") and ni:
+ gender = ni.get("gender") or ""
+ if gender not in ("masculine", "feminine", "masculine and feminine"):
+ missing.append(f"{key}: gender={gender!r}")
+
+ # Allow up to 7% missing (loan words, compound words, etc.)
+ noun_count = sum(
+ 1 for e in words.values() if (e.get("pos") or "").startswith("Noun") and e.get("noun_inflection")
+ )
+ if noun_count > 0:
+ pct_missing = len(missing) / noun_count
+ assert pct_missing < 0.07, f"{len(missing)}/{noun_count} nouns missing gender: {missing[:10]}"
+
+
+# ---------------------------------------------------------------------------
+# Mishkal data integrity
+# ---------------------------------------------------------------------------
+
+
+class TestMishkalIntegrity:
+ """Validate mishkal data consistency in words.json."""
+
+ @pytest.fixture()
+ def words(self):
+ words_path = Path(__file__).resolve().parent.parent / "data" / "words.json"
+ if not words_path.exists():
+ pytest.skip("words.json not available")
+ with open(words_path, encoding="utf-8") as f:
+ return json.load(f)
+
+ def test_mishkal_hebrew_matches_english(self, words):
+ """If mishkal and mishkal_hebrew are both set, they should correspond via _mishkal_to_hebrew."""
+ from pealim_detail_scrape import _mishkal_to_hebrew
+
+ mismatches = []
+ for key, entry in words.items():
+ for infl_key in ("noun_inflection", "adjective_inflection"):
+ infl = entry.get(infl_key)
+ if not infl:
+ continue
+ mishkal_eng = infl.get("mishkal") or ""
+ mishkal_heb = infl.get("mishkal_hebrew") or ""
+ if mishkal_eng and mishkal_heb:
+ expected = _mishkal_to_hebrew(mishkal_eng) or ""
+ if expected and expected != mishkal_heb:
+ mismatches.append(f"{key}: {mishkal_eng}→{mishkal_heb} (expected {expected})")
+
+ assert not mismatches, f"{len(mismatches)} mishkal mismatches: {mismatches[:10]}"
+
+ def test_mishkal_hebrew_is_hebrew(self, words):
+ """mishkal_hebrew must contain Hebrew characters."""
+ hebrew_re = re.compile(r"[\u05D0-\u05EA]")
+ bad = []
+ for key, entry in words.items():
+ for infl_key in ("noun_inflection", "adjective_inflection"):
+ infl = entry.get(infl_key)
+ if not infl:
+ continue
+ mishkal_heb = infl.get("mishkal_hebrew") or ""
+ if mishkal_heb and not hebrew_re.search(mishkal_heb):
+ bad.append(f"{key}: mishkal_hebrew={mishkal_heb!r}")
+
+ assert not bad, f"{len(bad)} non-Hebrew mishkal_hebrew values: {bad[:10]}"
+
+ def test_no_orphaned_mishkal(self, words):
+ """If mishkal_hebrew is set, mishkal (English) must also be set."""
+ orphans = []
+ for key, entry in words.items():
+ for infl_key in ("noun_inflection", "adjective_inflection"):
+ infl = entry.get(infl_key)
+ if not infl:
+ continue
+ mishkal_heb = infl.get("mishkal_hebrew") or ""
+ mishkal_eng = infl.get("mishkal") or ""
+ if mishkal_heb and not mishkal_eng:
+ orphans.append(f"{key}: has mishkal_hebrew but no mishkal")
+
+ assert not orphans, f"{len(orphans)} orphaned mishkal_hebrew: {orphans[:10]}"