moj-2024/lab/06_Biblioteki_stat_LM.ipynb
2024-04-10 12:48:52 +02:00

944 lines
34 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h1> Modelowanie Języka</h1>\n",
"<h2> 6. Biblioteki do statystycznych modeli językowych [ćwiczenia]</h2> "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### KENLM\n",
"\n",
"W praktyce korzysta się z gotowych bibliotek do statystycznych modeli językowych. Najbardziej popularną biblioteką jest KENLM ( https://kheafield.com/papers/avenue/kenlm.pdf ). Repozytorium znajduje się https://github.com/kpu/kenlm a dokumentacja https://kheafield.com/code/kenlm/\n",
"\n",
"Na komputerach wydziałowych nie powinno być problemu ze skompilowaniem biblioteki.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Instalacja\n",
"\n",
"(Zob. też dokumentacja)\n",
"\n",
" sudo apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev\n",
" wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz\n",
" mkdir kenlm/build\n",
" cd kenlm/build\n",
" cmake ..\n",
" make -j2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Najprostszy scenariusz użycia"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"KENLM_BUILD_PATH='/home/pawel/kenlm/build' # ścieżka, w której jest zainstalowany KenLM (zob. dokumentacja - link powyżej)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2024-04-10 12:13:27-- https://wolnelektury.pl/media/book/txt/lalka-tom-pierwszy.txt\n",
"Resolving wolnelektury.pl (wolnelektury.pl)... 51.83.143.148, 2001:41d0:602:3294::\n",
"Connecting to wolnelektury.pl (wolnelektury.pl)|51.83.143.148|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 860304 (840K) [text/plain]\n",
"Saving to: lalka-tom-pierwszy.txt.1\n",
"\n",
"lalka-tom-pierwszy. 100%[===================>] 840.14K 3.59MB/s in 0.2s \n",
"\n",
"2024-04-10 12:13:27 (3.59 MB/s) - lalka-tom-pierwszy.txt.1 saved [860304/860304]\n",
"\n"
]
}
],
"source": [
"!wget https://wolnelektury.pl/media/book/txt/lalka-tom-pierwszy.txt"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2024-04-10 12:13:30-- https://wolnelektury.pl/media/book/txt/lalka-tom-drugi.txt\n",
"Resolving wolnelektury.pl (wolnelektury.pl)... 51.83.143.148, 2001:41d0:602:3294::\n",
"Connecting to wolnelektury.pl (wolnelektury.pl)|51.83.143.148|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 949497 (927K) [text/plain]\n",
"Saving to: lalka-tom-drugi.txt.1\n",
"\n",
"lalka-tom-drugi.txt 100%[===================>] 927.24K 3.39MB/s in 0.3s \n",
"\n",
"2024-04-10 12:13:30 (3.39 MB/s) - lalka-tom-drugi.txt.1 saved [949497/949497]\n",
"\n"
]
}
],
"source": [
"!wget https://wolnelektury.pl/media/book/txt/lalka-tom-drugi.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### budowa modelu"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"=== 1/5 Counting and sorting n-grams ===\n",
"Reading /home/pawel/moj-2024/lab/lalka-tom-pierwszy.txt\n",
"----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"****************************************************************************************************\n",
"Unigram tokens 122871 types 33265\n",
"=== 2/5 Calculating and sorting adjusted counts ===\n",
"Chain sizes: 1:399180 2:2261987584 3:4241227008 4:6785963520\n",
"Statistics:\n",
"1 33265 D1=0.737356 D2=1.15675 D3+=1.59585\n",
"2 93948 D1=0.891914 D2=1.20314 D3+=1.44945\n",
"3 115490 D1=0.964904 D2=1.40636 D3+=1.66751\n",
"4 116433 D1=0.986444 D2=1.50367 D3+=1.9023\n",
"Memory estimate for binary LM:\n",
"type kB\n",
"probing 7800 assuming -p 1.5\n",
"probing 9157 assuming -r models -p 1.5\n",
"trie 3902 without quantization\n",
"trie 2378 assuming -q 8 -b 8 quantization \n",
"trie 3649 assuming -a 22 array pointer compression\n",
"trie 2125 assuming -a 22 -q 8 -b 8 array pointer compression and quantization\n",
"=== 3/5 Calculating and sorting initial probabilities ===\n",
"Chain sizes: 1:399180 2:1503168 3:2309800 4:2794392\n",
"----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100\n",
"####################################################################################################\n",
"=== 4/5 Calculating and writing order-interpolated probabilities ===\n",
"Chain sizes: 1:399180 2:1503168 3:2309800 4:2794392\n",
"----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100\n",
"####################################################################################################\n",
"=== 5/5 Writing ARPA model ===\n",
"----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100\n",
"****************************************************************************************************\n",
"Name:lmplz\tVmPeak:13142592 kB\tVmRSS:7564 kB\tRSSMax:2623832 kB\tuser:0.28374\tsys:1.02734\tCPU:1.3111\treal:1.25256\n"
]
}
],
"source": [
"!$KENLM_BUILD_PATH/bin/lmplz -o 4 < lalka-tom-pierwszy.txt > lalka_tom_pierwszy_lm.arpa"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## plik arpa\n",
"\n",
"Powyższa komenda tworzy model językowy z wygładzaniem i zapisuje go do pliku tekstowego arpa. Parametr -o 4 odpowiada za maksymalną ilość n-gramów w modelu: 4-gramy.\n",
"\n",
"Plik arpa zawiera w sobie prawdopodobieństwa dla poszczególnych n-gramów. W zasadzie są to logarytmy prawdopodbieństw o podstawie 10.\n",
"\n",
"Podejrzyjmy plik arpa:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\\data\\\n",
"ngram 1=33265\n",
"ngram 2=93948\n",
"ngram 3=115490\n",
"ngram 4=116433\n",
"\n",
"\\1-grams:\n",
"-5.0133595\t<unk>\t0\n",
"0\t<s>\t-0.99603957\n",
"-1.4302719\t</s>\t0\n",
"-4.7287908\tBolesław\t-0.049677044\n",
"-4.9033437\tPrus\t-0.049677044\n",
"-4.9033437\tLalka\t-0.049677044\n",
"-4.9033437\tISBN\t-0.049677044\n",
"-4.9033437\t978-83-288-2673-1\t-0.049677044\n",
"-4.9033437\tTom\t-0.049677044\n",
"-3.0029354\tI\t-0.17544968\n",
"-4.9033437\tI.\t-0.049677044\n",
"-3.5526814\tJak\t-0.1410632\n",
"-3.8170912\twygląda\t-0.16308141\n",
"-4.608305\tfirma\t-0.049677044\n",
"-4.33789\tJ.\t-0.3295009\n",
"-3.9192266\tMincel\t-0.12910372\n",
"-1.624716\ti\t-0.20128249\n",
"-4.1086636\tS.\t-0.098223634\n",
"-2.6843808\tWokulski\t-0.19202113\n",
"-2.8196363\tprzez\t-0.15214005\n",
"-4.9033437\tszkło\t-0.049677044\n",
"-4.9033437\tbutelek?\t-0.049677044\n",
"-2.848008\tW\t-0.19964235\n"
]
}
],
"source": [
"!head -n 30 lalka_tom_pierwszy_lm.arpa"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Linijka to kolejno: prawdopodobieństwo (log10), n-gram, waga back-off (log10).\n",
"\n",
"Aby spradzić prawdopodobieństwo sekwencji (a także PPL modelu) należy użyć komendy query"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"test_str=!(head -n 17 lalka-tom-drugi.txt | tail -n 1)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"test_str = test_str[0]"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Sytuacja polityczna jest tak niepewna, że wcale by mnie nie zdziwiło, gdyby około grudnia wybuchła wojna.'"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_str"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Sytuacja polityczna jest tak niepewna, że wcale by mnie nie zdziwiło, gdyby około grudnia wybuchła wojna.'"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_str"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Sytuacja=0 1 -6.009399\tpolityczna=21766 1 -4.9033437\tjest=123 1 -2.6640298\ttak=231 2 -1.7683144\tniepewna,=0 1 -5.1248584\tże=122 1 -2.1651394\twcale=5123 1 -4.167491\tby=1523 1 -3.55168\tmnie=2555 2 -1.6694618\tnie=127 2 -1.4439836\tzdziwiło,=0 1 -5.2158937\tgdyby=814 1 -3.2300434\tokoło=1462 1 -3.7384818\tgrudnia=0 1 -5.123236\twybuchła=0 1 -5.0133595\twojna.=1285 1 -4.9033437\t</s>=2 2 -0.8501559\tTotal: -61.54222 OOV: 5\n",
"Perplexity including OOVs:\t4169.948113875898\n",
"Perplexity excluding OOVs:\t834.2371454470355\n",
"OOVs:\t5\n",
"Tokens:\t17\n"
]
}
],
"source": [
"!echo $test_str | $KENLM_BUILD_PATH/bin/query lalka_tom_pierwszy_lm.arpa 2> /dev/null"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zgodnie z dokumentacją polecenia query, format wyjściowy to dla każdego słowa:\n",
" \n",
"word=vocab_id ngram_length log10(p(word|context))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A co jeśli trochę zmienimy początek zdania?"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"test2_str = \"Lubię placki i wcale by mnie nie zdziwiło, gdyby około grudnia wybuchła wojna.\""
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Lubię=17813 1 -5.899383\tplacki=0 1 -5.0630364\ti=16 1 -1.624716\twcale=5123 2 -3.2397003\tby=1523 1 -3.6538217\tmnie=2555 2 -1.6694618\tnie=127 2 -1.4439836\tzdziwiło,=0 1 -5.2158937\tgdyby=814 1 -3.2300434\tokoło=1462 1 -3.7384818\tgrudnia=0 1 -5.123236\twybuchła=0 1 -5.0133595\twojna.=1285 1 -4.9033437\t</s>=2 2 -0.8501559\tTotal: -50.668617 OOV: 4\n",
"Perplexity including OOVs:\t4160.896818387522\n",
"Perplexity excluding OOVs:\t1060.0079770155185\n",
"OOVs:\t4\n",
"Tokens:\t14\n"
]
}
],
"source": [
"!echo $test2_str | $KENLM_BUILD_PATH/bin/query lalka_tom_pierwszy_lm.arpa 2> /dev/null"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Trochę bardziej zaawansowane użycie "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pierwsza rzecz, która rzuca się w oczy: tokeny zawierają znaki interpunkcyjne. Użyjemy zatem popularnego tokenizera i detokenizera moses z https://github.com/moses-smt/mosesdecoder\n",
" \n",
"https://github.com/moses-smt/mosesdecoder/tree/master/scripts/tokenizer"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### tokenizacja i lowercasing"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"TOKENIZER_SCRIPTS='/home/pawel/mosesdecoder/scripts/tokenizer'"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Sytuacja polityczna jest tak niepewna, że wcale by mnie nie zdziwiło, gdyby około grudnia wybuchła wojna.\n"
]
}
],
"source": [
"!echo $test_str"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tokenizer Version 1.1\n",
"Language: en\n",
"Number of threads: 1\n",
"Sytuacja polityczna jest tak niepewna , że wcale by mnie nie zdziwiło , gdyby około grudnia wybuchła wojna .\n"
]
}
],
"source": [
"!echo $test_str | $TOKENIZER_SCRIPTS/tokenizer.perl --language pl"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"W łatwy sposób można odzyskać tekst źródłowy:"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Detokenizer Version $Revision: 4134 $\n",
"Language: en\n",
"Tokenizer Version 1.1\n",
"Language: en\n",
"Number of threads: 1\n",
"Sytuacja polityczna jest tak niepewna, że wcale by mnie nie zdziwiło, gdyby około grudnia wybuchła wojna.\n"
]
}
],
"source": [
"!echo $test_str | $TOKENIZER_SCRIPTS/tokenizer.perl --language pl | $TOKENIZER_SCRIPTS/detokenizer.perl --language pl"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"W naszym przykładzie stworzymy model językowy lowercase. Można osobno wytrenować też truecaser (osobny model do przywracania wielkości liter), jeżeli jest taka potrzeba."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tokenizer Version 1.1\n",
"Language: en\n",
"Number of threads: 1\n",
"sytuacja polityczna jest tak niepewna , że wcale by mnie nie zdziwiło , gdyby około grudnia wybuchła wojna .\n"
]
}
],
"source": [
"!echo $test_str | $TOKENIZER_SCRIPTS/tokenizer.perl --language pl | $TOKENIZER_SCRIPTS/lowercase.perl"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tokenizer Version 1.1\n",
"Language: en\n",
"Number of threads: 1\n"
]
}
],
"source": [
"!cat lalka-tom-pierwszy.txt | $TOKENIZER_SCRIPTS/tokenizer.perl --language pl | $TOKENIZER_SCRIPTS/lowercase.perl > lalka-tom-pierwszy-tokenized-lowercased.txt"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tokenizer Version 1.1\n",
"Language: en\n",
"Number of threads: 1\n"
]
}
],
"source": [
"!cat lalka-tom-drugi.txt | $TOKENIZER_SCRIPTS/tokenizer.perl --language pl | $TOKENIZER_SCRIPTS/lowercase.perl > lalka-tom-drugi-tokenized-lowercased.txt"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"=== 1/5 Counting and sorting n-grams ===\n",
"Reading /home/pawel/moj-2024/lab/lalka-tom-pierwszy-tokenized-lowercased.txt\n",
"----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100\n",
"****************************************************************************************************\n",
"Unigram tokens 149285 types 22230\n",
"=== 2/5 Calculating and sorting adjusted counts ===\n",
"Chain sizes: 1:266760 2:2262010112 3:4241268992 4:6786030592\n",
"Statistics:\n",
"1 8857/22230 D1=0.664486 D2=1.14301 D3+=1.57055\n",
"2 14632/86142 D1=0.838336 D2=1.2415 D3+=1.40935\n",
"3 8505/128074 D1=0.931027 D2=1.29971 D3+=1.54806\n",
"4 3174/138744 D1=0.967887 D2=1.35058 D3+=1.70692\n",
"Memory estimate for binary LM:\n",
"type kB\n",
"probing 822 assuming -p 1.5\n",
"probing 993 assuming -r models -p 1.5\n",
"trie 480 without quantization\n",
"trie 343 assuming -q 8 -b 8 quantization \n",
"trie 459 assuming -a 22 array pointer compression\n",
"trie 322 assuming -a 22 -q 8 -b 8 array pointer compression and quantization\n",
"=== 3/5 Calculating and sorting initial probabilities ===\n",
"Chain sizes: 1:106284 2:234112 3:170100 4:76176\n",
"----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100\n",
"**##################################################################################################\n",
"=== 4/5 Calculating and writing order-interpolated probabilities ===\n",
"Chain sizes: 1:106284 2:234112 3:170100 4:76176\n",
"----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100\n",
"####################################################################################################\n",
"=== 5/5 Writing ARPA model ===\n",
"----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100\n",
"****************************************************************************************************\n",
"Name:lmplz\tVmPeak:13142612 kB\tVmRSS:7392 kB\tRSSMax:2624428 kB\tuser:0.229863\tsys:0.579255\tCPU:0.809192\treal:0.791505\n"
]
}
],
"source": [
"!$KENLM_BUILD_PATH/bin/lmplz -o 4 --prune 1 1 1 1 < lalka-tom-pierwszy-tokenized-lowercased.txt > lalka_tom_pierwszy_lm.arpa"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"test_str=!(head -n 17 lalka-tom-drugi-tokenized-lowercased.txt | tail -n 1)"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"test_str=test_str[0]"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'sytuacja polityczna jest tak niepewna , że wcale by mnie nie zdziwiło , gdyby około grudnia wybuchła wojna .'"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_str"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### model binarny\n",
"\n",
"Konwertując model do postaci binarnej, inferencja będzie szybsza"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Reading lalka_tom_pierwszy_lm.arpa\n",
"----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100\n",
"****************************************************************************************************\n",
"SUCCESS\n"
]
}
],
"source": [
"!$KENLM_BUILD_PATH/bin/build_binary lalka_tom_pierwszy_lm.arpa lalka_tom_pierwszy_lm.binary"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"This binary file contains probing hash tables.\n",
"sytuacja=0 1 -5.568051\tpolityczna=0 1 -4.4812803\tjest=91 1 -2.6271343\ttak=175 2 -1.7584295\tniepewna=0 1 -4.603079\t,=22 1 -1.2027187\tże=90 2 -1.2062931\twcale=375 1 -4.0545278\tby=995 1 -3.5268068\tmnie=1491 2 -1.6614945\tnie=94 2 -1.4855772\tzdziwiło=0 1 -4.708499\t,=22 1 -1.2027187\tgdyby=555 2 -2.4179027\tokoło=957 1 -3.7740536\tgrudnia=0 1 -4.605748\twybuchła=0 1 -4.4812803\twojna=849 1 -4.213117\t.=42 1 -1.3757544\t</s>=2 2 -0.46293145\tTotal: -59.417397 OOV: 6\n",
"Perplexity including OOVs:\t935.1253434773644\n",
"Perplexity excluding OOVs:\t162.9687064350829\n",
"OOVs:\t6\n",
"Tokens:\t20\n",
"Name:query\tVmPeak:8864 kB\tVmRSS:4504 kB\tRSSMax:5328 kB\tuser:0.002388\tsys:0\tCPU:0.0024207\treal:0.000614597\n"
]
}
],
"source": [
"!echo $test_str | $KENLM_BUILD_PATH/bin/query lalka_tom_pierwszy_lm.binary"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### sprawdzanie dokumentacji\n",
"\n",
"Najłatwiej sprawdzić wywołując bezpośrednio komendę"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Builds unpruned language models with modified Kneser-Ney smoothing.\n",
"\n",
"Please cite:\n",
"@inproceedings{Heafield-estimate,\n",
" author = {Kenneth Heafield and Ivan Pouzyrevsky and Jonathan H. Clark and Philipp Koehn},\n",
" title = {Scalable Modified {Kneser-Ney} Language Model Estimation},\n",
" year = {2013},\n",
" month = {8},\n",
" booktitle = {Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics},\n",
" address = {Sofia, Bulgaria},\n",
" url = {http://kheafield.com/professional/edinburgh/estimate\\_paper.pdf},\n",
"}\n",
"\n",
"Provide the corpus on stdin. The ARPA file will be written to stdout. Order of\n",
"the model (-o) is the only mandatory option. As this is an on-disk program,\n",
"setting the temporary file location (-T) and sorting memory (-S) is recommended.\n",
"\n",
"Memory sizes are specified like GNU sort: a number followed by a unit character.\n",
"Valid units are % for percentage of memory (supported platforms only) and (in\n",
"increasing powers of 1024): b, K, M, G, T, P, E, Z, Y. Default is K (*1024).\n",
"This machine has 16611971072 bytes of memory.\n",
"\n",
"Language model building options:\n",
" -h [ --help ] Show this help message\n",
" -o [ --order ] arg Order of the model\n",
" --interpolate_unigrams [=arg(=1)] (=1)\n",
" Interpolate the unigrams (default) as \n",
" opposed to giving lots of mass to <unk>\n",
" like SRI. If you want SRI's behavior \n",
" with a large <unk> and the old lmplz \n",
" default, use --interpolate_unigrams 0.\n",
" --skip_symbols Treat <s>, </s>, and <unk> as \n",
" whitespace instead of throwing an \n",
" exception\n",
" -T [ --temp_prefix ] arg (=/tmp/) Temporary file prefix\n",
" -S [ --memory ] arg (=80%) Sorting memory\n",
" --minimum_block arg (=8K) Minimum block size to allow\n",
" --sort_block arg (=64M) Size of IO operations for sort \n",
" (determines arity)\n",
" --block_count arg (=2) Block count (per order)\n",
" --vocab_estimate arg (=1000000) Assume this vocabulary size for \n",
" purposes of calculating memory in step \n",
" 1 (corpus count) and pre-sizing the \n",
" hash table\n",
" --vocab_pad arg (=0) If the vocabulary is smaller than this \n",
" value, pad with <unk> to reach this \n",
" size. Requires --interpolate_unigrams\n",
" --verbose_header Add a verbose header to the ARPA file \n",
" that includes information such as token\n",
" count, smoothing type, etc.\n",
" --text arg Read text from a file instead of stdin\n",
" --arpa arg Write ARPA to a file instead of stdout\n",
" --intermediate arg Write ngrams to intermediate files. \n",
" Turns off ARPA output (which can be \n",
" reactivated by --arpa file). Forces \n",
" --renumber on.\n",
" --renumber Renumber the vocabulary identifiers so \n",
" that they are monotone with the hash of\n",
" each string. This is consistent with \n",
" the ordering used by the trie data \n",
" structure.\n",
" --collapse_values Collapse probability and backoff into a\n",
" single value, q that yields the same \n",
" sentence-level probabilities. See \n",
" http://kheafield.com/professional/edinb\n",
" urgh/rest_paper.pdf for more details, \n",
" including a proof.\n",
" --prune arg Prune n-grams with count less than or \n",
" equal to the given threshold. Specify \n",
" one value for each order i.e. 0 0 1 to \n",
" prune singleton trigrams and above. \n",
" The sequence of values must be \n",
" non-decreasing and the last value \n",
" applies to any remaining orders. \n",
" Default is to not prune, which is \n",
" equivalent to --prune 0.\n",
" --limit_vocab_file arg Read allowed vocabulary separated by \n",
" whitespace. N-grams that contain \n",
" vocabulary items not in this list will \n",
" be pruned. Can be combined with --prune\n",
" arg\n",
" --discount_fallback [=arg(=0.5 1 1.5)]\n",
" The closed-form estimate for Kneser-Ney\n",
" discounts does not work without \n",
" singletons or doubletons. It can also \n",
" fail if these values are out of range. \n",
" This option falls back to \n",
" user-specified discounts when the \n",
" closed-form estimate fails. Note that \n",
" this option is generally a bad idea: \n",
" you should deduplicate your corpus \n",
" instead. However, class-based models \n",
" need custom discounts because they lack\n",
" singleton unigrams. Provide up to \n",
" three discounts (for adjusted counts 1,\n",
" 2, and 3+), which will be applied to \n",
" all orders where the closed-form \n",
" estimates fail.\n",
"\n"
]
}
],
"source": [
"!$KENLM_BUILD_PATH/bin/lmplz "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### wrapper pythonowy\n"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Defaulting to user installation because normal site-packages is not writeable\n",
"Collecting https://github.com/kpu/kenlm/archive/master.zip\n",
" Downloading https://github.com/kpu/kenlm/archive/master.zip\n",
"\u001b[2K \u001b[32m-\u001b[0m \u001b[32m553.6 kB\u001b[0m \u001b[31m851.1 kB/s\u001b[0m \u001b[33m0:00:00\u001b[0m\n",
"\u001b[?25h Installing build dependencies ... \u001b[?25ldone\n",
"\u001b[?25h Getting requirements to build wheel ... \u001b[?25ldone\n",
"\u001b[?25h Preparing metadata (pyproject.toml) ... \u001b[?25ldone\n",
"\u001b[?25hBuilding wheels for collected packages: kenlm\n",
" Building wheel for kenlm (pyproject.toml) ... \u001b[?25ldone\n",
"\u001b[?25h Created wheel for kenlm: filename=kenlm-0.2.0-cp310-cp310-linux_x86_64.whl size=3184348 sha256=c9da9a754aa07ffa26f8983ced2910a547d665006e39fd053d365b802b4135e9\n",
" Stored in directory: /tmp/pip-ephem-wheel-cache-e8zp2xqd/wheels/a5/73/ee/670fbd0cee8f6f0b21d10987cb042291e662e26e1a07026462\n",
"Successfully built kenlm\n",
"Installing collected packages: kenlm\n",
"Successfully installed kenlm-0.2.0\n"
]
}
],
"source": [
"!pip install https://github.com/kpu/kenlm/archive/master.zip"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"-59.417396545410156\n"
]
}
],
"source": [
"import kenlm\n",
"model = kenlm.Model('lalka_tom_pierwszy_lm.binary')\n",
"print(model.score(test_str, bos = True, eos = True))"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(-5.568050861358643, 1, True)\n",
"(-4.481280326843262, 1, True)\n",
"(-2.627134323120117, 1, False)\n",
"(-1.7584295272827148, 2, False)\n",
"(-4.603078842163086, 1, True)\n",
"(-1.202718734741211, 1, False)\n",
"(-1.2062931060791016, 2, False)\n",
"(-4.054527759552002, 1, False)\n",
"(-3.5268068313598633, 1, False)\n",
"(-1.661494493484497, 2, False)\n",
"(-1.4855772256851196, 2, False)\n",
"(-4.708498954772949, 1, True)\n",
"(-1.202718734741211, 1, False)\n",
"(-2.417902708053589, 2, False)\n",
"(-3.7740535736083984, 1, False)\n",
"(-4.605748176574707, 1, True)\n",
"(-4.481280326843262, 1, True)\n",
"(-4.2131171226501465, 1, False)\n",
"(-1.3757543563842773, 1, False)\n",
"(-0.46293145418167114, 2, False)\n"
]
}
],
"source": [
"for i in model.full_scores(test_str):\n",
" print(i)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Zadanie \n",
"\n",
"Stworzyć model językowy za pomocą gotowej biblioteki (KenLM lub inna)\n",
"\n",
"Rozwiązanie proszę umieścić na https://gonito.csi.wmi.amu.edu.pl/challenge/challenging-america-word-gap-prediction\n",
"\n",
"Warunki zaliczenia:\n",
"- wynik widoczny na platformie zarówno dla dev i dla test\n",
"- wynik dla dev i test lepszy (niższy) niż 1024.00 (liczone przy pomocy geval)\n",
"- deadline: **24 kwietnia 2024**\n",
"- commitując rozwiązanie proszę również umieścić rozwiązanie w pliku /run.py (czyli na szczycie katalogu). Można przekonwertować jupyter do pliku python przez File → Download as → Python. Rozwiązanie nie musi być w pythonie, może być w innym języku.\n",
"- zadania wykonujemy samodzielnie\n",
"- w nazwie commita podaj nr indeksu\n",
"- w tagach podaj kenlm!\n",
"- uwaga na specjalne znaki \\\\n w pliku 'in.tsv' oraz pierwsze kolumny pliku in.tsv (które należy usunąć)\n",
"\n",
"\n",
"Punktacja:\n",
"- podstawa: 40 punktów\n",
"- dodatkowo 50 (czyli 40 + 50 = 90) punktów z najlepszy wynik\n",
"- dodatkowo 20 (czyli 40 + 20 = 60) punktów za znalezienie się w pierwszej połowie, ale poza najlepszym wynikiem"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"author": "Jakub Pokrywka",
"email": "kubapok@wmi.amu.edu.pl",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"lang": "pl",
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
},
"subtitle": "0.Informacje na temat przedmiotu[ćwiczenia]",
"title": "Ekstrakcja informacji",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4
}