{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Modelowanie języka – laboratoria\n", "### 10 kwietnia 2024\n", "# 6. Biblioteki do statystycznych modeli językowych" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### KENLM\n", "\n", "W praktyce korzysta się z gotowych bibliotek do statystycznych modeli językowych. Najbardziej popularną biblioteką jest KENLM ( https://kheafield.com/papers/avenue/kenlm.pdf ). Repozytorium znajduje się https://github.com/kpu/kenlm a dokumentacja https://kheafield.com/code/kenlm/\n", "\n", "Na komputerach wydziałowych nie powinno być problemu ze skompilowaniem biblioteki.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Instalacja\n", "\n", "(Zob. też dokumentacja)\n", "\n", " sudo apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev\n", " wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz\n", " mkdir kenlm/build\n", " cd kenlm/build\n", " cmake ..\n", " make -j2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Najprostszy scenariusz użycia" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "KENLM_BUILD_PATH='/home/pawel/kenlm/build' # ścieżka, w której jest zainstalowany KenLM (zob. dokumentacja - link powyżej)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2024-04-10 12:13:27-- https://wolnelektury.pl/media/book/txt/lalka-tom-pierwszy.txt\n", "Resolving wolnelektury.pl (wolnelektury.pl)... 51.83.143.148, 2001:41d0:602:3294::\n", "Connecting to wolnelektury.pl (wolnelektury.pl)|51.83.143.148|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 860304 (840K) [text/plain]\n", "Saving to: ‘lalka-tom-pierwszy.txt.1’\n", "\n", "lalka-tom-pierwszy. 100%[===================>] 840.14K 3.59MB/s in 0.2s \n", "\n", "2024-04-10 12:13:27 (3.59 MB/s) - ‘lalka-tom-pierwszy.txt.1’ saved [860304/860304]\n", "\n" ] } ], "source": [ "!wget https://wolnelektury.pl/media/book/txt/lalka-tom-pierwszy.txt" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2024-04-10 12:13:30-- https://wolnelektury.pl/media/book/txt/lalka-tom-drugi.txt\n", "Resolving wolnelektury.pl (wolnelektury.pl)... 51.83.143.148, 2001:41d0:602:3294::\n", "Connecting to wolnelektury.pl (wolnelektury.pl)|51.83.143.148|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 949497 (927K) [text/plain]\n", "Saving to: ‘lalka-tom-drugi.txt.1’\n", "\n", "lalka-tom-drugi.txt 100%[===================>] 927.24K 3.39MB/s in 0.3s \n", "\n", "2024-04-10 12:13:30 (3.39 MB/s) - ‘lalka-tom-drugi.txt.1’ saved [949497/949497]\n", "\n" ] } ], "source": [ "!wget https://wolnelektury.pl/media/book/txt/lalka-tom-drugi.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### budowa modelu" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== 1/5 Counting and sorting n-grams ===\n", "Reading /home/pawel/moj-2024/lab/lalka-tom-pierwszy.txt\n", "----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "****************************************************************************************************\n", "Unigram tokens 122871 types 33265\n", "=== 2/5 Calculating and sorting adjusted counts ===\n", "Chain sizes: 1:399180 2:2261987584 3:4241227008 4:6785963520\n", "Statistics:\n", "1 33265 D1=0.737356 D2=1.15675 D3+=1.59585\n", "2 93948 D1=0.891914 D2=1.20314 D3+=1.44945\n", "3 115490 D1=0.964904 D2=1.40636 D3+=1.66751\n", "4 116433 D1=0.986444 D2=1.50367 D3+=1.9023\n", "Memory estimate for binary LM:\n", "type kB\n", "probing 7800 assuming -p 1.5\n", "probing 9157 assuming -r models -p 1.5\n", "trie 3902 without quantization\n", "trie 2378 assuming -q 8 -b 8 quantization \n", "trie 3649 assuming -a 22 array pointer compression\n", "trie 2125 assuming -a 22 -q 8 -b 8 array pointer compression and quantization\n", "=== 3/5 Calculating and sorting initial probabilities ===\n", "Chain sizes: 1:399180 2:1503168 3:2309800 4:2794392\n", "----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100\n", "####################################################################################################\n", "=== 4/5 Calculating and writing order-interpolated probabilities ===\n", "Chain sizes: 1:399180 2:1503168 3:2309800 4:2794392\n", "----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100\n", "####################################################################################################\n", "=== 5/5 Writing ARPA model ===\n", "----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100\n", "****************************************************************************************************\n", "Name:lmplz\tVmPeak:13142592 kB\tVmRSS:7564 kB\tRSSMax:2623832 kB\tuser:0.28374\tsys:1.02734\tCPU:1.3111\treal:1.25256\n" ] } ], "source": [ "!$KENLM_BUILD_PATH/bin/lmplz -o 4 < lalka-tom-pierwszy.txt > lalka_tom_pierwszy_lm.arpa" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## plik arpa\n", "\n", "Powyższa komenda tworzy model językowy z wygładzaniem i zapisuje go do pliku tekstowego arpa. Parametr -o 4 odpowiada za maksymalną ilość n-gramów w modelu: 4-gramy.\n", "\n", "Plik arpa zawiera w sobie prawdopodobieństwa dla poszczególnych n-gramów. W zasadzie są to logarytmy prawdopodbieństw o podstawie 10.\n", "\n", "Podejrzyjmy plik arpa:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\\data\\\n", "ngram 1=33265\n", "ngram 2=93948\n", "ngram 3=115490\n", "ngram 4=116433\n", "\n", "\\1-grams:\n", "-5.0133595\t\t0\n", "0\t\t-0.99603957\n", "-1.4302719\t\t0\n", "-4.7287908\tBolesław\t-0.049677044\n", "-4.9033437\tPrus\t-0.049677044\n", "-4.9033437\tLalka\t-0.049677044\n", "-4.9033437\tISBN\t-0.049677044\n", "-4.9033437\t978-83-288-2673-1\t-0.049677044\n", "-4.9033437\tTom\t-0.049677044\n", "-3.0029354\tI\t-0.17544968\n", "-4.9033437\tI.\t-0.049677044\n", "-3.5526814\tJak\t-0.1410632\n", "-3.8170912\twygląda\t-0.16308141\n", "-4.608305\tfirma\t-0.049677044\n", "-4.33789\tJ.\t-0.3295009\n", "-3.9192266\tMincel\t-0.12910372\n", "-1.624716\ti\t-0.20128249\n", "-4.1086636\tS.\t-0.098223634\n", "-2.6843808\tWokulski\t-0.19202113\n", "-2.8196363\tprzez\t-0.15214005\n", "-4.9033437\tszkło\t-0.049677044\n", "-4.9033437\tbutelek?\t-0.049677044\n", "-2.848008\tW\t-0.19964235\n" ] } ], "source": [ "!head -n 30 lalka_tom_pierwszy_lm.arpa" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Linijka to kolejno: prawdopodobieństwo (log10), n-gram, waga back-off (log10).\n", "\n", "Aby spradzić prawdopodobieństwo sekwencji (a także PPL modelu) należy użyć komendy query" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "test_str=!(head -n 17 lalka-tom-drugi.txt | tail -n 1)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "test_str = test_str[0]" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Sytuacja polityczna jest tak niepewna, że wcale by mnie nie zdziwiło, gdyby około grudnia wybuchła wojna.'" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_str" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Sytuacja polityczna jest tak niepewna, że wcale by mnie nie zdziwiło, gdyby około grudnia wybuchła wojna.'" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_str" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sytuacja=0 1 -6.009399\tpolityczna=21766 1 -4.9033437\tjest=123 1 -2.6640298\ttak=231 2 -1.7683144\tniepewna,=0 1 -5.1248584\tże=122 1 -2.1651394\twcale=5123 1 -4.167491\tby=1523 1 -3.55168\tmnie=2555 2 -1.6694618\tnie=127 2 -1.4439836\tzdziwiło,=0 1 -5.2158937\tgdyby=814 1 -3.2300434\tokoło=1462 1 -3.7384818\tgrudnia=0 1 -5.123236\twybuchła=0 1 -5.0133595\twojna.=1285 1 -4.9033437\t=2 2 -0.8501559\tTotal: -61.54222 OOV: 5\n", "Perplexity including OOVs:\t4169.948113875898\n", "Perplexity excluding OOVs:\t834.2371454470355\n", "OOVs:\t5\n", "Tokens:\t17\n" ] } ], "source": [ "!echo $test_str | $KENLM_BUILD_PATH/bin/query lalka_tom_pierwszy_lm.arpa 2> /dev/null" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Zgodnie z dokumentacją polecenia query, format wyjściowy to dla każdego słowa:\n", " \n", "word=vocab_id ngram_length log10(p(word|context))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A co jeśli trochę zmienimy początek zdania?" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "test2_str = \"Lubię placki i wcale by mnie nie zdziwiło, gdyby około grudnia wybuchła wojna.\"" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Lubię=17813 1 -5.899383\tplacki=0 1 -5.0630364\ti=16 1 -1.624716\twcale=5123 2 -3.2397003\tby=1523 1 -3.6538217\tmnie=2555 2 -1.6694618\tnie=127 2 -1.4439836\tzdziwiło,=0 1 -5.2158937\tgdyby=814 1 -3.2300434\tokoło=1462 1 -3.7384818\tgrudnia=0 1 -5.123236\twybuchła=0 1 -5.0133595\twojna.=1285 1 -4.9033437\t=2 2 -0.8501559\tTotal: -50.668617 OOV: 4\n", "Perplexity including OOVs:\t4160.896818387522\n", "Perplexity excluding OOVs:\t1060.0079770155185\n", "OOVs:\t4\n", "Tokens:\t14\n" ] } ], "source": [ "!echo $test2_str | $KENLM_BUILD_PATH/bin/query lalka_tom_pierwszy_lm.arpa 2> /dev/null" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Trochę bardziej zaawansowane użycie " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pierwsza rzecz, która rzuca się w oczy: tokeny zawierają znaki interpunkcyjne. Użyjemy zatem popularnego tokenizera i detokenizera moses z https://github.com/moses-smt/mosesdecoder\n", " \n", "https://github.com/moses-smt/mosesdecoder/tree/master/scripts/tokenizer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### tokenizacja i lowercasing" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "TOKENIZER_SCRIPTS='/home/pawel/mosesdecoder/scripts/tokenizer'" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sytuacja polityczna jest tak niepewna, że wcale by mnie nie zdziwiło, gdyby około grudnia wybuchła wojna.\n" ] } ], "source": [ "!echo $test_str" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tokenizer Version 1.1\n", "Language: en\n", "Number of threads: 1\n", "Sytuacja polityczna jest tak niepewna , że wcale by mnie nie zdziwiło , gdyby około grudnia wybuchła wojna .\n" ] } ], "source": [ "!echo $test_str | $TOKENIZER_SCRIPTS/tokenizer.perl --language pl" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "W łatwy sposób można odzyskać tekst źródłowy:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Detokenizer Version $Revision: 4134 $\n", "Language: en\n", "Tokenizer Version 1.1\n", "Language: en\n", "Number of threads: 1\n", "Sytuacja polityczna jest tak niepewna, że wcale by mnie nie zdziwiło, gdyby około grudnia wybuchła wojna.\n" ] } ], "source": [ "!echo $test_str | $TOKENIZER_SCRIPTS/tokenizer.perl --language pl | $TOKENIZER_SCRIPTS/detokenizer.perl --language pl" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "W naszym przykładzie stworzymy model językowy lowercase. Można osobno wytrenować też truecaser (osobny model do przywracania wielkości liter), jeżeli jest taka potrzeba." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tokenizer Version 1.1\n", "Language: en\n", "Number of threads: 1\n", "sytuacja polityczna jest tak niepewna , że wcale by mnie nie zdziwiło , gdyby około grudnia wybuchła wojna .\n" ] } ], "source": [ "!echo $test_str | $TOKENIZER_SCRIPTS/tokenizer.perl --language pl | $TOKENIZER_SCRIPTS/lowercase.perl" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tokenizer Version 1.1\n", "Language: en\n", "Number of threads: 1\n" ] } ], "source": [ "!cat lalka-tom-pierwszy.txt | $TOKENIZER_SCRIPTS/tokenizer.perl --language pl | $TOKENIZER_SCRIPTS/lowercase.perl > lalka-tom-pierwszy-tokenized-lowercased.txt" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tokenizer Version 1.1\n", "Language: en\n", "Number of threads: 1\n" ] } ], "source": [ "!cat lalka-tom-drugi.txt | $TOKENIZER_SCRIPTS/tokenizer.perl --language pl | $TOKENIZER_SCRIPTS/lowercase.perl > lalka-tom-drugi-tokenized-lowercased.txt" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== 1/5 Counting and sorting n-grams ===\n", "Reading /home/pawel/moj-2024/lab/lalka-tom-pierwszy-tokenized-lowercased.txt\n", "----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100\n", "****************************************************************************************************\n", "Unigram tokens 149285 types 22230\n", "=== 2/5 Calculating and sorting adjusted counts ===\n", "Chain sizes: 1:266760 2:2262010112 3:4241268992 4:6786030592\n", "Statistics:\n", "1 8857/22230 D1=0.664486 D2=1.14301 D3+=1.57055\n", "2 14632/86142 D1=0.838336 D2=1.2415 D3+=1.40935\n", "3 8505/128074 D1=0.931027 D2=1.29971 D3+=1.54806\n", "4 3174/138744 D1=0.967887 D2=1.35058 D3+=1.70692\n", "Memory estimate for binary LM:\n", "type kB\n", "probing 822 assuming -p 1.5\n", "probing 993 assuming -r models -p 1.5\n", "trie 480 without quantization\n", "trie 343 assuming -q 8 -b 8 quantization \n", "trie 459 assuming -a 22 array pointer compression\n", "trie 322 assuming -a 22 -q 8 -b 8 array pointer compression and quantization\n", "=== 3/5 Calculating and sorting initial probabilities ===\n", "Chain sizes: 1:106284 2:234112 3:170100 4:76176\n", "----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100\n", "**##################################################################################################\n", "=== 4/5 Calculating and writing order-interpolated probabilities ===\n", "Chain sizes: 1:106284 2:234112 3:170100 4:76176\n", "----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100\n", "####################################################################################################\n", "=== 5/5 Writing ARPA model ===\n", "----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100\n", "****************************************************************************************************\n", "Name:lmplz\tVmPeak:13142612 kB\tVmRSS:7392 kB\tRSSMax:2624428 kB\tuser:0.229863\tsys:0.579255\tCPU:0.809192\treal:0.791505\n" ] } ], "source": [ "!$KENLM_BUILD_PATH/bin/lmplz -o 4 --prune 1 1 1 1 < lalka-tom-pierwszy-tokenized-lowercased.txt > lalka_tom_pierwszy_lm.arpa" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "test_str=!(head -n 17 lalka-tom-drugi-tokenized-lowercased.txt | tail -n 1)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "test_str=test_str[0]" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'sytuacja polityczna jest tak niepewna , że wcale by mnie nie zdziwiło , gdyby około grudnia wybuchła wojna .'" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_str" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### model binarny\n", "\n", "Konwertując model do postaci binarnej, inferencja będzie szybsza" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Reading lalka_tom_pierwszy_lm.arpa\n", "----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100\n", "****************************************************************************************************\n", "SUCCESS\n" ] } ], "source": [ "!$KENLM_BUILD_PATH/bin/build_binary lalka_tom_pierwszy_lm.arpa lalka_tom_pierwszy_lm.binary" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This binary file contains probing hash tables.\n", "sytuacja=0 1 -5.568051\tpolityczna=0 1 -4.4812803\tjest=91 1 -2.6271343\ttak=175 2 -1.7584295\tniepewna=0 1 -4.603079\t,=22 1 -1.2027187\tże=90 2 -1.2062931\twcale=375 1 -4.0545278\tby=995 1 -3.5268068\tmnie=1491 2 -1.6614945\tnie=94 2 -1.4855772\tzdziwiło=0 1 -4.708499\t,=22 1 -1.2027187\tgdyby=555 2 -2.4179027\tokoło=957 1 -3.7740536\tgrudnia=0 1 -4.605748\twybuchła=0 1 -4.4812803\twojna=849 1 -4.213117\t.=42 1 -1.3757544\t=2 2 -0.46293145\tTotal: -59.417397 OOV: 6\n", "Perplexity including OOVs:\t935.1253434773644\n", "Perplexity excluding OOVs:\t162.9687064350829\n", "OOVs:\t6\n", "Tokens:\t20\n", "Name:query\tVmPeak:8864 kB\tVmRSS:4504 kB\tRSSMax:5328 kB\tuser:0.002388\tsys:0\tCPU:0.0024207\treal:0.000614597\n" ] } ], "source": [ "!echo $test_str | $KENLM_BUILD_PATH/bin/query lalka_tom_pierwszy_lm.binary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### sprawdzanie dokumentacji\n", "\n", "Najłatwiej sprawdzić wywołując bezpośrednio komendę" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Builds unpruned language models with modified Kneser-Ney smoothing.\n", "\n", "Please cite:\n", "@inproceedings{Heafield-estimate,\n", " author = {Kenneth Heafield and Ivan Pouzyrevsky and Jonathan H. Clark and Philipp Koehn},\n", " title = {Scalable Modified {Kneser-Ney} Language Model Estimation},\n", " year = {2013},\n", " month = {8},\n", " booktitle = {Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics},\n", " address = {Sofia, Bulgaria},\n", " url = {http://kheafield.com/professional/edinburgh/estimate\\_paper.pdf},\n", "}\n", "\n", "Provide the corpus on stdin. The ARPA file will be written to stdout. Order of\n", "the model (-o) is the only mandatory option. As this is an on-disk program,\n", "setting the temporary file location (-T) and sorting memory (-S) is recommended.\n", "\n", "Memory sizes are specified like GNU sort: a number followed by a unit character.\n", "Valid units are % for percentage of memory (supported platforms only) and (in\n", "increasing powers of 1024): b, K, M, G, T, P, E, Z, Y. Default is K (*1024).\n", "This machine has 16611971072 bytes of memory.\n", "\n", "Language model building options:\n", " -h [ --help ] Show this help message\n", " -o [ --order ] arg Order of the model\n", " --interpolate_unigrams [=arg(=1)] (=1)\n", " Interpolate the unigrams (default) as \n", " opposed to giving lots of mass to \n", " like SRI. If you want SRI's behavior \n", " with a large and the old lmplz \n", " default, use --interpolate_unigrams 0.\n", " --skip_symbols Treat , , and as \n", " whitespace instead of throwing an \n", " exception\n", " -T [ --temp_prefix ] arg (=/tmp/) Temporary file prefix\n", " -S [ --memory ] arg (=80%) Sorting memory\n", " --minimum_block arg (=8K) Minimum block size to allow\n", " --sort_block arg (=64M) Size of IO operations for sort \n", " (determines arity)\n", " --block_count arg (=2) Block count (per order)\n", " --vocab_estimate arg (=1000000) Assume this vocabulary size for \n", " purposes of calculating memory in step \n", " 1 (corpus count) and pre-sizing the \n", " hash table\n", " --vocab_pad arg (=0) If the vocabulary is smaller than this \n", " value, pad with to reach this \n", " size. Requires --interpolate_unigrams\n", " --verbose_header Add a verbose header to the ARPA file \n", " that includes information such as token\n", " count, smoothing type, etc.\n", " --text arg Read text from a file instead of stdin\n", " --arpa arg Write ARPA to a file instead of stdout\n", " --intermediate arg Write ngrams to intermediate files. \n", " Turns off ARPA output (which can be \n", " reactivated by --arpa file). Forces \n", " --renumber on.\n", " --renumber Renumber the vocabulary identifiers so \n", " that they are monotone with the hash of\n", " each string. This is consistent with \n", " the ordering used by the trie data \n", " structure.\n", " --collapse_values Collapse probability and backoff into a\n", " single value, q that yields the same \n", " sentence-level probabilities. See \n", " http://kheafield.com/professional/edinb\n", " urgh/rest_paper.pdf for more details, \n", " including a proof.\n", " --prune arg Prune n-grams with count less than or \n", " equal to the given threshold. Specify \n", " one value for each order i.e. 0 0 1 to \n", " prune singleton trigrams and above. \n", " The sequence of values must be \n", " non-decreasing and the last value \n", " applies to any remaining orders. \n", " Default is to not prune, which is \n", " equivalent to --prune 0.\n", " --limit_vocab_file arg Read allowed vocabulary separated by \n", " whitespace. N-grams that contain \n", " vocabulary items not in this list will \n", " be pruned. Can be combined with --prune\n", " arg\n", " --discount_fallback [=arg(=0.5 1 1.5)]\n", " The closed-form estimate for Kneser-Ney\n", " discounts does not work without \n", " singletons or doubletons. It can also \n", " fail if these values are out of range. \n", " This option falls back to \n", " user-specified discounts when the \n", " closed-form estimate fails. Note that \n", " this option is generally a bad idea: \n", " you should deduplicate your corpus \n", " instead. However, class-based models \n", " need custom discounts because they lack\n", " singleton unigrams. Provide up to \n", " three discounts (for adjusted counts 1,\n", " 2, and 3+), which will be applied to \n", " all orders where the closed-form \n", " estimates fail.\n", "\n" ] } ], "source": [ "!$KENLM_BUILD_PATH/bin/lmplz " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### wrapper pythonowy\n" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Defaulting to user installation because normal site-packages is not writeable\n", "Collecting https://github.com/kpu/kenlm/archive/master.zip\n", " Downloading https://github.com/kpu/kenlm/archive/master.zip\n", "\u001b[2K \u001b[32m-\u001b[0m \u001b[32m553.6 kB\u001b[0m \u001b[31m851.1 kB/s\u001b[0m \u001b[33m0:00:00\u001b[0m\n", "\u001b[?25h Installing build dependencies ... \u001b[?25ldone\n", "\u001b[?25h Getting requirements to build wheel ... \u001b[?25ldone\n", "\u001b[?25h Preparing metadata (pyproject.toml) ... \u001b[?25ldone\n", "\u001b[?25hBuilding wheels for collected packages: kenlm\n", " Building wheel for kenlm (pyproject.toml) ... \u001b[?25ldone\n", "\u001b[?25h Created wheel for kenlm: filename=kenlm-0.2.0-cp310-cp310-linux_x86_64.whl size=3184348 sha256=c9da9a754aa07ffa26f8983ced2910a547d665006e39fd053d365b802b4135e9\n", " Stored in directory: /tmp/pip-ephem-wheel-cache-e8zp2xqd/wheels/a5/73/ee/670fbd0cee8f6f0b21d10987cb042291e662e26e1a07026462\n", "Successfully built kenlm\n", "Installing collected packages: kenlm\n", "Successfully installed kenlm-0.2.0\n" ] } ], "source": [ "!pip install https://github.com/kpu/kenlm/archive/master.zip" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-59.417396545410156\n" ] } ], "source": [ "import kenlm\n", "model = kenlm.Model('lalka_tom_pierwszy_lm.binary')\n", "print(model.score(test_str, bos = True, eos = True))" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(-5.568050861358643, 1, True)\n", "(-4.481280326843262, 1, True)\n", "(-2.627134323120117, 1, False)\n", "(-1.7584295272827148, 2, False)\n", "(-4.603078842163086, 1, True)\n", "(-1.202718734741211, 1, False)\n", "(-1.2062931060791016, 2, False)\n", "(-4.054527759552002, 1, False)\n", "(-3.5268068313598633, 1, False)\n", "(-1.661494493484497, 2, False)\n", "(-1.4855772256851196, 2, False)\n", "(-4.708498954772949, 1, True)\n", "(-1.202718734741211, 1, False)\n", "(-2.417902708053589, 2, False)\n", "(-3.7740535736083984, 1, False)\n", "(-4.605748176574707, 1, True)\n", "(-4.481280326843262, 1, True)\n", "(-4.2131171226501465, 1, False)\n", "(-1.3757543563842773, 1, False)\n", "(-0.46293145418167114, 2, False)\n" ] } ], "source": [ "for i in model.full_scores(test_str):\n", " print(i)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Zadanie \n", "\n", "Stworzyć model językowy za pomocą gotowej biblioteki (KenLM lub inna)\n", "\n", "Rozwiązanie proszę umieścić na https://gonito.csi.wmi.amu.edu.pl/challenge/challenging-america-word-gap-prediction\n", "\n", "Warunki zaliczenia:\n", "- wynik widoczny na platformie zarówno dla dev i dla test\n", "- wynik dla dev i test lepszy (niższy) niż 1024.00 (liczone przy pomocy geval)\n", "- deadline: **24 kwietnia 2024**\n", "- commitując rozwiązanie proszę również umieścić rozwiązanie w pliku /run.py (czyli na szczycie katalogu). Można przekonwertować jupyter do pliku python przez File → Download as → Python. Rozwiązanie nie musi być w pythonie, może być w innym języku.\n", "- zadania wykonujemy samodzielnie\n", "- w nazwie commita podaj nr indeksu\n", "- w tagach podaj kenlm!\n", "- uwaga na specjalne znaki \\\\n w pliku 'in.tsv' oraz pierwsze kolumny pliku in.tsv (które należy usunąć)\n", "\n", "\n", "Punktacja:\n", "- podstawa: 40 punktów\n", "- dodatkowo 50 (czyli 40 + 50 = 90) punktów z najlepszy wynik\n", "- dodatkowo 20 (czyli 40 + 20 = 60) punktów za znalezienie się w pierwszej połowie, ale poza najlepszym wynikiem" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "author": "Jakub Pokrywka", "email": "kubapok@wmi.amu.edu.pl", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "lang": "pl", "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" }, "subtitle": "0.Informacje na temat przedmiotu[ćwiczenia]", "title": "Ekstrakcja informacji", "year": "2021" }, "nbformat": 4, "nbformat_minor": 4 }