475 lines
11 KiB
Plaintext
475 lines
11 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
|
|
"<div class=\"alert alert-block alert-info\">\n",
|
|
"<h1> Modelowanie Języka</h1>\n",
|
|
"<h2> 7. <i>biblioteki STM</i> [ćwiczenia]</h2> \n",
|
|
"<h3> Jakub Pokrywka (2022)</h3>\n",
|
|
"</div>\n",
|
|
"\n",
|
|
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### KENLM\n",
|
|
"\n",
|
|
"W praktyce korzysta się z gotowych bibliotek do statystycznych modeli językowych. Najbardziej popularną biblioteką jest KENLM ( https://kheafield.com/papers/avenue/kenlm.pdf ). Repozytorium znajduje się https://github.com/kpu/kenlm a dokumentacja https://kheafield.com/code/kenlm/\n",
|
|
"\n",
|
|
"Na komputerach wydziałowych nie powinno być problemu ze skompilowaniem biblioteki.\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Najprostszy scenariusz użycia"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"KENLM_BUILD_PATH='/home/kuba/kenlm/build'"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!wget https://wolnelektury.pl/media/book/txt/lalka-tom-pierwszy.txt"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!wget https://wolnelektury.pl/media/book/txt/lalka-tom-drugi.txt"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### budowa modelu"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!$KENLM_BUILD_PATH/bin/lmplz -o 4 < lalka-tom-pierwszy.txt > lalka_tom_pierwszy_lm.arpa"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## plik arpa\n",
|
|
"\n",
|
|
"Powyższa komenda tworzy model językowy z wygładzaniem i zapisuje go do pliku tekstowego arpa. Parametr -o 4 odpowiada za maksymalną ilość n-gramów w modelu: 4-gramy.\n",
|
|
"\n",
|
|
"Plik arpa zawiera w sobie prawdopodobieństwa dla poszczególnych n-gramów. W zasadzie są to logarytmy prawdopodbieństw o podstawie 10.\n",
|
|
"\n",
|
|
"Podejrzyjmy plik arpa:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!head -n 30 lalka_tom_pierwszy_lm.arpa"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Linijka to kolejno: prawdopodobieństwo (log10), n-gram, waga back-off (log10).\n",
|
|
"\n",
|
|
"Aby spradzić prawdopodobieństwo sekwencji (a także PPL modelu) należy użyć komendy query"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"test_str=!(head -n 17 lalka-tom-drugi.txt | tail -n 1)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"test_str = test_str[0]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"test_str"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"test_str"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!echo $test_str | $KENLM_BUILD_PATH/bin/query lalka_tom_pierwszy_lm.arpa 2> /dev/null"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Zgodnie z dokumentacją polecenia query, format wyjściowy to dla każdego słowa:\n",
|
|
" \n",
|
|
"word=vocab_id ngram_length log10(p(word|context))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"A co jeśli trochę zmienimy początek zdania?"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"test2_str = \"Lubię placki i wcale by mnie nie zdziwiło, gdyby około grudnia wybuchła wojna.\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!echo $test2_str | $KENLM_BUILD_PATH/bin/query lalka_tom_pierwszy_lm.arpa 2> /dev/null"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"A co jeśli trochę zmienimy wejście?"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Trochę bardziej zaawansowane użycie "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Pierwsza rzecz, która rzuca się w oczy: tokeny zawierają znaki interpunkcyjne. Użyjemy zatem popularnego tokenizera i detokenizera moses z https://github.com/moses-smt/mosesdecoder\n",
|
|
" \n",
|
|
"https://github.com/moses-smt/mosesdecoder/tree/master/scripts/tokenizer"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### tokenizacja i lowercasing"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"TOKENIZER_SCRIPTS='/home/kuba/mosesdecoder/scripts/tokenizer'"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!echo $test_str"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!echo $test_str | $TOKENIZER_SCRIPTS/tokenizer.perl --language pl"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"W łatwy sposób można odzyskać tekst źródłowy:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!echo $test_str | $TOKENIZER_SCRIPTS/tokenizer.perl --language pl | $TOKENIZER_SCRIPTS/detokenizer.perl --language pl"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"W naszym przykładzie stworzymy model językowy lowercase. Można osobno wytrenować też truecaser (osobny model do przywracania wielkości liter), jeżeli jest taka potrzeba."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!echo $test_str | $TOKENIZER_SCRIPTS/tokenizer.perl --language pl | $TOKENIZER_SCRIPTS/lowercase.perl"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!cat lalka-tom-pierwszy.txt | $TOKENIZER_SCRIPTS/tokenizer.perl --language pl | $TOKENIZER_SCRIPTS/lowercase.perl > lalka-tom-pierwszy-tokenized-lowercased.txt"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!cat lalka-tom-drugi.txt | $TOKENIZER_SCRIPTS/tokenizer.perl --language pl | $TOKENIZER_SCRIPTS/lowercase.perl > lalka-tom-drugi-tokenized-lowercased.txt"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!$KENLM_BUILD_PATH/bin/lmplz -o 4 --prune 1 1 1 1 < lalka-tom-pierwszy-tokenized-lowercased.txt > lalka_tom_pierwszy_lm.arpa"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"test_str=!(head -n 17 lalka-tom-drugi-tokenized-lowercased.txt | tail -n 1)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"test_str=test_str[0]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"test_str"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### model binarny\n",
|
|
"\n",
|
|
"Konwertując model do postaci binarnej, inferencja będzie szybsza"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!$KENLM_BUILD_PATH/bin/build_binary lalka_tom_pierwszy_lm.arpa lalka_tom_pierwszy_lm.binary"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!echo $test_str | $KENLM_BUILD_PATH/bin/query lalka_tom_pierwszy_lm.binary"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### sprawdzanie dokumentacji\n",
|
|
"\n",
|
|
"Najłatwiej sprawdzić wywołując bezpośrednio komendę"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!$KENLM_BUILD_PATH/bin/lmplz "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### wrapper pythonowy\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!pip install https://github.com/kpu/kenlm/archive/master.zip"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import kenlm\n",
|
|
"model = kenlm.Model('lalka_tom_pierwszy_lm.binary')\n",
|
|
"print(model.score(test_str, bos = True, eos = True))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"for i in model.full_scores(test_str):\n",
|
|
" print(i)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Zadanie \n",
|
|
"\n",
|
|
"Stworzyć model językowy za pomocą gotowej biblioteki (KENLM lub inna)\n",
|
|
"\n",
|
|
"Warunki zaliczenia:\n",
|
|
"- wynik widoczny na platformie zarówno dla dev i dla test\n",
|
|
"- wynik dla dev i test lepszy (niższy) niż 1024.00 (liczone przy pomocy geval)\n",
|
|
"- deadline do końca dnia 17.04\n",
|
|
"- commitując rozwiązanie proszę również umieścić rozwiązanie w pliku /run.py (czyli na szczycie katalogu). Można przekonwertować jupyter do pliku python przez File → Download as → Python. Rozwiązanie nie musi być w pythonie, może być w innym języku.\n",
|
|
"- zadania wykonujemy samodzielnie\n",
|
|
"- w nazwie commita podaj nr indeksu\n",
|
|
"- w tagach podaj kenlm!\n",
|
|
"- uwaga na specjalne znaki \\\\n w pliku 'in.tsv' oraz pierwsze kolumny pliku in.tsv (które należy usunąć)\n",
|
|
"\n",
|
|
"\n",
|
|
"Punktacja:\n",
|
|
"- podstawa: 40 punktów\n",
|
|
"- 50 punktów z najlepszy wynik z 2 grup\n",
|
|
"- 20 punktów za znalezienie się w pierwszej połowie, ale poza najlepszym wynikiem"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"author": "Jakub Pokrywka",
|
|
"email": "kubapok@wmi.amu.edu.pl",
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"lang": "pl",
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.8.3"
|
|
},
|
|
"subtitle": "0.Informacje na temat przedmiotu[ćwiczenia]",
|
|
"title": "Ekstrakcja informacji",
|
|
"year": "2021"
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
}
|