SystemyDialogowe-ProjektMag.../lab/08-parsing-semantyczny-uczenie(zmodyfikowany).ipynb
2022-05-31 00:23:12 +02:00

867 lines
28 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Systemy Dialogowe </h1>\n",
"<h2> 8. <i>Parsing semantyczny z wykorzystaniem technik uczenia maszynowego</i> [laboratoria]</h2> \n",
"<h3> Marek Kubis (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Parsing semantyczny z wykorzystaniem technik uczenia maszynowego\n",
"================================================================\n",
"\n",
"Wprowadzenie\n",
"------------\n",
"Problem wykrywania slotów i ich wartości w wypowiedziach użytkownika można sformułować jako zadanie\n",
"polegające na przewidywaniu dla poszczególnych słów etykiet wskazujących na to czy i do jakiego\n",
"slotu dane słowo należy.\n",
"\n",
"> chciałbym zarezerwować stolik na jutro**/day** na godzinę dwunastą**/hour** czterdzieści**/hour** pięć**/hour** na pięć**/size** osób\n",
"\n",
"Granice slotów oznacza się korzystając z wybranego schematu etykietowania.\n",
"\n",
"### Schemat IOB\n",
"\n",
"| Prefix | Znaczenie |\n",
"|:------:|:---------------------------|\n",
"| I | wnętrze slotu (inside) |\n",
"| O | poza slotem (outside) |\n",
"| B | początek slotu (beginning) |\n",
"\n",
"> chciałbym zarezerwować stolik na jutro**/B-day** na godzinę dwunastą**/B-hour** czterdzieści**/I-hour** pięć**/I-hour** na pięć**/B-size** osób\n",
"\n",
"### Schemat IOBES\n",
"\n",
"| Prefix | Znaczenie |\n",
"|:------:|:---------------------------|\n",
"| I | wnętrze slotu (inside) |\n",
"| O | poza slotem (outside) |\n",
"| B | początek slotu (beginning) |\n",
"| E | koniec slotu (ending) |\n",
"| S | pojedyncze słowo (single) |\n",
"\n",
"> chciałbym zarezerwować stolik na jutro**/S-day** na godzinę dwunastą**/B-hour** czterdzieści**/I-hour** pięć**/E-hour** na pięć**/S-size** osób\n",
"\n",
"Jeżeli dla tak sformułowanego zadania przygotujemy zbiór danych\n",
"złożony z wypowiedzi użytkownika z oznaczonymi slotami (tzw. *zbiór uczący*),\n",
"to możemy zastosować techniki (nadzorowanego) uczenia maszynowego w celu zbudowania modelu\n",
"annotującego wypowiedzi użytkownika etykietami slotów.\n",
"\n",
"Do zbudowania takiego modelu można wykorzystać między innymi:\n",
"\n",
" 1. warunkowe pola losowe (Lafferty i in.; 2001),\n",
"\n",
" 2. rekurencyjne sieci neuronowe, np. sieci LSTM (Hochreiter i Schmidhuber; 1997),\n",
"\n",
" 3. transformery (Vaswani i in., 2017).\n",
"\n",
"Przykład\n",
"--------\n",
"Skorzystamy ze zbioru danych przygotowanego przez Schustera (2019)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zbiór ten gromadzi wypowiedzi w trzech językach opisane slotami dla dwunastu ram należących do trzech dziedzin `Alarm`, `Reminder` oraz `Weather`. Dane wczytamy korzystając z biblioteki [conllu](https://pypi.org/project/conllu/)."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"# text: halo\t\t\t\n",
"\n",
"# intent: hello\t\t\t\n",
"\n",
"# slots: \t\t\t\n",
"\n",
"1\thalo\thello\tNoLabel\n",
"\n",
"\t\t\t\n",
"\n",
"# text: chaciałbym pójść na premierę filmu jakie premiery są w tym tygodniu\t\t\t\n",
"\n",
"# intent: reqmore\t\t\t\n",
"\n",
"# slots: \t\t\t\n",
"\n",
"1\tchaciałbym\treqmore\tNoLabel\n",
"\n",
"2\tpójść\treqmore\tNoLabel\n",
"\n",
"3\tna\treqmore\tNoLabel\n",
"\n",
"4\tpremierę\treqmore\tNoLabel\n",
"\n",
"5\tfilmu\treqmore\tNoLabel\n",
"\n",
"6\tjakie\treqmore\tB-goal\n",
"\n",
"7\tpremiery\treqmore\tI-goal\n",
"\n"
]
}
],
"source": [
"from conllu import parse_incr\n",
"fields = ['id', 'form', 'frame', 'slot']\n",
"\n",
"def nolabel2o(line, i):\n",
" return 'O' if line[i] == 'NoLabel' else line[i]\n",
"# pathTrain = '../tasks/zad8/en/train-en.conllu'\n",
"# pathTest = '../tasks/zad8/en/test-en.conllu'\n",
"\n",
"pathTrain = '../tasks/zad8/pl/train.conllu'\n",
"pathTest = '../tasks/zad8/pl/test.conllu'\n",
"\n",
"with open(pathTrain, encoding=\"UTF-8\") as trainfile:\n",
" i=0\n",
" for line in trainfile:\n",
" print(line)\n",
" i+=1\n",
" if i==15: break \n",
" trainset = list(parse_incr(trainfile, fields=fields, field_parsers={'slot': nolabel2o}))\n",
"with open(pathTest, encoding=\"UTF-8\") as testfile:\n",
" testset = list(parse_incr(testfile, fields=fields, field_parsers={'slot': nolabel2o}))\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zobaczmy kilka przykładowych wypowiedzi z tego zbioru."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table>\n",
"<tbody>\n",
"<tr><td style=\"text-align: right;\">1</td><td>wybieram</td><td>inform</td><td>O </td></tr>\n",
"<tr><td style=\"text-align: right;\">2</td><td>batmana </td><td>inform</td><td>B-title</td></tr>\n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"'<table>\\n<tbody>\\n<tr><td style=\"text-align: right;\">1</td><td>wybieram</td><td>inform</td><td>O </td></tr>\\n<tr><td style=\"text-align: right;\">2</td><td>batmana </td><td>inform</td><td>B-title</td></tr>\\n</tbody>\\n</table>'"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from tabulate import tabulate\n",
"tabulate(trainset[1], tablefmt='html')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table>\n",
"<tbody>\n",
"<tr><td style=\"text-align: right;\">1</td><td>chcę </td><td>inform</td><td>O </td></tr>\n",
"<tr><td style=\"text-align: right;\">2</td><td>zarezerwować</td><td>inform</td><td>B-goal</td></tr>\n",
"<tr><td style=\"text-align: right;\">3</td><td>bilety </td><td>inform</td><td>O </td></tr>\n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"'<table>\\n<tbody>\\n<tr><td style=\"text-align: right;\">1</td><td>chcę </td><td>inform</td><td>O </td></tr>\\n<tr><td style=\"text-align: right;\">2</td><td>zarezerwować</td><td>inform</td><td>B-goal</td></tr>\\n<tr><td style=\"text-align: right;\">3</td><td>bilety </td><td>inform</td><td>O </td></tr>\\n</tbody>\\n</table>'"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tabulate(trainset[16], tablefmt='html')"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table>\n",
"<tbody>\n",
"<tr><td style=\"text-align: right;\">1</td><td>chciałbym </td><td>inform</td><td>O</td></tr>\n",
"<tr><td style=\"text-align: right;\">2</td><td>anulować </td><td>inform</td><td>O</td></tr>\n",
"<tr><td style=\"text-align: right;\">3</td><td>rezerwację</td><td>inform</td><td>O</td></tr>\n",
"<tr><td style=\"text-align: right;\">4</td><td>biletu </td><td>inform</td><td>O</td></tr>\n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"'<table>\\n<tbody>\\n<tr><td style=\"text-align: right;\">1</td><td>chciałbym </td><td>inform</td><td>O</td></tr>\\n<tr><td style=\"text-align: right;\">2</td><td>anulować </td><td>inform</td><td>O</td></tr>\\n<tr><td style=\"text-align: right;\">3</td><td>rezerwację</td><td>inform</td><td>O</td></tr>\\n<tr><td style=\"text-align: right;\">4</td><td>biletu </td><td>inform</td><td>O</td></tr>\\n</tbody>\\n</table>'"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tabulate(trainset[20], tablefmt='html')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Budując model skorzystamy z architektury opartej o rekurencyjne sieci neuronowe\n",
"zaimplementowanej w bibliotece [flair](https://github.com/flairNLP/flair) (Akbik i in. 2018)."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"from flair.data import Corpus, Sentence, Token\n",
"from flair.datasets import SentenceDataset, CSVClassificationCorpus\n",
"from flair.embeddings import StackedEmbeddings\n",
"from flair.embeddings import WordEmbeddings\n",
"from flair.embeddings import CharacterEmbeddings\n",
"from flair.embeddings import FlairEmbeddings\n",
"from flair.models import SequenceTagger\n",
"from flair.trainers import ModelTrainer\n",
"from flair.datasets import DataLoader\n",
"\n",
"# determinizacja obliczeń\n",
"import random\n",
"import torch\n",
"random.seed(42)\n",
"torch.manual_seed(42)\n",
"\n",
"if torch.cuda.is_available():\n",
" torch.cuda.manual_seed(0)\n",
" torch.cuda.manual_seed_all(0)\n",
" torch.backends.cudnn.enabled = False\n",
" torch.backends.cudnn.benchmark = False\n",
" torch.backends.cudnn.deterministic = True"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dane skonwertujemy do formatu wykorzystywanego przez `flair`, korzystając z następującej funkcji."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Corpus: 346 train + 38 dev + 32 test sentences\n",
"Dictionary with 20 tags: <unk>, O, B-interval, I-interval, B-title, B-date, I-date, B-time, B-quantity, B-area, I-area, B-goal, I-goal, I-title, I-time, I-quantity, B-seats, I-seats, <START>, <STOP>\n"
]
}
],
"source": [
"def conllu2flair(sentences, label1=None, label2=None):\n",
" fsentences = []\n",
"\n",
" for sentence in sentences:\n",
" fsentence = Sentence()\n",
"\n",
" for token in sentence:\n",
" ftoken = Token(token['form'])\n",
"\n",
" if label1:\n",
" if label2:\n",
" ftoken.add_tag(label1, token[label1] + \"/\" + token[label2])\n",
" else:\n",
" ftoken.add_tag(label1, token[label1])\n",
" \n",
" fsentence.add_token(ftoken)\n",
"\n",
" fsentences.append(fsentence)\n",
"\n",
" return SentenceDataset(fsentences)\n",
"\n",
"corpus = Corpus(train=conllu2flair(trainset, 'slot'), test=conllu2flair(testset, 'slot'))\n",
"print(corpus)\n",
"tag_dictionary = corpus.make_tag_dictionary(tag_type='slot')\n",
"print(tag_dictionary)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nasz model będzie wykorzystywał wektorowe reprezentacje słów (zob. [Word Embeddings](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_3_WORD_EMBEDDING.md))."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"embedding_types = [\n",
" WordEmbeddings('pl'),\n",
" FlairEmbeddings('polish-forward'),\n",
" FlairEmbeddings('polish-backward'),\n",
" CharacterEmbeddings(),\n",
"]\n",
"\n",
"embeddings = StackedEmbeddings(embeddings=embedding_types)\n",
"tagger = SequenceTagger(hidden_size=256, embeddings=embeddings,\n",
" tag_dictionary=tag_dictionary,\n",
" tag_type='slot', use_crf=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zobaczmy jak wygląda architektura sieci neuronowej, która będzie odpowiedzialna za przewidywanie\n",
"slotów w wypowiedziach."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"SequenceTagger(\n",
" (embeddings): StackedEmbeddings(\n",
" (list_embedding_0): WordEmbeddings('pl')\n",
" (list_embedding_1): FlairEmbeddings(\n",
" (lm): LanguageModel(\n",
" (drop): Dropout(p=0.25, inplace=False)\n",
" (encoder): Embedding(1602, 100)\n",
" (rnn): LSTM(100, 2048)\n",
" (decoder): Linear(in_features=2048, out_features=1602, bias=True)\n",
" )\n",
" )\n",
" (list_embedding_2): FlairEmbeddings(\n",
" (lm): LanguageModel(\n",
" (drop): Dropout(p=0.25, inplace=False)\n",
" (encoder): Embedding(1602, 100)\n",
" (rnn): LSTM(100, 2048)\n",
" (decoder): Linear(in_features=2048, out_features=1602, bias=True)\n",
" )\n",
" )\n",
" (list_embedding_3): CharacterEmbeddings(\n",
" (char_embedding): Embedding(275, 25)\n",
" (char_rnn): LSTM(25, 25, bidirectional=True)\n",
" )\n",
" )\n",
" (word_dropout): WordDropout(p=0.05)\n",
" (locked_dropout): LockedDropout(p=0.5)\n",
" (embedding2nn): Linear(in_features=4446, out_features=4446, bias=True)\n",
" (rnn): LSTM(4446, 256, batch_first=True, bidirectional=True)\n",
" (linear): Linear(in_features=512, out_features=78, bias=True)\n",
" (beta): 1.0\n",
" (weights): None\n",
" (weight_tensor) None\n",
")\n"
]
}
],
"source": [
"print(tagger)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wykonamy dziesięć iteracji (epok) uczenia a wynikowy model zapiszemy w katalogu `slot-model`."
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [],
"source": [
"modelPath = 'slot-model/final-model.pt'\n",
"\n",
"from os.path import exists\n",
"\n",
"fileExists = exists(modelPath)\n",
"\n",
"if(not fileExists):\n",
" trainer = ModelTrainer(tagger, corpus)\n",
" trainer.train('slot-model',\n",
" learning_rate=0.1,\n",
" mini_batch_size=32,\n",
" max_epochs=10,\n",
" train_with_dev=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Jakość wyuczonego modelu możemy ocenić, korzystając z zaraportowanych powyżej metryk, tj.:\n",
"\n",
" - *tp (true positives)*\n",
"\n",
" > liczba słów oznaczonych w zbiorze testowym etykietą $e$, które model oznaczył tą etykietą\n",
"\n",
" - *fp (false positives)*\n",
"\n",
" > liczba słów nieoznaczonych w zbiorze testowym etykietą $e$, które model oznaczył tą etykietą\n",
"\n",
" - *fn (false negatives)*\n",
"\n",
" > liczba słów oznaczonych w zbiorze testowym etykietą $e$, którym model nie nadał etykiety $e$\n",
"\n",
" - *precision*\n",
"\n",
" > $$\\frac{tp}{tp + fp}$$\n",
"\n",
" - *recall*\n",
"\n",
" > $$\\frac{tp}{tp + fn}$$\n",
"\n",
" - $F_1$\n",
"\n",
" > $$\\frac{2 \\cdot precision \\cdot recall}{precision + recall}$$\n",
"\n",
" - *micro* $F_1$\n",
"\n",
" > $F_1$ w którym $tp$, $fp$ i $fn$ są liczone łącznie dla wszystkich etykiet, tj. $tp = \\sum_{e}{{tp}_e}$, $fn = \\sum_{e}{{fn}_e}$, $fp = \\sum_{e}{{fp}_e}$\n",
"\n",
" - *macro* $F_1$\n",
"\n",
" > średnia arytmetyczna z $F_1$ obliczonych dla poszczególnych etykiet z osobna.\n",
"\n",
"Wyuczony model możemy wczytać z pliku korzystając z metody `load`."
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2022-05-30 22:30:48,788 loading file slot-model/final-model.pt\n"
]
}
],
"source": [
"model = SequenceTagger.load(modelPath)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wczytany model możemy wykorzystać do przewidywania slotów w wypowiedziach użytkownika, korzystając\n",
"z przedstawionej poniżej funkcji `predict`."
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('co', 'O'), ('gracie', 'O'), ('obecnie', 'O')]"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def predict(model, sentence):\n",
" csentence = [{'form': word} for word in sentence]\n",
" fsentence = conllu2flair([csentence])[0]\n",
" model.predict(fsentence)\n",
" return [(token, ftoken.get_tag('slot').value) for token, ftoken in zip(sentence, fsentence)]\n",
"\n",
"predict(model, 'poprosze bilet na batman'.split())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Jak pokazuje przykład poniżej model wyuczony tylko na 100 przykładach popełnia w dosyć prostej\n",
"wypowiedzi błąd etykietując słowo `alarm` tagiem `B-weather/noun`."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table>\n",
"<tbody>\n",
"<tr><td>kiedy </td><td>O/reqmore</td></tr>\n",
"<tr><td>gracie</td><td>O/reqmore</td></tr>\n",
"<tr><td>film </td><td>O/reqmore</td></tr>\n",
"<tr><td>zorro </td><td>O/reqmore</td></tr>\n",
"</tbody>\n",
"</table>"
],
"text/plain": [
"'<table>\\n<tbody>\\n<tr><td>kiedy </td><td>O/reqmore</td></tr>\\n<tr><td>gracie</td><td>O/reqmore</td></tr>\\n<tr><td>film </td><td>O/reqmore</td></tr>\\n<tr><td>zorro </td><td>O/reqmore</td></tr>\\n</tbody>\\n</table>'"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tabulate(predict(model, 'kiedy gracie film zorro'.split()), tablefmt='html')"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n",
"KeyboardInterrupt\n",
"\n"
]
}
],
"source": [
"# evaluation\n",
"\n",
"def precision(tpScore, fpScore):\n",
" return float(tpScore) / (tpScore + fpScore)\n",
"\n",
"def recall(tpScore, fnScore):\n",
" return float(tpScore) / (tpScore + fnScore)\n",
"\n",
"def f1(precision, recall):\n",
" return 2 * precision * recall/(precision + recall)\n",
"\n",
"def eval():\n",
" tp = 0\n",
" fp = 0\n",
" fn = 0\n",
" sentences = [sentence for sentence in testset]\n",
" for sentence in sentences:\n",
" # get sentence as terms list\n",
" termsList = [w[\"form\"] for w in sentence]\n",
" # predict tags\n",
" predTags = [tag[1] for tag in predict(model, termsList)]\n",
" \n",
" # expTags = [token[\"slot\"] + \"/\" + token[\"frame\"] for token in sentence]\n",
" expTags = [token[\"slot\"] for token in sentence]\n",
"\n",
" for i in range(len(predTags)):\n",
" if (expTags[i][0] == \"O\" and expTags[i] != predTags[i]):\n",
" fp += 1\n",
" elif ((expTags[i][0] != \"O\") & (predTags[i][0] == \"O\")):\n",
" fn += 1\n",
" elif ((expTags[i][0] != \"O\") & (predTags[i] == expTags[i])):\n",
" tp += 1\n",
"\n",
" precisionScore = precision(tp, fp)\n",
" recallScore = recall(tp, fn)\n",
" f1Score = f1(precisionScore, recallScore)\n",
" print(\"stats: \")\n",
" print(\"precision: \", precisionScore)\n",
" print(\"recall: \", recallScore)\n",
" print(\"f1: \", f1Score)\n",
"\n",
"eval()\n",
"\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Literatura\n",
"----------\n",
" 1. Sebastian Schuster, Sonal Gupta, Rushin Shah, Mike Lewis, Cross-lingual Transfer Learning for Multilingual Task Oriented Dialog. NAACL-HLT (1) 2019, pp. 3795-3805\n",
" 2. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML '01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 282289, https://repository.upenn.edu/cgi/viewcontent.cgi?article=1162&context=cis_papers\n",
" 3. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (November 15, 1997), 17351780, https://doi.org/10.1162/neco.1997.9.8.1735\n",
" 4. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, Attention is All you Need, NIPS 2017, pp. 5998-6008, https://arxiv.org/abs/1706.03762\n",
" 5. Alan Akbik, Duncan Blythe, Roland Vollgraf, Contextual String Embeddings for Sequence Labeling, Proceedings of the 27th International Conference on Computational Linguistics, pp. 16381649, https://www.aclweb.org/anthology/C18-1139.pdf\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Predykcja aktów mowy użytkownika"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2022-05-30 22:08:33,633 Reading data from ..\\tasks\\zad8\\pl\\dataSentence\n",
"2022-05-30 22:08:33,633 Train: ..\\tasks\\zad8\\pl\\dataSentence\\train.tsv\n",
"2022-05-30 22:08:33,634 Dev: None\n",
"2022-05-30 22:08:33,635 Test: ..\\tasks\\zad8\\pl\\dataSentence\\test.tsv\n",
"Corpus: 280 train + 31 dev + 32 test sentences\n"
]
}
],
"source": [
"def conllu2flair(sentences, label2=None):\n",
" fsentences = []\n",
"\n",
" for sentence in sentences:\n",
" fsentence = Sentence()\n",
"\n",
" for token in sentence:\n",
" ftoken = Token(token['form'])\n",
"\n",
" \n",
" if label2:\n",
" ftoken.add_tag(label2, token[label2])\n",
" \n",
" fsentence.add_token(ftoken)\n",
"\n",
" fsentences.append(fsentence)\n",
"\n",
" return SentenceDataset(fsentences)\n",
"\n",
"trainPath = \"../tasks/zad8/pl/dataSentence/train.tsv\"\n",
"testPath = \"../tasks/zad8/pl/dataSentence/test.tsv\"\n",
"dataFolder = \"../tasks/zad8/pl/dataSentence\"\n",
"column_name_map = {0: \"text\", 1: \"label_topic\"}\n",
"corpusClassification = CSVClassificationCorpus(dataFolder,\n",
" column_name_map,\n",
" skip_header=False,\n",
" delimiter='\\t',\n",
")\n",
"print(corpusClassification)"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2022-05-30 22:10:19,891 Computing label dictionary. Progress:\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 312/312 [00:04<00:00, 68.32it/s] "
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"2022-05-30 22:10:25,276 [b'inform', b'reqmore', b'hello', b'infomrm', b'reqmore inform', b'bye', b'ack', b'reqalts', b'impl-conf inform', b'help', b'request', b'affirm', b'thankyou', b'affirm inform', b'bye thankyou', b'hello inform', b'infrom', b'confirm', b'negate confirm', b'negate', b'negate ', b'deny']\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"from flair.data import Corpus\n",
"from flair.datasets import TREC_6\n",
"from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentRNNEmbeddings\n",
"from flair.models import TextClassifier\n",
"from flair.trainers import ModelTrainer\n",
"\n",
"from os.path import exists\n",
"\n",
"\n",
"# 2. create the label dictionary\n",
"label_dict = corpusClassification.make_label_dictionary()\n",
"\n",
"# 3. make a list of word embeddings\n",
"word_embeddings = [\n",
" WordEmbeddings('pl'),\n",
" FlairEmbeddings('polish-forward'),\n",
" FlairEmbeddings('polish-backward'),\n",
" CharacterEmbeddings(),\n",
"]\n",
"\n",
"# 4. initialize document embedding by passing list of word embeddings\n",
"# Can choose between many RNN types (GRU by default, to change use rnn_type parameter)\n",
"document_embeddings = DocumentRNNEmbeddings(word_embeddings, hidden_size=256)\n",
"\n",
"# 5. create the text classifier\n",
"classifier = TextClassifier(document_embeddings, label_dictionary=label_dict)\n",
"\n",
"# 6. initialize the text classifier trainer\n",
"trainer = ModelTrainer(classifier, corpusClassification)\n",
"\n",
"modelPath = 'resources/taggers/trec/final-model.pt'\n",
"\n",
"\n",
"fileExists = exists(modelPath)\n",
"\n",
"if(not fileExists):\n",
" # 7. start the training\n",
" trainer.train('resources/taggers/trec',\n",
" learning_rate=0.1,\n",
" mini_batch_size=32,\n",
" anneal_factor=0.5,\n",
" patience=5,\n",
" max_epochs=10)"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2022-05-30 22:10:47,199 loading file resources/taggers/trec/final-model.pt\n",
"[reqmore (0.5459)]\n"
]
}
],
"source": [
"classifier = TextClassifier.load(modelPath)\n",
"\n",
"# create example sentence\n",
"sentence = Sentence('Jakie filmy gracie jutro?')\n",
"\n",
"# predict class and print\n",
"classifier.predict(sentence)\n",
"\n",
"print(sentence.labels)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[inform (0.5967)]\n"
]
}
],
"source": [
"# create example sentence\n",
"sentence = Sentence('siedzenia h1 h2')\n",
"\n",
"# predict class and print\n",
"classifier.predict(sentence)\n",
"\n",
"print(sentence.labels)"
]
}
],
"metadata": {
"author": "Marek Kubis",
"email": "mkubis@amu.edu.pl",
"interpreter": {
"hash": "2f9d6cf1e3d8195079a65c851de355134a77367bcd714b1a5d498c42d3c07114"
},
"jupytext": {
"cell_metadata_filter": "-all",
"main_language": "python",
"notebook_metadata_filter": "-all"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"lang": "pl",
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
},
"subtitle": "8.Parsing semantyczny z wykorzystaniem technik uczenia maszynowego[laboratoria]",
"title": "Systemy Dialogowe",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4
}