895 lines
29 KiB
Plaintext
895 lines
29 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
|
||
"<div class=\"alert alert-block alert-info\">\n",
|
||
"<h1> Systemy Dialogowe </h1>\n",
|
||
"<h2> 8. <i>Parsing semantyczny z wykorzystaniem technik uczenia maszynowego</i> [laboratoria]</h2> \n",
|
||
"<h3> Marek Kubis (2021)</h3>\n",
|
||
"</div>\n",
|
||
"\n",
|
||
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Parsing semantyczny z wykorzystaniem technik uczenia maszynowego\n",
|
||
"================================================================\n",
|
||
"\n",
|
||
"Wprowadzenie\n",
|
||
"------------\n",
|
||
"Problem wykrywania slotów i ich wartości w wypowiedziach użytkownika można sformułować jako zadanie\n",
|
||
"polegające na przewidywaniu dla poszczególnych słów etykiet wskazujących na to czy i do jakiego\n",
|
||
"slotu dane słowo należy.\n",
|
||
"\n",
|
||
"> chciałbym zarezerwować stolik na jutro**/day** na godzinę dwunastą**/hour** czterdzieści**/hour** pięć**/hour** na pięć**/size** osób\n",
|
||
"\n",
|
||
"Granice slotów oznacza się korzystając z wybranego schematu etykietowania.\n",
|
||
"\n",
|
||
"### Schemat IOB\n",
|
||
"\n",
|
||
"| Prefix | Znaczenie |\n",
|
||
"|:------:|:---------------------------|\n",
|
||
"| I | wnętrze slotu (inside) |\n",
|
||
"| O | poza slotem (outside) |\n",
|
||
"| B | początek slotu (beginning) |\n",
|
||
"\n",
|
||
"> chciałbym zarezerwować stolik na jutro**/B-day** na godzinę dwunastą**/B-hour** czterdzieści**/I-hour** pięć**/I-hour** na pięć**/B-size** osób\n",
|
||
"\n",
|
||
"### Schemat IOBES\n",
|
||
"\n",
|
||
"| Prefix | Znaczenie |\n",
|
||
"|:------:|:---------------------------|\n",
|
||
"| I | wnętrze slotu (inside) |\n",
|
||
"| O | poza slotem (outside) |\n",
|
||
"| B | początek slotu (beginning) |\n",
|
||
"| E | koniec slotu (ending) |\n",
|
||
"| S | pojedyncze słowo (single) |\n",
|
||
"\n",
|
||
"> chciałbym zarezerwować stolik na jutro**/S-day** na godzinę dwunastą**/B-hour** czterdzieści**/I-hour** pięć**/E-hour** na pięć**/S-size** osób\n",
|
||
"\n",
|
||
"Jeżeli dla tak sformułowanego zadania przygotujemy zbiór danych\n",
|
||
"złożony z wypowiedzi użytkownika z oznaczonymi slotami (tzw. *zbiór uczący*),\n",
|
||
"to możemy zastosować techniki (nadzorowanego) uczenia maszynowego w celu zbudowania modelu\n",
|
||
"annotującego wypowiedzi użytkownika etykietami slotów.\n",
|
||
"\n",
|
||
"Do zbudowania takiego modelu można wykorzystać między innymi:\n",
|
||
"\n",
|
||
" 1. warunkowe pola losowe (Lafferty i in.; 2001),\n",
|
||
"\n",
|
||
" 2. rekurencyjne sieci neuronowe, np. sieci LSTM (Hochreiter i Schmidhuber; 1997),\n",
|
||
"\n",
|
||
" 3. transformery (Vaswani i in., 2017).\n",
|
||
"\n",
|
||
"Przykład\n",
|
||
"--------\n",
|
||
"Skorzystamy ze zbioru danych przygotowanego przez Schustera (2019)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Zbiór ten gromadzi wypowiedzi w trzech językach opisane slotami dla dwunastu ram należących do trzech dziedzin `Alarm`, `Reminder` oraz `Weather`. Dane wczytamy korzystając z biblioteki [conllu](https://pypi.org/project/conllu/)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"# text: halo\t\t\t\n",
|
||
"\n",
|
||
"# intent: hello\t\t\t\n",
|
||
"\n",
|
||
"# slots: \t\t\t\n",
|
||
"\n",
|
||
"1\thalo\thello\tNoLabel\n",
|
||
"\n",
|
||
"\t\t\t\n",
|
||
"\n",
|
||
"# text: chaciałbym pójść na premierę filmu jakie premiery są w tym tygodniu\t\t\t\n",
|
||
"\n",
|
||
"# intent: reqmore\t\t\t\n",
|
||
"\n",
|
||
"# slots: \t\t\t\n",
|
||
"\n",
|
||
"1\tchaciałbym\treqmore\tNoLabel\n",
|
||
"\n",
|
||
"2\tpójść\treqmore\tNoLabel\n",
|
||
"\n",
|
||
"3\tna\treqmore\tNoLabel\n",
|
||
"\n",
|
||
"4\tpremierę\treqmore\tNoLabel\n",
|
||
"\n",
|
||
"5\tfilmu\treqmore\tNoLabel\n",
|
||
"\n",
|
||
"6\tjakie\treqmore\tB-goal\n",
|
||
"\n",
|
||
"7\tpremiery\treqmore\tI-goal\n",
|
||
"\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"from conllu import parse_incr\n",
|
||
"fields = ['id', 'form', 'frame', 'slot']\n",
|
||
"\n",
|
||
"def nolabel2o(line, i):\n",
|
||
" return 'O' if line[i] == 'NoLabel' else line[i]\n",
|
||
"# pathTrain = '../tasks/zad8/en/train-en.conllu'\n",
|
||
"# pathTest = '../tasks/zad8/en/test-en.conllu'\n",
|
||
"\n",
|
||
"pathTrain = '../tasks/zad8/pl/train.conllu'\n",
|
||
"pathTest = '../tasks/zad8/pl/test.conllu'\n",
|
||
"\n",
|
||
"with open(pathTrain, encoding=\"UTF-8\") as trainfile:\n",
|
||
" i=0\n",
|
||
" for line in trainfile:\n",
|
||
" print(line)\n",
|
||
" i+=1\n",
|
||
" if i==15: break \n",
|
||
" trainset = list(parse_incr(trainfile, fields=fields, field_parsers={'slot': nolabel2o}))\n",
|
||
"with open(pathTest, encoding=\"UTF-8\") as testfile:\n",
|
||
" testset = list(parse_incr(testfile, fields=fields, field_parsers={'slot': nolabel2o}))\n",
|
||
" "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Zobaczmy kilka przykładowych wypowiedzi z tego zbioru."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<table>\n",
|
||
"<tbody>\n",
|
||
"<tr><td style=\"text-align: right;\">1</td><td>wybieram</td><td>inform</td><td>O </td></tr>\n",
|
||
"<tr><td style=\"text-align: right;\">2</td><td>batmana </td><td>inform</td><td>B-title</td></tr>\n",
|
||
"</tbody>\n",
|
||
"</table>"
|
||
],
|
||
"text/plain": [
|
||
"'<table>\\n<tbody>\\n<tr><td style=\"text-align: right;\">1</td><td>wybieram</td><td>inform</td><td>O </td></tr>\\n<tr><td style=\"text-align: right;\">2</td><td>batmana </td><td>inform</td><td>B-title</td></tr>\\n</tbody>\\n</table>'"
|
||
]
|
||
},
|
||
"execution_count": 2,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"from tabulate import tabulate\n",
|
||
"tabulate(trainset[1], tablefmt='html')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<table>\n",
|
||
"<tbody>\n",
|
||
"<tr><td style=\"text-align: right;\">1</td><td>chcę </td><td>inform</td><td>O </td></tr>\n",
|
||
"<tr><td style=\"text-align: right;\">2</td><td>zarezerwować</td><td>inform</td><td>B-goal</td></tr>\n",
|
||
"<tr><td style=\"text-align: right;\">3</td><td>bilety </td><td>inform</td><td>O </td></tr>\n",
|
||
"</tbody>\n",
|
||
"</table>"
|
||
],
|
||
"text/plain": [
|
||
"'<table>\\n<tbody>\\n<tr><td style=\"text-align: right;\">1</td><td>chcę </td><td>inform</td><td>O </td></tr>\\n<tr><td style=\"text-align: right;\">2</td><td>zarezerwować</td><td>inform</td><td>B-goal</td></tr>\\n<tr><td style=\"text-align: right;\">3</td><td>bilety </td><td>inform</td><td>O </td></tr>\\n</tbody>\\n</table>'"
|
||
]
|
||
},
|
||
"execution_count": 3,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"tabulate(trainset[16], tablefmt='html')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<table>\n",
|
||
"<tbody>\n",
|
||
"<tr><td style=\"text-align: right;\">1</td><td>chciałbym </td><td>inform</td><td>O</td></tr>\n",
|
||
"<tr><td style=\"text-align: right;\">2</td><td>anulować </td><td>inform</td><td>O</td></tr>\n",
|
||
"<tr><td style=\"text-align: right;\">3</td><td>rezerwację</td><td>inform</td><td>O</td></tr>\n",
|
||
"<tr><td style=\"text-align: right;\">4</td><td>biletu </td><td>inform</td><td>O</td></tr>\n",
|
||
"</tbody>\n",
|
||
"</table>"
|
||
],
|
||
"text/plain": [
|
||
"'<table>\\n<tbody>\\n<tr><td style=\"text-align: right;\">1</td><td>chciałbym </td><td>inform</td><td>O</td></tr>\\n<tr><td style=\"text-align: right;\">2</td><td>anulować </td><td>inform</td><td>O</td></tr>\\n<tr><td style=\"text-align: right;\">3</td><td>rezerwację</td><td>inform</td><td>O</td></tr>\\n<tr><td style=\"text-align: right;\">4</td><td>biletu </td><td>inform</td><td>O</td></tr>\\n</tbody>\\n</table>'"
|
||
]
|
||
},
|
||
"execution_count": 4,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"tabulate(trainset[20], tablefmt='html')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Budując model skorzystamy z architektury opartej o rekurencyjne sieci neuronowe\n",
|
||
"zaimplementowanej w bibliotece [flair](https://github.com/flairNLP/flair) (Akbik i in. 2018)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from flair.data import Corpus, Sentence, Token\n",
|
||
"from flair.datasets import SentenceDataset, CSVClassificationCorpus\n",
|
||
"from flair.embeddings import StackedEmbeddings\n",
|
||
"from flair.embeddings import WordEmbeddings\n",
|
||
"from flair.embeddings import CharacterEmbeddings\n",
|
||
"from flair.embeddings import FlairEmbeddings\n",
|
||
"from flair.models import SequenceTagger\n",
|
||
"from flair.trainers import ModelTrainer\n",
|
||
"from flair.datasets import DataLoader\n",
|
||
"import flair\n",
|
||
"# determinizacja obliczeń\n",
|
||
"import random\n",
|
||
"import torch\n",
|
||
"random.seed(42)\n",
|
||
"torch.manual_seed(42)\n",
|
||
"\n",
|
||
"if torch.cuda.is_available():\n",
|
||
" torch.cuda.manual_seed(0)\n",
|
||
" torch.cuda.manual_seed_all(0)\n",
|
||
" torch.backends.cudnn.enabled = False\n",
|
||
" torch.backends.cudnn.benchmark = False\n",
|
||
" torch.backends.cudnn.deterministic = True"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'0.6.1'"
|
||
]
|
||
},
|
||
"execution_count": 5,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"flair.__version__\n",
|
||
"# Python 3.8.3 "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Dane skonwertujemy do formatu wykorzystywanego przez `flair`, korzystając z następującej funkcji."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 30,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Corpus: 346 train + 38 dev + 32 test sentences\n",
|
||
"Dictionary with 20 tags: <unk>, O, B-interval, I-interval, B-title, B-date, I-date, B-time, B-quantity, B-area, I-area, B-goal, I-goal, I-title, I-time, I-quantity, B-seats, I-seats, <START>, <STOP>\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"def conllu2flair(sentences, label1=None, label2=None):\n",
|
||
" fsentences = []\n",
|
||
"\n",
|
||
" for sentence in sentences:\n",
|
||
" fsentence = Sentence()\n",
|
||
"\n",
|
||
" for token in sentence:\n",
|
||
" ftoken = Token(token['form'])\n",
|
||
"\n",
|
||
" if label1:\n",
|
||
" if label2:\n",
|
||
" ftoken.add_tag(label1, token[label1] + \"/\" + token[label2])\n",
|
||
" else:\n",
|
||
" ftoken.add_tag(label1, token[label1])\n",
|
||
" \n",
|
||
" fsentence.add_token(ftoken)\n",
|
||
"\n",
|
||
" fsentences.append(fsentence)\n",
|
||
"\n",
|
||
" return SentenceDataset(fsentences)\n",
|
||
"\n",
|
||
"corpus = Corpus(train=conllu2flair(trainset, 'slot'), test=conllu2flair(testset, 'slot'))\n",
|
||
"print(corpus)\n",
|
||
"tag_dictionary = corpus.make_tag_dictionary(tag_type='slot')\n",
|
||
"print(tag_dictionary)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Nasz model będzie wykorzystywał wektorowe reprezentacje słów (zob. [Word Embeddings](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_3_WORD_EMBEDDING.md))."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 31,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"embedding_types = [\n",
|
||
" WordEmbeddings('pl'),\n",
|
||
" FlairEmbeddings('polish-forward'),\n",
|
||
" FlairEmbeddings('polish-backward'),\n",
|
||
" CharacterEmbeddings(),\n",
|
||
"]\n",
|
||
"\n",
|
||
"embeddings = StackedEmbeddings(embeddings=embedding_types)\n",
|
||
"tagger = SequenceTagger(hidden_size=256, embeddings=embeddings,\n",
|
||
" tag_dictionary=tag_dictionary,\n",
|
||
" tag_type='slot', use_crf=True)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Zobaczmy jak wygląda architektura sieci neuronowej, która będzie odpowiedzialna za przewidywanie\n",
|
||
"slotów w wypowiedziach."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"SequenceTagger(\n",
|
||
" (embeddings): StackedEmbeddings(\n",
|
||
" (list_embedding_0): WordEmbeddings('pl')\n",
|
||
" (list_embedding_1): FlairEmbeddings(\n",
|
||
" (lm): LanguageModel(\n",
|
||
" (drop): Dropout(p=0.25, inplace=False)\n",
|
||
" (encoder): Embedding(1602, 100)\n",
|
||
" (rnn): LSTM(100, 2048)\n",
|
||
" (decoder): Linear(in_features=2048, out_features=1602, bias=True)\n",
|
||
" )\n",
|
||
" )\n",
|
||
" (list_embedding_2): FlairEmbeddings(\n",
|
||
" (lm): LanguageModel(\n",
|
||
" (drop): Dropout(p=0.25, inplace=False)\n",
|
||
" (encoder): Embedding(1602, 100)\n",
|
||
" (rnn): LSTM(100, 2048)\n",
|
||
" (decoder): Linear(in_features=2048, out_features=1602, bias=True)\n",
|
||
" )\n",
|
||
" )\n",
|
||
" (list_embedding_3): CharacterEmbeddings(\n",
|
||
" (char_embedding): Embedding(275, 25)\n",
|
||
" (char_rnn): LSTM(25, 25, bidirectional=True)\n",
|
||
" )\n",
|
||
" )\n",
|
||
" (word_dropout): WordDropout(p=0.05)\n",
|
||
" (locked_dropout): LockedDropout(p=0.5)\n",
|
||
" (embedding2nn): Linear(in_features=4446, out_features=4446, bias=True)\n",
|
||
" (rnn): LSTM(4446, 256, batch_first=True, bidirectional=True)\n",
|
||
" (linear): Linear(in_features=512, out_features=78, bias=True)\n",
|
||
" (beta): 1.0\n",
|
||
" (weights): None\n",
|
||
" (weight_tensor) None\n",
|
||
")\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(tagger)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Wykonamy dziesięć iteracji (epok) uczenia a wynikowy model zapiszemy w katalogu `slot-model`."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 44,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"modelPath = 'slot-model/final-model.pt'\n",
|
||
"\n",
|
||
"from os.path import exists\n",
|
||
"\n",
|
||
"fileExists = exists(modelPath)\n",
|
||
"\n",
|
||
"if(not fileExists):\n",
|
||
" trainer = ModelTrainer(tagger, corpus)\n",
|
||
" trainer.train('slot-model',\n",
|
||
" learning_rate=0.1,\n",
|
||
" mini_batch_size=32,\n",
|
||
" max_epochs=10,\n",
|
||
" train_with_dev=False)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Jakość wyuczonego modelu możemy ocenić, korzystając z zaraportowanych powyżej metryk, tj.:\n",
|
||
"\n",
|
||
" - *tp (true positives)*\n",
|
||
"\n",
|
||
" > liczba słów oznaczonych w zbiorze testowym etykietą $e$, które model oznaczył tą etykietą\n",
|
||
"\n",
|
||
" - *fp (false positives)*\n",
|
||
"\n",
|
||
" > liczba słów nieoznaczonych w zbiorze testowym etykietą $e$, które model oznaczył tą etykietą\n",
|
||
"\n",
|
||
" - *fn (false negatives)*\n",
|
||
"\n",
|
||
" > liczba słów oznaczonych w zbiorze testowym etykietą $e$, którym model nie nadał etykiety $e$\n",
|
||
"\n",
|
||
" - *precision*\n",
|
||
"\n",
|
||
" > $$\\frac{tp}{tp + fp}$$\n",
|
||
"\n",
|
||
" - *recall*\n",
|
||
"\n",
|
||
" > $$\\frac{tp}{tp + fn}$$\n",
|
||
"\n",
|
||
" - $F_1$\n",
|
||
"\n",
|
||
" > $$\\frac{2 \\cdot precision \\cdot recall}{precision + recall}$$\n",
|
||
"\n",
|
||
" - *micro* $F_1$\n",
|
||
"\n",
|
||
" > $F_1$ w którym $tp$, $fp$ i $fn$ są liczone łącznie dla wszystkich etykiet, tj. $tp = \\sum_{e}{{tp}_e}$, $fn = \\sum_{e}{{fn}_e}$, $fp = \\sum_{e}{{fp}_e}$\n",
|
||
"\n",
|
||
" - *macro* $F_1$\n",
|
||
"\n",
|
||
" > średnia arytmetyczna z $F_1$ obliczonych dla poszczególnych etykiet z osobna.\n",
|
||
"\n",
|
||
"Wyuczony model możemy wczytać z pliku korzystając z metody `load`."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 45,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"2022-05-30 22:30:48,788 loading file slot-model/final-model.pt\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"model = SequenceTagger.load(modelPath)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Wczytany model możemy wykorzystać do przewidywania slotów w wypowiedziach użytkownika, korzystając\n",
|
||
"z przedstawionej poniżej funkcji `predict`."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 47,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"[('poprosze', 'O'), ('bilet', 'O'), ('na', 'O'), ('batman', 'B-title')]"
|
||
]
|
||
},
|
||
"execution_count": 47,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"def predict(model, sentence):\n",
|
||
" csentence = [{'form': word} for word in sentence]\n",
|
||
" fsentence = conllu2flair([csentence])[0]\n",
|
||
" model.predict(fsentence)\n",
|
||
" return [(token, ftoken.get_tag('slot').value) for token, ftoken in zip(sentence, fsentence)]\n",
|
||
"\n",
|
||
"predict(model, 'poprosze bilet na batman'.split())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Jak pokazuje przykład poniżej model wyuczony tylko na 100 przykładach popełnia w dosyć prostej\n",
|
||
"wypowiedzi błąd etykietując słowo `alarm` tagiem `B-weather/noun`."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 12,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<table>\n",
|
||
"<tbody>\n",
|
||
"<tr><td>kiedy </td><td>O/reqmore</td></tr>\n",
|
||
"<tr><td>gracie</td><td>O/reqmore</td></tr>\n",
|
||
"<tr><td>film </td><td>O/reqmore</td></tr>\n",
|
||
"<tr><td>zorro </td><td>O/reqmore</td></tr>\n",
|
||
"</tbody>\n",
|
||
"</table>"
|
||
],
|
||
"text/plain": [
|
||
"'<table>\\n<tbody>\\n<tr><td>kiedy </td><td>O/reqmore</td></tr>\\n<tr><td>gracie</td><td>O/reqmore</td></tr>\\n<tr><td>film </td><td>O/reqmore</td></tr>\\n<tr><td>zorro </td><td>O/reqmore</td></tr>\\n</tbody>\\n</table>'"
|
||
]
|
||
},
|
||
"execution_count": 12,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"tabulate(predict(model, 'kiedy gracie film zorro'.split()), tablefmt='html')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 43,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"\n",
|
||
"KeyboardInterrupt\n",
|
||
"\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# evaluation\n",
|
||
"\n",
|
||
"def precision(tpScore, fpScore):\n",
|
||
" return float(tpScore) / (tpScore + fpScore)\n",
|
||
"\n",
|
||
"def recall(tpScore, fnScore):\n",
|
||
" return float(tpScore) / (tpScore + fnScore)\n",
|
||
"\n",
|
||
"def f1(precision, recall):\n",
|
||
" return 2 * precision * recall/(precision + recall)\n",
|
||
"\n",
|
||
"def eval():\n",
|
||
" tp = 0\n",
|
||
" fp = 0\n",
|
||
" fn = 0\n",
|
||
" sentences = [sentence for sentence in testset]\n",
|
||
" for sentence in sentences:\n",
|
||
" # get sentence as terms list\n",
|
||
" termsList = [w[\"form\"] for w in sentence]\n",
|
||
" # predict tags\n",
|
||
" predTags = [tag[1] for tag in predict(model, termsList)]\n",
|
||
" \n",
|
||
" # expTags = [token[\"slot\"] + \"/\" + token[\"frame\"] for token in sentence]\n",
|
||
" expTags = [token[\"slot\"] for token in sentence]\n",
|
||
"\n",
|
||
" for i in range(len(predTags)):\n",
|
||
" if (expTags[i][0] == \"O\" and expTags[i] != predTags[i]):\n",
|
||
" fp += 1\n",
|
||
" elif ((expTags[i][0] != \"O\") & (predTags[i][0] == \"O\")):\n",
|
||
" fn += 1\n",
|
||
" elif ((expTags[i][0] != \"O\") & (predTags[i] == expTags[i])):\n",
|
||
" tp += 1\n",
|
||
"\n",
|
||
" precisionScore = precision(tp, fp)\n",
|
||
" recallScore = recall(tp, fn)\n",
|
||
" f1Score = f1(precisionScore, recallScore)\n",
|
||
" print(\"stats: \")\n",
|
||
" print(\"precision: \", precisionScore)\n",
|
||
" print(\"recall: \", recallScore)\n",
|
||
" print(\"f1: \", f1Score)\n",
|
||
"\n",
|
||
"eval()\n",
|
||
"\n",
|
||
" "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Literatura\n",
|
||
"----------\n",
|
||
" 1. Sebastian Schuster, Sonal Gupta, Rushin Shah, Mike Lewis, Cross-lingual Transfer Learning for Multilingual Task Oriented Dialog. NAACL-HLT (1) 2019, pp. 3795-3805\n",
|
||
" 2. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML '01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 282–289, https://repository.upenn.edu/cgi/viewcontent.cgi?article=1162&context=cis_papers\n",
|
||
" 3. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (November 15, 1997), 1735–1780, https://doi.org/10.1162/neco.1997.9.8.1735\n",
|
||
" 4. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, Attention is All you Need, NIPS 2017, pp. 5998-6008, https://arxiv.org/abs/1706.03762\n",
|
||
" 5. Alan Akbik, Duncan Blythe, Roland Vollgraf, Contextual String Embeddings for Sequence Labeling, Proceedings of the 27th International Conference on Computational Linguistics, pp. 1638–1649, https://www.aclweb.org/anthology/C18-1139.pdf\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Predykcja aktów mowy użytkownika"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 36,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"2022-05-30 22:08:33,633 Reading data from ..\\tasks\\zad8\\pl\\dataSentence\n",
|
||
"2022-05-30 22:08:33,633 Train: ..\\tasks\\zad8\\pl\\dataSentence\\train.tsv\n",
|
||
"2022-05-30 22:08:33,634 Dev: None\n",
|
||
"2022-05-30 22:08:33,635 Test: ..\\tasks\\zad8\\pl\\dataSentence\\test.tsv\n",
|
||
"Corpus: 280 train + 31 dev + 32 test sentences\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"def conllu2flair(sentences, label2=None):\n",
|
||
" fsentences = []\n",
|
||
"\n",
|
||
" for sentence in sentences:\n",
|
||
" fsentence = Sentence()\n",
|
||
"\n",
|
||
" for token in sentence:\n",
|
||
" ftoken = Token(token['form'])\n",
|
||
"\n",
|
||
" \n",
|
||
" if label2:\n",
|
||
" ftoken.add_tag(label2, token[label2])\n",
|
||
" \n",
|
||
" fsentence.add_token(ftoken)\n",
|
||
"\n",
|
||
" fsentences.append(fsentence)\n",
|
||
"\n",
|
||
" return SentenceDataset(fsentences)\n",
|
||
"\n",
|
||
"trainPath = \"../tasks/zad8/pl/dataSentence/train.tsv\"\n",
|
||
"testPath = \"../tasks/zad8/pl/dataSentence/test.tsv\"\n",
|
||
"dataFolder = \"../tasks/zad8/pl/dataSentence\"\n",
|
||
"column_name_map = {0: \"text\", 1: \"label_topic\"}\n",
|
||
"corpusClassification = CSVClassificationCorpus(dataFolder,\n",
|
||
" column_name_map,\n",
|
||
" skip_header=False,\n",
|
||
" delimiter='\\t',\n",
|
||
")\n",
|
||
"print(corpusClassification)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 40,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"2022-05-30 22:10:19,891 Computing label dictionary. Progress:\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"100%|██████████| 312/312 [00:04<00:00, 68.32it/s] "
|
||
]
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"2022-05-30 22:10:25,276 [b'inform', b'reqmore', b'hello', b'infomrm', b'reqmore inform', b'bye', b'ack', b'reqalts', b'impl-conf inform', b'help', b'request', b'affirm', b'thankyou', b'affirm inform', b'bye thankyou', b'hello inform', b'infrom', b'confirm', b'negate confirm', b'negate', b'negate ', b'deny']\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"from flair.data import Corpus\n",
|
||
"from flair.datasets import TREC_6\n",
|
||
"from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentRNNEmbeddings\n",
|
||
"from flair.models import TextClassifier\n",
|
||
"from flair.trainers import ModelTrainer\n",
|
||
"\n",
|
||
"from os.path import exists\n",
|
||
"\n",
|
||
"\n",
|
||
"# 2. create the label dictionary\n",
|
||
"label_dict = corpusClassification.make_label_dictionary()\n",
|
||
"\n",
|
||
"# 3. make a list of word embeddings\n",
|
||
"word_embeddings = [\n",
|
||
" WordEmbeddings('pl'),\n",
|
||
" FlairEmbeddings('polish-forward'),\n",
|
||
" FlairEmbeddings('polish-backward'),\n",
|
||
" CharacterEmbeddings(),\n",
|
||
"]\n",
|
||
"\n",
|
||
"# 4. initialize document embedding by passing list of word embeddings\n",
|
||
"# Can choose between many RNN types (GRU by default, to change use rnn_type parameter)\n",
|
||
"document_embeddings = DocumentRNNEmbeddings(word_embeddings, hidden_size=256)\n",
|
||
"\n",
|
||
"# 5. create the text classifier\n",
|
||
"classifier = TextClassifier(document_embeddings, label_dictionary=label_dict)\n",
|
||
"\n",
|
||
"# 6. initialize the text classifier trainer\n",
|
||
"trainer = ModelTrainer(classifier, corpusClassification)\n",
|
||
"\n",
|
||
"modelPath = 'resources/taggers/trec/final-model.pt'\n",
|
||
"\n",
|
||
"\n",
|
||
"fileExists = exists(modelPath)\n",
|
||
"\n",
|
||
"if(not fileExists):\n",
|
||
" # 7. start the training\n",
|
||
" trainer.train('resources/taggers/trec',\n",
|
||
" learning_rate=0.1,\n",
|
||
" mini_batch_size=32,\n",
|
||
" anneal_factor=0.5,\n",
|
||
" patience=5,\n",
|
||
" max_epochs=10)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 41,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"2022-05-30 22:10:47,199 loading file resources/taggers/trec/final-model.pt\n",
|
||
"[reqmore (0.5459)]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"classifier = TextClassifier.load(modelPath)\n",
|
||
"\n",
|
||
"# create example sentence\n",
|
||
"sentence = Sentence('Jakie filmy gracie jutro?')\n",
|
||
"\n",
|
||
"# predict class and print\n",
|
||
"classifier.predict(sentence)\n",
|
||
"\n",
|
||
"print(sentence.labels)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 42,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"[inform (0.5967)]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# create example sentence\n",
|
||
"sentence = Sentence('siedzenia h1 h2')\n",
|
||
"\n",
|
||
"# predict class and print\n",
|
||
"classifier.predict(sentence)\n",
|
||
"\n",
|
||
"print(sentence.labels)"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"author": "Marek Kubis",
|
||
"email": "mkubis@amu.edu.pl",
|
||
"interpreter": {
|
||
"hash": "2f9d6cf1e3d8195079a65c851de355134a77367bcd714b1a5d498c42d3c07114"
|
||
},
|
||
"jupytext": {
|
||
"cell_metadata_filter": "-all",
|
||
"main_language": "python",
|
||
"notebook_metadata_filter": "-all"
|
||
},
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"lang": "pl",
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.8.3"
|
||
},
|
||
"subtitle": "8.Parsing semantyczny z wykorzystaniem technik uczenia maszynowego[laboratoria]",
|
||
"title": "Systemy Dialogowe",
|
||
"year": "2021"
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 4
|
||
}
|