SystemyDialogowe/.ipynb_checkpoints/08-parsing-semantyczny-uczenie-checkpoint.ipynb
2022-05-18 00:06:14 +02:00

423 lines
18 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Systemy Dialogowe </h1>\n",
"<h2> 8. <i>Parsing semantyczny z wykorzystaniem technik uczenia maszynowego</i> [laboratoria]</h2> \n",
"<h3> Marek Kubis (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Parsing semantyczny z wykorzystaniem technik uczenia maszynowego\n",
"================================================================\n",
"\n",
"Wprowadzenie\n",
"------------\n",
"Problem wykrywania slot\u00f3w i ich warto\u015bci w wypowiedziach u\u017cytkownika mo\u017cna sformu\u0142owa\u0107 jako zadanie\n",
"polegaj\u0105ce na przewidywaniu dla poszczeg\u00f3lnych s\u0142\u00f3w etykiet wskazuj\u0105cych na to czy i do jakiego\n",
"slotu dane s\u0142owo nale\u017cy.\n",
"\n",
"> chcia\u0142bym zarezerwowa\u0107 stolik na jutro**/day** na godzin\u0119 dwunast\u0105**/hour** czterdzie\u015bci**/hour** pi\u0119\u0107**/hour** na pi\u0119\u0107**/size** os\u00f3b\n",
"\n",
"Granice slot\u00f3w oznacza si\u0119 korzystaj\u0105c z wybranego schematu etykietowania.\n",
"\n",
"### Schemat IOB\n",
"\n",
"| Prefix | Znaczenie |\n",
"|:------:|:---------------------------|\n",
"| I | wn\u0119trze slotu (inside) |\n",
"| O | poza slotem (outside) |\n",
"| B | pocz\u0105tek slotu (beginning) |\n",
"\n",
"> chcia\u0142bym zarezerwowa\u0107 stolik na jutro**/B-day** na godzin\u0119 dwunast\u0105**/B-hour** czterdzie\u015bci**/I-hour** pi\u0119\u0107**/I-hour** na pi\u0119\u0107**/B-size** os\u00f3b\n",
"\n",
"### Schemat IOBES\n",
"\n",
"| Prefix | Znaczenie |\n",
"|:------:|:---------------------------|\n",
"| I | wn\u0119trze slotu (inside) |\n",
"| O | poza slotem (outside) |\n",
"| B | pocz\u0105tek slotu (beginning) |\n",
"| E | koniec slotu (ending) |\n",
"| S | pojedyncze s\u0142owo (single) |\n",
"\n",
"> chcia\u0142bym zarezerwowa\u0107 stolik na jutro**/S-day** na godzin\u0119 dwunast\u0105**/B-hour** czterdzie\u015bci**/I-hour** pi\u0119\u0107**/E-hour** na pi\u0119\u0107**/S-size** os\u00f3b\n",
"\n",
"Je\u017celi dla tak sformu\u0142owanego zadania przygotujemy zbi\u00f3r danych\n",
"z\u0142o\u017cony z wypowiedzi u\u017cytkownika z oznaczonymi slotami (tzw. *zbi\u00f3r ucz\u0105cy*),\n",
"to mo\u017cemy zastosowa\u0107 techniki (nadzorowanego) uczenia maszynowego w celu zbudowania modelu\n",
"annotuj\u0105cego wypowiedzi u\u017cytkownika etykietami slot\u00f3w.\n",
"\n",
"Do zbudowania takiego modelu mo\u017cna wykorzysta\u0107 mi\u0119dzy innymi:\n",
"\n",
" 1. warunkowe pola losowe (Lafferty i in.; 2001),\n",
"\n",
" 2. rekurencyjne sieci neuronowe, np. sieci LSTM (Hochreiter i Schmidhuber; 1997),\n",
"\n",
" 3. transformery (Vaswani i in., 2017).\n",
"\n",
"Przyk\u0142ad\n",
"--------\n",
"Skorzystamy ze zbioru danych przygotowanego przez Schustera (2019)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!mkdir -p l07\n",
"%cd l07\n",
"!curl -L -C - https://fb.me/multilingual_task_oriented_data -o data.zip\n",
"!unzip data.zip\n",
"%cd .."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zbi\u00f3r ten gromadzi wypowiedzi w trzech j\u0119zykach opisane slotami dla dwunastu ram nale\u017c\u0105cych do trzech dziedzin `Alarm`, `Reminder` oraz `Weather`. Dane wczytamy korzystaj\u0105c z biblioteki [conllu](https://pypi.org/project/conllu/)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from conllu import parse_incr\n",
"fields = ['id', 'form', 'frame', 'slot']\n",
"\n",
"def nolabel2o(line, i):\n",
" return 'O' if line[i] == 'NoLabel' else line[i]\n",
"\n",
"with open('l07/en/train-en.conllu') as trainfile:\n",
" trainset = list(parse_incr(trainfile, fields=fields, field_parsers={'slot': nolabel2o}))\n",
"with open('l07/en/test-en.conllu') as testfile:\n",
" testset = list(parse_incr(testfile, fields=fields, field_parsers={'slot': nolabel2o}))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zobaczmy kilka przyk\u0142adowych wypowiedzi z tego zbioru."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from tabulate import tabulate\n",
"tabulate(trainset[0], tablefmt='html')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tabulate(trainset[1000], tablefmt='html')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tabulate(trainset[2000], tablefmt='html')"
]
},
{
"cell_type": "markdown",
"metadata": {
"lines_to_next_cell": 0
},
"source": [
"Na potrzeby prezentacji procesu uczenia w jupyterowym notatniku zaw\u0119zimy zbi\u00f3r danych do pocz\u0105tkowych przyk\u0142ad\u00f3w."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"trainset = trainset[:100]\n",
"testset = testset[:100]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Buduj\u0105c model skorzystamy z architektury opartej o rekurencyjne sieci neuronowe\n",
"zaimplementowanej w bibliotece [flair](https://github.com/flairNLP/flair) (Akbik i in. 2018)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from flair.data import Corpus, Sentence, Token\n",
"from flair.datasets import SentenceDataset\n",
"from flair.embeddings import StackedEmbeddings\n",
"from flair.embeddings import WordEmbeddings\n",
"from flair.embeddings import CharacterEmbeddings\n",
"from flair.embeddings import FlairEmbeddings\n",
"from flair.models import SequenceTagger\n",
"from flair.trainers import ModelTrainer\n",
"\n",
"# determinizacja oblicze\u0144\n",
"import random\n",
"import torch\n",
"random.seed(42)\n",
"torch.manual_seed(42)\n",
"\n",
"if torch.cuda.is_available():\n",
" torch.cuda.manual_seed(0)\n",
" torch.cuda.manual_seed_all(0)\n",
" torch.backends.cudnn.enabled = False\n",
" torch.backends.cudnn.benchmark = False\n",
" torch.backends.cudnn.deterministic = True"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dane skonwertujemy do formatu wykorzystywanego przez `flair`, korzystaj\u0105c z nast\u0119puj\u0105cej funkcji."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def conllu2flair(sentences, label=None):\n",
" fsentences = []\n",
"\n",
" for sentence in sentences:\n",
" fsentence = Sentence()\n",
"\n",
" for token in sentence:\n",
" ftoken = Token(token['form'])\n",
"\n",
" if label:\n",
" ftoken.add_tag(label, token[label])\n",
"\n",
" fsentence.add_token(ftoken)\n",
"\n",
" fsentences.append(fsentence)\n",
"\n",
" return SentenceDataset(fsentences)\n",
"\n",
"corpus = Corpus(train=conllu2flair(trainset, 'slot'), test=conllu2flair(testset, 'slot'))\n",
"print(corpus)\n",
"tag_dictionary = corpus.make_tag_dictionary(tag_type='slot')\n",
"print(tag_dictionary)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nasz model b\u0119dzie wykorzystywa\u0142 wektorowe reprezentacje s\u0142\u00f3w (zob. [Word Embeddings](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_3_WORD_EMBEDDING.md))."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"embedding_types = [\n",
" WordEmbeddings('en'),\n",
" FlairEmbeddings('en-forward'),\n",
" FlairEmbeddings('en-backward'),\n",
" CharacterEmbeddings(),\n",
"]\n",
"\n",
"embeddings = StackedEmbeddings(embeddings=embedding_types)\n",
"tagger = SequenceTagger(hidden_size=256, embeddings=embeddings,\n",
" tag_dictionary=tag_dictionary,\n",
" tag_type='slot', use_crf=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zobaczmy jak wygl\u0105da architektura sieci neuronowej, kt\u00f3ra b\u0119dzie odpowiedzialna za przewidywanie\n",
"slot\u00f3w w wypowiedziach."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(tagger)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wykonamy dziesi\u0119\u0107 iteracji (epok) uczenia a wynikowy model zapiszemy w katalogu `slot-model`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"trainer = ModelTrainer(tagger, corpus)\n",
"trainer.train('slot-model',\n",
" learning_rate=0.1,\n",
" mini_batch_size=32,\n",
" max_epochs=10,\n",
" train_with_dev=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Jako\u015b\u0107 wyuczonego modelu mo\u017cemy oceni\u0107, korzystaj\u0105c z zaraportowanych powy\u017cej metryk, tj.:\n",
"\n",
" - *tp (true positives)*\n",
"\n",
" > liczba s\u0142\u00f3w oznaczonych w zbiorze testowym etykiet\u0105 $e$, kt\u00f3re model oznaczy\u0142 t\u0105 etykiet\u0105\n",
"\n",
" - *fp (false positives)*\n",
"\n",
" > liczba s\u0142\u00f3w nieoznaczonych w zbiorze testowym etykiet\u0105 $e$, kt\u00f3re model oznaczy\u0142 t\u0105 etykiet\u0105\n",
"\n",
" - *fn (false negatives)*\n",
"\n",
" > liczba s\u0142\u00f3w oznaczonych w zbiorze testowym etykiet\u0105 $e$, kt\u00f3rym model nie nada\u0142 etykiety $e$\n",
"\n",
" - *precision*\n",
"\n",
" > $$\\frac{tp}{tp + fp}$$\n",
"\n",
" - *recall*\n",
"\n",
" > $$\\frac{tp}{tp + fn}$$\n",
"\n",
" - $F_1$\n",
"\n",
" > $$\\frac{2 \\cdot precision \\cdot recall}{precision + recall}$$\n",
"\n",
" - *micro* $F_1$\n",
"\n",
" > $F_1$ w kt\u00f3rym $tp$, $fp$ i $fn$ s\u0105 liczone \u0142\u0105cznie dla wszystkich etykiet, tj. $tp = \\sum_{e}{{tp}_e}$, $fn = \\sum_{e}{{fn}_e}$, $fp = \\sum_{e}{{fp}_e}$\n",
"\n",
" - *macro* $F_1$\n",
"\n",
" > \u015brednia arytmetyczna z $F_1$ obliczonych dla poszczeg\u00f3lnych etykiet z osobna.\n",
"\n",
"Wyuczony model mo\u017cemy wczyta\u0107 z pliku korzystaj\u0105c z metody `load`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model = SequenceTagger.load('slot-model/final-model.pt')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wczytany model mo\u017cemy wykorzysta\u0107 do przewidywania slot\u00f3w w wypowiedziach u\u017cytkownika, korzystaj\u0105c\n",
"z przedstawionej poni\u017cej funkcji `predict`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def predict(model, sentence):\n",
" csentence = [{'form': word} for word in sentence]\n",
" fsentence = conllu2flair([csentence])[0]\n",
" model.predict(fsentence)\n",
" return [(token, ftoken.get_tag('slot').value) for token, ftoken in zip(sentence, fsentence)]\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Jak pokazuje przyk\u0142ad poni\u017cej model wyuczony tylko na 100 przyk\u0142adach pope\u0142nia w dosy\u0107 prostej\n",
"wypowiedzi b\u0142\u0105d etykietuj\u0105c s\u0142owo `alarm` tagiem `B-weather/noun`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tabulate(predict(model, 'change my 3 pm alarm to the next day'.split()), tablefmt='html')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Literatura\n",
"----------\n",
" 1. Sebastian Schuster, Sonal Gupta, Rushin Shah, Mike Lewis, Cross-lingual Transfer Learning for Multilingual Task Oriented Dialog. NAACL-HLT (1) 2019, pp. 3795-3805\n",
" 2. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML '01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 282\u2013289, https://repository.upenn.edu/cgi/viewcontent.cgi?article=1162&context=cis_papers\n",
" 3. Sepp Hochreiter and J\u00fcrgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (November 15, 1997), 1735\u20131780, https://doi.org/10.1162/neco.1997.9.8.1735\n",
" 4. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, Attention is All you Need, NIPS 2017, pp. 5998-6008, https://arxiv.org/abs/1706.03762\n",
" 5. Alan Akbik, Duncan Blythe, Roland Vollgraf, Contextual String Embeddings for Sequence Labeling, Proceedings of the 27th International Conference on Computational Linguistics, pp. 1638\u20131649, https://www.aclweb.org/anthology/C18-1139.pdf\n"
]
}
],
"metadata": {
"jupytext": {
"cell_metadata_filter": "-all",
"main_language": "python",
"notebook_metadata_filter": "-all"
},
"author": "Marek Kubis",
"email": "mkubis@amu.edu.pl",
"lang": "pl",
"subtitle": "8.Parsing semantyczny z wykorzystaniem technik uczenia maszynowego[laboratoria]",
"title": "Systemy Dialogowe",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4
}