{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n", "
\n", "

Systemy Dialogowe

\n", "

8. Parsing semantyczny z wykorzystaniem technik uczenia maszynowego [laboratoria]

\n", "

Marek Kubis (2021)

\n", "
\n", "\n", "![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Parsing semantyczny z wykorzystaniem technik uczenia maszynowego\n", "================================================================\n", "\n", "Wprowadzenie\n", "------------\n", "Problem wykrywania slotów i ich wartości w wypowiedziach użytkownika można sformułować jako zadanie\n", "polegające na przewidywaniu dla poszczególnych słów etykiet wskazujących na to czy i do jakiego\n", "slotu dane słowo należy.\n", "\n", "> chciałbym zarezerwować stolik na jutro**/day** na godzinę dwunastą**/hour** czterdzieści**/hour** pięć**/hour** na pięć**/size** osób\n", "\n", "Granice slotów oznacza się korzystając z wybranego schematu etykietowania.\n", "\n", "### Schemat IOB\n", "\n", "| Prefix | Znaczenie |\n", "|:------:|:---------------------------|\n", "| I | wnętrze slotu (inside) |\n", "| O | poza slotem (outside) |\n", "| B | początek slotu (beginning) |\n", "\n", "> chciałbym zarezerwować stolik na jutro**/B-day** na godzinę dwunastą**/B-hour** czterdzieści**/I-hour** pięć**/I-hour** na pięć**/B-size** osób\n", "\n", "### Schemat IOBES\n", "\n", "| Prefix | Znaczenie |\n", "|:------:|:---------------------------|\n", "| I | wnętrze slotu (inside) |\n", "| O | poza slotem (outside) |\n", "| B | początek slotu (beginning) |\n", "| E | koniec slotu (ending) |\n", "| S | pojedyncze słowo (single) |\n", "\n", "> chciałbym zarezerwować stolik na jutro**/S-day** na godzinę dwunastą**/B-hour** czterdzieści**/I-hour** pięć**/E-hour** na pięć**/S-size** osób\n", "\n", "Jeżeli dla tak sformułowanego zadania przygotujemy zbiór danych\n", "złożony z wypowiedzi użytkownika z oznaczonymi slotami (tzw. *zbiór uczący*),\n", "to możemy zastosować techniki (nadzorowanego) uczenia maszynowego w celu zbudowania modelu\n", "annotującego wypowiedzi użytkownika etykietami slotów.\n", "\n", "Do zbudowania takiego modelu można wykorzystać między innymi:\n", "\n", " 1. warunkowe pola losowe (Lafferty i in.; 2001),\n", "\n", " 2. rekurencyjne sieci neuronowe, np. sieci LSTM (Hochreiter i Schmidhuber; 1997),\n", "\n", " 3. transformery (Vaswani i in., 2017).\n", "\n", "Przykład\n", "--------\n", "Skorzystamy ze zbioru danych przygotowanego przez Schustera (2019)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!mkdir -p l07\n", "%cd l07\n", "!curl -L -C - https://fb.me/multilingual_task_oriented_data -o data.zip\n", "!unzip data.zip\n", "%cd .." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Zbiór ten gromadzi wypowiedzi w trzech językach opisane slotami dla dwunastu ram należących do trzech dziedzin `Alarm`, `Reminder` oraz `Weather`. Dane wczytamy korzystając z biblioteki [conllu](https://pypi.org/project/conllu/)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from conllu import parse_incr\n", "fields = ['id', 'form', 'frame', 'slot']\n", "\n", "def nolabel2o(line, i):\n", " return 'O' if line[i] == 'NoLabel' else line[i]\n", "\n", "with open('./train_data//train.conllu') as trainfile:\n", " trainset = list(parse_incr(trainfile, fields=fields, field_parsers={'slot': nolabel2o}))\n", "with open('./train_data//test.conllu') as testfile:\n", " testset = list(parse_incr(testfile, fields=fields, field_parsers={'slot': nolabel2o}))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Zobaczmy kilka przykładowych wypowiedzi z tego zbioru." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "
1chciałbymO
2kupić O
3popcorn O
" ], "text/plain": [ "'\\n\\n\\n\\n\\n\\n
1chciałbymO
2kupić O
3popcorn O
'" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from tabulate import tabulate\n", "tabulate(trainset[26], tablefmt='html')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tabulate(trainset[1000], tablefmt='html')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tabulate(trainset[2000], tablefmt='html')" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Na potrzeby prezentacji procesu uczenia w jupyterowym notatniku zawęzimy zbiór danych do początkowych przykładów." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "trainset = trainset[:100]\n", "testset = testset[:100]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Budując model skorzystamy z architektury opartej o rekurencyjne sieci neuronowe\n", "zaimplementowanej w bibliotece [flair](https://github.com/flairNLP/flair) (Akbik i in. 2018)." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\Adrian\\AppData\\Roaming\\Python\\Python37\\site-packages\\tqdm\\auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n" ] } ], "source": [ "from flair.data import Corpus, Sentence, Token\n", "from flair.datasets import SentenceDataset\n", "from flair.embeddings import StackedEmbeddings\n", "from flair.embeddings import WordEmbeddings\n", "from flair.embeddings import CharacterEmbeddings\n", "from flair.embeddings import FlairEmbeddings\n", "from flair.models import SequenceTagger\n", "from flair.trainers import ModelTrainer\n", "\n", "# determinizacja obliczeń\n", "import random\n", "import torch\n", "random.seed(42)\n", "torch.manual_seed(42)\n", "\n", "if torch.cuda.is_available():\n", " torch.cuda.manual_seed(0)\n", " torch.cuda.manual_seed_all(0)\n", " torch.backends.cudnn.enabled = False\n", " torch.backends.cudnn.benchmark = False\n", " torch.backends.cudnn.deterministic = True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dane skonwertujemy do formatu wykorzystywanego przez `flair`, korzystając z następującej funkcji." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Corpus: 194 train + 22 dev + 33 test sentences\n", "Dictionary with 12 tags: , O, B-time, I-time, B-area, I-area, B-quantity, B-date, I-quantity, I-date, , \n" ] } ], "source": [ "def conllu2flair(sentences, label=None):\n", " fsentences = []\n", "\n", " for sentence in sentences:\n", " fsentence = Sentence()\n", "\n", " for token in sentence:\n", " ftoken = Token(token['form'])\n", "\n", " if label:\n", " ftoken.add_tag(label, token[label])\n", "\n", " fsentence.add_token(ftoken)\n", "\n", " fsentences.append(fsentence)\n", "\n", " return SentenceDataset(fsentences)\n", "\n", "corpus = Corpus(train=conllu2flair(trainset, 'slot'), test=conllu2flair(testset, 'slot'))\n", "print(corpus)\n", "tag_dictionary = corpus.make_tag_dictionary(tag_type='slot')\n", "print(tag_dictionary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nasz model będzie wykorzystywał wektorowe reprezentacje słów (zob. [Word Embeddings](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_3_WORD_EMBEDDING.md))." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2022-05-17 23:51:31,428 https://flair.informatik.hu-berlin.de/resources/embeddings/token/pl-wiki-fasttext-300d-1M.vectors.npy not found in cache, downloading to C:\\Users\\Adrian\\AppData\\Local\\Temp\\tmpdtf6je0q\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 1199998928/1199998928 [00:38<00:00, 31059047.54B/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "2022-05-17 23:52:10,221 copying C:\\Users\\Adrian\\AppData\\Local\\Temp\\tmpdtf6je0q to cache at C:\\Users\\Adrian\\.flair\\embeddings\\pl-wiki-fasttext-300d-1M.vectors.npy\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "2022-05-17 23:52:11,581 removing temp file C:\\Users\\Adrian\\AppData\\Local\\Temp\\tmpdtf6je0q\n", "2022-05-17 23:52:11,834 https://flair.informatik.hu-berlin.de/resources/embeddings/token/pl-wiki-fasttext-300d-1M not found in cache, downloading to C:\\Users\\Adrian\\AppData\\Local\\Temp\\tmpncdt74ud\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 40874795/40874795 [00:01<00:00, 25496548.48B/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "2022-05-17 23:52:13,623 copying C:\\Users\\Adrian\\AppData\\Local\\Temp\\tmpncdt74ud to cache at C:\\Users\\Adrian\\.flair\\embeddings\\pl-wiki-fasttext-300d-1M\n", "2022-05-17 23:52:13,678 removing temp file C:\\Users\\Adrian\\AppData\\Local\\Temp\\tmpncdt74ud\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "2022-05-17 23:52:21,696 https://flair.informatik.hu-berlin.de/resources/embeddings/flair/lm-polish-forward-v0.2.pt not found in cache, downloading to C:\\Users\\Adrian\\AppData\\Local\\Temp\\tmp6okeka8n\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 84244196/84244196 [00:02<00:00, 35143826.68B/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "2022-05-17 23:52:24,338 copying C:\\Users\\Adrian\\AppData\\Local\\Temp\\tmp6okeka8n to cache at C:\\Users\\Adrian\\.flair\\embeddings\\lm-polish-forward-v0.2.pt\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "2022-05-17 23:52:24,435 removing temp file C:\\Users\\Adrian\\AppData\\Local\\Temp\\tmp6okeka8n\n", "2022-05-17 23:52:24,857 https://flair.informatik.hu-berlin.de/resources/embeddings/flair/lm-polish-backward-v0.2.pt not found in cache, downloading to C:\\Users\\Adrian\\AppData\\Local\\Temp\\tmp_6ut1zi9\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 84244196/84244196 [00:02<00:00, 35815492.94B/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "2022-05-17 23:52:27,375 copying C:\\Users\\Adrian\\AppData\\Local\\Temp\\tmp_6ut1zi9 to cache at C:\\Users\\Adrian\\.flair\\embeddings\\lm-polish-backward-v0.2.pt\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "2022-05-17 23:52:27,460 removing temp file C:\\Users\\Adrian\\AppData\\Local\\Temp\\tmp_6ut1zi9\n" ] } ], "source": [ "embedding_types = [\n", " WordEmbeddings('pl'),\n", " FlairEmbeddings('polish-forward'),\n", " FlairEmbeddings('polish-backward'),\n", " CharacterEmbeddings(),\n", "]\n", "\n", "embeddings = StackedEmbeddings(embeddings=embedding_types)\n", "tagger = SequenceTagger(hidden_size=256, embeddings=embeddings,\n", " tag_dictionary=tag_dictionary,\n", " tag_type='slot', use_crf=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Zobaczmy jak wygląda architektura sieci neuronowej, która będzie odpowiedzialna za przewidywanie\n", "slotów w wypowiedziach." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SequenceTagger(\n", " (embeddings): StackedEmbeddings(\n", " (list_embedding_0): WordEmbeddings('pl')\n", " (list_embedding_1): FlairEmbeddings(\n", " (lm): LanguageModel(\n", " (drop): Dropout(p=0.25, inplace=False)\n", " (encoder): Embedding(1602, 100)\n", " (rnn): LSTM(100, 2048)\n", " (decoder): Linear(in_features=2048, out_features=1602, bias=True)\n", " )\n", " )\n", " (list_embedding_2): FlairEmbeddings(\n", " (lm): LanguageModel(\n", " (drop): Dropout(p=0.25, inplace=False)\n", " (encoder): Embedding(1602, 100)\n", " (rnn): LSTM(100, 2048)\n", " (decoder): Linear(in_features=2048, out_features=1602, bias=True)\n", " )\n", " )\n", " (list_embedding_3): CharacterEmbeddings(\n", " (char_embedding): Embedding(275, 25)\n", " (char_rnn): LSTM(25, 25, bidirectional=True)\n", " )\n", " )\n", " (word_dropout): WordDropout(p=0.05)\n", " (locked_dropout): LockedDropout(p=0.5)\n", " (embedding2nn): Linear(in_features=4446, out_features=4446, bias=True)\n", " (rnn): LSTM(4446, 256, batch_first=True, bidirectional=True)\n", " (linear): Linear(in_features=512, out_features=12, bias=True)\n", " (beta): 1.0\n", " (weights): None\n", " (weight_tensor) None\n", ")\n" ] } ], "source": [ "print(tagger)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wykonamy dziesięć iteracji (epok) uczenia a wynikowy model zapiszemy w katalogu `slot-model`." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2022-05-17 23:52:57,432 ----------------------------------------------------------------------------------------------------\n", "2022-05-17 23:52:57,433 Model: \"SequenceTagger(\n", " (embeddings): StackedEmbeddings(\n", " (list_embedding_0): WordEmbeddings('pl')\n", " (list_embedding_1): FlairEmbeddings(\n", " (lm): LanguageModel(\n", " (drop): Dropout(p=0.25, inplace=False)\n", " (encoder): Embedding(1602, 100)\n", " (rnn): LSTM(100, 2048)\n", " (decoder): Linear(in_features=2048, out_features=1602, bias=True)\n", " )\n", " )\n", " (list_embedding_2): FlairEmbeddings(\n", " (lm): LanguageModel(\n", " (drop): Dropout(p=0.25, inplace=False)\n", " (encoder): Embedding(1602, 100)\n", " (rnn): LSTM(100, 2048)\n", " (decoder): Linear(in_features=2048, out_features=1602, bias=True)\n", " )\n", " )\n", " (list_embedding_3): CharacterEmbeddings(\n", " (char_embedding): Embedding(275, 25)\n", " (char_rnn): LSTM(25, 25, bidirectional=True)\n", " )\n", " )\n", " (word_dropout): WordDropout(p=0.05)\n", " (locked_dropout): LockedDropout(p=0.5)\n", " (embedding2nn): Linear(in_features=4446, out_features=4446, bias=True)\n", " (rnn): LSTM(4446, 256, batch_first=True, bidirectional=True)\n", " (linear): Linear(in_features=512, out_features=12, bias=True)\n", " (beta): 1.0\n", " (weights): None\n", " (weight_tensor) None\n", ")\"\n", "2022-05-17 23:52:57,434 ----------------------------------------------------------------------------------------------------\n", "2022-05-17 23:52:57,435 Corpus: \"Corpus: 194 train + 22 dev + 33 test sentences\"\n", "2022-05-17 23:52:57,435 ----------------------------------------------------------------------------------------------------\n", "2022-05-17 23:52:57,435 Parameters:\n", "2022-05-17 23:52:57,436 - learning_rate: \"0.1\"\n", "2022-05-17 23:52:57,437 - mini_batch_size: \"32\"\n", "2022-05-17 23:52:57,437 - patience: \"3\"\n", "2022-05-17 23:52:57,437 - anneal_factor: \"0.5\"\n", "2022-05-17 23:52:57,438 - max_epochs: \"10\"\n", "2022-05-17 23:52:57,439 - shuffle: \"True\"\n", "2022-05-17 23:52:57,440 - train_with_dev: \"False\"\n", "2022-05-17 23:52:57,440 - batch_growth_annealing: \"False\"\n", "2022-05-17 23:52:57,441 ----------------------------------------------------------------------------------------------------\n", "2022-05-17 23:52:57,441 Model training base path: \"slot-model\"\n", "2022-05-17 23:52:57,442 ----------------------------------------------------------------------------------------------------\n", "2022-05-17 23:52:57,443 Device: cpu\n", "2022-05-17 23:52:57,443 ----------------------------------------------------------------------------------------------------\n", "2022-05-17 23:52:57,444 Embeddings storage mode: cpu\n", "2022-05-17 23:52:57,446 ----------------------------------------------------------------------------------------------------\n", "2022-05-17 23:52:59,206 epoch 1 - iter 1/7 - loss 16.77810669 - samples/sec: 18.23 - lr: 0.100000\n", "2022-05-17 23:53:01,036 epoch 1 - iter 2/7 - loss 15.17136908 - samples/sec: 17.51 - lr: 0.100000\n", "2022-05-17 23:53:02,450 epoch 1 - iter 3/7 - loss 13.45863914 - samples/sec: 22.63 - lr: 0.100000\n", "2022-05-17 23:53:04,163 epoch 1 - iter 4/7 - loss 11.81387305 - samples/sec: 18.70 - lr: 0.100000\n", "2022-05-17 23:53:06,030 epoch 1 - iter 5/7 - loss 10.41218300 - samples/sec: 17.14 - lr: 0.100000\n", "2022-05-17 23:53:07,655 epoch 1 - iter 6/7 - loss 9.20362504 - samples/sec: 19.70 - lr: 0.100000\n", "2022-05-17 23:53:07,968 epoch 1 - iter 7/7 - loss 8.10721644 - samples/sec: 102.61 - lr: 0.100000\n", "2022-05-17 23:53:07,969 ----------------------------------------------------------------------------------------------------\n", "2022-05-17 23:53:07,970 EPOCH 1 done: loss 8.1072 - lr 0.1000000\n", "2022-05-17 23:53:09,606 DEV : loss 3.991352081298828 - score 0.2\n", "2022-05-17 23:53:09,607 BAD EPOCHS (no improvement): 0\n", "saving best model\n", "2022-05-17 23:53:14,975 ----------------------------------------------------------------------------------------------------\n", "2022-05-17 23:53:15,484 epoch 2 - iter 1/7 - loss 3.58558130 - samples/sec: 63.03 - lr: 0.100000\n", "2022-05-17 23:53:15,865 epoch 2 - iter 2/7 - loss 3.12797976 - samples/sec: 84.20 - lr: 0.100000\n", "2022-05-17 23:53:16,267 epoch 2 - iter 3/7 - loss 2.60615242 - samples/sec: 79.80 - lr: 0.100000\n", "2022-05-17 23:53:16,738 epoch 2 - iter 4/7 - loss 2.71958175 - samples/sec: 68.18 - lr: 0.100000\n", "2022-05-17 23:53:17,170 epoch 2 - iter 5/7 - loss 2.70331609 - samples/sec: 74.26 - lr: 0.100000\n", "2022-05-17 23:53:17,603 epoch 2 - iter 6/7 - loss 2.51522466 - samples/sec: 74.01 - lr: 0.100000\n", "2022-05-17 23:53:17,748 epoch 2 - iter 7/7 - loss 2.19215042 - samples/sec: 221.61 - lr: 0.100000\n", "2022-05-17 23:53:17,749 ----------------------------------------------------------------------------------------------------\n", "2022-05-17 23:53:17,750 EPOCH 2 done: loss 2.1922 - lr 0.1000000\n", "2022-05-17 23:53:17,844 DEV : loss 3.9842920303344727 - score 0.3636\n", "2022-05-17 23:53:17,846 BAD EPOCHS (no improvement): 0\n", "saving best model\n", "2022-05-17 23:53:22,865 ----------------------------------------------------------------------------------------------------\n", "2022-05-17 23:53:23,305 epoch 3 - iter 1/7 - loss 2.19582605 - samples/sec: 72.76 - lr: 0.100000\n", "2022-05-17 23:53:23,741 epoch 3 - iter 2/7 - loss 1.85529530 - samples/sec: 73.58 - lr: 0.100000\n", "2022-05-17 23:53:24,212 epoch 3 - iter 3/7 - loss 1.91948136 - samples/sec: 68.09 - lr: 0.100000\n", "2022-05-17 23:53:24,717 epoch 3 - iter 4/7 - loss 2.11527669 - samples/sec: 63.50 - lr: 0.100000\n", "2022-05-17 23:53:25,129 epoch 3 - iter 5/7 - loss 2.12587404 - samples/sec: 77.75 - lr: 0.100000\n", "2022-05-17 23:53:25,630 epoch 3 - iter 6/7 - loss 2.01592445 - samples/sec: 63.92 - lr: 0.100000\n", "2022-05-17 23:53:25,755 epoch 3 - iter 7/7 - loss 1.73551549 - samples/sec: 258.75 - lr: 0.100000\n", "2022-05-17 23:53:25,756 ----------------------------------------------------------------------------------------------------\n", "2022-05-17 23:53:25,757 EPOCH 3 done: loss 1.7355 - lr 0.1000000\n", "2022-05-17 23:53:25,854 DEV : loss 3.3194284439086914 - score 0.3077\n", "2022-05-17 23:53:25,855 BAD EPOCHS (no improvement): 1\n", "2022-05-17 23:53:25,856 ----------------------------------------------------------------------------------------------------\n", "2022-05-17 23:53:26,274 epoch 4 - iter 1/7 - loss 1.46010232 - samples/sec: 76.66 - lr: 0.100000\n", "2022-05-17 23:53:26,734 epoch 4 - iter 2/7 - loss 1.18807647 - samples/sec: 69.66 - lr: 0.100000\n", "2022-05-17 23:53:27,229 epoch 4 - iter 3/7 - loss 1.33144226 - samples/sec: 64.87 - lr: 0.100000\n", "2022-05-17 23:53:27,775 epoch 4 - iter 4/7 - loss 1.64428358 - samples/sec: 58.69 - lr: 0.100000\n", "2022-05-17 23:53:28,243 epoch 4 - iter 5/7 - loss 1.62551130 - samples/sec: 68.71 - lr: 0.100000\n", "2022-05-17 23:53:28,727 epoch 4 - iter 6/7 - loss 1.74551653 - samples/sec: 66.25 - lr: 0.100000\n", "2022-05-17 23:53:28,856 epoch 4 - iter 7/7 - loss 1.53921426 - samples/sec: 248.73 - lr: 0.100000\n", "2022-05-17 23:53:28,857 ----------------------------------------------------------------------------------------------------\n", "2022-05-17 23:53:28,858 EPOCH 4 done: loss 1.5392 - lr 0.1000000\n", "2022-05-17 23:53:28,962 DEV : loss 2.8986825942993164 - score 0.2857\n", "2022-05-17 23:53:28,963 BAD EPOCHS (no improvement): 2\n", "2022-05-17 23:53:28,965 ----------------------------------------------------------------------------------------------------\n", "2022-05-17 23:53:29,417 epoch 5 - iter 1/7 - loss 1.72827125 - samples/sec: 70.90 - lr: 0.100000\n", "2022-05-17 23:53:29,902 epoch 5 - iter 2/7 - loss 1.51951337 - samples/sec: 66.07 - lr: 0.100000\n", "2022-05-17 23:53:30,355 epoch 5 - iter 3/7 - loss 1.55555471 - samples/sec: 70.83 - lr: 0.100000\n", "2022-05-17 23:53:30,840 epoch 5 - iter 4/7 - loss 1.31492138 - samples/sec: 66.16 - lr: 0.100000\n", "2022-05-17 23:53:31,257 epoch 5 - iter 5/7 - loss 1.46497860 - samples/sec: 76.92 - lr: 0.100000\n", "2022-05-17 23:53:31,768 epoch 5 - iter 6/7 - loss 1.60987592 - samples/sec: 62.75 - lr: 0.100000\n", "2022-05-17 23:53:31,929 epoch 5 - iter 7/7 - loss 2.72113044 - samples/sec: 200.53 - lr: 0.100000\n", "2022-05-17 23:53:31,930 ----------------------------------------------------------------------------------------------------\n", "2022-05-17 23:53:31,931 EPOCH 5 done: loss 2.7211 - lr 0.1000000\n", "2022-05-17 23:53:32,024 DEV : loss 2.766446590423584 - score 0.3077\n", "2022-05-17 23:53:32,025 BAD EPOCHS (no improvement): 3\n", "2022-05-17 23:53:32,026 ----------------------------------------------------------------------------------------------------\n", "2022-05-17 23:53:32,475 epoch 6 - iter 1/7 - loss 1.68398678 - samples/sec: 71.62 - lr: 0.100000\n", "2022-05-17 23:53:32,971 epoch 6 - iter 2/7 - loss 1.67541099 - samples/sec: 64.62 - lr: 0.100000\n", "2022-05-17 23:53:33,400 epoch 6 - iter 3/7 - loss 1.58060956 - samples/sec: 74.78 - lr: 0.100000\n", "2022-05-17 23:53:33,878 epoch 6 - iter 4/7 - loss 1.55456299 - samples/sec: 66.92 - lr: 0.100000\n", "2022-05-17 23:53:34,278 epoch 6 - iter 5/7 - loss 1.50003145 - samples/sec: 80.28 - lr: 0.100000\n", "2022-05-17 23:53:34,813 epoch 6 - iter 6/7 - loss 1.46878848 - samples/sec: 60.04 - lr: 0.100000\n", "2022-05-17 23:53:34,951 epoch 6 - iter 7/7 - loss 1.66172016 - samples/sec: 233.22 - lr: 0.100000\n", "2022-05-17 23:53:34,952 ----------------------------------------------------------------------------------------------------\n", "2022-05-17 23:53:34,952 EPOCH 6 done: loss 1.6617 - lr 0.1000000\n", "2022-05-17 23:53:35,040 DEV : loss 2.2595832347869873 - score 0.2857\n", "Epoch 6: reducing learning rate of group 0 to 5.0000e-02.\n", "2022-05-17 23:53:35,041 BAD EPOCHS (no improvement): 4\n", "2022-05-17 23:53:35,043 ----------------------------------------------------------------------------------------------------\n", "2022-05-17 23:53:35,461 epoch 7 - iter 1/7 - loss 1.14667833 - samples/sec: 76.93 - lr: 0.050000\n", "2022-05-17 23:53:35,976 epoch 7 - iter 2/7 - loss 1.11618459 - samples/sec: 62.22 - lr: 0.050000\n", "2022-05-17 23:53:36,416 epoch 7 - iter 3/7 - loss 1.24378494 - samples/sec: 72.88 - lr: 0.050000\n", "2022-05-17 23:53:36,880 epoch 7 - iter 4/7 - loss 1.31663331 - samples/sec: 69.14 - lr: 0.050000\n", "2022-05-17 23:53:37,298 epoch 7 - iter 5/7 - loss 1.39581544 - samples/sec: 76.75 - lr: 0.050000\n", "2022-05-17 23:53:37,714 epoch 7 - iter 6/7 - loss 1.34690581 - samples/sec: 77.09 - lr: 0.050000\n", "2022-05-17 23:53:37,860 epoch 7 - iter 7/7 - loss 1.46004195 - samples/sec: 220.36 - lr: 0.050000\n", "2022-05-17 23:53:37,861 ----------------------------------------------------------------------------------------------------\n", "2022-05-17 23:53:37,861 EPOCH 7 done: loss 1.4600 - lr 0.0500000\n", "2022-05-17 23:53:37,954 DEV : loss 2.200728416442871 - score 0.2857\n", "2022-05-17 23:53:37,955 BAD EPOCHS (no improvement): 1\n", "2022-05-17 23:53:37,956 ----------------------------------------------------------------------------------------------------\n", "2022-05-17 23:53:38,423 epoch 8 - iter 1/7 - loss 1.14459288 - samples/sec: 68.83 - lr: 0.050000\n", "2022-05-17 23:53:38,805 epoch 8 - iter 2/7 - loss 0.95714736 - samples/sec: 83.88 - lr: 0.050000\n", "2022-05-17 23:53:39,302 epoch 8 - iter 3/7 - loss 1.17704646 - samples/sec: 64.42 - lr: 0.050000\n", "2022-05-17 23:53:39,781 epoch 8 - iter 4/7 - loss 1.29963121 - samples/sec: 66.92 - lr: 0.050000\n", "2022-05-17 23:53:40,256 epoch 8 - iter 5/7 - loss 1.34262223 - samples/sec: 67.59 - lr: 0.050000\n", "2022-05-17 23:53:40,704 epoch 8 - iter 6/7 - loss 1.33356750 - samples/sec: 71.53 - lr: 0.050000\n", "2022-05-17 23:53:40,846 epoch 8 - iter 7/7 - loss 1.20113390 - samples/sec: 226.59 - lr: 0.050000\n", "2022-05-17 23:53:40,847 ----------------------------------------------------------------------------------------------------\n", "2022-05-17 23:53:40,848 EPOCH 8 done: loss 1.2011 - lr 0.0500000\n", "2022-05-17 23:53:40,941 DEV : loss 2.4227261543273926 - score 0.2857\n", "2022-05-17 23:53:40,942 BAD EPOCHS (no improvement): 2\n", "2022-05-17 23:53:40,943 ----------------------------------------------------------------------------------------------------\n", "2022-05-17 23:53:41,389 epoch 9 - iter 1/7 - loss 1.12297106 - samples/sec: 71.73 - lr: 0.050000\n", "2022-05-17 23:53:41,800 epoch 9 - iter 2/7 - loss 0.92356640 - samples/sec: 78.01 - lr: 0.050000\n", "2022-05-17 23:53:42,249 epoch 9 - iter 3/7 - loss 1.02407436 - samples/sec: 71.37 - lr: 0.050000\n", "2022-05-17 23:53:42,667 epoch 9 - iter 4/7 - loss 1.04805315 - samples/sec: 76.71 - lr: 0.050000\n", "2022-05-17 23:53:43,215 epoch 9 - iter 5/7 - loss 1.33371143 - samples/sec: 58.59 - lr: 0.050000\n", "2022-05-17 23:53:43,661 epoch 9 - iter 6/7 - loss 1.27829826 - samples/sec: 71.89 - lr: 0.050000\n", "2022-05-17 23:53:43,796 epoch 9 - iter 7/7 - loss 1.10260926 - samples/sec: 240.25 - lr: 0.050000\n", "2022-05-17 23:53:43,797 ----------------------------------------------------------------------------------------------------\n", "2022-05-17 23:53:43,798 EPOCH 9 done: loss 1.1026 - lr 0.0500000\n", "2022-05-17 23:53:43,895 DEV : loss 2.1707162857055664 - score 0.3077\n", "2022-05-17 23:53:43,896 BAD EPOCHS (no improvement): 3\n", "2022-05-17 23:53:43,903 ----------------------------------------------------------------------------------------------------\n", "2022-05-17 23:53:44,338 epoch 10 - iter 1/7 - loss 1.34320462 - samples/sec: 73.74 - lr: 0.050000\n", "2022-05-17 23:53:44,808 epoch 10 - iter 2/7 - loss 0.96772069 - samples/sec: 68.25 - lr: 0.050000\n", "2022-05-17 23:53:45,207 epoch 10 - iter 3/7 - loss 1.06257542 - samples/sec: 80.34 - lr: 0.050000\n", "2022-05-17 23:53:45,729 epoch 10 - iter 4/7 - loss 0.92318819 - samples/sec: 61.50 - lr: 0.050000\n", "2022-05-17 23:53:46,202 epoch 10 - iter 5/7 - loss 1.08295707 - samples/sec: 67.82 - lr: 0.050000\n", "2022-05-17 23:53:46,707 epoch 10 - iter 6/7 - loss 1.18012399 - samples/sec: 63.49 - lr: 0.050000\n", "2022-05-17 23:53:46,841 epoch 10 - iter 7/7 - loss 1.01267667 - samples/sec: 239.34 - lr: 0.050000\n", "2022-05-17 23:53:46,842 ----------------------------------------------------------------------------------------------------\n", "2022-05-17 23:53:46,842 EPOCH 10 done: loss 1.0127 - lr 0.0500000\n", "2022-05-17 23:53:46,942 DEV : loss 1.9863343238830566 - score 0.3077\n", "Epoch 10: reducing learning rate of group 0 to 2.5000e-02.\n", "2022-05-17 23:53:46,943 BAD EPOCHS (no improvement): 4\n", "2022-05-17 23:53:51,951 ----------------------------------------------------------------------------------------------------\n", "2022-05-17 23:53:51,952 Testing using best model ...\n", "2022-05-17 23:53:51,953 loading file slot-model\\best-model.pt\n", "2022-05-17 23:53:57,745 0.8000\t0.2667\t0.4000\n", "2022-05-17 23:53:57,746 \n", "Results:\n", "- F1-score (micro) 0.4000\n", "- F1-score (macro) 0.2424\n", "\n", "By class:\n", "date tp: 0 - fp: 0 - fn: 4 - precision: 0.0000 - recall: 0.0000 - f1-score: 0.0000\n", "quantity tp: 4 - fp: 1 - fn: 2 - precision: 0.8000 - recall: 0.6667 - f1-score: 0.7273\n", "time tp: 0 - fp: 0 - fn: 5 - precision: 0.0000 - recall: 0.0000 - f1-score: 0.0000\n", "2022-05-17 23:53:57,747 ----------------------------------------------------------------------------------------------------\n" ] }, { "data": { "text/plain": [ "{'test_score': 0.4,\n", " 'dev_score_history': [0.2,\n", " 0.36363636363636365,\n", " 0.3076923076923077,\n", " 0.28571428571428575,\n", " 0.3076923076923077,\n", " 0.28571428571428575,\n", " 0.28571428571428575,\n", " 0.28571428571428575,\n", " 0.3076923076923077,\n", " 0.3076923076923077],\n", " 'train_loss_history': [8.107216443334307,\n", " 2.19215042250497,\n", " 1.735515492303031,\n", " 1.5392142619405473,\n", " 2.721130439213344,\n", " 1.6617201566696167,\n", " 1.460041948727199,\n", " 1.2011338983263289,\n", " 1.1026092597416468,\n", " 1.012676673276084],\n", " 'dev_loss_history': [3.991352081298828,\n", " 3.9842920303344727,\n", " 3.3194284439086914,\n", " 2.8986825942993164,\n", " 2.766446590423584,\n", " 2.2595832347869873,\n", " 2.200728416442871,\n", " 2.4227261543273926,\n", " 2.1707162857055664,\n", " 1.9863343238830566]}" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "trainer = ModelTrainer(tagger, corpus)\n", "trainer.train('slot-model',\n", " learning_rate=0.1,\n", " mini_batch_size=32,\n", " max_epochs=10,\n", " train_with_dev=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Jakość wyuczonego modelu możemy ocenić, korzystając z zaraportowanych powyżej metryk, tj.:\n", "\n", " - *tp (true positives)*\n", "\n", " > liczba słów oznaczonych w zbiorze testowym etykietą $e$, które model oznaczył tą etykietą\n", "\n", " - *fp (false positives)*\n", "\n", " > liczba słów nieoznaczonych w zbiorze testowym etykietą $e$, które model oznaczył tą etykietą\n", "\n", " - *fn (false negatives)*\n", "\n", " > liczba słów oznaczonych w zbiorze testowym etykietą $e$, którym model nie nadał etykiety $e$\n", "\n", " - *precision*\n", "\n", " > $$\\frac{tp}{tp + fp}$$\n", "\n", " - *recall*\n", "\n", " > $$\\frac{tp}{tp + fn}$$\n", "\n", " - $F_1$\n", "\n", " > $$\\frac{2 \\cdot precision \\cdot recall}{precision + recall}$$\n", "\n", " - *micro* $F_1$\n", "\n", " > $F_1$ w którym $tp$, $fp$ i $fn$ są liczone łącznie dla wszystkich etykiet, tj. $tp = \\sum_{e}{{tp}_e}$, $fn = \\sum_{e}{{fn}_e}$, $fp = \\sum_{e}{{fp}_e}$\n", "\n", " - *macro* $F_1$\n", "\n", " > średnia arytmetyczna z $F_1$ obliczonych dla poszczególnych etykiet z osobna.\n", "\n", "Wyuczony model możemy wczytać z pliku korzystając z metody `load`." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2022-05-17 23:57:03,014 loading file slot-model/final-model.pt\n" ] } ], "source": [ "model = SequenceTagger.load('slot-model/final-model.pt')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wczytany model możemy wykorzystać do przewidywania slotów w wypowiedziach użytkownika, korzystając\n", "z przedstawionej poniżej funkcji `predict`." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "def predict(model, sentence):\n", " csentence = [{'form': word} for word in sentence]\n", " fsentence = conllu2flair([csentence])[0]\n", " model.predict(fsentence)\n", " return [(token, ftoken.get_tag('slot').value) for token, ftoken in zip(sentence, fsentence)]\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Jak pokazuje przykład poniżej model wyuczony tylko na 100 przykładach popełnia w dosyć prostej\n", "wypowiedzi błąd etykietując słowo `alarm` tagiem `B-weather/noun`." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
chciałbym O
zarezerwowaćO
2 B-quantity
bilety O
na O
batman O
na O
19:30 B-quantity
na O
środku O
z O
tyłun O
po O
prawej O
i O
po O
lewej O
nie O
chce O
z O
przodu O
" ], "text/plain": [ "'\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n
chciałbym O
zarezerwowaćO
2 B-quantity
bilety O
na O
batman O
na O
19:30 B-quantity
na O
środku O
z O
tyłun O
po O
prawej O
i O
po O
lewej O
nie O
chce O
z O
przodu O
'" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tabulate(predict(model, 'chciałbym zarezerwować 2 bilety na batman na 19:30 na środku z tyłun po prawej i po lewej nie chce z przodu'.split()), tablefmt='html')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Literatura\n", "----------\n", " 1. Sebastian Schuster, Sonal Gupta, Rushin Shah, Mike Lewis, Cross-lingual Transfer Learning for Multilingual Task Oriented Dialog. NAACL-HLT (1) 2019, pp. 3795-3805\n", " 2. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML '01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 282–289, https://repository.upenn.edu/cgi/viewcontent.cgi?article=1162&context=cis_papers\n", " 3. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (November 15, 1997), 1735–1780, https://doi.org/10.1162/neco.1997.9.8.1735\n", " 4. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, Attention is All you Need, NIPS 2017, pp. 5998-6008, https://arxiv.org/abs/1706.03762\n", " 5. Alan Akbik, Duncan Blythe, Roland Vollgraf, Contextual String Embeddings for Sequence Labeling, Proceedings of the 27th International Conference on Computational Linguistics, pp. 1638–1649, https://www.aclweb.org/anthology/C18-1139.pdf\n" ] } ], "metadata": { "author": "Marek Kubis", "email": "mkubis@amu.edu.pl", "jupytext": { "cell_metadata_filter": "-all", "main_language": "python", "notebook_metadata_filter": "-all" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "lang": "pl", "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.13" }, "subtitle": "8.Parsing semantyczny z wykorzystaniem technik uczenia maszynowego[laboratoria]", "title": "Systemy Dialogowe", "year": "2021" }, "nbformat": 4, "nbformat_minor": 4 }