System_Dialogowy_Janet/07-parsing-semantyczny-uczenie.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Parsing semantyczny z wykorzystaniem technik uczenia maszynowego\n",
    "================================================================\n",
    "\n",
    "Wprowadzenie\n",
    "------------\n",
    "Problem wykrywania slotów i ich wartości w wypowiedziach użytkownika można sformułować jako zadanie\n",
    "polegające na przewidywaniu dla poszczególnych słów etykiet wskazujących na to czy i do jakiego\n",
    "slotu dane słowo należy.\n",
    "\n",
    "> chciałbym zarezerwować stolik na jutro**/day** na godzinę dwunastą**/hour** czterdzieści**/hour** pięć**/hour** na pięć**/size** osób\n",
    "\n",
    "Granice slotów oznacza się korzystając z wybranego schematu etykietowania.\n",
    "\n",
    "### Schemat IOB\n",
    "\n",
    "| Prefix | Znaczenie                  |\n",
    "|:------:|:---------------------------|\n",
    "| I      | wnętrze slotu (inside)     |\n",
    "| O      | poza slotem (outside)      |\n",
    "| B      | początek slotu (beginning) |\n",
    "\n",
    "> chciałbym zarezerwować stolik na jutro**/B-day** na godzinę dwunastą**/B-hour** czterdzieści**/I-hour** pięć**/I-hour** na pięć**/B-size** osób\n",
    "\n",
    "### Schemat IOBES\n",
    "\n",
    "| Prefix | Znaczenie                  |\n",
    "|:------:|:---------------------------|\n",
    "| I      | wnętrze slotu (inside)     |\n",
    "| O      | poza slotem (outside)      |\n",
    "| B      | początek slotu (beginning) |\n",
    "| E      | koniec slotu (ending)      |\n",
    "| S      | pojedyncze słowo (single)  |\n",
    "\n",
    "> chciałbym zarezerwować stolik na jutro**/S-day** na godzinę dwunastą**/B-hour** czterdzieści**/I-hour** pięć**/E-hour** na pięć**/S-size** osób\n",
    "\n",
    "Jeżeli dla tak sformułowanego zadania przygotujemy zbiór danych\n",
    "złożony z wypowiedzi użytkownika z oznaczonymi slotami (tzw. *zbiór uczący*),\n",
    "to możemy zastosować techniki (nadzorowanego) uczenia maszynowego w celu zbudowania modelu\n",
    "annotującego wypowiedzi użytkownika etykietami slotów.\n",
    "\n",
    "Do zbudowania takiego modelu można wykorzystać między innymi:\n",
    "\n",
    " 1. warunkowe pola losowe (Lafferty i in.; 2001),\n",
    "\n",
    " 2. rekurencyjne sieci neuronowe, np. sieci LSTM (Hochreiter i Schmidhuber; 1997),\n",
    "\n",
    " 3. transformery (Vaswani i in., 2017).\n",
    "\n",
    "Przykład\n",
    "--------\n",
    "Skorzystamy ze zbioru danych przygotowanego przez Schustera (2019)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "C:\\Users\\Ania\\Desktop\\System_Dialogowy_Janet\\l07\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "A subdirectory or file -p already exists.\n",
      "Error occurred while processing: -p.\n",
      "A subdirectory or file l07 already exists.\n",
      "Error occurred while processing: l07.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "C:\\Users\\Ania\\Desktop\\System_Dialogowy_Janet\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "** Resuming transfer from byte position 8923190\n",
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
      "\n",
      "  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0\n",
      "  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0\n",
      "\n",
      "  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0\n",
      "100    49  100    49    0     0     56      0 --:--:-- --:--:-- --:--:--   742\n"
     ]
    }
   ],
   "source": [
    "!mkdir -p l07\n",
    "%cd l07\n",
    "!curl -L -C -  https://fb.me/multilingual_task_oriented_data  -o data.zip\n",
    "%cd .."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Zbiór ten gromadzi wypowiedzi w trzech językach opisane slotami dla dwunastu ram należących do trzech dziedzin `Alarm`, `Reminder` oraz `Weather`. Dane wczytamy korzystając z biblioteki [conllu](https://pypi.org/project/conllu/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: conllu in c:\\programdata\\anaconda3\\lib\\site-packages (4.4)\n"
     ]
    }
   ],
   "source": [
    "!pip3 install conllu\n",
    "import codecs\n",
    "from conllu import parse_incr\n",
    "fields = ['id', 'form', 'frame', 'slot']\n",
    "\n",
    "def nolabel2o(line, i):\n",
    "    return 'O' if line[i] == 'NoLabel' else line[i]\n",
    "\n",
    "with open('Janet_test.conllu', encoding='utf-8') as trainfile:\n",
    "    trainset = list(parse_incr(trainfile, fields=fields, field_parsers={'slot': nolabel2o}))\n",
    "with open('Janet_test.conllu', encoding='utf-8') as testfile:\n",
    "    testset = list(parse_incr(testfile, fields=fields, field_parsers={'slot': nolabel2o}))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Zobaczmy kilka przykładowych wypowiedzi z tego zbioru."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: tabulate in c:\\programdata\\anaconda3\\lib\\site-packages (0.8.9)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<tbody>\n",
       "<tr><td style=\"text-align: right;\">1</td><td>hej</td><td>greeting</td><td>O</td></tr>\n",
       "</tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "'<table>\\n<tbody>\\n<tr><td style=\"text-align: right;\">1</td><td>hej</td><td>greeting</td><td>O</td></tr>\\n</tbody>\\n</table>'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "!pip3 install tabulate\n",
    "from tabulate import tabulate\n",
    "tabulate(trainset[0], tablefmt='html')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0
   },
   "source": [
    "Na potrzeby prezentacji procesu uczenia w jupyterowym notatniku zawęzimy zbiór danych do początkowych przykładów."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Budując model skorzystamy z architektury opartej o rekurencyjne sieci neuronowe\n",
    "zaimplementowanej w bibliotece [flair](https://github.com/flairNLP/flair) (Akbik i in. 2018)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: flair in c:\\programdata\\anaconda3\\lib\\site-packages (0.8.0.post1)\n",
      "Requirement already satisfied: huggingface-hub in c:\\programdata\\anaconda3\\lib\\site-packages (from flair) (0.0.8)\n",
      "Requirement already satisfied: mpld3==0.3 in c:\\programdata\\anaconda3\\lib\\site-packages (from flair) (0.3)\n",
      "Requirement already satisfied: langdetect in c:\\programdata\\anaconda3\\lib\\site-packages (from flair) (1.0.9)\n",
      "Requirement already satisfied: hyperopt>=0.1.1 in c:\\programdata\\anaconda3\\lib\\site-packages (from flair) (0.2.5)\n",
      "Requirement already satisfied: tabulate in c:\\programdata\\anaconda3\\lib\\site-packages (from flair) (0.8.9)\n",
      "Requirement already satisfied: torch<=1.7.1,>=1.5.0 in c:\\programdata\\anaconda3\\lib\\site-packages (from flair) (1.7.1)\n",
      "Requirement already satisfied: matplotlib>=2.2.3 in c:\\programdata\\anaconda3\\lib\\site-packages (from flair) (3.3.2)\n",
      "Requirement already satisfied: sentencepiece==0.1.95 in c:\\programdata\\anaconda3\\lib\\site-packages (from flair) (0.1.95)\n",
      "Requirement already satisfied: numpy<1.20.0 in c:\\programdata\\anaconda3\\lib\\site-packages (from flair) (1.19.2)\n",
      "Requirement already satisfied: sqlitedict>=1.6.0 in c:\\programdata\\anaconda3\\lib\\site-packages (from flair) (1.7.0)\n",
      "Requirement already satisfied: lxml in c:\\programdata\\anaconda3\\lib\\site-packages (from flair) (4.6.1)\n",
      "Requirement already satisfied: python-dateutil>=2.6.1 in c:\\programdata\\anaconda3\\lib\\site-packages (from flair) (2.8.1)\n",
      "Requirement already satisfied: gensim<=3.8.3,>=3.4.0 in c:\\programdata\\anaconda3\\lib\\site-packages (from flair) (3.8.3)\n",
      "Requirement already satisfied: deprecated>=1.2.4 in c:\\programdata\\anaconda3\\lib\\site-packages (from flair) (1.2.12)\n",
      "Requirement already satisfied: gdown==3.12.2 in c:\\programdata\\anaconda3\\lib\\site-packages (from flair) (3.12.2)\n",
      "Requirement already satisfied: janome in c:\\programdata\\anaconda3\\lib\\site-packages (from flair) (0.4.1)\n",
      "Requirement already satisfied: konoha<5.0.0,>=4.0.0 in c:\\programdata\\anaconda3\\lib\\site-packages (from flair) (4.6.4)\n",
      "Requirement already satisfied: bpemb>=0.3.2 in c:\\programdata\\anaconda3\\lib\\site-packages (from flair) (0.3.3)\n",
      "Requirement already satisfied: tqdm>=4.26.0 in c:\\programdata\\anaconda3\\lib\\site-packages (from flair) (4.50.2)\n",
      "Requirement already satisfied: segtok>=1.5.7 in c:\\programdata\\anaconda3\\lib\\site-packages (from flair) (1.5.10)\n",
      "Requirement already satisfied: transformers>=4.0.0 in c:\\programdata\\anaconda3\\lib\\site-packages (from flair) (4.6.0)\n",
      "Requirement already satisfied: regex in c:\\programdata\\anaconda3\\lib\\site-packages (from flair) (2020.10.15)\n",
      "Requirement already satisfied: scikit-learn>=0.21.3 in c:\\programdata\\anaconda3\\lib\\site-packages (from flair) (0.23.2)\n",
      "Requirement already satisfied: ftfy in c:\\programdata\\anaconda3\\lib\\site-packages (from flair) (6.0.1)\n",
      "Requirement already satisfied: filelock in c:\\programdata\\anaconda3\\lib\\site-packages (from huggingface-hub->flair) (3.0.12)\n",
      "Requirement already satisfied: requests in c:\\programdata\\anaconda3\\lib\\site-packages (from huggingface-hub->flair) (2.24.0)\n",
      "Requirement already satisfied: six in c:\\programdata\\anaconda3\\lib\\site-packages (from langdetect->flair) (1.15.0)\n",
      "Requirement already satisfied: future in c:\\programdata\\anaconda3\\lib\\site-packages (from hyperopt>=0.1.1->flair) (0.18.2)\n",
      "Requirement already satisfied: scipy in c:\\programdata\\anaconda3\\lib\\site-packages (from hyperopt>=0.1.1->flair) (1.5.2)\n",
      "Requirement already satisfied: cloudpickle in c:\\programdata\\anaconda3\\lib\\site-packages (from hyperopt>=0.1.1->flair) (1.6.0)\n",
      "Requirement already satisfied: networkx>=2.2 in c:\\programdata\\anaconda3\\lib\\site-packages (from hyperopt>=0.1.1->flair) (2.5)\n",
      "Requirement already satisfied: typing-extensions in c:\\programdata\\anaconda3\\lib\\site-packages (from torch<=1.7.1,>=1.5.0->flair) (3.7.4.3)\n",
      "Requirement already satisfied: certifi>=2020.06.20 in c:\\programdata\\anaconda3\\lib\\site-packages (from matplotlib>=2.2.3->flair) (2020.6.20)\n",
      "Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in c:\\programdata\\anaconda3\\lib\\site-packages (from matplotlib>=2.2.3->flair) (2.4.7)\n",
      "Requirement already satisfied: pillow>=6.2.0 in c:\\programdata\\anaconda3\\lib\\site-packages (from matplotlib>=2.2.3->flair) (8.0.1)\n",
      "Requirement already satisfied: kiwisolver>=1.0.1 in c:\\programdata\\anaconda3\\lib\\site-packages (from matplotlib>=2.2.3->flair) (1.3.0)\n",
      "Requirement already satisfied: cycler>=0.10 in c:\\programdata\\anaconda3\\lib\\site-packages (from matplotlib>=2.2.3->flair) (0.10.0)\n",
      "Requirement already satisfied: Cython==0.29.14 in c:\\programdata\\anaconda3\\lib\\site-packages (from gensim<=3.8.3,>=3.4.0->flair) (0.29.14)\n",
      "Requirement already satisfied: smart-open>=1.8.1 in c:\\programdata\\anaconda3\\lib\\site-packages (from gensim<=3.8.3,>=3.4.0->flair) (5.0.0)\n",
      "Requirement already satisfied: wrapt<2,>=1.10 in c:\\users\\ania\\appdata\\roaming\\python\\python38\\site-packages (from deprecated>=1.2.4->flair) (1.12.1)\n",
      "Requirement already satisfied: importlib-metadata<4.0.0,>=3.7.0 in c:\\programdata\\anaconda3\\lib\\site-packages (from konoha<5.0.0,>=4.0.0->flair) (3.10.1)\n",
      "Requirement already satisfied: overrides<4.0.0,>=3.0.0 in c:\\programdata\\anaconda3\\lib\\site-packages (from konoha<5.0.0,>=4.0.0->flair) (3.1.0)\n",
      "Requirement already satisfied: packaging in c:\\programdata\\anaconda3\\lib\\site-packages (from transformers>=4.0.0->flair) (20.4)\n",
      "Requirement already satisfied: tokenizers<0.11,>=0.10.1 in c:\\programdata\\anaconda3\\lib\\site-packages (from transformers>=4.0.0->flair) (0.10.2)\n",
      "Requirement already satisfied: sacremoses in c:\\programdata\\anaconda3\\lib\\site-packages (from transformers>=4.0.0->flair) (0.0.45)\n",
      "Requirement already satisfied: joblib>=0.11 in c:\\programdata\\anaconda3\\lib\\site-packages (from scikit-learn>=0.21.3->flair) (0.17.0)\n",
      "Requirement already satisfied: threadpoolctl>=2.0.0 in c:\\programdata\\anaconda3\\lib\\site-packages (from scikit-learn>=0.21.3->flair) (2.1.0)\n",
      "Requirement already satisfied: wcwidth in c:\\programdata\\anaconda3\\lib\\site-packages (from ftfy->flair) (0.2.5)\n",
      "Requirement already satisfied: idna<3,>=2.5 in c:\\programdata\\anaconda3\\lib\\site-packages (from requests->huggingface-hub->flair) (2.10)\n",
      "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in c:\\programdata\\anaconda3\\lib\\site-packages (from requests->huggingface-hub->flair) (1.25.11)\n",
      "Requirement already satisfied: chardet<4,>=3.0.2 in c:\\programdata\\anaconda3\\lib\\site-packages (from requests->huggingface-hub->flair) (3.0.4)\n",
      "Requirement already satisfied: decorator>=4.3.0 in c:\\programdata\\anaconda3\\lib\\site-packages (from networkx>=2.2->hyperopt>=0.1.1->flair) (4.4.2)\n",
      "Requirement already satisfied: zipp>=0.5 in c:\\programdata\\anaconda3\\lib\\site-packages (from importlib-metadata<4.0.0,>=3.7.0->konoha<5.0.0,>=4.0.0->flair) (3.4.0)\n",
      "Requirement already satisfied: click in c:\\programdata\\anaconda3\\lib\\site-packages (from sacremoses->transformers>=4.0.0->flair) (7.1.2)\n"
     ]
    }
   ],
   "source": [
    "!pip3 install flair"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: torch in c:\\programdata\\anaconda3\\lib\\site-packages (1.7.1)\n",
      "Requirement already satisfied: typing-extensions in c:\\programdata\\anaconda3\\lib\\site-packages (from torch) (3.7.4.3)\n",
      "Requirement already satisfied: numpy in c:\\programdata\\anaconda3\\lib\\site-packages (from torch) (1.19.2)\n"
     ]
    }
   ],
   "source": [
    "from flair.data import Corpus, Sentence, Token\n",
    "from flair.datasets import SentenceDataset\n",
    "from flair.embeddings import StackedEmbeddings\n",
    "from flair.embeddings import WordEmbeddings\n",
    "from flair.embeddings import CharacterEmbeddings\n",
    "from flair.embeddings import FlairEmbeddings\n",
    "from flair.models import SequenceTagger\n",
    "from flair.trainers import ModelTrainer\n",
    "\n",
    "!pip3 install torch\n",
    "# determinizacja obliczeń\n",
    "import random\n",
    "import torch\n",
    "random.seed(42)\n",
    "torch.manual_seed(42)\n",
    "\n",
    "if torch.cuda.is_available():\n",
    "    torch.cuda.manual_seed(0)\n",
    "    torch.cuda.manual_seed_all(0)\n",
    "    torch.backends.cudnn.enabled = False\n",
    "    torch.backends.cudnn.benchmark = False\n",
    "    torch.backends.cudnn.deterministic = True"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Dane skonwertujemy do formatu wykorzystywanego przez `flair`, korzystając z następującej funkcji."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Corpus: 36 train + 4 dev + 40 test sentences\n",
      "Dictionary with 13 tags: <unk>, O, B-appoinment/doctor, I-appoinment/doctor, B-datetime, I-datetime, B-login/id, B-appointment/type, I-appointment/type, B-prescription/type, B-login/password, <START>, <STOP>\n"
     ]
    }
   ],
   "source": [
    "def conllu2flair(sentences, label=None):\n",
    "    fsentences = []\n",
    "\n",
    "    for sentence in sentences:\n",
    "        fsentence = Sentence()\n",
    "\n",
    "        for token in sentence:\n",
    "            ftoken = Token(token['form'])\n",
    "\n",
    "            if label:\n",
    "                ftoken.add_tag(label, token[label])\n",
    "\n",
    "            fsentence.add_token(ftoken)\n",
    "\n",
    "        fsentences.append(fsentence)\n",
    "\n",
    "    return SentenceDataset(fsentences)\n",
    "\n",
    "corpus = Corpus(train=conllu2flair(trainset, 'slot'), test=conllu2flair(testset, 'slot'))\n",
    "print(corpus)\n",
    "tag_dictionary = corpus.make_tag_dictionary(tag_type='slot')\n",
    "print(tag_dictionary)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Nasz model będzie wykorzystywał wektorowe reprezentacje słów (zob. [Word Embeddings](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_3_WORD_EMBEDDING.md))."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "embedding_types = [\n",
    "    WordEmbeddings('pl'),\n",
    "    FlairEmbeddings('pl-forward'),\n",
    "    FlairEmbeddings('pl-backward'),\n",
    "    CharacterEmbeddings(),\n",
    "]\n",
    "\n",
    "embeddings = StackedEmbeddings(embeddings=embedding_types)\n",
    "tagger = SequenceTagger(hidden_size=256, embeddings=embeddings,\n",
    "                        tag_dictionary=tag_dictionary,\n",
    "                        tag_type='slot', use_crf=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Zobaczmy jak wygląda architektura sieci neuronowej, która będzie odpowiedzialna za przewidywanie\n",
    "slotów w wypowiedziach."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "SequenceTagger(\n",
      "  (embeddings): StackedEmbeddings(\n",
      "    (list_embedding_0): WordEmbeddings('pl')\n",
      "    (list_embedding_1): FlairEmbeddings(\n",
      "      (lm): LanguageModel(\n",
      "        (drop): Dropout(p=0.25, inplace=False)\n",
      "        (encoder): Embedding(1602, 100)\n",
      "        (rnn): LSTM(100, 2048)\n",
      "        (decoder): Linear(in_features=2048, out_features=1602, bias=True)\n",
      "      )\n",
      "    )\n",
      "    (list_embedding_2): FlairEmbeddings(\n",
      "      (lm): LanguageModel(\n",
      "        (drop): Dropout(p=0.25, inplace=False)\n",
      "        (encoder): Embedding(1602, 100)\n",
      "        (rnn): LSTM(100, 2048)\n",
      "        (decoder): Linear(in_features=2048, out_features=1602, bias=True)\n",
      "      )\n",
      "    )\n",
      "    (list_embedding_3): CharacterEmbeddings(\n",
      "      (char_embedding): Embedding(275, 25)\n",
      "      (char_rnn): LSTM(25, 25, bidirectional=True)\n",
      "    )\n",
      "  )\n",
      "  (word_dropout): WordDropout(p=0.05)\n",
      "  (locked_dropout): LockedDropout(p=0.5)\n",
      "  (embedding2nn): Linear(in_features=4446, out_features=4446, bias=True)\n",
      "  (rnn): LSTM(4446, 256, batch_first=True, bidirectional=True)\n",
      "  (linear): Linear(in_features=512, out_features=13, bias=True)\n",
      "  (beta): 1.0\n",
      "  (weights): None\n",
      "  (weight_tensor) None\n",
      ")\n"
     ]
    }
   ],
   "source": [
    "print(tagger)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wykonamy dziesięć iteracji (epok) uczenia a wynikowy model zapiszemy w katalogu `slot-model`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2021-05-16 11:40:14,273 ----------------------------------------------------------------------------------------------------\n",
      "2021-05-16 11:40:14,274 Model: \"SequenceTagger(\n",
      "  (embeddings): StackedEmbeddings(\n",
      "    (list_embedding_0): WordEmbeddings('pl')\n",
      "    (list_embedding_1): FlairEmbeddings(\n",
      "      (lm): LanguageModel(\n",
      "        (drop): Dropout(p=0.25, inplace=False)\n",
      "        (encoder): Embedding(1602, 100)\n",
      "        (rnn): LSTM(100, 2048)\n",
      "        (decoder): Linear(in_features=2048, out_features=1602, bias=True)\n",
      "      )\n",
      "    )\n",
      "    (list_embedding_2): FlairEmbeddings(\n",
      "      (lm): LanguageModel(\n",
      "        (drop): Dropout(p=0.25, inplace=False)\n",
      "        (encoder): Embedding(1602, 100)\n",
      "        (rnn): LSTM(100, 2048)\n",
      "        (decoder): Linear(in_features=2048, out_features=1602, bias=True)\n",
      "      )\n",
      "    )\n",
      "    (list_embedding_3): CharacterEmbeddings(\n",
      "      (char_embedding): Embedding(275, 25)\n",
      "      (char_rnn): LSTM(25, 25, bidirectional=True)\n",
      "    )\n",
      "  )\n",
      "  (word_dropout): WordDropout(p=0.05)\n",
      "  (locked_dropout): LockedDropout(p=0.5)\n",
      "  (embedding2nn): Linear(in_features=4446, out_features=4446, bias=True)\n",
      "  (rnn): LSTM(4446, 256, batch_first=True, bidirectional=True)\n",
      "  (linear): Linear(in_features=512, out_features=13, bias=True)\n",
      "  (beta): 1.0\n",
      "  (weights): None\n",
      "  (weight_tensor) None\n",
      ")\"\n",
      "2021-05-16 11:40:14,275 ----------------------------------------------------------------------------------------------------\n",
      "2021-05-16 11:40:14,277 Corpus: \"Corpus: 36 train + 4 dev + 40 test sentences\"\n",
      "2021-05-16 11:40:14,277 ----------------------------------------------------------------------------------------------------\n",
      "2021-05-16 11:40:14,278 Parameters:\n",
      "2021-05-16 11:40:14,279  - learning_rate: \"0.1\"\n",
      "2021-05-16 11:40:14,280  - mini_batch_size: \"32\"\n",
      "2021-05-16 11:40:14,280  - patience: \"3\"\n",
      "2021-05-16 11:40:14,281  - anneal_factor: \"0.5\"\n",
      "2021-05-16 11:40:14,282  - max_epochs: \"10\"\n",
      "2021-05-16 11:40:14,283  - shuffle: \"True\"\n",
      "2021-05-16 11:40:14,285  - train_with_dev: \"False\"\n",
      "2021-05-16 11:40:14,286  - batch_growth_annealing: \"False\"\n",
      "2021-05-16 11:40:14,287 ----------------------------------------------------------------------------------------------------\n",
      "2021-05-16 11:40:14,288 Model training base path: \"slot-model\"\n",
      "2021-05-16 11:40:14,288 ----------------------------------------------------------------------------------------------------\n",
      "2021-05-16 11:40:14,289 Device: cpu\n",
      "2021-05-16 11:40:14,290 ----------------------------------------------------------------------------------------------------\n",
      "2021-05-16 11:40:14,292 Embeddings storage mode: cpu\n",
      "2021-05-16 11:40:14,295 ----------------------------------------------------------------------------------------------------\n",
      "2021-05-16 11:40:18,737 epoch 1 - iter 1/2 - loss 13.17695141 - samples/sec: 7.21 - lr: 0.100000\n",
      "2021-05-16 11:40:19,989 epoch 1 - iter 2/2 - loss 11.51309586 - samples/sec: 25.57 - lr: 0.100000\n",
      "2021-05-16 11:40:19,989 ----------------------------------------------------------------------------------------------------\n",
      "2021-05-16 11:40:19,989 EPOCH 1 done: loss 11.5131 - lr 0.1000000\n",
      "2021-05-16 11:40:20,670 DEV : loss 5.320306777954102 - score 0.0\n",
      "2021-05-16 11:40:20,671 BAD EPOCHS (no improvement): 0\n",
      "saving best model\n",
      "2021-05-16 11:40:30,073 ----------------------------------------------------------------------------------------------------\n",
      "2021-05-16 11:40:30,802 epoch 2 - iter 1/2 - loss 8.20096970 - samples/sec: 45.04 - lr: 0.100000\n",
      "2021-05-16 11:40:31,005 epoch 2 - iter 2/2 - loss 5.87843704 - samples/sec: 157.40 - lr: 0.100000\n",
      "2021-05-16 11:40:31,006 ----------------------------------------------------------------------------------------------------\n",
      "2021-05-16 11:40:31,008 EPOCH 2 done: loss 5.8784 - lr 0.1000000\n",
      "2021-05-16 11:40:31,020 DEV : loss 2.201185703277588 - score 0.0\n",
      "2021-05-16 11:40:31,038 BAD EPOCHS (no improvement): 0\n",
      "saving best model\n",
      "2021-05-16 11:40:40,878 ----------------------------------------------------------------------------------------------------\n",
      "2021-05-16 11:40:41,800 epoch 3 - iter 1/2 - loss 3.59802794 - samples/sec: 34.83 - lr: 0.100000\n",
      "2021-05-16 11:40:42,230 epoch 3 - iter 2/2 - loss 7.24588382 - samples/sec: 74.64 - lr: 0.100000\n",
      "2021-05-16 11:40:42,231 ----------------------------------------------------------------------------------------------------\n",
      "2021-05-16 11:40:42,233 EPOCH 3 done: loss 7.2459 - lr 0.1000000\n",
      "2021-05-16 11:40:42,290 DEV : loss 2.3815672397613525 - score 0.0\n",
      "2021-05-16 11:40:42,295 BAD EPOCHS (no improvement): 1\n",
      "2021-05-16 11:40:42,300 ----------------------------------------------------------------------------------------------------\n",
      "2021-05-16 11:40:43,662 epoch 4 - iter 1/2 - loss 4.05115032 - samples/sec: 23.57 - lr: 0.100000\n",
      "2021-05-16 11:40:44,013 epoch 4 - iter 2/2 - loss 3.16846037 - samples/sec: 91.53 - lr: 0.100000\n",
      "2021-05-16 11:40:44,015 ----------------------------------------------------------------------------------------------------\n",
      "2021-05-16 11:40:44,018 EPOCH 4 done: loss 3.1685 - lr 0.1000000\n",
      "2021-05-16 11:40:44,072 DEV : loss 1.7660648822784424 - score 0.0\n",
      "2021-05-16 11:40:44,075 BAD EPOCHS (no improvement): 0\n",
      "saving best model\n",
      "2021-05-16 11:40:53,620 ----------------------------------------------------------------------------------------------------\n",
      "2021-05-16 11:40:54,419 epoch 5 - iter 1/2 - loss 3.52825356 - samples/sec: 40.10 - lr: 0.100000\n",
      "2021-05-16 11:40:54,594 epoch 5 - iter 2/2 - loss 3.12245941 - samples/sec: 183.91 - lr: 0.100000\n",
      "2021-05-16 11:40:54,595 ----------------------------------------------------------------------------------------------------\n",
      "2021-05-16 11:40:54,596 EPOCH 5 done: loss 3.1225 - lr 0.1000000\n",
      "2021-05-16 11:40:54,624 DEV : loss 1.8835055828094482 - score 0.0\n",
      "2021-05-16 11:40:54,626 BAD EPOCHS (no improvement): 1\n",
      "2021-05-16 11:40:54,627 ----------------------------------------------------------------------------------------------------\n",
      "2021-05-16 11:40:55,393 epoch 6 - iter 1/2 - loss 2.84318709 - samples/sec: 41.88 - lr: 0.100000\n",
      "2021-05-16 11:40:55,648 epoch 6 - iter 2/2 - loss 4.79819477 - samples/sec: 125.98 - lr: 0.100000\n",
      "2021-05-16 11:40:55,649 ----------------------------------------------------------------------------------------------------\n",
      "2021-05-16 11:40:55,650 EPOCH 6 done: loss 4.7982 - lr 0.1000000\n",
      "2021-05-16 11:40:55,675 DEV : loss 1.9106686115264893 - score 0.0\n",
      "2021-05-16 11:40:55,677 BAD EPOCHS (no improvement): 2\n",
      "2021-05-16 11:40:55,678 ----------------------------------------------------------------------------------------------------\n",
      "2021-05-16 11:40:56,467 epoch 7 - iter 1/2 - loss 3.35292196 - samples/sec: 40.66 - lr: 0.100000\n",
      "2021-05-16 11:40:56,661 epoch 7 - iter 2/2 - loss 1.90253919 - samples/sec: 165.80 - lr: 0.100000\n",
      "2021-05-16 11:40:56,662 ----------------------------------------------------------------------------------------------------\n",
      "2021-05-16 11:40:56,663 EPOCH 7 done: loss 1.9025 - lr 0.1000000\n",
      "2021-05-16 11:40:56,689 DEV : loss 1.5785303115844727 - score 0.0\n",
      "2021-05-16 11:40:56,691 BAD EPOCHS (no improvement): 0\n",
      "saving best model\n",
      "2021-05-16 11:41:09,226 ----------------------------------------------------------------------------------------------------\n",
      "2021-05-16 11:41:10,375 epoch 8 - iter 1/2 - loss 3.24992299 - samples/sec: 27.87 - lr: 0.100000\n",
      "2021-05-16 11:41:10,744 epoch 8 - iter 2/2 - loss 3.30123496 - samples/sec: 87.17 - lr: 0.100000\n",
      "2021-05-16 11:41:10,745 ----------------------------------------------------------------------------------------------------\n",
      "2021-05-16 11:41:10,746 EPOCH 8 done: loss 3.3012 - lr 0.1000000\n",
      "2021-05-16 11:41:10,798 DEV : loss 1.590420126914978 - score 0.0\n",
      "2021-05-16 11:41:10,802 BAD EPOCHS (no improvement): 1\n",
      "2021-05-16 11:41:10,807 ----------------------------------------------------------------------------------------------------\n",
      "2021-05-16 11:41:12,175 epoch 9 - iter 1/2 - loss 2.74546242 - samples/sec: 23.41 - lr: 0.100000\n",
      "2021-05-16 11:41:12,515 epoch 9 - iter 2/2 - loss 2.34704965 - samples/sec: 94.40 - lr: 0.100000\n",
      "2021-05-16 11:41:12,518 ----------------------------------------------------------------------------------------------------\n",
      "2021-05-16 11:41:12,520 EPOCH 9 done: loss 2.3470 - lr 0.1000000\n",
      "2021-05-16 11:41:12,573 DEV : loss 1.6068150997161865 - score 0.0\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2021-05-16 11:41:12,575 BAD EPOCHS (no improvement): 2\n",
      "2021-05-16 11:41:12,577 ----------------------------------------------------------------------------------------------------\n",
      "2021-05-16 11:41:13,690 epoch 10 - iter 1/2 - loss 2.63941884 - samples/sec: 28.79 - lr: 0.100000\n",
      "2021-05-16 11:41:13,878 epoch 10 - iter 2/2 - loss 2.18226165 - samples/sec: 171.12 - lr: 0.100000\n",
      "2021-05-16 11:41:13,879 ----------------------------------------------------------------------------------------------------\n",
      "2021-05-16 11:41:13,880 EPOCH 10 done: loss 2.1823 - lr 0.1000000\n",
      "2021-05-16 11:41:13,906 DEV : loss 1.458857536315918 - score 0.0\n",
      "2021-05-16 11:41:13,907 BAD EPOCHS (no improvement): 0\n",
      "saving best model\n",
      "2021-05-16 11:41:33,558 ----------------------------------------------------------------------------------------------------\n",
      "2021-05-16 11:41:33,559 Testing using best model ...\n",
      "2021-05-16 11:41:33,560 loading file slot-model\\best-model.pt\n",
      "2021-05-16 11:41:45,502 0.1765\t0.1667\t0.1714\n",
      "2021-05-16 11:41:45,503 \n",
      "Results:\n",
      "- F1-score (micro) 0.1714\n",
      "- F1-score (macro) 0.1161\n",
      "\n",
      "By class:\n",
      "appoinment/doctor tp: 1 - fp: 9 - fn: 5 - precision: 0.1000 - recall: 0.1667 - f1-score: 0.1250\n",
      "appointment/type tp: 0 - fp: 0 - fn: 2 - precision: 0.0000 - recall: 0.0000 - f1-score: 0.0000\n",
      "datetime   tp: 0 - fp: 1 - fn: 3 - precision: 0.0000 - recall: 0.0000 - f1-score: 0.0000\n",
      "login/id   tp: 2 - fp: 2 - fn: 1 - precision: 0.5000 - recall: 0.6667 - f1-score: 0.5714\n",
      "login/password tp: 0 - fp: 0 - fn: 3 - precision: 0.0000 - recall: 0.0000 - f1-score: 0.0000\n",
      "prescription/type tp: 0 - fp: 2 - fn: 1 - precision: 0.0000 - recall: 0.0000 - f1-score: 0.0000\n",
      "2021-05-16 11:41:45,503 ----------------------------------------------------------------------------------------------------\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'test_score': 0.17142857142857143,\n",
       " 'dev_score_history': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],\n",
       " 'train_loss_history': [11.51309585571289,\n",
       "  5.878437042236328,\n",
       "  7.245883822441101,\n",
       "  3.1684603691101074,\n",
       "  3.1224594116210938,\n",
       "  4.798194766044617,\n",
       "  1.9025391936302185,\n",
       "  3.3012349605560303,\n",
       "  2.347049653530121,\n",
       "  2.182261645793915],\n",
       " 'dev_loss_history': [5.320306777954102,\n",
       "  2.201185703277588,\n",
       "  2.3815672397613525,\n",
       "  1.7660648822784424,\n",
       "  1.8835055828094482,\n",
       "  1.9106686115264893,\n",
       "  1.5785303115844727,\n",
       "  1.590420126914978,\n",
       "  1.6068150997161865,\n",
       "  1.458857536315918]}"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainer = ModelTrainer(tagger, corpus)\n",
    "trainer.train('slot-model',\n",
    "              learning_rate=0.1,\n",
    "              mini_batch_size=32,\n",
    "              max_epochs=10,\n",
    "              train_with_dev=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Jakość wyuczonego modelu możemy ocenić, korzystając z zaraportowanych powyżej metryk, tj.:\n",
    "\n",
    " - *tp (true positives)*\n",
    "\n",
    "   > liczba słów oznaczonych w zbiorze testowym etykietą $e$, które model oznaczył tą etykietą\n",
    "\n",
    " - *fp (false positives)*\n",
    "\n",
    "   > liczba słów nieoznaczonych w zbiorze testowym etykietą $e$, które model oznaczył tą etykietą\n",
    "\n",
    " - *fn (false negatives)*\n",
    "\n",
    "   > liczba słów oznaczonych w zbiorze testowym etykietą $e$, którym model nie nadał etykiety $e$\n",
    "\n",
    " - *precision*\n",
    "\n",
    "   > $$\\frac{tp}{tp + fp}$$\n",
    "\n",
    " - *recall*\n",
    "\n",
    "   > $$\\frac{tp}{tp + fn}$$\n",
    "\n",
    " - $F_1$\n",
    "\n",
    "   > $$\\frac{2 \\cdot precision \\cdot recall}{precision + recall}$$\n",
    "\n",
    " - *micro* $F_1$\n",
    "\n",
    "   > $F_1$ w którym $tp$, $fp$ i $fn$ są liczone łącznie dla wszystkich etykiet, tj. $tp = \\sum_{e}{{tp}_e}$, $fn = \\sum_{e}{{fn}_e}$, $fp = \\sum_{e}{{fp}_e}$\n",
    "\n",
    " - *macro* $F_1$\n",
    "\n",
    "   > średnia arytmetyczna z $F_1$ obliczonych dla poszczególnych etykiet z osobna.\n",
    "\n",
    "Wyuczony model możemy wczytać z pliku korzystając z metody `load`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2021-05-16 11:41:45,529 loading file slot-model/final-model.pt\n"
     ]
    }
   ],
   "source": [
    "model = SequenceTagger.load('slot-model/final-model.pt')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wczytany model możemy wykorzystać do przewidywania slotów w wypowiedziach użytkownika, korzystając\n",
    "z przedstawionej poniżej funkcji `predict`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "def predict(model, sentence):\n",
    "    csentence = [{'form': word} for word in sentence]\n",
    "    fsentence = conllu2flair([csentence])[0]\n",
    "    model.predict(fsentence)\n",
    "    return [(token, ftoken.get_tag('slot').value) for token, ftoken in zip(sentence, fsentence)]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Jak pokazuje przykład poniżej model wyuczony tylko na 100 przykładach popełnia w dosyć prostej\n",
    "wypowiedzi błąd etykietując słowo `alarm` tagiem `B-weather/noun`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<tbody>\n",
       "<tr><td>doktor        </td><td>I-appoinment/doctor</td></tr>\n",
       "<tr><td>lekarza       </td><td>B-appoinment/doctor</td></tr>\n",
       "<tr><td>rodzinnego    </td><td>O                  </td></tr>\n",
       "<tr><td>najlepiej     </td><td>O                  </td></tr>\n",
       "<tr><td>dzisiaj       </td><td>O                  </td></tr>\n",
       "<tr><td>w             </td><td>O                  </td></tr>\n",
       "<tr><td>godzinach     </td><td>O                  </td></tr>\n",
       "<tr><td>popołudniowych</td><td>O                  </td></tr>\n",
       "<tr><td>dziś          </td><td>O                  </td></tr>\n",
       "</tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "'<table>\\n<tbody>\\n<tr><td>doktor        </td><td>I-appoinment/doctor</td></tr>\\n<tr><td>lekarza       </td><td>B-appoinment/doctor</td></tr>\\n<tr><td>rodzinnego    </td><td>O                  </td></tr>\\n<tr><td>najlepiej     </td><td>O                  </td></tr>\\n<tr><td>dzisiaj       </td><td>O                  </td></tr>\\n<tr><td>w             </td><td>O                  </td></tr>\\n<tr><td>godzinach     </td><td>O                  </td></tr>\\n<tr><td>popołudniowych</td><td>O                  </td></tr>\\n<tr><td>dziś          </td><td>O                  </td></tr>\\n</tbody>\\n</table>'"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tabulate(predict(model, 'doktor lekarza rodzinnego najlepiej dzisiaj w godzinach popołudniowych dziś '.split()), tablefmt='html')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Literatura\n",
    "----------\n",
    " 1. Sebastian Schuster, Sonal Gupta, Rushin Shah, Mike Lewis, Cross-lingual Transfer Learning for Multilingual Task Oriented Dialog. NAACL-HLT (1) 2019, pp. 3795-3805\n",
    " 2. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML '01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 282–289, https://repository.upenn.edu/cgi/viewcontent.cgi?article=1162&context=cis_papers\n",
    " 3. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (November 15, 1997), 1735–1780, https://doi.org/10.1162/neco.1997.9.8.1735\n",
    " 4. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, Attention is All you Need, NIPS 2017, pp. 5998-6008, https://arxiv.org/abs/1706.03762\n",
    " 5. Alan Akbik, Duncan Blythe, Roland Vollgraf, Contextual String Embeddings for Sequence Labeling, Proceedings of the 27th International Conference on Computational Linguistics, pp. 1638–1649, https://www.aclweb.org/anthology/C18-1139.pdf\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "jupytext": {
   "cell_metadata_filter": "-all",
   "main_language": "python",
   "notebook_metadata_filter": "-all"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}