451 lines
15 KiB
Plaintext
451 lines
15 KiB
Plaintext
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"id": "68bc3d74",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Parsing semantyczny z wykorzystaniem technik uczenia maszynowego\n",
|
|||
|
"================================================================\n",
|
|||
|
"\n",
|
|||
|
"Wprowadzenie\n",
|
|||
|
"------------\n",
|
|||
|
"Problem wykrywania slotów i ich wartości w wypowiedziach użytkownika można sformułować jako zadanie\n",
|
|||
|
"polegające na przewidywaniu dla poszczególnych słów etykiet wskazujących na to czy i do jakiego\n",
|
|||
|
"slotu dane słowo należy.\n",
|
|||
|
"\n",
|
|||
|
"<pre>chciałbym zarezerwować stolik na jutro<b>/day</b> na godzinę dwunastą<b>/hour</b> czterdzieści<b>/hour</b> pięć<b>/hour</b> na pięć<b>/size</b> osób</pre>\n",
|
|||
|
"\n",
|
|||
|
"Granice slotów oznacza się korzystając z wybranego schematu etykietowania.\n",
|
|||
|
"\n",
|
|||
|
"### Schemat IOB\n",
|
|||
|
"\n",
|
|||
|
"| Prefix | Znaczenie |\n",
|
|||
|
"|:------:|:---------------------------|\n",
|
|||
|
"| I | wnętrze slotu (inside) |\n",
|
|||
|
"| O | poza slotem (outside) |\n",
|
|||
|
"| B | początek slotu (beginning) |\n",
|
|||
|
"\n",
|
|||
|
"<pre>chciałbym zarezerwować stolik na jutro<b>/B-day</b> na godzinę dwunastą<b>/B-hour</b> czterdzieści<b>/I-hour</b> pięć<b>/I-hour</b> na pięć<b>/B-size</b> osób</pre>\n",
|
|||
|
"\n",
|
|||
|
"### Schemat IOBES\n",
|
|||
|
"\n",
|
|||
|
"| Prefix | Znaczenie |\n",
|
|||
|
"|:------:|:---------------------------|\n",
|
|||
|
"| I | wnętrze slotu (inside) |\n",
|
|||
|
"| O | poza slotem (outside) |\n",
|
|||
|
"| B | początek slotu (beginning) |\n",
|
|||
|
"| E | koniec slotu (ending) |\n",
|
|||
|
"| S | pojedyncze słowo (single) |\n",
|
|||
|
"\n",
|
|||
|
"<pre>chciałbym zarezerwować stolik na jutro<b>/S-day</b> na godzinę dwunastą<b>/B-hour</b> czterdzieści<b>/I-hour</b> pięć<b>/E-hour</b> na pięć<b>/S-size</b> osób</pre>\n",
|
|||
|
"\n",
|
|||
|
"Jeżeli dla tak sformułowanego zadania przygotujemy zbiór danych\n",
|
|||
|
"złożony z wypowiedzi użytkownika z oznaczonymi slotami (tzw. *zbiór uczący*),\n",
|
|||
|
"to możemy zastosować techniki (nadzorowanego) uczenia maszynowego w celu zbudowania modelu\n",
|
|||
|
"annotującego wypowiedzi użytkownika etykietami slotów.\n",
|
|||
|
"\n",
|
|||
|
"Do zbudowania takiego modelu można wykorzystać między innymi:\n",
|
|||
|
"\n",
|
|||
|
" 1. warunkowe pola losowe (Lafferty i in.; 2001),\n",
|
|||
|
"\n",
|
|||
|
" 2. rekurencyjne sieci neuronowe, np. sieci LSTM (Hochreiter i Schmidhuber; 1997),\n",
|
|||
|
"\n",
|
|||
|
" 3. transformery (Vaswani i in., 2017).\n",
|
|||
|
"\n",
|
|||
|
"Przykład\n",
|
|||
|
"--------\n",
|
|||
|
"Skorzystamy ze zbioru danych przygotowanego przez Schustera (2019)."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"id": "8cca8cd1",
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"!mkdir -p l07\n",
|
|||
|
"%cd l07\n",
|
|||
|
"!curl -L -C - https://fb.me/multilingual_task_oriented_data -o data.zip\n",
|
|||
|
"!unzip data.zip\n",
|
|||
|
"%cd .."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"id": "56d91f6c",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Zbiór ten gromadzi wypowiedzi w trzech językach opisane slotami dla dwunastu ram należących do trzech dziedzin `Alarm`, `Reminder` oraz `Weather`. Dane wczytamy korzystając z biblioteki [conllu](https://pypi.org/project/conllu/)."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"id": "18b9a032",
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"from conllu import parse_incr\n",
|
|||
|
"fields = ['id', 'form', 'frame', 'slot']\n",
|
|||
|
"\n",
|
|||
|
"def nolabel2o(line, i):\n",
|
|||
|
" return 'O' if line[i] == 'NoLabel' else line[i]\n",
|
|||
|
"\n",
|
|||
|
"with open('l07/en/train-en.conllu') as trainfile:\n",
|
|||
|
" trainset = list(parse_incr(trainfile, fields=fields, field_parsers={'slot': nolabel2o}))\n",
|
|||
|
"with open('l07/en/test-en.conllu') as testfile:\n",
|
|||
|
" testset = list(parse_incr(testfile, fields=fields, field_parsers={'slot': nolabel2o}))"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"id": "7477593e",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Zobaczmy kilka przykładowych wypowiedzi z tego zbioru."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"id": "b2799ad2",
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"from tabulate import tabulate\n",
|
|||
|
"tabulate(trainset[0], tablefmt='html')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"id": "ba2c2706",
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"tabulate(trainset[1000], tablefmt='html')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"id": "b5c9db18",
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"tabulate(trainset[2000], tablefmt='html')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"id": "0f35074d",
|
|||
|
"metadata": {
|
|||
|
"lines_to_next_cell": 0
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"Na potrzeby prezentacji procesu uczenia w jupyterowym notatniku zawęzimy zbiór danych do początkowych przykładów."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"id": "f735ca85",
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"trainset = trainset[:300]\n",
|
|||
|
"testset = testset[:300]"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"id": "66284486",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Budując model skorzystamy z architektury opartej o rekurencyjne sieci neuronowe\n",
|
|||
|
"zaimplementowanej w bibliotece [flair](https://github.com/flairNLP/flair) (Akbik i in. 2018)."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"id": "f3e30f81",
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"from flair.data import Corpus, Sentence, Token\n",
|
|||
|
"from flair.datasets import FlairDatapointDataset\n",
|
|||
|
"from flair.embeddings import StackedEmbeddings\n",
|
|||
|
"from flair.embeddings import WordEmbeddings\n",
|
|||
|
"from flair.embeddings import CharacterEmbeddings\n",
|
|||
|
"from flair.embeddings import FlairEmbeddings\n",
|
|||
|
"from flair.models import SequenceTagger\n",
|
|||
|
"from flair.trainers import ModelTrainer\n",
|
|||
|
"\n",
|
|||
|
"# determinizacja obliczeń\n",
|
|||
|
"import random\n",
|
|||
|
"import torch\n",
|
|||
|
"random.seed(42)\n",
|
|||
|
"torch.manual_seed(42)\n",
|
|||
|
"\n",
|
|||
|
"if torch.cuda.is_available():\n",
|
|||
|
" torch.cuda.manual_seed(0)\n",
|
|||
|
" torch.cuda.manual_seed_all(0)\n",
|
|||
|
" torch.backends.cudnn.enabled = False\n",
|
|||
|
" torch.backends.cudnn.benchmark = False\n",
|
|||
|
" torch.backends.cudnn.deterministic = True"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"id": "c1a33987",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Dane skonwertujemy do formatu wykorzystywanego przez `flair`, korzystając z następującej funkcji."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"id": "f3c47593",
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"def conllu2flair(sentences, label=None):\n",
|
|||
|
" fsentences = []\n",
|
|||
|
"\n",
|
|||
|
" for sentence in sentences:\n",
|
|||
|
" fsentence = Sentence(' '.join(token['form'] for token in sentence), use_tokenizer=False)\n",
|
|||
|
" start_idx = None\n",
|
|||
|
" end_idx = None\n",
|
|||
|
" tag = None\n",
|
|||
|
"\n",
|
|||
|
" if label:\n",
|
|||
|
" for idx, (token, ftoken) in enumerate(zip(sentence, fsentence)):\n",
|
|||
|
" if token[label].startswith('B-'):\n",
|
|||
|
" start_idx = idx\n",
|
|||
|
" end_idx = idx\n",
|
|||
|
" tag = token[label][2:]\n",
|
|||
|
" elif token[label].startswith('I-'):\n",
|
|||
|
" end_idx = idx\n",
|
|||
|
" elif token[label] == 'O':\n",
|
|||
|
" if start_idx is not None:\n",
|
|||
|
" fsentence[start_idx:end_idx+1].add_label(label, tag)\n",
|
|||
|
" start_idx = None\n",
|
|||
|
" end_idx = None\n",
|
|||
|
" tag = None\n",
|
|||
|
"\n",
|
|||
|
" if start_idx is not None:\n",
|
|||
|
" fsentence[start_idx:end_idx+1].add_label(label, tag)\n",
|
|||
|
"\n",
|
|||
|
" fsentences.append(fsentence)\n",
|
|||
|
"\n",
|
|||
|
" return FlairDatapointDataset(fsentences)\n",
|
|||
|
"\n",
|
|||
|
"corpus = Corpus(train=conllu2flair(trainset, 'slot'), test=conllu2flair(testset, 'slot'))\n",
|
|||
|
"print(corpus)\n",
|
|||
|
"tag_dictionary = corpus.make_label_dictionary(label_type='slot')\n",
|
|||
|
"print(tag_dictionary)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"id": "0ed59fb2",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Nasz model będzie wykorzystywał wektorowe reprezentacje słów (zob. [Word Embeddings](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_EMBEDDINGS_OVERVIEW.md))."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"id": "408cf961",
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"embedding_types = [\n",
|
|||
|
" WordEmbeddings('en'),\n",
|
|||
|
" FlairEmbeddings('en-forward'),\n",
|
|||
|
" FlairEmbeddings('en-backward'),\n",
|
|||
|
" CharacterEmbeddings(),\n",
|
|||
|
"]\n",
|
|||
|
"\n",
|
|||
|
"embeddings = StackedEmbeddings(embeddings=embedding_types)\n",
|
|||
|
"tagger = SequenceTagger(hidden_size=256, embeddings=embeddings,\n",
|
|||
|
" tag_dictionary=tag_dictionary,\n",
|
|||
|
" tag_type='slot', use_crf=True)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"id": "ab634218",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Zobaczmy jak wygląda architektura sieci neuronowej, która będzie odpowiedzialna za przewidywanie\n",
|
|||
|
"slotów w wypowiedziach."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"id": "04d0bbf3",
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"print(tagger)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"id": "8e0da880",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Wykonamy dziesięć iteracji (epok) uczenia a wynikowy model zapiszemy w katalogu `slot-model`."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"id": "0fd2b573",
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"trainer = ModelTrainer(tagger, corpus)\n",
|
|||
|
"trainer.train('slot-model',\n",
|
|||
|
" learning_rate=0.1,\n",
|
|||
|
" mini_batch_size=32,\n",
|
|||
|
" max_epochs=10,\n",
|
|||
|
" train_with_dev=False)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"id": "bcd0c303",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Jakość wyuczonego modelu możemy ocenić, korzystając z zaraportowanych powyżej metryk, tj.:\n",
|
|||
|
"\n",
|
|||
|
" - *tp (true positives)*\n",
|
|||
|
"\n",
|
|||
|
" > liczba słów oznaczonych w zbiorze testowym etykietą $e$, które model oznaczył tą etykietą\n",
|
|||
|
"\n",
|
|||
|
" - *fp (false positives)*\n",
|
|||
|
"\n",
|
|||
|
" > liczba słów nieoznaczonych w zbiorze testowym etykietą $e$, które model oznaczył tą etykietą\n",
|
|||
|
"\n",
|
|||
|
" - *fn (false negatives)*\n",
|
|||
|
"\n",
|
|||
|
" > liczba słów oznaczonych w zbiorze testowym etykietą $e$, którym model nie nadał etykiety $e$\n",
|
|||
|
"\n",
|
|||
|
" - *precision*\n",
|
|||
|
"\n",
|
|||
|
" > $$\\frac{tp}{tp + fp}$$\n",
|
|||
|
"\n",
|
|||
|
" - *recall*\n",
|
|||
|
"\n",
|
|||
|
" > $$\\frac{tp}{tp + fn}$$\n",
|
|||
|
"\n",
|
|||
|
" - $F_1$\n",
|
|||
|
"\n",
|
|||
|
" > $$\\frac{2 \\cdot precision \\cdot recall}{precision + recall}$$\n",
|
|||
|
"\n",
|
|||
|
" - *micro* $F_1$\n",
|
|||
|
"\n",
|
|||
|
" > $F_1$ w którym $tp$, $fp$ i $fn$ są liczone łącznie dla wszystkich etykiet, tj. $tp = \\sum_{e}{{tp}_e}$, $fn = \\sum_{e}{{fn}_e}$, $fp = \\sum_{e}{{fp}_e}$\n",
|
|||
|
"\n",
|
|||
|
" - *macro* $F_1$\n",
|
|||
|
"\n",
|
|||
|
" > średnia arytmetyczna z $F_1$ obliczonych dla poszczególnych etykiet z osobna.\n",
|
|||
|
"\n",
|
|||
|
"Wyuczony model możemy wczytać z pliku korzystając z metody `load`."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"id": "d12596c1",
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"model = SequenceTagger.load('slot-model/final-model.pt')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"id": "a97dd603",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Wczytany model możemy wykorzystać do przewidywania slotów w wypowiedziach użytkownika, korzystając\n",
|
|||
|
"z przedstawionej poniżej funkcji `predict`."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"id": "87c310cf",
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"def predict(model, sentence):\n",
|
|||
|
" csentence = [{'form': word, 'slot': 'O'} for word in sentence]\n",
|
|||
|
" fsentence = conllu2flair([csentence])[0]\n",
|
|||
|
" model.predict(fsentence)\n",
|
|||
|
"\n",
|
|||
|
" for span in fsentence.get_spans('slot'):\n",
|
|||
|
" tag = span.get_label('slot').value\n",
|
|||
|
" csentence[span.tokens[0].idx - 1]['slot'] = f'B-{tag}'\n",
|
|||
|
"\n",
|
|||
|
" for token in span.tokens[1:]:\n",
|
|||
|
" csentence[token.idx - 1]['slot'] = f'I-{tag}'\n",
|
|||
|
"\n",
|
|||
|
" return csentence\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"id": "97043331",
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"tabulate(predict(model, 'set alarm for 20 minutes'.split()), tablefmt='html')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"id": "29856a8a",
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"tabulate(predict(model, 'change my 3 pm alarm to the next day'.split()), tablefmt='html')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"id": "21b00302",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Literatura\n",
|
|||
|
"----------\n",
|
|||
|
" 1. Sebastian Schuster, Sonal Gupta, Rushin Shah, Mike Lewis, Cross-lingual Transfer Learning for Multilingual Task Oriented Dialog. NAACL-HLT (1) 2019, pp. 3795-3805\n",
|
|||
|
" 2. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML '01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 282–289, https://repository.upenn.edu/cgi/viewcontent.cgi?article=1162&context=cis_papers\n",
|
|||
|
" 3. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (November 15, 1997), 1735–1780, https://doi.org/10.1162/neco.1997.9.8.1735\n",
|
|||
|
" 4. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, Attention is All you Need, NIPS 2017, pp. 5998-6008, https://arxiv.org/abs/1706.03762\n",
|
|||
|
" 5. Alan Akbik, Duncan Blythe, Roland Vollgraf, Contextual String Embeddings for Sequence Labeling, Proceedings of the 27th International Conference on Computational Linguistics, pp. 1638–1649, https://www.aclweb.org/anthology/C18-1139.pdf\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"jupytext": {
|
|||
|
"cell_metadata_filter": "-all",
|
|||
|
"main_language": "python",
|
|||
|
"notebook_metadata_filter": "-all"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 5
|
|||
|
}
|