Lab. 11
This commit is contained in:
parent
9b538831ae
commit
65219c12dc
337
lab/11_Ensemble_modeli.ipynb
Normal file
337
lab/11_Ensemble_modeli.ipynb
Normal file
@ -0,0 +1,337 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Modelowanie języka – laboratoria\n",
|
||||
"### 29 maja 2024\n",
|
||||
"# 11. *Ensemble* modeli"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"W jaki sposób można polepszyć wynik predykcji dla zadania uczenia maszynowego?"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Metody zbiorcze (*ensemble*) polegają na agregacji predykcji różnych modeli, aby uzyskać wynik lepszy niż dla każdego z modeli z osobna."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 32,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"mkdir: cannot create directory ‘dev-0-ireland-news’: File exists\r\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"!mkdir dev-0-ireland-news"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Mamy wyzwanie:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"https://gonito.net/challenge-all-submissions/ireland-news-headlines-word-gap"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"https://github.com/kubapok/ireland-news-word-gap/tree/0c6557c8a3cd6d8c77f64618850b2ae82c19476a"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 33,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"--2022-05-13 13:23:05-- https://github.com/kubapok/ireland-news-word-gap/raw/11c72875023c5c01c9d0c0ca39d72c90c840aeb3/dev-0/out.tsv\n",
|
||||
"Resolving github.com (github.com)... 140.82.121.4\n",
|
||||
"Connecting to github.com (github.com)|140.82.121.4|:443... connected.\n",
|
||||
"HTTP request sent, awaiting response... 302 Found\n",
|
||||
"Location: https://raw.githubusercontent.com/kubapok/ireland-news-word-gap/11c72875023c5c01c9d0c0ca39d72c90c840aeb3/dev-0/out.tsv [following]\n",
|
||||
"--2022-05-13 13:23:06-- https://raw.githubusercontent.com/kubapok/ireland-news-word-gap/11c72875023c5c01c9d0c0ca39d72c90c840aeb3/dev-0/out.tsv\n",
|
||||
"Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...\n",
|
||||
"Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.\n",
|
||||
"HTTP request sent, awaiting response... 200 OK\n",
|
||||
"Length: 63249692 (60M) [text/plain]\n",
|
||||
"Saving to: ‘out.tsv’\n",
|
||||
"\n",
|
||||
"out.tsv 100%[===================>] 60,32M 26,7MB/s in 2,3s \n",
|
||||
"\n",
|
||||
"2022-05-13 13:23:08 (26,7 MB/s) - ‘out.tsv’ saved [63249692/63249692]\n",
|
||||
"\n",
|
||||
"--2022-05-13 13:23:09-- https://github.com/kubapok/ireland-news-word-gap/raw/0c6557c8a3cd6d8c77f64618850b2ae82c19476a/dev-0/out.tsv\n",
|
||||
"Resolving github.com (github.com)... 140.82.121.4\n",
|
||||
"Connecting to github.com (github.com)|140.82.121.4|:443... connected.\n",
|
||||
"HTTP request sent, awaiting response... 302 Found\n",
|
||||
"Location: https://raw.githubusercontent.com/kubapok/ireland-news-word-gap/0c6557c8a3cd6d8c77f64618850b2ae82c19476a/dev-0/out.tsv [following]\n",
|
||||
"--2022-05-13 13:23:09-- https://raw.githubusercontent.com/kubapok/ireland-news-word-gap/0c6557c8a3cd6d8c77f64618850b2ae82c19476a/dev-0/out.tsv\n",
|
||||
"Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...\n",
|
||||
"Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.\n",
|
||||
"HTTP request sent, awaiting response... 200 OK\n",
|
||||
"Length: 63271863 (60M) [text/plain]\n",
|
||||
"Saving to: ‘out.tsv’\n",
|
||||
"\n",
|
||||
"out.tsv 100%[===================>] 60,34M 45,1MB/s in 1,3s \n",
|
||||
"\n",
|
||||
"2022-05-13 13:23:10 (45,1 MB/s) - ‘out.tsv’ saved [63271863/63271863]\n",
|
||||
"\n",
|
||||
"--2022-05-13 13:23:11-- https://git.wmi.amu.edu.pl/kubapok/ireland-news-word-gap-prediction/raw/branch/master/dev-0/expected.tsv\n",
|
||||
"Resolving git.wmi.amu.edu.pl (git.wmi.amu.edu.pl)... 150.254.78.40\n",
|
||||
"Connecting to git.wmi.amu.edu.pl (git.wmi.amu.edu.pl)|150.254.78.40|:443... connected.\n",
|
||||
"HTTP request sent, awaiting response... 200 OK\n",
|
||||
"Length: 866583 (846K) [text/plain]\n",
|
||||
"Saving to: ‘expected.tsv.1’\n",
|
||||
"\n",
|
||||
"expected.tsv.1 100%[===================>] 846,27K 1,91MB/s in 0,4s \n",
|
||||
"\n",
|
||||
"2022-05-13 13:23:11 (1,91 MB/s) - ‘expected.tsv.1’ saved [866583/866583]\n",
|
||||
"\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"!wget https://github.com/kubapok/ireland-news-word-gap/raw/11c72875023c5c01c9d0c0ca39d72c90c840aeb3/dev-0/out.tsv\n",
|
||||
"!mv out.tsv ./dev-0/out-solution1.tsv\n",
|
||||
"!wget https://github.com/kubapok/ireland-news-word-gap/raw/0c6557c8a3cd6d8c77f64618850b2ae82c19476a/dev-0/out.tsv\n",
|
||||
"!mv out.tsv ./dev-0/out-solution2.tsv\n",
|
||||
"! ( cd dev-0 ; wget https://git.wmi.amu.edu.pl/kubapok/ireland-news-word-gap-prediction/raw/branch/master/dev-0/expected.tsv)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 34,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"--2022-05-13 13:23:12-- https://gonito.net/get/bin/geval\n",
|
||||
"Resolving gonito.net (gonito.net)... 150.254.78.126\n",
|
||||
"Connecting to gonito.net (gonito.net)|150.254.78.126|:443... connected.\n",
|
||||
"HTTP request sent, awaiting response... 200 OK\n",
|
||||
"Length: 12860136 (12M) [application/octet-stream]\n",
|
||||
"Saving to: ‘geval.1’\n",
|
||||
"\n",
|
||||
"geval.1 100%[===================>] 12,26M 2,67MB/s in 4,1s \n",
|
||||
"\n",
|
||||
"2022-05-13 13:23:16 (2,97 MB/s) - ‘geval.1’ saved [12860136/12860136]\n",
|
||||
"\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"!wget https://gonito.net/get/bin/geval\n",
|
||||
"!chmod u+x geval"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 35,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"35.05218788086649\r\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"!./geval --metric PerplexityHashed -o ./dev-0/out-solution1.tsv -e dev-0/expected.tsv"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 36,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"33.47429048442195\r\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"!./geval --metric PerplexityHashed -o ./dev-0/out-solution2.tsv -e dev-0/expected.tsv"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 37,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"with open('./dev-0/out-solution1.tsv') as s1, open('./dev-0/out-solution2.tsv') as s2, open('./dev-0/out-merge.tsv','w') as f_merge:\n",
|
||||
" for l1, l2 in zip(s1, s2):\n",
|
||||
" dir1 = {''.join(x.split(':')[:-1]): float(x.split(':')[-1]) for x in l1.rstrip().split(' ')}\n",
|
||||
" dir2 = {''.join(x.split(':')[:-1]): float(x.split(':')[-1]) for x in l2.rstrip().split(' ')}\n",
|
||||
" newdir = dict()\n",
|
||||
" for k in dir1.keys() | dir2.keys():\n",
|
||||
" newdir[k] = dir1[k] if k in dir1 else 0.0\n",
|
||||
" newdir[k] += dir2[k] if k in dir2 else 0.0\n",
|
||||
" newdir[k] /= 2\n",
|
||||
" merge_line = ' '.join([k + ':' + str(v) for k,v in newdir.items()]) + '\\n'\n",
|
||||
" f_merge.write(merge_line)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 38,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"29.054162509715063\r\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"!./geval --metric PerplexityHashed -o ./dev-0/out-merge.tsv -e dev-0/expected.tsv"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Jakie modele można składać ze sobą w metodach *ensemble*?\n",
|
||||
"\n",
|
||||
"- kilka dobrych niezależnych modeli \n",
|
||||
"- kilka modeli wyuczonych dla różnego ziarna losowości (*seed*)\n",
|
||||
"- kilka ostatnich checkpointów"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"W jaki sposób można składać różne modele?\n",
|
||||
"\n",
|
||||
"- średnia ważona\n",
|
||||
"- średnia geometryczna\n",
|
||||
"- inna średnia\n",
|
||||
"- głosowanie klas (dla zadań klasyfikacyjnych)\n",
|
||||
"- uczenie osobnego prostego modelu, którego zadanie to składanie modeli (np. regresja liniowa)\n",
|
||||
"- jednoczesne uczenie kilku modeli ze wspólnym backpropagation"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Jakie są wady metod *ensemble*?\n",
|
||||
"\n",
|
||||
"- wyższy stopień skomplikowania modelu\n",
|
||||
"- dłuższy czas inferencji\n",
|
||||
"- zużycie większych zasobów komputera\n",
|
||||
"- gorsza interpretowalność"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Nie zawsze warto robić *ensemble*:\n",
|
||||
"- w zastosowaniach komercyjnych, jeżeli mamy ograniczenia czasu lub zasobów, model jest ciężki lub wynik ewaluacji modelu nie jest bardzo ważny,\n",
|
||||
"- w zastosowaniach akademickich/naukowych lub kiedy chcemy porównać kilka różnych metod, wtedy składanie modeli może nam zaburzyć obraz zagadnienia."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"W praktyce, jeżeli startujemy w konkursie uczenia maszynowego, zawsze warto korzystać z *ensemble*."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Warto mieć na uwadze, że niektóre metody z założenia są ensemblami. Np. las losowy albo boostowane drzewa decyzyjne."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Zadanie\n",
|
||||
"\n",
|
||||
"Stworzyć *ensemble* dwóch sieci LSTM dla *Challenging America word gap prediction* (https://gonito.csi.wmi.amu.edu.pl/challenge/challenging-america-word-gap-prediction):\n",
|
||||
"- jedna sieć powinna działać do przodu (czyli jak zwyczajna sieć rekurencyjna),\n",
|
||||
"- druga sieć powinna działać do tyłu.\n",
|
||||
"\n",
|
||||
"Przykład:\n",
|
||||
"- przykładowy tekst: `\"ala\" \"ma\" \"kota\" \"MASK\" \"2\" \"psy\" \"i\" \"chomika\"`\n",
|
||||
"- sieć do przodu: `\"ala\" \"ma\" \"kota\"`\n",
|
||||
"- sieć do tyłu: `\"chomika\" \"i\" \"psy\" \"2\"`\n",
|
||||
" \n",
|
||||
"Zrobienie sieci odwrotnej („do tyłu”) jest bardzo proste. Wystarczy odwrócić kolejność słów, nie ma potrzeby ingerować w architekturę modeli.\n",
|
||||
"\n",
|
||||
"Metodą agregacji zastosowaną w *ensemble*’u w najprostszej wersji może być np. średnia arytmetyczna. Można też spróbować innych sposobów, np. jednoczesnego uczenia obu sieci.\n",
|
||||
"\n",
|
||||
"Punktacja: **50 punktów**\n",
|
||||
"\n",
|
||||
"Deadline: **12 czerwca 2024** przed zajęciami"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"author": "Jakub Pokrywka",
|
||||
"email": "kubapok@wmi.amu.edu.pl",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"lang": "pl",
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.3"
|
||||
},
|
||||
"subtitle": "0.Informacje na temat przedmiotu[ćwiczenia]",
|
||||
"title": "Ekstrakcja informacji",
|
||||
"year": "2021"
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
Loading…
Reference in New Issue
Block a user