From 65219c12dcd9f4947fd1547bc05dfaa6029399e7 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Pawe=C5=82=20Sk=C3=B3rzewski?= Date: Wed, 29 May 2024 09:37:16 +0200 Subject: [PATCH] Lab. 11 --- lab/11_Ensemble_modeli.ipynb | 337 +++++++++++++++++++++++++++++++++++ 1 file changed, 337 insertions(+) create mode 100644 lab/11_Ensemble_modeli.ipynb diff --git a/lab/11_Ensemble_modeli.ipynb b/lab/11_Ensemble_modeli.ipynb new file mode 100644 index 0000000..a1efb07 --- /dev/null +++ b/lab/11_Ensemble_modeli.ipynb @@ -0,0 +1,337 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Modelowanie języka – laboratoria\n", + "### 29 maja 2024\n", + "# 11. *Ensemble* modeli" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "W jaki sposób można polepszyć wynik predykcji dla zadania uczenia maszynowego?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Metody zbiorcze (*ensemble*) polegają na agregacji predykcji różnych modeli, aby uzyskać wynik lepszy niż dla każdego z modeli z osobna." + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "mkdir: cannot create directory ‘dev-0-ireland-news’: File exists\r\n" + ] + } + ], + "source": [ + "!mkdir dev-0-ireland-news" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Mamy wyzwanie:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "https://gonito.net/challenge-all-submissions/ireland-news-headlines-word-gap" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "https://github.com/kubapok/ireland-news-word-gap/tree/0c6557c8a3cd6d8c77f64618850b2ae82c19476a" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "--2022-05-13 13:23:05-- https://github.com/kubapok/ireland-news-word-gap/raw/11c72875023c5c01c9d0c0ca39d72c90c840aeb3/dev-0/out.tsv\n", + "Resolving github.com (github.com)... 140.82.121.4\n", + "Connecting to github.com (github.com)|140.82.121.4|:443... connected.\n", + "HTTP request sent, awaiting response... 302 Found\n", + "Location: https://raw.githubusercontent.com/kubapok/ireland-news-word-gap/11c72875023c5c01c9d0c0ca39d72c90c840aeb3/dev-0/out.tsv [following]\n", + "--2022-05-13 13:23:06-- https://raw.githubusercontent.com/kubapok/ireland-news-word-gap/11c72875023c5c01c9d0c0ca39d72c90c840aeb3/dev-0/out.tsv\n", + "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...\n", + "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.\n", + "HTTP request sent, awaiting response... 200 OK\n", + "Length: 63249692 (60M) [text/plain]\n", + "Saving to: ‘out.tsv’\n", + "\n", + "out.tsv 100%[===================>] 60,32M 26,7MB/s in 2,3s \n", + "\n", + "2022-05-13 13:23:08 (26,7 MB/s) - ‘out.tsv’ saved [63249692/63249692]\n", + "\n", + "--2022-05-13 13:23:09-- https://github.com/kubapok/ireland-news-word-gap/raw/0c6557c8a3cd6d8c77f64618850b2ae82c19476a/dev-0/out.tsv\n", + "Resolving github.com (github.com)... 140.82.121.4\n", + "Connecting to github.com (github.com)|140.82.121.4|:443... connected.\n", + "HTTP request sent, awaiting response... 302 Found\n", + "Location: https://raw.githubusercontent.com/kubapok/ireland-news-word-gap/0c6557c8a3cd6d8c77f64618850b2ae82c19476a/dev-0/out.tsv [following]\n", + "--2022-05-13 13:23:09-- https://raw.githubusercontent.com/kubapok/ireland-news-word-gap/0c6557c8a3cd6d8c77f64618850b2ae82c19476a/dev-0/out.tsv\n", + "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...\n", + "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.\n", + "HTTP request sent, awaiting response... 200 OK\n", + "Length: 63271863 (60M) [text/plain]\n", + "Saving to: ‘out.tsv’\n", + "\n", + "out.tsv 100%[===================>] 60,34M 45,1MB/s in 1,3s \n", + "\n", + "2022-05-13 13:23:10 (45,1 MB/s) - ‘out.tsv’ saved [63271863/63271863]\n", + "\n", + "--2022-05-13 13:23:11-- https://git.wmi.amu.edu.pl/kubapok/ireland-news-word-gap-prediction/raw/branch/master/dev-0/expected.tsv\n", + "Resolving git.wmi.amu.edu.pl (git.wmi.amu.edu.pl)... 150.254.78.40\n", + "Connecting to git.wmi.amu.edu.pl (git.wmi.amu.edu.pl)|150.254.78.40|:443... connected.\n", + "HTTP request sent, awaiting response... 200 OK\n", + "Length: 866583 (846K) [text/plain]\n", + "Saving to: ‘expected.tsv.1’\n", + "\n", + "expected.tsv.1 100%[===================>] 846,27K 1,91MB/s in 0,4s \n", + "\n", + "2022-05-13 13:23:11 (1,91 MB/s) - ‘expected.tsv.1’ saved [866583/866583]\n", + "\n" + ] + } + ], + "source": [ + "!wget https://github.com/kubapok/ireland-news-word-gap/raw/11c72875023c5c01c9d0c0ca39d72c90c840aeb3/dev-0/out.tsv\n", + "!mv out.tsv ./dev-0/out-solution1.tsv\n", + "!wget https://github.com/kubapok/ireland-news-word-gap/raw/0c6557c8a3cd6d8c77f64618850b2ae82c19476a/dev-0/out.tsv\n", + "!mv out.tsv ./dev-0/out-solution2.tsv\n", + "! ( cd dev-0 ; wget https://git.wmi.amu.edu.pl/kubapok/ireland-news-word-gap-prediction/raw/branch/master/dev-0/expected.tsv)" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "--2022-05-13 13:23:12-- https://gonito.net/get/bin/geval\n", + "Resolving gonito.net (gonito.net)... 150.254.78.126\n", + "Connecting to gonito.net (gonito.net)|150.254.78.126|:443... connected.\n", + "HTTP request sent, awaiting response... 200 OK\n", + "Length: 12860136 (12M) [application/octet-stream]\n", + "Saving to: ‘geval.1’\n", + "\n", + "geval.1 100%[===================>] 12,26M 2,67MB/s in 4,1s \n", + "\n", + "2022-05-13 13:23:16 (2,97 MB/s) - ‘geval.1’ saved [12860136/12860136]\n", + "\n" + ] + } + ], + "source": [ + "!wget https://gonito.net/get/bin/geval\n", + "!chmod u+x geval" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "35.05218788086649\r\n" + ] + } + ], + "source": [ + "!./geval --metric PerplexityHashed -o ./dev-0/out-solution1.tsv -e dev-0/expected.tsv" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "33.47429048442195\r\n" + ] + } + ], + "source": [ + "!./geval --metric PerplexityHashed -o ./dev-0/out-solution2.tsv -e dev-0/expected.tsv" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [], + "source": [ + "with open('./dev-0/out-solution1.tsv') as s1, open('./dev-0/out-solution2.tsv') as s2, open('./dev-0/out-merge.tsv','w') as f_merge:\n", + " for l1, l2 in zip(s1, s2):\n", + " dir1 = {''.join(x.split(':')[:-1]): float(x.split(':')[-1]) for x in l1.rstrip().split(' ')}\n", + " dir2 = {''.join(x.split(':')[:-1]): float(x.split(':')[-1]) for x in l2.rstrip().split(' ')}\n", + " newdir = dict()\n", + " for k in dir1.keys() | dir2.keys():\n", + " newdir[k] = dir1[k] if k in dir1 else 0.0\n", + " newdir[k] += dir2[k] if k in dir2 else 0.0\n", + " newdir[k] /= 2\n", + " merge_line = ' '.join([k + ':' + str(v) for k,v in newdir.items()]) + '\\n'\n", + " f_merge.write(merge_line)" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "29.054162509715063\r\n" + ] + } + ], + "source": [ + "!./geval --metric PerplexityHashed -o ./dev-0/out-merge.tsv -e dev-0/expected.tsv" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Jakie modele można składać ze sobą w metodach *ensemble*?\n", + "\n", + "- kilka dobrych niezależnych modeli \n", + "- kilka modeli wyuczonych dla różnego ziarna losowości (*seed*)\n", + "- kilka ostatnich checkpointów" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "W jaki sposób można składać różne modele?\n", + "\n", + "- średnia ważona\n", + "- średnia geometryczna\n", + "- inna średnia\n", + "- głosowanie klas (dla zadań klasyfikacyjnych)\n", + "- uczenie osobnego prostego modelu, którego zadanie to składanie modeli (np. regresja liniowa)\n", + "- jednoczesne uczenie kilku modeli ze wspólnym backpropagation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Jakie są wady metod *ensemble*?\n", + "\n", + "- wyższy stopień skomplikowania modelu\n", + "- dłuższy czas inferencji\n", + "- zużycie większych zasobów komputera\n", + "- gorsza interpretowalność" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Nie zawsze warto robić *ensemble*:\n", + "- w zastosowaniach komercyjnych, jeżeli mamy ograniczenia czasu lub zasobów, model jest ciężki lub wynik ewaluacji modelu nie jest bardzo ważny,\n", + "- w zastosowaniach akademickich/naukowych lub kiedy chcemy porównać kilka różnych metod, wtedy składanie modeli może nam zaburzyć obraz zagadnienia." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "W praktyce, jeżeli startujemy w konkursie uczenia maszynowego, zawsze warto korzystać z *ensemble*." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Warto mieć na uwadze, że niektóre metody z założenia są ensemblami. Np. las losowy albo boostowane drzewa decyzyjne." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Zadanie\n", + "\n", + "Stworzyć *ensemble* dwóch sieci LSTM dla *Challenging America word gap prediction* (https://gonito.csi.wmi.amu.edu.pl/challenge/challenging-america-word-gap-prediction):\n", + "- jedna sieć powinna działać do przodu (czyli jak zwyczajna sieć rekurencyjna),\n", + "- druga sieć powinna działać do tyłu.\n", + "\n", + "Przykład:\n", + "- przykładowy tekst: `\"ala\" \"ma\" \"kota\" \"MASK\" \"2\" \"psy\" \"i\" \"chomika\"`\n", + "- sieć do przodu: `\"ala\" \"ma\" \"kota\"`\n", + "- sieć do tyłu: `\"chomika\" \"i\" \"psy\" \"2\"`\n", + " \n", + "Zrobienie sieci odwrotnej („do tyłu”) jest bardzo proste. Wystarczy odwrócić kolejność słów, nie ma potrzeby ingerować w architekturę modeli.\n", + "\n", + "Metodą agregacji zastosowaną w *ensemble*’u w najprostszej wersji może być np. średnia arytmetyczna. Można też spróbować innych sposobów, np. jednoczesnego uczenia obu sieci.\n", + "\n", + "Punktacja: **50 punktów**\n", + "\n", + "Deadline: **12 czerwca 2024** przed zajęciami" + ] + } + ], + "metadata": { + "author": "Jakub Pokrywka", + "email": "kubapok@wmi.amu.edu.pl", + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "lang": "pl", + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.3" + }, + "subtitle": "0.Informacje na temat przedmiotu[ćwiczenia]", + "title": "Ekstrakcja informacji", + "year": "2021" + }, + "nbformat": 4, + "nbformat_minor": 4 +}