{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Modelowanie języka – laboratoria\n", "### 29 maja 2024\n", "# 11. *Ensemble* modeli" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "W jaki sposób można polepszyć wynik predykcji dla zadania uczenia maszynowego?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Metody zbiorcze (*ensemble*) polegają na agregacji predykcji różnych modeli, aby uzyskać wynik lepszy niż dla każdego z modeli z osobna." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "mkdir: cannot create directory ‘dev-0-ireland-news’: File exists\r\n" ] } ], "source": [ "!mkdir dev-0-ireland-news" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mamy wyzwanie:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "https://gonito.net/challenge-all-submissions/ireland-news-headlines-word-gap" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "https://github.com/kubapok/ireland-news-word-gap/tree/0c6557c8a3cd6d8c77f64618850b2ae82c19476a" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2022-05-13 13:23:05-- https://github.com/kubapok/ireland-news-word-gap/raw/11c72875023c5c01c9d0c0ca39d72c90c840aeb3/dev-0/out.tsv\n", "Resolving github.com (github.com)... 140.82.121.4\n", "Connecting to github.com (github.com)|140.82.121.4|:443... connected.\n", "HTTP request sent, awaiting response... 302 Found\n", "Location: https://raw.githubusercontent.com/kubapok/ireland-news-word-gap/11c72875023c5c01c9d0c0ca39d72c90c840aeb3/dev-0/out.tsv [following]\n", "--2022-05-13 13:23:06-- https://raw.githubusercontent.com/kubapok/ireland-news-word-gap/11c72875023c5c01c9d0c0ca39d72c90c840aeb3/dev-0/out.tsv\n", "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...\n", "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 63249692 (60M) [text/plain]\n", "Saving to: ‘out.tsv’\n", "\n", "out.tsv 100%[===================>] 60,32M 26,7MB/s in 2,3s \n", "\n", "2022-05-13 13:23:08 (26,7 MB/s) - ‘out.tsv’ saved [63249692/63249692]\n", "\n", "--2022-05-13 13:23:09-- https://github.com/kubapok/ireland-news-word-gap/raw/0c6557c8a3cd6d8c77f64618850b2ae82c19476a/dev-0/out.tsv\n", "Resolving github.com (github.com)... 140.82.121.4\n", "Connecting to github.com (github.com)|140.82.121.4|:443... connected.\n", "HTTP request sent, awaiting response... 302 Found\n", "Location: https://raw.githubusercontent.com/kubapok/ireland-news-word-gap/0c6557c8a3cd6d8c77f64618850b2ae82c19476a/dev-0/out.tsv [following]\n", "--2022-05-13 13:23:09-- https://raw.githubusercontent.com/kubapok/ireland-news-word-gap/0c6557c8a3cd6d8c77f64618850b2ae82c19476a/dev-0/out.tsv\n", "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...\n", "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 63271863 (60M) [text/plain]\n", "Saving to: ‘out.tsv’\n", "\n", "out.tsv 100%[===================>] 60,34M 45,1MB/s in 1,3s \n", "\n", "2022-05-13 13:23:10 (45,1 MB/s) - ‘out.tsv’ saved [63271863/63271863]\n", "\n", "--2022-05-13 13:23:11-- https://git.wmi.amu.edu.pl/kubapok/ireland-news-word-gap-prediction/raw/branch/master/dev-0/expected.tsv\n", "Resolving git.wmi.amu.edu.pl (git.wmi.amu.edu.pl)... 150.254.78.40\n", "Connecting to git.wmi.amu.edu.pl (git.wmi.amu.edu.pl)|150.254.78.40|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 866583 (846K) [text/plain]\n", "Saving to: ‘expected.tsv.1’\n", "\n", "expected.tsv.1 100%[===================>] 846,27K 1,91MB/s in 0,4s \n", "\n", "2022-05-13 13:23:11 (1,91 MB/s) - ‘expected.tsv.1’ saved [866583/866583]\n", "\n" ] } ], "source": [ "!wget https://github.com/kubapok/ireland-news-word-gap/raw/11c72875023c5c01c9d0c0ca39d72c90c840aeb3/dev-0/out.tsv\n", "!mv out.tsv ./dev-0/out-solution1.tsv\n", "!wget https://github.com/kubapok/ireland-news-word-gap/raw/0c6557c8a3cd6d8c77f64618850b2ae82c19476a/dev-0/out.tsv\n", "!mv out.tsv ./dev-0/out-solution2.tsv\n", "! ( cd dev-0 ; wget https://git.wmi.amu.edu.pl/kubapok/ireland-news-word-gap-prediction/raw/branch/master/dev-0/expected.tsv)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2022-05-13 13:23:12-- https://gonito.net/get/bin/geval\n", "Resolving gonito.net (gonito.net)... 150.254.78.126\n", "Connecting to gonito.net (gonito.net)|150.254.78.126|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 12860136 (12M) [application/octet-stream]\n", "Saving to: ‘geval.1’\n", "\n", "geval.1 100%[===================>] 12,26M 2,67MB/s in 4,1s \n", "\n", "2022-05-13 13:23:16 (2,97 MB/s) - ‘geval.1’ saved [12860136/12860136]\n", "\n" ] } ], "source": [ "!wget https://gonito.net/get/bin/geval\n", "!chmod u+x geval" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "35.05218788086649\r\n" ] } ], "source": [ "!./geval --metric PerplexityHashed -o ./dev-0/out-solution1.tsv -e dev-0/expected.tsv" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "33.47429048442195\r\n" ] } ], "source": [ "!./geval --metric PerplexityHashed -o ./dev-0/out-solution2.tsv -e dev-0/expected.tsv" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "with open('./dev-0/out-solution1.tsv') as s1, open('./dev-0/out-solution2.tsv') as s2, open('./dev-0/out-merge.tsv','w') as f_merge:\n", " for l1, l2 in zip(s1, s2):\n", " dir1 = {''.join(x.split(':')[:-1]): float(x.split(':')[-1]) for x in l1.rstrip().split(' ')}\n", " dir2 = {''.join(x.split(':')[:-1]): float(x.split(':')[-1]) for x in l2.rstrip().split(' ')}\n", " newdir = dict()\n", " for k in dir1.keys() | dir2.keys():\n", " newdir[k] = dir1[k] if k in dir1 else 0.0\n", " newdir[k] += dir2[k] if k in dir2 else 0.0\n", " newdir[k] /= 2\n", " merge_line = ' '.join([k + ':' + str(v) for k,v in newdir.items()]) + '\\n'\n", " f_merge.write(merge_line)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "29.054162509715063\r\n" ] } ], "source": [ "!./geval --metric PerplexityHashed -o ./dev-0/out-merge.tsv -e dev-0/expected.tsv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Jakie modele można składać ze sobą w metodach *ensemble*?\n", "\n", "- kilka dobrych niezależnych modeli \n", "- kilka modeli wyuczonych dla różnego ziarna losowości (*seed*)\n", "- kilka ostatnich checkpointów" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "W jaki sposób można składać różne modele?\n", "\n", "- średnia ważona\n", "- średnia geometryczna\n", "- inna średnia\n", "- głosowanie klas (dla zadań klasyfikacyjnych)\n", "- uczenie osobnego prostego modelu, którego zadanie to składanie modeli (np. regresja liniowa)\n", "- jednoczesne uczenie kilku modeli ze wspólnym backpropagation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Jakie są wady metod *ensemble*?\n", "\n", "- wyższy stopień skomplikowania modelu\n", "- dłuższy czas inferencji\n", "- zużycie większych zasobów komputera\n", "- gorsza interpretowalność" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nie zawsze warto robić *ensemble*:\n", "- w zastosowaniach komercyjnych, jeżeli mamy ograniczenia czasu lub zasobów, model jest ciężki lub wynik ewaluacji modelu nie jest bardzo ważny,\n", "- w zastosowaniach akademickich/naukowych lub kiedy chcemy porównać kilka różnych metod, wtedy składanie modeli może nam zaburzyć obraz zagadnienia." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "W praktyce, jeżeli startujemy w konkursie uczenia maszynowego, zawsze warto korzystać z *ensemble*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Warto mieć na uwadze, że niektóre metody z założenia są ensemblami. Np. las losowy albo boostowane drzewa decyzyjne." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Zadanie\n", "\n", "Stworzyć *ensemble* dwóch sieci LSTM dla *Challenging America word gap prediction* (https://gonito.csi.wmi.amu.edu.pl/challenge/challenging-america-word-gap-prediction):\n", "- jedna sieć powinna działać do przodu (czyli jak zwyczajna sieć rekurencyjna),\n", "- druga sieć powinna działać do tyłu.\n", "\n", "Przykład:\n", "- przykładowy tekst: `\"ala\" \"ma\" \"kota\" \"MASK\" \"2\" \"psy\" \"i\" \"chomika\"`\n", "- sieć do przodu: `\"ala\" \"ma\" \"kota\"`\n", "- sieć do tyłu: `\"chomika\" \"i\" \"psy\" \"2\"`\n", " \n", "Zrobienie sieci odwrotnej („do tyłu”) jest bardzo proste. Wystarczy odwrócić kolejność słów, nie ma potrzeby ingerować w architekturę modeli.\n", "\n", "Metodą agregacji zastosowaną w *ensemble*’u w najprostszej wersji może być np. średnia arytmetyczna. Można też spróbować innych sposobów, np. jednoczesnego uczenia obu sieci.\n", "\n", "Punktacja: **50 punktów**\n", "\n", "Deadline: **12 czerwca 2024** przed zajęciami" ] } ], "metadata": { "author": "Jakub Pokrywka", "email": "kubapok@wmi.amu.edu.pl", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "lang": "pl", "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" }, "subtitle": "0.Informacje na temat przedmiotu[ćwiczenia]", "title": "Ekstrakcja informacji", "year": "2021" }, "nbformat": 4, "nbformat_minor": 4 }