This commit is contained in:
Filip Gralinski 2022-07-06 08:43:40 +02:00
parent 59b20b3de5
commit 00d84daae3
9 changed files with 98 additions and 410 deletions

View File

@ -1,5 +1,20 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Modelowanie języka</h1>\n",
"<h2> 07. <i>Wygładzanie w n-gramowych modelach języka</i> [wykład]</h2> \n",
"<h3> Filip Graliński (2022)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -54,7 +69,7 @@
"Rozpatrzmy przykład z 3 kolorami (wiemy, że w urnie mogą być kule\n",
"żółte, zielone i czerwone, tj. $m=3$) i 4 losowaniami ($T=4$):\n",
"\n",
"![img](./05_Wygladzanie/urna.drawio.png)\n",
"![img](./07_Wygladzanie/urna.drawio.png)\n",
"\n",
"Gdybyśmy w prosty sposób oszacowali prawdopodobieństwa, doszlibyśmy do\n",
"wniosku, że prawdopodobieństwo wylosowania kuli czerwonej wynosi 3/4, żółtej — 1/4,\n",
@ -168,7 +183,7 @@
"$k_i$ to ile razy w zbiorze uczącym pojawił się $i$-ty wyraz słownika,\n",
"$T$ — długość zbioru uczącego.\n",
"\n",
"![img](./05_Wygladzanie/urna-wyrazy.drawio.png)\n",
"![img](./07_Wygladzanie/urna-wyrazy.drawio.png)\n",
"\n",
"A zatem przy użyciu wygładzania +1 w następujący sposób estymować\n",
"będziemy prawdopodobieństwo słowa $w$:\n",
@ -303,113 +318,11 @@
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['<s>',\n",
" 'lubisz',\n",
" 'curry',\n",
" ',',\n",
" 'prawda',\n",
" '?',\n",
" '</s>',\n",
" '<s>',\n",
" 'nałożę',\n",
" 'ci',\n",
" 'więcej',\n",
" '.',\n",
" '</s>',\n",
" '<s>',\n",
" 'hey',\n",
" '!',\n",
" '</s>',\n",
" '<s>',\n",
" 'smakuje',\n",
" 'ci',\n",
" '?',\n",
" '</s>',\n",
" '<s>',\n",
" 'hey',\n",
" ',',\n",
" 'brzydalu',\n",
" '.',\n",
" '</s>',\n",
" '<s>',\n",
" 'spójrz',\n",
" 'na',\n",
" 'nią',\n",
" '.',\n",
" '</s>',\n",
" '<s>',\n",
" '-',\n",
" 'wariatka',\n",
" '.',\n",
" '</s>',\n",
" '<s>',\n",
" '-',\n",
" 'zadałam',\n",
" 'ci',\n",
" 'pytanie',\n",
" '!',\n",
" '</s>',\n",
" '<s>',\n",
" 'no',\n",
" ',',\n",
" 'tak',\n",
" 'lepiej',\n",
" '!',\n",
" '</s>',\n",
" '<s>',\n",
" '-',\n",
" 'wygląda',\n",
" 'dobrze',\n",
" '!',\n",
" '</s>',\n",
" '<s>',\n",
" '-',\n",
" 'tak',\n",
" 'lepiej',\n",
" '!',\n",
" '</s>',\n",
" '<s>',\n",
" 'pasuje',\n",
" 'jej',\n",
" '.',\n",
" '</s>',\n",
" '<s>',\n",
" '-',\n",
" 'hey',\n",
" '.',\n",
" '</s>',\n",
" '<s>',\n",
" '-',\n",
" 'co',\n",
" 'do',\n",
" '...?',\n",
" '</s>',\n",
" '<s>',\n",
" 'co',\n",
" 'do',\n",
" 'cholery',\n",
" 'robisz',\n",
" '?',\n",
" '</s>',\n",
" '<s>',\n",
" 'zejdź',\n",
" 'mi',\n",
" 'z',\n",
" 'oczu',\n",
" ',',\n",
" 'zdziro',\n",
" '.',\n",
" '</s>',\n",
" '<s>',\n",
" 'przestań',\n",
" 'dokuczać']"
"name": "stdout",
"output_type": "stream",
"text": [
"['<s>', 'lubisz', 'curry', ',', 'prawda', '?', '</s>', '<s>', 'nałożę', 'ci', 'więcej', '.', '</s>', '<s>', 'hey', '!', '</s>', '<s>', 'smakuje', 'ci', '?', '</s>', '<s>', 'hey', ',', 'brzydalu', '.', '</s>', '<s>', 'spójrz', 'na', 'nią', '.', '</s>', '<s>', '-', 'wariatka', '.', '</s>', '<s>', '-', 'zadałam', 'ci', 'pytanie', '!', '</s>', '<s>', 'no', ',', 'tak', 'lepiej', '!', '</s>', '<s>', '-', 'wygląda', 'dobrze', '!', '</s>', '<s>', '-', 'tak', 'lepiej', '!', '</s>', '<s>', 'pasuje', 'jej', '.', '</s>', '<s>', '-', 'hey', '.', '</s>', '<s>', '-', 'co', 'do', '...?', '</s>', '<s>', 'co', 'do', 'cholery', 'robisz', '?', '</s>', '<s>', 'zejdź', 'mi', 'z', 'oczu', ',', 'zdziro', '.', '</s>', '<s>', 'przestań', 'dokuczać']"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
@ -448,7 +361,7 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
@ -459,18 +372,15 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"name": "stdout",
"output_type": "stream",
"text": [
"48113"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
@ -479,7 +389,7 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
@ -518,18 +428,15 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"name": "stdout",
"output_type": "stream",
"text": [
"926594"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
@ -553,112 +460,13 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>liczba tokenów</th>\n",
" <th>średnia częstość w części B</th>\n",
" <th>estymacje +1</th>\n",
" <th>estymacje +0.01</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>388334</td>\n",
" <td>1.900495</td>\n",
" <td>0.993586</td>\n",
" <td>0.009999</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>403870</td>\n",
" <td>0.592770</td>\n",
" <td>1.987172</td>\n",
" <td>1.009935</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>117529</td>\n",
" <td>1.565809</td>\n",
" <td>2.980759</td>\n",
" <td>2.009870</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>62800</td>\n",
" <td>2.514268</td>\n",
" <td>3.974345</td>\n",
" <td>3.009806</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>40856</td>\n",
" <td>3.504944</td>\n",
" <td>4.967931</td>\n",
" <td>4.009741</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>29443</td>\n",
" <td>4.454098</td>\n",
" <td>5.961517</td>\n",
" <td>5.009677</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>22709</td>\n",
" <td>5.232023</td>\n",
" <td>6.955103</td>\n",
" <td>6.009612</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>18255</td>\n",
" <td>6.157929</td>\n",
" <td>7.948689</td>\n",
" <td>7.009548</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>15076</td>\n",
" <td>7.308039</td>\n",
" <td>8.942276</td>\n",
" <td>8.009483</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>12859</td>\n",
" <td>8.045649</td>\n",
" <td>9.935862</td>\n",
" <td>9.009418</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"name": "stdout",
"output_type": "stream",
"text": [
"liczba tokenów średnia częstość w części B estymacje +1 estymacje +0.01\n",
"0 388334 1.900495 0.993586 0.009999\n",
"1 403870 0.592770 1.987172 1.009935\n",
@ -671,10 +479,6 @@
"8 15076 7.308039 8.942276 8.009483\n",
"9 12859 8.045649 9.935862 9.009418"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
@ -716,112 +520,13 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>liczba tokenów</th>\n",
" <th>średnia częstość w części B</th>\n",
" <th>estymacje +1</th>\n",
" <th>Good-Turing</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>388334</td>\n",
" <td>1.900495</td>\n",
" <td>0.993586</td>\n",
" <td>1.040007</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>403870</td>\n",
" <td>0.592770</td>\n",
" <td>1.987172</td>\n",
" <td>0.582014</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>117529</td>\n",
" <td>1.565809</td>\n",
" <td>2.980759</td>\n",
" <td>1.603009</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>62800</td>\n",
" <td>2.514268</td>\n",
" <td>3.974345</td>\n",
" <td>2.602293</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>40856</td>\n",
" <td>3.504944</td>\n",
" <td>4.967931</td>\n",
" <td>3.603265</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>29443</td>\n",
" <td>4.454098</td>\n",
" <td>5.961517</td>\n",
" <td>4.627721</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>22709</td>\n",
" <td>5.232023</td>\n",
" <td>6.955103</td>\n",
" <td>5.627064</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>18255</td>\n",
" <td>6.157929</td>\n",
" <td>7.948689</td>\n",
" <td>6.606847</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>15076</td>\n",
" <td>7.308039</td>\n",
" <td>8.942276</td>\n",
" <td>7.676506</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>12859</td>\n",
" <td>8.045649</td>\n",
" <td>9.935862</td>\n",
" <td>8.557431</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"name": "stdout",
"output_type": "stream",
"text": [
"liczba tokenów średnia częstość w części B estymacje +1 Good-Turing\n",
"0 388334 1.900495 0.993586 1.040007\n",
"1 403870 0.592770 1.987172 0.582014\n",
@ -834,10 +539,6 @@
"8 15076 7.308039 8.942276 7.676506\n",
"9 12859 8.045649 9.935862 8.557431"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
@ -1008,18 +709,15 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"name": "stdout",
"output_type": "stream",
"text": [
"[('k', 'o', 't'), ('o', 't', 'e'), ('t', 'e', 'k')]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
@ -1036,7 +734,7 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
@ -1048,23 +746,13 @@
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"321"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"outputs": [],
"source": [
"len(histories['jork'])\n",
"len(histories['zielony'])"
"len(histories['zielony'])\n",
"histories['jork']"
]
},
{
@ -1112,7 +800,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"![img](./05_Wygladzanie/size-perplexity.gif \"Perplexity dla różnych rozmiarów zbioru testowego\")\n",
"![img](./07_Wygladzanie/size-perplexity.gif \"Perplexity dla różnych rozmiarów zbioru testowego\")\n",
"\n"
]
},
@ -1128,7 +816,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"![img](./05_Wygladzanie/size-perplexity2.gif \"Perplexity dla różnych rozmiarów zbioru uczącego\")\n",
"![img](./07_Wygladzanie/size-perplexity2.gif \"Perplexity dla różnych rozmiarów zbioru uczącego\")\n",
"\n"
]
},
@ -1144,7 +832,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"![img](./05_Wygladzanie/order-perplexity.gif \"Perplexity dla różnych wartości rządu modelu\")\n",
"![img](./07_Wygladzanie/order-perplexity.gif \"Perplexity dla różnych wartości rządu modelu\")\n",
"\n"
]
}
@ -1165,7 +853,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.2"
"version": "3.10.5"
},
"org": null
},

View File

@ -25,7 +25,7 @@ $$p_i = \frac{k_i}{T}.$$
Rozpatrzmy przykład z 3 kolorami (wiemy, że w urnie mogą być kule
żółte, zielone i czerwone, tj. $m=3$) i 4 losowaniami ($T=4$):
[[./05_Wygladzanie/urna.drawio.png]]
[[./07_Wygladzanie/urna.drawio.png]]
Gdybyśmy w prosty sposób oszacowali prawdopodobieństwa, doszlibyśmy do
wniosku, że prawdopodobieństwo wylosowania kuli czerwonej wynosi 3/4, żółtej — 1/4,
@ -85,7 +85,7 @@ losowania kul z urny: $m$ to liczba wszystkich wyrazów (czyli rozmiar słownika
$k_i$ to ile razy w zbiorze uczącym pojawił się $i$-ty wyraz słownika,
$T$ — długość zbioru uczącego.
[[./05_Wygladzanie/urna-wyrazy.drawio.png]]
[[./07_Wygladzanie/urna-wyrazy.drawio.png]]
A zatem przy użyciu wygładzania +1 w następujący sposób estymować
będziemy prawdopodobieństwo słowa $w$:
@ -173,7 +173,7 @@ Stwórzmy generator, który będzie wczytywał słowa z pliku, dodatkowo:
- dodamy specjalne tokeny na początek i koniec zdania (~<s>~ i ~</s>~).
#+BEGIN_SRC python :session mysession :exports both :results raw drawer
#+BEGIN_SRC ipython :session mysession :exports both :results raw drawer
from itertools import islice
import regex as re
import sys
@ -200,7 +200,7 @@ Stwórzmy generator, który będzie wczytywał słowa z pliku, dodatkowo:
Zobaczmy, ile razy, średnio w drugiej połówce korpusu występują
wyrazy, które w pierwszej wystąpiły określoną liczbę razy.
#+BEGIN_SRC python :session mysession :exports both :results raw drawer
#+BEGIN_SRC ipython :session mysession :exports both :results raw drawer
from collections import Counter
counterA = Counter(get_words_from_file('opensubtitlesA.pl.txt'))
@ -210,7 +210,7 @@ wyrazy, które w pierwszej wystąpiły określoną liczbę razy.
:results:
:end:
#+BEGIN_SRC python :session mysession :exports both :results raw drawer
#+BEGIN_SRC ipython :session mysession :exports both :results raw drawer
counterA['taki']
#+END_SRC
@ -219,7 +219,7 @@ counterA['taki']
48113
:end:
#+BEGIN_SRC python :session mysession :exports both :results raw drawer
#+BEGIN_SRC ipython :session mysession :exports both :results raw drawer
max_r = 10
buckets = {}
@ -251,7 +251,7 @@ counterA['taki']
Policzmy teraz jakiej liczby wystąpień byśmy oczekiwali, gdyby użyć wygładzania +1 bądź +0.01.
(Uwaga: zwracamy liczbę wystąpień, a nie względną częstość, stąd przemnażamy przez rozmiar całego korpusu).
#+BEGIN_SRC python :session mysession :exports both :results raw drawer
#+BEGIN_SRC ipython :session mysession :exports both :results raw drawer
def plus_alpha_smoothing(alpha, m, t, k):
return t*(k + alpha)/(t + alpha * m)
@ -275,7 +275,7 @@ Policzmy teraz jakiej liczby wystąpień byśmy oczekiwali, gdyby użyć wygład
926594
:end:
#+BEGIN_SRC python :session mysession :exports both :results raw drawer
#+BEGIN_SRC ipython :session mysession :exports both :results raw drawer
import pandas as pd
pd.DataFrame(data, columns=["liczba tokenów", "średnia częstość w części B", "estymacje +1", "estymacje +0.01"])
@ -309,7 +309,7 @@ $$p(w) = \frac{\# w + 1}{|C|}\frac{N_{r+1}}{N_r}.$$
**** Przykład
#+BEGIN_SRC python :session mysession :exports both :results raw drawer
#+BEGIN_SRC ipython :session mysession :exports both :results raw drawer
good_turing_counts = [(ix+1)*nb_of_types[ix+1]/nb_of_types[ix] for ix in range(0, max_r)]
data2 = list(zip(nb_of_types, empirical_counts, plus_one_counts, good_turing_counts))
@ -415,7 +415,7 @@ W metodzie Knesera-Neya w następujący sposób estymujemy prawdopodobieństwo u
$$P(w) = \frac{N_{1+}(\bullet w)}{\sum_{w_j} N_{1+}(\bullet w_j)}.$$
#+BEGIN_SRC python :session mysession :exports both :results raw drawer
#+BEGIN_SRC ipython :session mysession :exports both :results raw drawer
def ngrams(iter, size):
ngram = []
for item in iter:
@ -433,7 +433,7 @@ $$P(w) = \frac{N_{1+}(\bullet w)}{\sum_{w_j} N_{1+}(\bullet w_j)}.$$
:end:
#+BEGIN_SRC python :session mysession :exports both :results raw drawer
#+BEGIN_SRC ipython :session mysession :exports both :results raw drawer
histories = { }
for prev_token, token in ngrams(get_words_from_file('opensubtitlesA.pl.txt'), 2):
histories.setdefault(token, set())
@ -444,7 +444,7 @@ $$P(w) = \frac{N_{1+}(\bullet w)}{\sum_{w_j} N_{1+}(\bullet w_j)}.$$
:results:
:end:
#+BEGIN_SRC python :session mysession :exports both :results raw drawer
#+BEGIN_SRC ipython :session mysession :exports both :results raw drawer
len(histories['jork'])
len(histories['zielony'])
histories['jork']
@ -472,15 +472,15 @@ Knesera-Neya połączone z *przycinaniem* słownika n-gramów (wszystkie
**** Zmiana perplexity przy zwiększaniu zbioru testowego
#+CAPTION: Perplexity dla różnych rozmiarów zbioru testowego
[[./05_Wygladzanie/size-perplexity.gif]]
[[./07_Wygladzanie/size-perplexity.gif]]
**** Zmiana perplexity przy zwiększaniu zbioru uczącego
#+CAPTION: Perplexity dla różnych rozmiarów zbioru uczącego
[[./05_Wygladzanie/size-perplexity2.gif]]
[[./07_Wygladzanie/size-perplexity2.gif]]
**** Zmiana perplexity przy zwiększaniu rządu modelu
#+CAPTION: Perplexity dla różnych wartości rządu modelu
[[./05_Wygladzanie/order-perplexity.gif]]
[[./07_Wygladzanie/order-perplexity.gif]]

View File

Before

Width:  |  Height:  |  Size: 4.1 KiB

After

Width:  |  Height:  |  Size: 4.1 KiB

View File

Before

Width:  |  Height:  |  Size: 4.5 KiB

After

Width:  |  Height:  |  Size: 4.5 KiB

View File

Before

Width:  |  Height:  |  Size: 4.8 KiB

After

Width:  |  Height:  |  Size: 4.8 KiB

View File

Before

Width:  |  Height:  |  Size: 24 KiB

After

Width:  |  Height:  |  Size: 24 KiB

View File

Before

Width:  |  Height:  |  Size: 17 KiB

After

Width:  |  Height:  |  Size: 17 KiB