Fix 07
@ -1,5 +1,20 @@
|
|||||||
{
|
{
|
||||||
"cells": [
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
|
||||||
|
"<div class=\"alert alert-block alert-info\">\n",
|
||||||
|
"<h1> Modelowanie języka</h1>\n",
|
||||||
|
"<h2> 07. <i>Wygładzanie w n-gramowych modelach języka</i> [wykład]</h2> \n",
|
||||||
|
"<h3> Filip Graliński (2022)</h3>\n",
|
||||||
|
"</div>\n",
|
||||||
|
"\n",
|
||||||
|
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)\n",
|
||||||
|
"\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
@ -54,7 +69,7 @@
|
|||||||
"Rozpatrzmy przykład z 3 kolorami (wiemy, że w urnie mogą być kule\n",
|
"Rozpatrzmy przykład z 3 kolorami (wiemy, że w urnie mogą być kule\n",
|
||||||
"żółte, zielone i czerwone, tj. $m=3$) i 4 losowaniami ($T=4$):\n",
|
"żółte, zielone i czerwone, tj. $m=3$) i 4 losowaniami ($T=4$):\n",
|
||||||
"\n",
|
"\n",
|
||||||
"![img](./05_Wygladzanie/urna.drawio.png)\n",
|
"![img](./07_Wygladzanie/urna.drawio.png)\n",
|
||||||
"\n",
|
"\n",
|
||||||
"Gdybyśmy w prosty sposób oszacowali prawdopodobieństwa, doszlibyśmy do\n",
|
"Gdybyśmy w prosty sposób oszacowali prawdopodobieństwa, doszlibyśmy do\n",
|
||||||
"wniosku, że prawdopodobieństwo wylosowania kuli czerwonej wynosi 3/4, żółtej — 1/4,\n",
|
"wniosku, że prawdopodobieństwo wylosowania kuli czerwonej wynosi 3/4, żółtej — 1/4,\n",
|
||||||
@ -168,7 +183,7 @@
|
|||||||
"$k_i$ to ile razy w zbiorze uczącym pojawił się $i$-ty wyraz słownika,\n",
|
"$k_i$ to ile razy w zbiorze uczącym pojawił się $i$-ty wyraz słownika,\n",
|
||||||
"$T$ — długość zbioru uczącego.\n",
|
"$T$ — długość zbioru uczącego.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"![img](./05_Wygladzanie/urna-wyrazy.drawio.png)\n",
|
"![img](./07_Wygladzanie/urna-wyrazy.drawio.png)\n",
|
||||||
"\n",
|
"\n",
|
||||||
"A zatem przy użyciu wygładzania +1 w następujący sposób estymować\n",
|
"A zatem przy użyciu wygładzania +1 w następujący sposób estymować\n",
|
||||||
"będziemy prawdopodobieństwo słowa $w$:\n",
|
"będziemy prawdopodobieństwo słowa $w$:\n",
|
||||||
@ -303,113 +318,11 @@
|
|||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
{
|
{
|
||||||
"data": {
|
"name": "stdout",
|
||||||
"text/plain": [
|
"output_type": "stream",
|
||||||
"['<s>',\n",
|
"text": [
|
||||||
" 'lubisz',\n",
|
"['<s>', 'lubisz', 'curry', ',', 'prawda', '?', '</s>', '<s>', 'nałożę', 'ci', 'więcej', '.', '</s>', '<s>', 'hey', '!', '</s>', '<s>', 'smakuje', 'ci', '?', '</s>', '<s>', 'hey', ',', 'brzydalu', '.', '</s>', '<s>', 'spójrz', 'na', 'nią', '.', '</s>', '<s>', '-', 'wariatka', '.', '</s>', '<s>', '-', 'zadałam', 'ci', 'pytanie', '!', '</s>', '<s>', 'no', ',', 'tak', 'lepiej', '!', '</s>', '<s>', '-', 'wygląda', 'dobrze', '!', '</s>', '<s>', '-', 'tak', 'lepiej', '!', '</s>', '<s>', 'pasuje', 'jej', '.', '</s>', '<s>', '-', 'hey', '.', '</s>', '<s>', '-', 'co', 'do', '...?', '</s>', '<s>', 'co', 'do', 'cholery', 'robisz', '?', '</s>', '<s>', 'zejdź', 'mi', 'z', 'oczu', ',', 'zdziro', '.', '</s>', '<s>', 'przestań', 'dokuczać']"
|
||||||
" 'curry',\n",
|
|
||||||
" ',',\n",
|
|
||||||
" 'prawda',\n",
|
|
||||||
" '?',\n",
|
|
||||||
" '</s>',\n",
|
|
||||||
" '<s>',\n",
|
|
||||||
" 'nałożę',\n",
|
|
||||||
" 'ci',\n",
|
|
||||||
" 'więcej',\n",
|
|
||||||
" '.',\n",
|
|
||||||
" '</s>',\n",
|
|
||||||
" '<s>',\n",
|
|
||||||
" 'hey',\n",
|
|
||||||
" '!',\n",
|
|
||||||
" '</s>',\n",
|
|
||||||
" '<s>',\n",
|
|
||||||
" 'smakuje',\n",
|
|
||||||
" 'ci',\n",
|
|
||||||
" '?',\n",
|
|
||||||
" '</s>',\n",
|
|
||||||
" '<s>',\n",
|
|
||||||
" 'hey',\n",
|
|
||||||
" ',',\n",
|
|
||||||
" 'brzydalu',\n",
|
|
||||||
" '.',\n",
|
|
||||||
" '</s>',\n",
|
|
||||||
" '<s>',\n",
|
|
||||||
" 'spójrz',\n",
|
|
||||||
" 'na',\n",
|
|
||||||
" 'nią',\n",
|
|
||||||
" '.',\n",
|
|
||||||
" '</s>',\n",
|
|
||||||
" '<s>',\n",
|
|
||||||
" '-',\n",
|
|
||||||
" 'wariatka',\n",
|
|
||||||
" '.',\n",
|
|
||||||
" '</s>',\n",
|
|
||||||
" '<s>',\n",
|
|
||||||
" '-',\n",
|
|
||||||
" 'zadałam',\n",
|
|
||||||
" 'ci',\n",
|
|
||||||
" 'pytanie',\n",
|
|
||||||
" '!',\n",
|
|
||||||
" '</s>',\n",
|
|
||||||
" '<s>',\n",
|
|
||||||
" 'no',\n",
|
|
||||||
" ',',\n",
|
|
||||||
" 'tak',\n",
|
|
||||||
" 'lepiej',\n",
|
|
||||||
" '!',\n",
|
|
||||||
" '</s>',\n",
|
|
||||||
" '<s>',\n",
|
|
||||||
" '-',\n",
|
|
||||||
" 'wygląda',\n",
|
|
||||||
" 'dobrze',\n",
|
|
||||||
" '!',\n",
|
|
||||||
" '</s>',\n",
|
|
||||||
" '<s>',\n",
|
|
||||||
" '-',\n",
|
|
||||||
" 'tak',\n",
|
|
||||||
" 'lepiej',\n",
|
|
||||||
" '!',\n",
|
|
||||||
" '</s>',\n",
|
|
||||||
" '<s>',\n",
|
|
||||||
" 'pasuje',\n",
|
|
||||||
" 'jej',\n",
|
|
||||||
" '.',\n",
|
|
||||||
" '</s>',\n",
|
|
||||||
" '<s>',\n",
|
|
||||||
" '-',\n",
|
|
||||||
" 'hey',\n",
|
|
||||||
" '.',\n",
|
|
||||||
" '</s>',\n",
|
|
||||||
" '<s>',\n",
|
|
||||||
" '-',\n",
|
|
||||||
" 'co',\n",
|
|
||||||
" 'do',\n",
|
|
||||||
" '...?',\n",
|
|
||||||
" '</s>',\n",
|
|
||||||
" '<s>',\n",
|
|
||||||
" 'co',\n",
|
|
||||||
" 'do',\n",
|
|
||||||
" 'cholery',\n",
|
|
||||||
" 'robisz',\n",
|
|
||||||
" '?',\n",
|
|
||||||
" '</s>',\n",
|
|
||||||
" '<s>',\n",
|
|
||||||
" 'zejdź',\n",
|
|
||||||
" 'mi',\n",
|
|
||||||
" 'z',\n",
|
|
||||||
" 'oczu',\n",
|
|
||||||
" ',',\n",
|
|
||||||
" 'zdziro',\n",
|
|
||||||
" '.',\n",
|
|
||||||
" '</s>',\n",
|
|
||||||
" '<s>',\n",
|
|
||||||
" 'przestań',\n",
|
|
||||||
" 'dokuczać']"
|
|
||||||
]
|
]
|
||||||
},
|
|
||||||
"execution_count": 1,
|
|
||||||
"metadata": {},
|
|
||||||
"output_type": "execute_result"
|
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"source": [
|
"source": [
|
||||||
@ -448,7 +361,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 2,
|
"execution_count": 1,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
@ -459,18 +372,15 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 3,
|
"execution_count": 1,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
{
|
{
|
||||||
"data": {
|
"name": "stdout",
|
||||||
"text/plain": [
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
"48113"
|
"48113"
|
||||||
]
|
]
|
||||||
},
|
|
||||||
"execution_count": 3,
|
|
||||||
"metadata": {},
|
|
||||||
"output_type": "execute_result"
|
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"source": [
|
"source": [
|
||||||
@ -479,7 +389,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 4,
|
"execution_count": 1,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
@ -518,18 +428,15 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 5,
|
"execution_count": 1,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
{
|
{
|
||||||
"data": {
|
"name": "stdout",
|
||||||
"text/plain": [
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
"926594"
|
"926594"
|
||||||
]
|
]
|
||||||
},
|
|
||||||
"execution_count": 5,
|
|
||||||
"metadata": {},
|
|
||||||
"output_type": "execute_result"
|
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"source": [
|
"source": [
|
||||||
@ -553,113 +460,14 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 6,
|
"execution_count": 1,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
{
|
{
|
||||||
"data": {
|
"name": "stdout",
|
||||||
"text/html": [
|
"output_type": "stream",
|
||||||
"<div>\n",
|
"text": [
|
||||||
"<style scoped>\n",
|
"liczba tokenów średnia częstość w części B estymacje +1 estymacje +0.01\n",
|
||||||
" .dataframe tbody tr th:only-of-type {\n",
|
|
||||||
" vertical-align: middle;\n",
|
|
||||||
" }\n",
|
|
||||||
"\n",
|
|
||||||
" .dataframe tbody tr th {\n",
|
|
||||||
" vertical-align: top;\n",
|
|
||||||
" }\n",
|
|
||||||
"\n",
|
|
||||||
" .dataframe thead th {\n",
|
|
||||||
" text-align: right;\n",
|
|
||||||
" }\n",
|
|
||||||
"</style>\n",
|
|
||||||
"<table border=\"1\" class=\"dataframe\">\n",
|
|
||||||
" <thead>\n",
|
|
||||||
" <tr style=\"text-align: right;\">\n",
|
|
||||||
" <th></th>\n",
|
|
||||||
" <th>liczba tokenów</th>\n",
|
|
||||||
" <th>średnia częstość w części B</th>\n",
|
|
||||||
" <th>estymacje +1</th>\n",
|
|
||||||
" <th>estymacje +0.01</th>\n",
|
|
||||||
" </tr>\n",
|
|
||||||
" </thead>\n",
|
|
||||||
" <tbody>\n",
|
|
||||||
" <tr>\n",
|
|
||||||
" <th>0</th>\n",
|
|
||||||
" <td>388334</td>\n",
|
|
||||||
" <td>1.900495</td>\n",
|
|
||||||
" <td>0.993586</td>\n",
|
|
||||||
" <td>0.009999</td>\n",
|
|
||||||
" </tr>\n",
|
|
||||||
" <tr>\n",
|
|
||||||
" <th>1</th>\n",
|
|
||||||
" <td>403870</td>\n",
|
|
||||||
" <td>0.592770</td>\n",
|
|
||||||
" <td>1.987172</td>\n",
|
|
||||||
" <td>1.009935</td>\n",
|
|
||||||
" </tr>\n",
|
|
||||||
" <tr>\n",
|
|
||||||
" <th>2</th>\n",
|
|
||||||
" <td>117529</td>\n",
|
|
||||||
" <td>1.565809</td>\n",
|
|
||||||
" <td>2.980759</td>\n",
|
|
||||||
" <td>2.009870</td>\n",
|
|
||||||
" </tr>\n",
|
|
||||||
" <tr>\n",
|
|
||||||
" <th>3</th>\n",
|
|
||||||
" <td>62800</td>\n",
|
|
||||||
" <td>2.514268</td>\n",
|
|
||||||
" <td>3.974345</td>\n",
|
|
||||||
" <td>3.009806</td>\n",
|
|
||||||
" </tr>\n",
|
|
||||||
" <tr>\n",
|
|
||||||
" <th>4</th>\n",
|
|
||||||
" <td>40856</td>\n",
|
|
||||||
" <td>3.504944</td>\n",
|
|
||||||
" <td>4.967931</td>\n",
|
|
||||||
" <td>4.009741</td>\n",
|
|
||||||
" </tr>\n",
|
|
||||||
" <tr>\n",
|
|
||||||
" <th>5</th>\n",
|
|
||||||
" <td>29443</td>\n",
|
|
||||||
" <td>4.454098</td>\n",
|
|
||||||
" <td>5.961517</td>\n",
|
|
||||||
" <td>5.009677</td>\n",
|
|
||||||
" </tr>\n",
|
|
||||||
" <tr>\n",
|
|
||||||
" <th>6</th>\n",
|
|
||||||
" <td>22709</td>\n",
|
|
||||||
" <td>5.232023</td>\n",
|
|
||||||
" <td>6.955103</td>\n",
|
|
||||||
" <td>6.009612</td>\n",
|
|
||||||
" </tr>\n",
|
|
||||||
" <tr>\n",
|
|
||||||
" <th>7</th>\n",
|
|
||||||
" <td>18255</td>\n",
|
|
||||||
" <td>6.157929</td>\n",
|
|
||||||
" <td>7.948689</td>\n",
|
|
||||||
" <td>7.009548</td>\n",
|
|
||||||
" </tr>\n",
|
|
||||||
" <tr>\n",
|
|
||||||
" <th>8</th>\n",
|
|
||||||
" <td>15076</td>\n",
|
|
||||||
" <td>7.308039</td>\n",
|
|
||||||
" <td>8.942276</td>\n",
|
|
||||||
" <td>8.009483</td>\n",
|
|
||||||
" </tr>\n",
|
|
||||||
" <tr>\n",
|
|
||||||
" <th>9</th>\n",
|
|
||||||
" <td>12859</td>\n",
|
|
||||||
" <td>8.045649</td>\n",
|
|
||||||
" <td>9.935862</td>\n",
|
|
||||||
" <td>9.009418</td>\n",
|
|
||||||
" </tr>\n",
|
|
||||||
" </tbody>\n",
|
|
||||||
"</table>\n",
|
|
||||||
"</div>"
|
|
||||||
],
|
|
||||||
"text/plain": [
|
|
||||||
" liczba tokenów średnia częstość w części B estymacje +1 estymacje +0.01\n",
|
|
||||||
"0 388334 1.900495 0.993586 0.009999\n",
|
"0 388334 1.900495 0.993586 0.009999\n",
|
||||||
"1 403870 0.592770 1.987172 1.009935\n",
|
"1 403870 0.592770 1.987172 1.009935\n",
|
||||||
"2 117529 1.565809 2.980759 2.009870\n",
|
"2 117529 1.565809 2.980759 2.009870\n",
|
||||||
@ -671,10 +479,6 @@
|
|||||||
"8 15076 7.308039 8.942276 8.009483\n",
|
"8 15076 7.308039 8.942276 8.009483\n",
|
||||||
"9 12859 8.045649 9.935862 9.009418"
|
"9 12859 8.045649 9.935862 9.009418"
|
||||||
]
|
]
|
||||||
},
|
|
||||||
"execution_count": 6,
|
|
||||||
"metadata": {},
|
|
||||||
"output_type": "execute_result"
|
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"source": [
|
"source": [
|
||||||
@ -716,113 +520,14 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 7,
|
"execution_count": 1,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
{
|
{
|
||||||
"data": {
|
"name": "stdout",
|
||||||
"text/html": [
|
"output_type": "stream",
|
||||||
"<div>\n",
|
"text": [
|
||||||
"<style scoped>\n",
|
"liczba tokenów średnia częstość w części B estymacje +1 Good-Turing\n",
|
||||||
" .dataframe tbody tr th:only-of-type {\n",
|
|
||||||
" vertical-align: middle;\n",
|
|
||||||
" }\n",
|
|
||||||
"\n",
|
|
||||||
" .dataframe tbody tr th {\n",
|
|
||||||
" vertical-align: top;\n",
|
|
||||||
" }\n",
|
|
||||||
"\n",
|
|
||||||
" .dataframe thead th {\n",
|
|
||||||
" text-align: right;\n",
|
|
||||||
" }\n",
|
|
||||||
"</style>\n",
|
|
||||||
"<table border=\"1\" class=\"dataframe\">\n",
|
|
||||||
" <thead>\n",
|
|
||||||
" <tr style=\"text-align: right;\">\n",
|
|
||||||
" <th></th>\n",
|
|
||||||
" <th>liczba tokenów</th>\n",
|
|
||||||
" <th>średnia częstość w części B</th>\n",
|
|
||||||
" <th>estymacje +1</th>\n",
|
|
||||||
" <th>Good-Turing</th>\n",
|
|
||||||
" </tr>\n",
|
|
||||||
" </thead>\n",
|
|
||||||
" <tbody>\n",
|
|
||||||
" <tr>\n",
|
|
||||||
" <th>0</th>\n",
|
|
||||||
" <td>388334</td>\n",
|
|
||||||
" <td>1.900495</td>\n",
|
|
||||||
" <td>0.993586</td>\n",
|
|
||||||
" <td>1.040007</td>\n",
|
|
||||||
" </tr>\n",
|
|
||||||
" <tr>\n",
|
|
||||||
" <th>1</th>\n",
|
|
||||||
" <td>403870</td>\n",
|
|
||||||
" <td>0.592770</td>\n",
|
|
||||||
" <td>1.987172</td>\n",
|
|
||||||
" <td>0.582014</td>\n",
|
|
||||||
" </tr>\n",
|
|
||||||
" <tr>\n",
|
|
||||||
" <th>2</th>\n",
|
|
||||||
" <td>117529</td>\n",
|
|
||||||
" <td>1.565809</td>\n",
|
|
||||||
" <td>2.980759</td>\n",
|
|
||||||
" <td>1.603009</td>\n",
|
|
||||||
" </tr>\n",
|
|
||||||
" <tr>\n",
|
|
||||||
" <th>3</th>\n",
|
|
||||||
" <td>62800</td>\n",
|
|
||||||
" <td>2.514268</td>\n",
|
|
||||||
" <td>3.974345</td>\n",
|
|
||||||
" <td>2.602293</td>\n",
|
|
||||||
" </tr>\n",
|
|
||||||
" <tr>\n",
|
|
||||||
" <th>4</th>\n",
|
|
||||||
" <td>40856</td>\n",
|
|
||||||
" <td>3.504944</td>\n",
|
|
||||||
" <td>4.967931</td>\n",
|
|
||||||
" <td>3.603265</td>\n",
|
|
||||||
" </tr>\n",
|
|
||||||
" <tr>\n",
|
|
||||||
" <th>5</th>\n",
|
|
||||||
" <td>29443</td>\n",
|
|
||||||
" <td>4.454098</td>\n",
|
|
||||||
" <td>5.961517</td>\n",
|
|
||||||
" <td>4.627721</td>\n",
|
|
||||||
" </tr>\n",
|
|
||||||
" <tr>\n",
|
|
||||||
" <th>6</th>\n",
|
|
||||||
" <td>22709</td>\n",
|
|
||||||
" <td>5.232023</td>\n",
|
|
||||||
" <td>6.955103</td>\n",
|
|
||||||
" <td>5.627064</td>\n",
|
|
||||||
" </tr>\n",
|
|
||||||
" <tr>\n",
|
|
||||||
" <th>7</th>\n",
|
|
||||||
" <td>18255</td>\n",
|
|
||||||
" <td>6.157929</td>\n",
|
|
||||||
" <td>7.948689</td>\n",
|
|
||||||
" <td>6.606847</td>\n",
|
|
||||||
" </tr>\n",
|
|
||||||
" <tr>\n",
|
|
||||||
" <th>8</th>\n",
|
|
||||||
" <td>15076</td>\n",
|
|
||||||
" <td>7.308039</td>\n",
|
|
||||||
" <td>8.942276</td>\n",
|
|
||||||
" <td>7.676506</td>\n",
|
|
||||||
" </tr>\n",
|
|
||||||
" <tr>\n",
|
|
||||||
" <th>9</th>\n",
|
|
||||||
" <td>12859</td>\n",
|
|
||||||
" <td>8.045649</td>\n",
|
|
||||||
" <td>9.935862</td>\n",
|
|
||||||
" <td>8.557431</td>\n",
|
|
||||||
" </tr>\n",
|
|
||||||
" </tbody>\n",
|
|
||||||
"</table>\n",
|
|
||||||
"</div>"
|
|
||||||
],
|
|
||||||
"text/plain": [
|
|
||||||
" liczba tokenów średnia częstość w części B estymacje +1 Good-Turing\n",
|
|
||||||
"0 388334 1.900495 0.993586 1.040007\n",
|
"0 388334 1.900495 0.993586 1.040007\n",
|
||||||
"1 403870 0.592770 1.987172 0.582014\n",
|
"1 403870 0.592770 1.987172 0.582014\n",
|
||||||
"2 117529 1.565809 2.980759 1.603009\n",
|
"2 117529 1.565809 2.980759 1.603009\n",
|
||||||
@ -834,10 +539,6 @@
|
|||||||
"8 15076 7.308039 8.942276 7.676506\n",
|
"8 15076 7.308039 8.942276 7.676506\n",
|
||||||
"9 12859 8.045649 9.935862 8.557431"
|
"9 12859 8.045649 9.935862 8.557431"
|
||||||
]
|
]
|
||||||
},
|
|
||||||
"execution_count": 7,
|
|
||||||
"metadata": {},
|
|
||||||
"output_type": "execute_result"
|
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"source": [
|
"source": [
|
||||||
@ -1008,18 +709,15 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 8,
|
"execution_count": 1,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [
|
"outputs": [
|
||||||
{
|
{
|
||||||
"data": {
|
"name": "stdout",
|
||||||
"text/plain": [
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
"[('k', 'o', 't'), ('o', 't', 'e'), ('t', 'e', 'k')]"
|
"[('k', 'o', 't'), ('o', 't', 'e'), ('t', 'e', 'k')]"
|
||||||
]
|
]
|
||||||
},
|
|
||||||
"execution_count": 8,
|
|
||||||
"metadata": {},
|
|
||||||
"output_type": "execute_result"
|
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"source": [
|
"source": [
|
||||||
@ -1036,7 +734,7 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 9,
|
"execution_count": 1,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
@ -1048,23 +746,13 @@
|
|||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 12,
|
"execution_count": 1,
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [
|
"outputs": [],
|
||||||
{
|
|
||||||
"data": {
|
|
||||||
"text/plain": [
|
|
||||||
"321"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"execution_count": 12,
|
|
||||||
"metadata": {},
|
|
||||||
"output_type": "execute_result"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"source": [
|
"source": [
|
||||||
"len(histories['jork'])\n",
|
"len(histories['jork'])\n",
|
||||||
"len(histories['zielony'])"
|
"len(histories['zielony'])\n",
|
||||||
|
"histories['jork']"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@ -1112,7 +800,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"![img](./05_Wygladzanie/size-perplexity.gif \"Perplexity dla różnych rozmiarów zbioru testowego\")\n",
|
"![img](./07_Wygladzanie/size-perplexity.gif \"Perplexity dla różnych rozmiarów zbioru testowego\")\n",
|
||||||
"\n"
|
"\n"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
@ -1128,7 +816,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"![img](./05_Wygladzanie/size-perplexity2.gif \"Perplexity dla różnych rozmiarów zbioru uczącego\")\n",
|
"![img](./07_Wygladzanie/size-perplexity2.gif \"Perplexity dla różnych rozmiarów zbioru uczącego\")\n",
|
||||||
"\n"
|
"\n"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
@ -1144,7 +832,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"![img](./05_Wygladzanie/order-perplexity.gif \"Perplexity dla różnych wartości rządu modelu\")\n",
|
"![img](./07_Wygladzanie/order-perplexity.gif \"Perplexity dla różnych wartości rządu modelu\")\n",
|
||||||
"\n"
|
"\n"
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
@ -1165,7 +853,7 @@
|
|||||||
"name": "python",
|
"name": "python",
|
||||||
"nbconvert_exporter": "python",
|
"nbconvert_exporter": "python",
|
||||||
"pygments_lexer": "ipython3",
|
"pygments_lexer": "ipython3",
|
||||||
"version": "3.10.2"
|
"version": "3.10.5"
|
||||||
},
|
},
|
||||||
"org": null
|
"org": null
|
||||||
},
|
},
|
@ -25,7 +25,7 @@ $$p_i = \frac{k_i}{T}.$$
|
|||||||
Rozpatrzmy przykład z 3 kolorami (wiemy, że w urnie mogą być kule
|
Rozpatrzmy przykład z 3 kolorami (wiemy, że w urnie mogą być kule
|
||||||
żółte, zielone i czerwone, tj. $m=3$) i 4 losowaniami ($T=4$):
|
żółte, zielone i czerwone, tj. $m=3$) i 4 losowaniami ($T=4$):
|
||||||
|
|
||||||
[[./05_Wygladzanie/urna.drawio.png]]
|
[[./07_Wygladzanie/urna.drawio.png]]
|
||||||
|
|
||||||
Gdybyśmy w prosty sposób oszacowali prawdopodobieństwa, doszlibyśmy do
|
Gdybyśmy w prosty sposób oszacowali prawdopodobieństwa, doszlibyśmy do
|
||||||
wniosku, że prawdopodobieństwo wylosowania kuli czerwonej wynosi 3/4, żółtej — 1/4,
|
wniosku, że prawdopodobieństwo wylosowania kuli czerwonej wynosi 3/4, żółtej — 1/4,
|
||||||
@ -85,7 +85,7 @@ losowania kul z urny: $m$ to liczba wszystkich wyrazów (czyli rozmiar słownika
|
|||||||
$k_i$ to ile razy w zbiorze uczącym pojawił się $i$-ty wyraz słownika,
|
$k_i$ to ile razy w zbiorze uczącym pojawił się $i$-ty wyraz słownika,
|
||||||
$T$ — długość zbioru uczącego.
|
$T$ — długość zbioru uczącego.
|
||||||
|
|
||||||
[[./05_Wygladzanie/urna-wyrazy.drawio.png]]
|
[[./07_Wygladzanie/urna-wyrazy.drawio.png]]
|
||||||
|
|
||||||
A zatem przy użyciu wygładzania +1 w następujący sposób estymować
|
A zatem przy użyciu wygładzania +1 w następujący sposób estymować
|
||||||
będziemy prawdopodobieństwo słowa $w$:
|
będziemy prawdopodobieństwo słowa $w$:
|
||||||
@ -173,7 +173,7 @@ Stwórzmy generator, który będzie wczytywał słowa z pliku, dodatkowo:
|
|||||||
- dodamy specjalne tokeny na początek i koniec zdania (~<s>~ i ~</s>~).
|
- dodamy specjalne tokeny na początek i koniec zdania (~<s>~ i ~</s>~).
|
||||||
|
|
||||||
|
|
||||||
#+BEGIN_SRC python :session mysession :exports both :results raw drawer
|
#+BEGIN_SRC ipython :session mysession :exports both :results raw drawer
|
||||||
from itertools import islice
|
from itertools import islice
|
||||||
import regex as re
|
import regex as re
|
||||||
import sys
|
import sys
|
||||||
@ -200,7 +200,7 @@ Stwórzmy generator, który będzie wczytywał słowa z pliku, dodatkowo:
|
|||||||
Zobaczmy, ile razy, średnio w drugiej połówce korpusu występują
|
Zobaczmy, ile razy, średnio w drugiej połówce korpusu występują
|
||||||
wyrazy, które w pierwszej wystąpiły określoną liczbę razy.
|
wyrazy, które w pierwszej wystąpiły określoną liczbę razy.
|
||||||
|
|
||||||
#+BEGIN_SRC python :session mysession :exports both :results raw drawer
|
#+BEGIN_SRC ipython :session mysession :exports both :results raw drawer
|
||||||
from collections import Counter
|
from collections import Counter
|
||||||
|
|
||||||
counterA = Counter(get_words_from_file('opensubtitlesA.pl.txt'))
|
counterA = Counter(get_words_from_file('opensubtitlesA.pl.txt'))
|
||||||
@ -210,7 +210,7 @@ wyrazy, które w pierwszej wystąpiły określoną liczbę razy.
|
|||||||
:results:
|
:results:
|
||||||
:end:
|
:end:
|
||||||
|
|
||||||
#+BEGIN_SRC python :session mysession :exports both :results raw drawer
|
#+BEGIN_SRC ipython :session mysession :exports both :results raw drawer
|
||||||
counterA['taki']
|
counterA['taki']
|
||||||
#+END_SRC
|
#+END_SRC
|
||||||
|
|
||||||
@ -219,7 +219,7 @@ counterA['taki']
|
|||||||
48113
|
48113
|
||||||
:end:
|
:end:
|
||||||
|
|
||||||
#+BEGIN_SRC python :session mysession :exports both :results raw drawer
|
#+BEGIN_SRC ipython :session mysession :exports both :results raw drawer
|
||||||
max_r = 10
|
max_r = 10
|
||||||
|
|
||||||
buckets = {}
|
buckets = {}
|
||||||
@ -251,7 +251,7 @@ counterA['taki']
|
|||||||
Policzmy teraz jakiej liczby wystąpień byśmy oczekiwali, gdyby użyć wygładzania +1 bądź +0.01.
|
Policzmy teraz jakiej liczby wystąpień byśmy oczekiwali, gdyby użyć wygładzania +1 bądź +0.01.
|
||||||
(Uwaga: zwracamy liczbę wystąpień, a nie względną częstość, stąd przemnażamy przez rozmiar całego korpusu).
|
(Uwaga: zwracamy liczbę wystąpień, a nie względną częstość, stąd przemnażamy przez rozmiar całego korpusu).
|
||||||
|
|
||||||
#+BEGIN_SRC python :session mysession :exports both :results raw drawer
|
#+BEGIN_SRC ipython :session mysession :exports both :results raw drawer
|
||||||
def plus_alpha_smoothing(alpha, m, t, k):
|
def plus_alpha_smoothing(alpha, m, t, k):
|
||||||
return t*(k + alpha)/(t + alpha * m)
|
return t*(k + alpha)/(t + alpha * m)
|
||||||
|
|
||||||
@ -275,7 +275,7 @@ Policzmy teraz jakiej liczby wystąpień byśmy oczekiwali, gdyby użyć wygład
|
|||||||
926594
|
926594
|
||||||
:end:
|
:end:
|
||||||
|
|
||||||
#+BEGIN_SRC python :session mysession :exports both :results raw drawer
|
#+BEGIN_SRC ipython :session mysession :exports both :results raw drawer
|
||||||
import pandas as pd
|
import pandas as pd
|
||||||
|
|
||||||
pd.DataFrame(data, columns=["liczba tokenów", "średnia częstość w części B", "estymacje +1", "estymacje +0.01"])
|
pd.DataFrame(data, columns=["liczba tokenów", "średnia częstość w części B", "estymacje +1", "estymacje +0.01"])
|
||||||
@ -309,7 +309,7 @@ $$p(w) = \frac{\# w + 1}{|C|}\frac{N_{r+1}}{N_r}.$$
|
|||||||
|
|
||||||
**** Przykład
|
**** Przykład
|
||||||
|
|
||||||
#+BEGIN_SRC python :session mysession :exports both :results raw drawer
|
#+BEGIN_SRC ipython :session mysession :exports both :results raw drawer
|
||||||
good_turing_counts = [(ix+1)*nb_of_types[ix+1]/nb_of_types[ix] for ix in range(0, max_r)]
|
good_turing_counts = [(ix+1)*nb_of_types[ix+1]/nb_of_types[ix] for ix in range(0, max_r)]
|
||||||
|
|
||||||
data2 = list(zip(nb_of_types, empirical_counts, plus_one_counts, good_turing_counts))
|
data2 = list(zip(nb_of_types, empirical_counts, plus_one_counts, good_turing_counts))
|
||||||
@ -415,7 +415,7 @@ W metodzie Knesera-Neya w następujący sposób estymujemy prawdopodobieństwo u
|
|||||||
|
|
||||||
$$P(w) = \frac{N_{1+}(\bullet w)}{\sum_{w_j} N_{1+}(\bullet w_j)}.$$
|
$$P(w) = \frac{N_{1+}(\bullet w)}{\sum_{w_j} N_{1+}(\bullet w_j)}.$$
|
||||||
|
|
||||||
#+BEGIN_SRC python :session mysession :exports both :results raw drawer
|
#+BEGIN_SRC ipython :session mysession :exports both :results raw drawer
|
||||||
def ngrams(iter, size):
|
def ngrams(iter, size):
|
||||||
ngram = []
|
ngram = []
|
||||||
for item in iter:
|
for item in iter:
|
||||||
@ -433,7 +433,7 @@ $$P(w) = \frac{N_{1+}(\bullet w)}{\sum_{w_j} N_{1+}(\bullet w_j)}.$$
|
|||||||
:end:
|
:end:
|
||||||
|
|
||||||
|
|
||||||
#+BEGIN_SRC python :session mysession :exports both :results raw drawer
|
#+BEGIN_SRC ipython :session mysession :exports both :results raw drawer
|
||||||
histories = { }
|
histories = { }
|
||||||
for prev_token, token in ngrams(get_words_from_file('opensubtitlesA.pl.txt'), 2):
|
for prev_token, token in ngrams(get_words_from_file('opensubtitlesA.pl.txt'), 2):
|
||||||
histories.setdefault(token, set())
|
histories.setdefault(token, set())
|
||||||
@ -444,7 +444,7 @@ $$P(w) = \frac{N_{1+}(\bullet w)}{\sum_{w_j} N_{1+}(\bullet w_j)}.$$
|
|||||||
:results:
|
:results:
|
||||||
:end:
|
:end:
|
||||||
|
|
||||||
#+BEGIN_SRC python :session mysession :exports both :results raw drawer
|
#+BEGIN_SRC ipython :session mysession :exports both :results raw drawer
|
||||||
len(histories['jork'])
|
len(histories['jork'])
|
||||||
len(histories['zielony'])
|
len(histories['zielony'])
|
||||||
histories['jork']
|
histories['jork']
|
||||||
@ -472,15 +472,15 @@ Knesera-Neya połączone z *przycinaniem* słownika n-gramów (wszystkie
|
|||||||
**** Zmiana perplexity przy zwiększaniu zbioru testowego
|
**** Zmiana perplexity przy zwiększaniu zbioru testowego
|
||||||
|
|
||||||
#+CAPTION: Perplexity dla różnych rozmiarów zbioru testowego
|
#+CAPTION: Perplexity dla różnych rozmiarów zbioru testowego
|
||||||
[[./05_Wygladzanie/size-perplexity.gif]]
|
[[./07_Wygladzanie/size-perplexity.gif]]
|
||||||
|
|
||||||
|
|
||||||
**** Zmiana perplexity przy zwiększaniu zbioru uczącego
|
**** Zmiana perplexity przy zwiększaniu zbioru uczącego
|
||||||
|
|
||||||
#+CAPTION: Perplexity dla różnych rozmiarów zbioru uczącego
|
#+CAPTION: Perplexity dla różnych rozmiarów zbioru uczącego
|
||||||
[[./05_Wygladzanie/size-perplexity2.gif]]
|
[[./07_Wygladzanie/size-perplexity2.gif]]
|
||||||
|
|
||||||
**** Zmiana perplexity przy zwiększaniu rządu modelu
|
**** Zmiana perplexity przy zwiększaniu rządu modelu
|
||||||
|
|
||||||
#+CAPTION: Perplexity dla różnych wartości rządu modelu
|
#+CAPTION: Perplexity dla różnych wartości rządu modelu
|
||||||
[[./05_Wygladzanie/order-perplexity.gif]]
|
[[./07_Wygladzanie/order-perplexity.gif]]
|
Before Width: | Height: | Size: 4.1 KiB After Width: | Height: | Size: 4.1 KiB |
Before Width: | Height: | Size: 4.5 KiB After Width: | Height: | Size: 4.5 KiB |
Before Width: | Height: | Size: 4.8 KiB After Width: | Height: | Size: 4.8 KiB |
Before Width: | Height: | Size: 24 KiB After Width: | Height: | Size: 24 KiB |
Before Width: | Height: | Size: 17 KiB After Width: | Height: | Size: 17 KiB |