Fix 07
@ -1,5 +1,20 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
|
||||
"<div class=\"alert alert-block alert-info\">\n",
|
||||
"<h1> Modelowanie języka</h1>\n",
|
||||
"<h2> 07. <i>Wygładzanie w n-gramowych modelach języka</i> [wykład]</h2> \n",
|
||||
"<h3> Filip Graliński (2022)</h3>\n",
|
||||
"</div>\n",
|
||||
"\n",
|
||||
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
@ -54,7 +69,7 @@
|
||||
"Rozpatrzmy przykład z 3 kolorami (wiemy, że w urnie mogą być kule\n",
|
||||
"żółte, zielone i czerwone, tj. $m=3$) i 4 losowaniami ($T=4$):\n",
|
||||
"\n",
|
||||
"![img](./05_Wygladzanie/urna.drawio.png)\n",
|
||||
"![img](./07_Wygladzanie/urna.drawio.png)\n",
|
||||
"\n",
|
||||
"Gdybyśmy w prosty sposób oszacowali prawdopodobieństwa, doszlibyśmy do\n",
|
||||
"wniosku, że prawdopodobieństwo wylosowania kuli czerwonej wynosi 3/4, żółtej — 1/4,\n",
|
||||
@ -168,7 +183,7 @@
|
||||
"$k_i$ to ile razy w zbiorze uczącym pojawił się $i$-ty wyraz słownika,\n",
|
||||
"$T$ — długość zbioru uczącego.\n",
|
||||
"\n",
|
||||
"![img](./05_Wygladzanie/urna-wyrazy.drawio.png)\n",
|
||||
"![img](./07_Wygladzanie/urna-wyrazy.drawio.png)\n",
|
||||
"\n",
|
||||
"A zatem przy użyciu wygładzania +1 w następujący sposób estymować\n",
|
||||
"będziemy prawdopodobieństwo słowa $w$:\n",
|
||||
@ -303,113 +318,11 @@
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"['<s>',\n",
|
||||
" 'lubisz',\n",
|
||||
" 'curry',\n",
|
||||
" ',',\n",
|
||||
" 'prawda',\n",
|
||||
" '?',\n",
|
||||
" '</s>',\n",
|
||||
" '<s>',\n",
|
||||
" 'nałożę',\n",
|
||||
" 'ci',\n",
|
||||
" 'więcej',\n",
|
||||
" '.',\n",
|
||||
" '</s>',\n",
|
||||
" '<s>',\n",
|
||||
" 'hey',\n",
|
||||
" '!',\n",
|
||||
" '</s>',\n",
|
||||
" '<s>',\n",
|
||||
" 'smakuje',\n",
|
||||
" 'ci',\n",
|
||||
" '?',\n",
|
||||
" '</s>',\n",
|
||||
" '<s>',\n",
|
||||
" 'hey',\n",
|
||||
" ',',\n",
|
||||
" 'brzydalu',\n",
|
||||
" '.',\n",
|
||||
" '</s>',\n",
|
||||
" '<s>',\n",
|
||||
" 'spójrz',\n",
|
||||
" 'na',\n",
|
||||
" 'nią',\n",
|
||||
" '.',\n",
|
||||
" '</s>',\n",
|
||||
" '<s>',\n",
|
||||
" '-',\n",
|
||||
" 'wariatka',\n",
|
||||
" '.',\n",
|
||||
" '</s>',\n",
|
||||
" '<s>',\n",
|
||||
" '-',\n",
|
||||
" 'zadałam',\n",
|
||||
" 'ci',\n",
|
||||
" 'pytanie',\n",
|
||||
" '!',\n",
|
||||
" '</s>',\n",
|
||||
" '<s>',\n",
|
||||
" 'no',\n",
|
||||
" ',',\n",
|
||||
" 'tak',\n",
|
||||
" 'lepiej',\n",
|
||||
" '!',\n",
|
||||
" '</s>',\n",
|
||||
" '<s>',\n",
|
||||
" '-',\n",
|
||||
" 'wygląda',\n",
|
||||
" 'dobrze',\n",
|
||||
" '!',\n",
|
||||
" '</s>',\n",
|
||||
" '<s>',\n",
|
||||
" '-',\n",
|
||||
" 'tak',\n",
|
||||
" 'lepiej',\n",
|
||||
" '!',\n",
|
||||
" '</s>',\n",
|
||||
" '<s>',\n",
|
||||
" 'pasuje',\n",
|
||||
" 'jej',\n",
|
||||
" '.',\n",
|
||||
" '</s>',\n",
|
||||
" '<s>',\n",
|
||||
" '-',\n",
|
||||
" 'hey',\n",
|
||||
" '.',\n",
|
||||
" '</s>',\n",
|
||||
" '<s>',\n",
|
||||
" '-',\n",
|
||||
" 'co',\n",
|
||||
" 'do',\n",
|
||||
" '...?',\n",
|
||||
" '</s>',\n",
|
||||
" '<s>',\n",
|
||||
" 'co',\n",
|
||||
" 'do',\n",
|
||||
" 'cholery',\n",
|
||||
" 'robisz',\n",
|
||||
" '?',\n",
|
||||
" '</s>',\n",
|
||||
" '<s>',\n",
|
||||
" 'zejdź',\n",
|
||||
" 'mi',\n",
|
||||
" 'z',\n",
|
||||
" 'oczu',\n",
|
||||
" ',',\n",
|
||||
" 'zdziro',\n",
|
||||
" '.',\n",
|
||||
" '</s>',\n",
|
||||
" '<s>',\n",
|
||||
" 'przestań',\n",
|
||||
" 'dokuczać']"
|
||||
]
|
||||
},
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"['<s>', 'lubisz', 'curry', ',', 'prawda', '?', '</s>', '<s>', 'nałożę', 'ci', 'więcej', '.', '</s>', '<s>', 'hey', '!', '</s>', '<s>', 'smakuje', 'ci', '?', '</s>', '<s>', 'hey', ',', 'brzydalu', '.', '</s>', '<s>', 'spójrz', 'na', 'nią', '.', '</s>', '<s>', '-', 'wariatka', '.', '</s>', '<s>', '-', 'zadałam', 'ci', 'pytanie', '!', '</s>', '<s>', 'no', ',', 'tak', 'lepiej', '!', '</s>', '<s>', '-', 'wygląda', 'dobrze', '!', '</s>', '<s>', '-', 'tak', 'lepiej', '!', '</s>', '<s>', 'pasuje', 'jej', '.', '</s>', '<s>', '-', 'hey', '.', '</s>', '<s>', '-', 'co', 'do', '...?', '</s>', '<s>', 'co', 'do', 'cholery', 'robisz', '?', '</s>', '<s>', 'zejdź', 'mi', 'z', 'oczu', ',', 'zdziro', '.', '</s>', '<s>', 'przestań', 'dokuczać']"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
@ -448,7 +361,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
@ -459,18 +372,15 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"48113"
|
||||
]
|
||||
},
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"48113"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
@ -479,7 +389,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
@ -518,18 +428,15 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"926594"
|
||||
]
|
||||
},
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"926594"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
@ -553,128 +460,25 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>liczba tokenów</th>\n",
|
||||
" <th>średnia częstość w części B</th>\n",
|
||||
" <th>estymacje +1</th>\n",
|
||||
" <th>estymacje +0.01</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>388334</td>\n",
|
||||
" <td>1.900495</td>\n",
|
||||
" <td>0.993586</td>\n",
|
||||
" <td>0.009999</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>403870</td>\n",
|
||||
" <td>0.592770</td>\n",
|
||||
" <td>1.987172</td>\n",
|
||||
" <td>1.009935</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>117529</td>\n",
|
||||
" <td>1.565809</td>\n",
|
||||
" <td>2.980759</td>\n",
|
||||
" <td>2.009870</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>62800</td>\n",
|
||||
" <td>2.514268</td>\n",
|
||||
" <td>3.974345</td>\n",
|
||||
" <td>3.009806</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>40856</td>\n",
|
||||
" <td>3.504944</td>\n",
|
||||
" <td>4.967931</td>\n",
|
||||
" <td>4.009741</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>5</th>\n",
|
||||
" <td>29443</td>\n",
|
||||
" <td>4.454098</td>\n",
|
||||
" <td>5.961517</td>\n",
|
||||
" <td>5.009677</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>6</th>\n",
|
||||
" <td>22709</td>\n",
|
||||
" <td>5.232023</td>\n",
|
||||
" <td>6.955103</td>\n",
|
||||
" <td>6.009612</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>7</th>\n",
|
||||
" <td>18255</td>\n",
|
||||
" <td>6.157929</td>\n",
|
||||
" <td>7.948689</td>\n",
|
||||
" <td>7.009548</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>8</th>\n",
|
||||
" <td>15076</td>\n",
|
||||
" <td>7.308039</td>\n",
|
||||
" <td>8.942276</td>\n",
|
||||
" <td>8.009483</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>9</th>\n",
|
||||
" <td>12859</td>\n",
|
||||
" <td>8.045649</td>\n",
|
||||
" <td>9.935862</td>\n",
|
||||
" <td>9.009418</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" liczba tokenów średnia częstość w części B estymacje +1 estymacje +0.01\n",
|
||||
"0 388334 1.900495 0.993586 0.009999\n",
|
||||
"1 403870 0.592770 1.987172 1.009935\n",
|
||||
"2 117529 1.565809 2.980759 2.009870\n",
|
||||
"3 62800 2.514268 3.974345 3.009806\n",
|
||||
"4 40856 3.504944 4.967931 4.009741\n",
|
||||
"5 29443 4.454098 5.961517 5.009677\n",
|
||||
"6 22709 5.232023 6.955103 6.009612\n",
|
||||
"7 18255 6.157929 7.948689 7.009548\n",
|
||||
"8 15076 7.308039 8.942276 8.009483\n",
|
||||
"9 12859 8.045649 9.935862 9.009418"
|
||||
]
|
||||
},
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"liczba tokenów średnia częstość w części B estymacje +1 estymacje +0.01\n",
|
||||
"0 388334 1.900495 0.993586 0.009999\n",
|
||||
"1 403870 0.592770 1.987172 1.009935\n",
|
||||
"2 117529 1.565809 2.980759 2.009870\n",
|
||||
"3 62800 2.514268 3.974345 3.009806\n",
|
||||
"4 40856 3.504944 4.967931 4.009741\n",
|
||||
"5 29443 4.454098 5.961517 5.009677\n",
|
||||
"6 22709 5.232023 6.955103 6.009612\n",
|
||||
"7 18255 6.157929 7.948689 7.009548\n",
|
||||
"8 15076 7.308039 8.942276 8.009483\n",
|
||||
"9 12859 8.045649 9.935862 9.009418"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
@ -716,128 +520,25 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>liczba tokenów</th>\n",
|
||||
" <th>średnia częstość w części B</th>\n",
|
||||
" <th>estymacje +1</th>\n",
|
||||
" <th>Good-Turing</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>388334</td>\n",
|
||||
" <td>1.900495</td>\n",
|
||||
" <td>0.993586</td>\n",
|
||||
" <td>1.040007</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>403870</td>\n",
|
||||
" <td>0.592770</td>\n",
|
||||
" <td>1.987172</td>\n",
|
||||
" <td>0.582014</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>117529</td>\n",
|
||||
" <td>1.565809</td>\n",
|
||||
" <td>2.980759</td>\n",
|
||||
" <td>1.603009</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>62800</td>\n",
|
||||
" <td>2.514268</td>\n",
|
||||
" <td>3.974345</td>\n",
|
||||
" <td>2.602293</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>40856</td>\n",
|
||||
" <td>3.504944</td>\n",
|
||||
" <td>4.967931</td>\n",
|
||||
" <td>3.603265</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>5</th>\n",
|
||||
" <td>29443</td>\n",
|
||||
" <td>4.454098</td>\n",
|
||||
" <td>5.961517</td>\n",
|
||||
" <td>4.627721</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>6</th>\n",
|
||||
" <td>22709</td>\n",
|
||||
" <td>5.232023</td>\n",
|
||||
" <td>6.955103</td>\n",
|
||||
" <td>5.627064</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>7</th>\n",
|
||||
" <td>18255</td>\n",
|
||||
" <td>6.157929</td>\n",
|
||||
" <td>7.948689</td>\n",
|
||||
" <td>6.606847</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>8</th>\n",
|
||||
" <td>15076</td>\n",
|
||||
" <td>7.308039</td>\n",
|
||||
" <td>8.942276</td>\n",
|
||||
" <td>7.676506</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>9</th>\n",
|
||||
" <td>12859</td>\n",
|
||||
" <td>8.045649</td>\n",
|
||||
" <td>9.935862</td>\n",
|
||||
" <td>8.557431</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" liczba tokenów średnia częstość w części B estymacje +1 Good-Turing\n",
|
||||
"0 388334 1.900495 0.993586 1.040007\n",
|
||||
"1 403870 0.592770 1.987172 0.582014\n",
|
||||
"2 117529 1.565809 2.980759 1.603009\n",
|
||||
"3 62800 2.514268 3.974345 2.602293\n",
|
||||
"4 40856 3.504944 4.967931 3.603265\n",
|
||||
"5 29443 4.454098 5.961517 4.627721\n",
|
||||
"6 22709 5.232023 6.955103 5.627064\n",
|
||||
"7 18255 6.157929 7.948689 6.606847\n",
|
||||
"8 15076 7.308039 8.942276 7.676506\n",
|
||||
"9 12859 8.045649 9.935862 8.557431"
|
||||
]
|
||||
},
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"liczba tokenów średnia częstość w części B estymacje +1 Good-Turing\n",
|
||||
"0 388334 1.900495 0.993586 1.040007\n",
|
||||
"1 403870 0.592770 1.987172 0.582014\n",
|
||||
"2 117529 1.565809 2.980759 1.603009\n",
|
||||
"3 62800 2.514268 3.974345 2.602293\n",
|
||||
"4 40856 3.504944 4.967931 3.603265\n",
|
||||
"5 29443 4.454098 5.961517 4.627721\n",
|
||||
"6 22709 5.232023 6.955103 5.627064\n",
|
||||
"7 18255 6.157929 7.948689 6.606847\n",
|
||||
"8 15076 7.308039 8.942276 7.676506\n",
|
||||
"9 12859 8.045649 9.935862 8.557431"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
@ -1008,18 +709,15 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[('k', 'o', 't'), ('o', 't', 'e'), ('t', 'e', 'k')]"
|
||||
]
|
||||
},
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[('k', 'o', 't'), ('o', 't', 'e'), ('t', 'e', 'k')]"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
@ -1036,7 +734,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
@ -1048,23 +746,13 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"321"
|
||||
]
|
||||
},
|
||||
"execution_count": 12,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"len(histories['jork'])\n",
|
||||
"len(histories['zielony'])"
|
||||
"len(histories['zielony'])\n",
|
||||
"histories['jork']"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -1112,7 +800,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"![img](./05_Wygladzanie/size-perplexity.gif \"Perplexity dla różnych rozmiarów zbioru testowego\")\n",
|
||||
"![img](./07_Wygladzanie/size-perplexity.gif \"Perplexity dla różnych rozmiarów zbioru testowego\")\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
@ -1128,7 +816,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"![img](./05_Wygladzanie/size-perplexity2.gif \"Perplexity dla różnych rozmiarów zbioru uczącego\")\n",
|
||||
"![img](./07_Wygladzanie/size-perplexity2.gif \"Perplexity dla różnych rozmiarów zbioru uczącego\")\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
@ -1144,7 +832,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"![img](./05_Wygladzanie/order-perplexity.gif \"Perplexity dla różnych wartości rządu modelu\")\n",
|
||||
"![img](./07_Wygladzanie/order-perplexity.gif \"Perplexity dla różnych wartości rządu modelu\")\n",
|
||||
"\n"
|
||||
]
|
||||
}
|
||||
@ -1165,7 +853,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.2"
|
||||
"version": "3.10.5"
|
||||
},
|
||||
"org": null
|
||||
},
|
@ -25,7 +25,7 @@ $$p_i = \frac{k_i}{T}.$$
|
||||
Rozpatrzmy przykład z 3 kolorami (wiemy, że w urnie mogą być kule
|
||||
żółte, zielone i czerwone, tj. $m=3$) i 4 losowaniami ($T=4$):
|
||||
|
||||
[[./05_Wygladzanie/urna.drawio.png]]
|
||||
[[./07_Wygladzanie/urna.drawio.png]]
|
||||
|
||||
Gdybyśmy w prosty sposób oszacowali prawdopodobieństwa, doszlibyśmy do
|
||||
wniosku, że prawdopodobieństwo wylosowania kuli czerwonej wynosi 3/4, żółtej — 1/4,
|
||||
@ -85,7 +85,7 @@ losowania kul z urny: $m$ to liczba wszystkich wyrazów (czyli rozmiar słownika
|
||||
$k_i$ to ile razy w zbiorze uczącym pojawił się $i$-ty wyraz słownika,
|
||||
$T$ — długość zbioru uczącego.
|
||||
|
||||
[[./05_Wygladzanie/urna-wyrazy.drawio.png]]
|
||||
[[./07_Wygladzanie/urna-wyrazy.drawio.png]]
|
||||
|
||||
A zatem przy użyciu wygładzania +1 w następujący sposób estymować
|
||||
będziemy prawdopodobieństwo słowa $w$:
|
||||
@ -173,7 +173,7 @@ Stwórzmy generator, który będzie wczytywał słowa z pliku, dodatkowo:
|
||||
- dodamy specjalne tokeny na początek i koniec zdania (~<s>~ i ~</s>~).
|
||||
|
||||
|
||||
#+BEGIN_SRC python :session mysession :exports both :results raw drawer
|
||||
#+BEGIN_SRC ipython :session mysession :exports both :results raw drawer
|
||||
from itertools import islice
|
||||
import regex as re
|
||||
import sys
|
||||
@ -200,7 +200,7 @@ Stwórzmy generator, który będzie wczytywał słowa z pliku, dodatkowo:
|
||||
Zobaczmy, ile razy, średnio w drugiej połówce korpusu występują
|
||||
wyrazy, które w pierwszej wystąpiły określoną liczbę razy.
|
||||
|
||||
#+BEGIN_SRC python :session mysession :exports both :results raw drawer
|
||||
#+BEGIN_SRC ipython :session mysession :exports both :results raw drawer
|
||||
from collections import Counter
|
||||
|
||||
counterA = Counter(get_words_from_file('opensubtitlesA.pl.txt'))
|
||||
@ -210,7 +210,7 @@ wyrazy, które w pierwszej wystąpiły określoną liczbę razy.
|
||||
:results:
|
||||
:end:
|
||||
|
||||
#+BEGIN_SRC python :session mysession :exports both :results raw drawer
|
||||
#+BEGIN_SRC ipython :session mysession :exports both :results raw drawer
|
||||
counterA['taki']
|
||||
#+END_SRC
|
||||
|
||||
@ -219,7 +219,7 @@ counterA['taki']
|
||||
48113
|
||||
:end:
|
||||
|
||||
#+BEGIN_SRC python :session mysession :exports both :results raw drawer
|
||||
#+BEGIN_SRC ipython :session mysession :exports both :results raw drawer
|
||||
max_r = 10
|
||||
|
||||
buckets = {}
|
||||
@ -251,7 +251,7 @@ counterA['taki']
|
||||
Policzmy teraz jakiej liczby wystąpień byśmy oczekiwali, gdyby użyć wygładzania +1 bądź +0.01.
|
||||
(Uwaga: zwracamy liczbę wystąpień, a nie względną częstość, stąd przemnażamy przez rozmiar całego korpusu).
|
||||
|
||||
#+BEGIN_SRC python :session mysession :exports both :results raw drawer
|
||||
#+BEGIN_SRC ipython :session mysession :exports both :results raw drawer
|
||||
def plus_alpha_smoothing(alpha, m, t, k):
|
||||
return t*(k + alpha)/(t + alpha * m)
|
||||
|
||||
@ -275,7 +275,7 @@ Policzmy teraz jakiej liczby wystąpień byśmy oczekiwali, gdyby użyć wygład
|
||||
926594
|
||||
:end:
|
||||
|
||||
#+BEGIN_SRC python :session mysession :exports both :results raw drawer
|
||||
#+BEGIN_SRC ipython :session mysession :exports both :results raw drawer
|
||||
import pandas as pd
|
||||
|
||||
pd.DataFrame(data, columns=["liczba tokenów", "średnia częstość w części B", "estymacje +1", "estymacje +0.01"])
|
||||
@ -309,7 +309,7 @@ $$p(w) = \frac{\# w + 1}{|C|}\frac{N_{r+1}}{N_r}.$$
|
||||
|
||||
**** Przykład
|
||||
|
||||
#+BEGIN_SRC python :session mysession :exports both :results raw drawer
|
||||
#+BEGIN_SRC ipython :session mysession :exports both :results raw drawer
|
||||
good_turing_counts = [(ix+1)*nb_of_types[ix+1]/nb_of_types[ix] for ix in range(0, max_r)]
|
||||
|
||||
data2 = list(zip(nb_of_types, empirical_counts, plus_one_counts, good_turing_counts))
|
||||
@ -415,7 +415,7 @@ W metodzie Knesera-Neya w następujący sposób estymujemy prawdopodobieństwo u
|
||||
|
||||
$$P(w) = \frac{N_{1+}(\bullet w)}{\sum_{w_j} N_{1+}(\bullet w_j)}.$$
|
||||
|
||||
#+BEGIN_SRC python :session mysession :exports both :results raw drawer
|
||||
#+BEGIN_SRC ipython :session mysession :exports both :results raw drawer
|
||||
def ngrams(iter, size):
|
||||
ngram = []
|
||||
for item in iter:
|
||||
@ -433,7 +433,7 @@ $$P(w) = \frac{N_{1+}(\bullet w)}{\sum_{w_j} N_{1+}(\bullet w_j)}.$$
|
||||
:end:
|
||||
|
||||
|
||||
#+BEGIN_SRC python :session mysession :exports both :results raw drawer
|
||||
#+BEGIN_SRC ipython :session mysession :exports both :results raw drawer
|
||||
histories = { }
|
||||
for prev_token, token in ngrams(get_words_from_file('opensubtitlesA.pl.txt'), 2):
|
||||
histories.setdefault(token, set())
|
||||
@ -444,7 +444,7 @@ $$P(w) = \frac{N_{1+}(\bullet w)}{\sum_{w_j} N_{1+}(\bullet w_j)}.$$
|
||||
:results:
|
||||
:end:
|
||||
|
||||
#+BEGIN_SRC python :session mysession :exports both :results raw drawer
|
||||
#+BEGIN_SRC ipython :session mysession :exports both :results raw drawer
|
||||
len(histories['jork'])
|
||||
len(histories['zielony'])
|
||||
histories['jork']
|
||||
@ -472,15 +472,15 @@ Knesera-Neya połączone z *przycinaniem* słownika n-gramów (wszystkie
|
||||
**** Zmiana perplexity przy zwiększaniu zbioru testowego
|
||||
|
||||
#+CAPTION: Perplexity dla różnych rozmiarów zbioru testowego
|
||||
[[./05_Wygladzanie/size-perplexity.gif]]
|
||||
[[./07_Wygladzanie/size-perplexity.gif]]
|
||||
|
||||
|
||||
**** Zmiana perplexity przy zwiększaniu zbioru uczącego
|
||||
|
||||
#+CAPTION: Perplexity dla różnych rozmiarów zbioru uczącego
|
||||
[[./05_Wygladzanie/size-perplexity2.gif]]
|
||||
[[./07_Wygladzanie/size-perplexity2.gif]]
|
||||
|
||||
**** Zmiana perplexity przy zwiększaniu rządu modelu
|
||||
|
||||
#+CAPTION: Perplexity dla różnych wartości rządu modelu
|
||||
[[./05_Wygladzanie/order-perplexity.gif]]
|
||||
[[./07_Wygladzanie/order-perplexity.gif]]
|
Before Width: | Height: | Size: 4.1 KiB After Width: | Height: | Size: 4.1 KiB |
Before Width: | Height: | Size: 4.5 KiB After Width: | Height: | Size: 4.5 KiB |
Before Width: | Height: | Size: 4.8 KiB After Width: | Height: | Size: 4.8 KiB |
Before Width: | Height: | Size: 24 KiB After Width: | Height: | Size: 24 KiB |
Before Width: | Height: | Size: 17 KiB After Width: | Height: | Size: 17 KiB |