2021-06-09 12:43:29 +02:00
|
|
|
|
{
|
2021-09-27 08:10:10 +02:00
|
|
|
|
"cells": [
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
|
|
|
|
|
"<div class=\"alert alert-block alert-info\">\n",
|
|
|
|
|
"<h1> Ekstrakcja informacji </h1>\n",
|
|
|
|
|
"<h2> 12. <i>Kodowanie BPE</i> [wykład]</h2> \n",
|
|
|
|
|
"<h3> Filip Graliński (2021)</h3>\n",
|
|
|
|
|
"</div>\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"## Podział na jednostki podwyrazowe\n",
|
|
|
|
|
"\n"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"### Słownik nie może być za duży…\n",
|
|
|
|
|
"\n"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"Jeśli używamy wyuczalnych zanurzeń słów (embeddingów), wówczas musimy\n",
|
|
|
|
|
"je dopisać do listy parametrów całego modelu — jest to $|V|n$ wag,\n",
|
|
|
|
|
"gdzie $n$ to rozmiar embeddingów; w wypadku uczenia dodatkowo musimy\n",
|
|
|
|
|
"jeszcze pamiętać związane z embeddingami gradienty. Pamięć RAM karty\n",
|
|
|
|
|
"graficznej jest rzecz jasna ograniczona, słownik więc nie może być\n",
|
|
|
|
|
"dowolnie duży. Dla danego modelu karty graficznej dość łatwo ustalić\n",
|
|
|
|
|
"maksymalny rozmiar słownika — jest „twarde” ograniczenie, które musimy\n",
|
|
|
|
|
"spełnić.\n",
|
|
|
|
|
"\n"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"#### Czy rzeczywiście słownik może być taki duży?\n",
|
|
|
|
|
"\n"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"Ile jest różnych form fleksyjnych w języku polskim? Zobaczmy w słowniku PoliMorf…\n",
|
|
|
|
|
"\n"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 1,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [
|
|
|
|
|
{
|
|
|
|
|
"name": "stdout",
|
|
|
|
|
"output_type": "stream",
|
|
|
|
|
"text": [
|
|
|
|
|
"a\n",
|
|
|
|
|
"aa\n",
|
|
|
|
|
"AA\n",
|
|
|
|
|
"Aachen\n",
|
|
|
|
|
"Aalborg\n",
|
|
|
|
|
"Aalborgiem\n",
|
|
|
|
|
"Aalborgowi\n",
|
|
|
|
|
"Aalborgu\n",
|
|
|
|
|
"AAP\n",
|
|
|
|
|
"Aar\n",
|
|
|
|
|
"Aarem\n",
|
|
|
|
|
"Aarowi\n",
|
|
|
|
|
"Aaru\n",
|
|
|
|
|
"Aarze\n",
|
|
|
|
|
"Aara\n",
|
|
|
|
|
"Aarą\n",
|
|
|
|
|
"Aarę\n",
|
|
|
|
|
"Aaro\n",
|
|
|
|
|
"Aary\n",
|
|
|
|
|
"Aarze\n",
|
|
|
|
|
"uniq: błąd zapisu: Przerwany potok\n"
|
|
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
],
|
|
|
|
|
"source": [
|
|
|
|
|
"! wget -q 'http://zil.ipipan.waw.pl/PoliMorf?action=AttachFile&do=get&target=PoliMorf-0.6.7.tab.gz' -O - | zcat | cut -f 1 | uniq | head -n 20"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 2,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [
|
|
|
|
|
{
|
|
|
|
|
"name": "stdout",
|
|
|
|
|
"output_type": "stream",
|
|
|
|
|
"text": [
|
|
|
|
|
"3844535\n"
|
|
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
],
|
|
|
|
|
"source": [
|
|
|
|
|
"! wget -q 'http://zil.ipipan.waw.pl/PoliMorf?action=AttachFile&do=get&target=PoliMorf-0.6.7.tab.gz' -O - | zcat | cut -f 1 | sort -u | wc -l"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"**Pytanie** W którym języku europejskim wyrazów będzie jeszcze więcej niż języku polskim?\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"Tak naprawdę form jest jeszcze więcej, oczywiście PoliMorf nie wyczerpuje zbioru…\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"**Pytanie** Podaj przykłady „oczywistych” wyrazów, których nie ma w PoliMorfie. Jak w sposób systematyczny szukać takich wyrazów?\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"Z drugiej strony, w PoliMorfie jest dużo dziwnych, „sztucznych” wyrazów.\n",
|
|
|
|
|
"\n"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 3,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [
|
|
|
|
|
{
|
|
|
|
|
"name": "stdout",
|
|
|
|
|
"output_type": "stream",
|
|
|
|
|
"text": [
|
|
|
|
|
"niebiałościenną\n",
|
|
|
|
|
"nieponadosobowości\n",
|
|
|
|
|
"nieknerający\n",
|
|
|
|
|
"inspektoratów\n",
|
|
|
|
|
"Korytkowskich\n",
|
|
|
|
|
"elektrostatyczności\n",
|
|
|
|
|
"Okola\n",
|
|
|
|
|
"bezsłowny\n",
|
|
|
|
|
"indygowcu\n",
|
|
|
|
|
"gadany\n",
|
|
|
|
|
"nieładowarkowościach\n",
|
|
|
|
|
"niepawężnicowate\n",
|
|
|
|
|
"Thom\n",
|
|
|
|
|
"poradlmy\n",
|
|
|
|
|
"olejący\n",
|
|
|
|
|
"Ziemianinów\n",
|
|
|
|
|
"stenotropizmami\n",
|
|
|
|
|
"wigiliowości\n",
|
|
|
|
|
"pognanej\n",
|
|
|
|
|
"niekinezyterapeutycznym\n"
|
|
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
],
|
|
|
|
|
"source": [
|
|
|
|
|
"! wget -q 'http://zil.ipipan.waw.pl/PoliMorf?action=AttachFile&do=get&target=PoliMorf-0.6.7.tab.gz' -O - | zcat | cut -f 1 | shuf -n 20"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"Inaczej, zobaczmy, ile różnych wyrazów jest w jakimś rzeczywistym zbiorze tekstów, rozpatrzmy\n",
|
|
|
|
|
"teksty zebrane na potrzeby identyfikacji płci autora tekstu:\n",
|
|
|
|
|
"\n"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 1,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [
|
|
|
|
|
{
|
|
|
|
|
"name": "stdout",
|
|
|
|
|
"output_type": "stream",
|
|
|
|
|
"text": [
|
|
|
|
|
"# Out[7]:"
|
|
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
],
|
|
|
|
|
"source": [
|
|
|
|
|
"! git clone --single-branch --depth 1 git://gonito.net/petite-difference-challenge2"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 1,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"! xzcat petite-difference-challenge2/train/in.tsv.xz | perl -C -ne 'print \"$&\\n\" while/\\p{L}+/g;' | sort -u > vocab.txt"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 4,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [
|
|
|
|
|
{
|
|
|
|
|
"name": "stdout",
|
|
|
|
|
"output_type": "stream",
|
|
|
|
|
"text": [
|
|
|
|
|
"ˆ\n",
|
|
|
|
|
"ˇ\n",
|
|
|
|
|
"゚\n",
|
|
|
|
|
"a\n",
|
|
|
|
|
"A\n",
|
|
|
|
|
"á\n",
|
|
|
|
|
"Á\n",
|
|
|
|
|
"à\n",
|
|
|
|
|
"À\n",
|
|
|
|
|
"ă\n",
|
|
|
|
|
"Ă\n",
|
|
|
|
|
"â\n",
|
|
|
|
|
"Â\n",
|
|
|
|
|
"å\n",
|
|
|
|
|
"Å\n",
|
|
|
|
|
"ä\n",
|
|
|
|
|
"Ä\n",
|
|
|
|
|
"Ã\n",
|
|
|
|
|
"ā\n",
|
|
|
|
|
"aa\n",
|
|
|
|
|
"aA\n",
|
|
|
|
|
"Aa\n",
|
|
|
|
|
"AA\n",
|
|
|
|
|
"aĂ\n",
|
|
|
|
|
"AĂ\n",
|
|
|
|
|
"aâ\n",
|
|
|
|
|
"aÂ\n",
|
|
|
|
|
"Aâ\n",
|
|
|
|
|
"aÅ\n",
|
|
|
|
|
"aÄ\n",
|
|
|
|
|
"ª\n",
|
|
|
|
|
"aaa\n",
|
|
|
|
|
"aAa\n",
|
|
|
|
|
"Aaa\n",
|
|
|
|
|
"AaA\n",
|
|
|
|
|
"AAa\n",
|
|
|
|
|
"AAA\n",
|
|
|
|
|
"aaaa\n",
|
|
|
|
|
"aAaa\n",
|
|
|
|
|
"Aaaa\n",
|
|
|
|
|
"AaAa\n",
|
|
|
|
|
"AAaa\n",
|
|
|
|
|
"AAAa\n",
|
|
|
|
|
"AAAA\n",
|
|
|
|
|
"aaaaa\n",
|
|
|
|
|
"Aaaaa\n",
|
|
|
|
|
"AaaaA\n",
|
|
|
|
|
"AAaaa\n",
|
|
|
|
|
"AAAAA\n",
|
|
|
|
|
"aaaaaa\n"
|
|
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
],
|
|
|
|
|
"source": [
|
|
|
|
|
"! head -n 50 vocab.txt"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 5,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [
|
|
|
|
|
{
|
|
|
|
|
"name": "stdout",
|
|
|
|
|
"output_type": "stream",
|
|
|
|
|
"text": [
|
|
|
|
|
"2974556 vocab.txt\n"
|
|
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
],
|
|
|
|
|
"source": [
|
|
|
|
|
"! wc -l vocab.txt"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"Co gorsza, nawet jak weźmiemy cały taki słownik bez ograniczeń i tak\n",
|
|
|
|
|
"nie pokryje on sporej części tekstów przetwarzanych w czasie inferencji.\n",
|
|
|
|
|
"Zobaczmy, ilu wyrazów ze zbioru deweloperskiego nie będzie w słowniku.\n",
|
|
|
|
|
"\n"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 6,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [
|
|
|
|
|
{
|
|
|
|
|
"name": "stdout",
|
|
|
|
|
"output_type": "stream",
|
|
|
|
|
"text": [
|
|
|
|
|
"81380\n"
|
|
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
],
|
|
|
|
|
"source": [
|
|
|
|
|
"! cat petite-difference-challenge2/dev-0/in.tsv | perl -C -ne 'print \"$&\\n\" while/\\p{L}+/g;' | sort -u | comm vocab.txt - -13 | wc -l"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"Takie wyrazy nazywamy wyrazami **OOV** (*out-of-vocabulary*).\n",
|
|
|
|
|
"\n"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"### Obcięcie słownika\n",
|
|
|
|
|
"\n"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"Najprostszy sposób ograniczenia słownika to po prostu obcięcie do $N$ najczęstszych słów.\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"Spróbujmy zastosować do korpusu „płci”:\n",
|
|
|
|
|
"\n"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 7,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [
|
|
|
|
|
{
|
|
|
|
|
"name": "stdout",
|
|
|
|
|
"output_type": "stream",
|
|
|
|
|
"text": [
|
|
|
|
|
"sort: błąd zapisu: 'standardowe wyjście': Przerwany potok\n",
|
|
|
|
|
"sort: błąd zapisu\n"
|
|
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
],
|
|
|
|
|
"source": [
|
|
|
|
|
"! xzcat petite-difference-challenge2/train/in.tsv.xz | perl -C -ne 'print \"$&\\n\" while/\\p{L}+/g;' | sort | uniq -c | sort -k 1rn | head -n 50000 | sort -k 2 > vocab50000.txt"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"Daje to lepszy efekt niż można się spodziewać. Odrzucamy w ten sposób\n",
|
|
|
|
|
"tylko bardzo rzadkie słowa (albo takie, które wystąpiły tylko raz w\n",
|
|
|
|
|
"korpusie — tzw. *hapax legomena*), choć tych słów jest bardzo dużo.\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"**Zagadka**: 50000 najczęstszych słów (1,9% **typów**) pokrywa jaki odsetek **wystąpień**?\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"Rozkład normalny w języku nie jest… normalny — nie spotkamy się z nim\n",
|
|
|
|
|
"badając języki. W tekstach dominują „skrzywione” rozkłady z długimi,\n",
|
|
|
|
|
"„chudymi” ogonami.\n",
|
|
|
|
|
"\n"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 1,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"! xzcat petite-difference-challenge2/train/in.tsv.xz | perl -C -ne 'print \"$&\\n\" while/\\p{L}+/g;' | sort | uniq -c | sort -k 1rn | cut -f 1 > freqs.txt"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 9,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [
|
|
|
|
|
{
|
|
|
|
|
"data": {
|
|
|
|
|
"text/plain": [
|
|
|
|
|
"'word-distribution.png'"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
"execution_count": 9,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"output_type": "execute_result"
|
2021-06-09 12:43:29 +02:00
|
|
|
|
},
|
2021-09-27 08:10:10 +02:00
|
|
|
|
{
|
|
|
|
|
"data": {
|
|
|
|
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAEFCAYAAAD69rxNAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Z1A+gAAAACXBIWXMAAAsTAAALEwEAmpwYAAAXf0lEQVR4nO3deZSdVZnv8e9TQ+YRUiEhIYQhBDDMhSB4kUFZkaalrwtYoNJ2yzUXb+ttbg8qehVv32UvV+ui2742l5tWBG0aW5S2kSYMjYzKVAlTSJgkAoEMBSEhIWMl+/5xTmWoU5U6qXNOndpV389aWanznvec93mz9ceu/b7v3pFSQpKUn4Z6FyBJ6hsDXJIyZYBLUqYMcEnKlAEuSZlq6s+DTZo0Kc2cObM/DylJ2Vu4cOFbKaWWrtv7NcBnzpxJW1tbfx5SkrIXEa92t90hFEnKlAEuSZkywCUpUwa4JGXKAJekTBngkpQpA1ySMpVFgN+7dBXX3v9yvcuQpAEliwC//4V2vv/QsnqXIUkDShYBDuDCE5K0pywCPKLeFUjSwJNFgAPY/5akPWUR4HbAJalUFgEuSSqVTYB7DVOS9pRFgIdXMSWpRBYBDt5GKEldZRPgkqQ9ZRPg9r8laU9ZBLhD4JJUKosAB+yCS1IXvQZ4RFwfEasjYnE37/1FRKSImFSb8orH8VEeSSpRTg/8BmBu140RcRDwEeC1KtfULTvgkrSnXgM8pfQgsKabt/4W+CL9kK2OgUtSqT6NgUfEx4A3UkpPl7HvvIhoi4i29vb2vhxOktSNfQ7wiBgFfBX4ejn7p5Tmp5RaU0qtLS0t+3q43b+nz5+VpMGoLz3ww4BDgKcj4nfAdGBRREypZmG7cwRFkko17esHUkrPApM7XxdDvDWl9FYV6yo9bi2/XJIyVM5thDcDjwCzI2J5RFxe+7K61tDfR5Skga/XHnhK6dJe3p9ZtWr2epz+OIok5SOLJzGdTlaSSmUR4ADJUXBJ2kMWAW7/W5JKZRHg4Bi4JHWVR4DbBZekEnkEON4HLkldZRHgTicrSaWyCHBJUql8AtwxFEnaQxYB7nM8klQqiwAHH+SRpK6yCHA74JJUKosABx/kkaSusghwx8AlqVQWAQ7ehCJJXWUR4D7II0mlsghwcFFjSeqqnCXVro+I1RGxeLdt346I5yPimYj414iYUMsiHQOXpFLl9MBvAOZ22XYPMCeldCzwInBVleuSJPWi1wBPKT0IrOmy7e6UUkfx5aPA9BrUtmcdtT6AJGWmGmPgnwEW9PRmRMyLiLaIaGtvb+/TARxBkaRSFQV4RHwV6ABu6mmflNL8lFJrSqm1paWlz8fyGqYk7amprx+MiE8D5wPnpFrfIuJVTEkq0acAj4i5wJeAD6WUNla3JElSOcq5jfBm4BFgdkQsj4jLge8BY4F7IuKpiLiulkXa/5akUr32wFNKl3az+Qc1qKVXKSXC4RRJAjJ5EtPMlqRSWQR4J+9EkaRdsghwJ7OSpFJZBLgkqVRWAe4IiiTtkkWAexFTkkplEeCdnBNcknbJIsDtgEtSqSwCvJP9b0naJYsAdwxckkplEeCdHAKXpF2yCHDnP5GkUlkEeKfkKLgk7ZRVgEuSdskqwB0Dl6Rdsghwh8AlqVQWAS5JKlXOkmrXR8TqiFi827b9IuKeiHip+PfE2pYpSeqqnB74DcDcLtu+DNybUpoF3Ft8XTPOBy5JpXoN8JTSg8CaLpsvAG4s/nwj8AfVLaunWvrjKJKUh76OgR+QUloBUPx7ck87RsS8iGiLiLb29vY+HcyLmJJUquYXMVNK81NKrSml1paWlsq+ywd5JGmnvgb4qoiYClD8e3X1SiplB1ySSvU1wG8DPl38+dPAv1WnnL1zDFySdinnNsKbgUeA2RGxPCIuB74FfCQiXgI+UnxdM46BS1Kppt52SCld2sNb51S5ll7ZAZekXbJ4EtP7wCWpVBYBLkkqlUWAd46Bb9/hIIokdcoiwJsbC2V2bN9R50okaeDIK8DtgUvSTlkEeFNjYQxla4c9cEnqlEWANxcD3B64JO2SRYA3NTgGLkldZRHgnWPgWw1wSdopkwAvDqFsdwhFkjplEeBNO+9CsQcuSZ2yCPDmhkIPfJs9cEnaKY8AbyqUuc0xcEnaKYsAb2pwDFySusoiwDvvQrEHLkm7ZBHgTT7II0klsghwe+CSVKqiAI+I/xERz0XE4oi4OSJGVKuw3Q3rfJDHuVAkaac+B3hETAP+O9CaUpoDNAKXVKuw3Y1obgRgswEuSTtVOoTSBIyMiCZgFPBm5SWVGjmsGOBbt9fi6yUpS30O8JTSG8B3gNeAFcC6lNLdXfeLiHkR0RYRbe3t7X061shiD3yjAS5JO1UyhDIRuAA4BDgQGB0Rn+q6X0ppfkqpNaXU2tLS0qdjNTYEw5oa2Lito6/lStKgU8kQyoeBZSml9pTSNuBW4LTqlFVq9LBGNm6xBy5JnSoJ8NeAUyNiVEQEcA6wtDpllRo9vIkNW+yBS1KnSsbAHwN+BiwCni1+1/wq1VVizPAm1m82wCWpU1MlH04pXQ1cXaVa9mrsiCY2bNnWH4eSpCxk8SQmwLgRzfbAJWk3+QT4yGbWbbIHLkmdsgnw8SObWbfRAJekTtkE+MRRw1i/pcP5UCSpKJsA3290MwBrN22tcyWSNDBkE+ATRw8DYM17BrgkQUYB3jJmOADt67fUuRJJGhiyCfDJ4wpTja9+1wCXJMgowA8YV+iBr3x3c50rkaSBIZsAHzWsiXEjmli5zgCXJMgowAGmTRzFm2s31bsMSRoQsgrw6RNH8vo7G+tdhiQNCFkF+Iz9RvHamo2klOpdiiTVXVYBPnP/UWzetoPV3kooSZkF+KTRALzS/l6dK5Gk+ssqwA9rGQPAy6vX17kSSaq/rAJ86vgRjB3exIurNtS7FEmqu4oCPCImRMTPIuL5iFgaER+oVmE9HI/ZU8aydMW7tTyMJGWh0h74d4E7U0pHAsdRw0WNOx194DiWrniXHTu8E0XS0NbnAI+IccAZwA8AUkpbU0prq1RXj+ZMG897W7ez7G0vZEoa2irpgR8KtAM/jIgnI+L7ETG6604RMS8i2iKirb29vYLDFRw7fTwAT7++tuLvkqScVRLgTcCJwP9NKZ0AvAd8uetOKaX5KaXWlFJrS0tLBYcrmDV5LGOGN7HotXcq/i5JylklAb4cWJ5Seqz4+mcUAr2mGhuCE2ZM4IllBrikoa3PAZ5SWgm8HhGzi5vOAZZUpapenHro/rywaj1vbfCJTElDV6V3oXwBuCkingGOB/664orK8MHDJwHw65ff6o/DSdKA1FTJh1NKTwGt1SmlfHOmjWfCqGYeeLGdC46f1t+Hl6QBIasnMTs1NgRnHtHCfc+vpmP7jnqXI0l1kWWAA8ydM4V3Nm7jsWVr6l2KJNVFtgH+oSMmM3pYI7948o16lyJJdZFtgI8c1sjcOVO5c/FKNm/bXu9yJKnfZRvgABccfyDrt3Rw79LV9S5Fkvpd1gF+2mH7M2XcCG5Z+Hq9S5Gkfpd1gDc1NnBR63QeeLGdZW85uZWkoSXrAAe47NSDaWoIbvzN7+pdiiT1q+wDfPK4Efz+sQfyL0+8zpr3tta7HEnqN9kHOMBnzziUTdu28/V/W1zvUiSp3wyKAD9q6jj+06xJ3P7MCt5Yu6ne5UhSvxgUAQ7wJ2cdDsCVP3myzpVIUv8YNAF+6qH7M23CSJ743Ts89+a6epcjSTU3aAIc4HufOAGAy29oq3MlklR7gyrAT5gxkfcdOI6V727m5wuX17scSaqpQRXgAD/8o5MB+PNbnmabU81KGsQGXYBPHjeCz5x+CADfufuFOlcjSbVTcYBHRGNEPBkRt1ejoGr42vlHcdbsFv7fA69w3wtOdCVpcKpGD/xPgaVV+J6qiQiuufh4xg5v4o9/+ISLH0salCoK8IiYDvwe8P3qlFM9E0cP44sfPRKAs75zPymlOlckSdVVaQ/874AvAj1eLYyIeRHRFhFt7e3tFR5u31x26sGcOGMC6zd38IfXP96vx5akWutzgEf
|
|
|
|
|
"text/plain": [
|
|
|
|
|
"<Figure size 432x288 with 1 Axes>"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
"metadata": {
|
|
|
|
|
"needs_background": "light"
|
|
|
|
|
},
|
|
|
|
|
"output_type": "display_data"
|
|
|
|
|
}
|
|
|
|
|
],
|
|
|
|
|
"source": [
|
|
|
|
|
"%matplotlib inline\n",
|
|
|
|
|
"import matplotlib.pyplot as plt\n",
|
|
|
|
|
"import re\n",
|
|
|
|
|
"from math import log\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"freqs = []\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"with open('freqs.txt', 'r') as fh:\n",
|
|
|
|
|
" for line in fh:\n",
|
|
|
|
|
" m = re.match(r'\\s*(\\d+)', line)\n",
|
|
|
|
|
" if m:\n",
|
|
|
|
|
" freqs.append(log(float(m.group(1))))\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"plt.plot(range(len(freqs)), freqs)\n",
|
|
|
|
|
"fname = 'word-distribution.png'\n",
|
|
|
|
|
"plt.savefig(fname)\n",
|
|
|
|
|
"fname"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"[[file:# Out[25]:\n",
|
|
|
|
|
"\n",
|
|
|
|
|
" 'word-distribution.png'\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"![img](./obipy-resources/c0TrCn.png)]]\n",
|
|
|
|
|
"\n"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"### Lematyzacja\n",
|
|
|
|
|
"\n"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"Lematyzacja wydaje się dobrym pomysłem, zwłaszcza dla języków dla bogatej fleksji:\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"- znacznie redukujemy słownik,\n",
|
|
|
|
|
"- formy fleksyjne tego samego wyrazu są traktowane tak samo (co wydaje się słuszne).\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"W praktyce współcześnie **nie** stosuje się lematyzacji (w połączeniu z\n",
|
|
|
|
|
"metodami opartymi na sieciach neuronowych):\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"- lematyzacja wymaga wiedzy językowej (reguł lub słownika),\n",
|
|
|
|
|
" wytworzenie takiej wiedzy może być kosztowne, obecnie preferowane\n",
|
|
|
|
|
" są metody niezależne od języka;\n",
|
|
|
|
|
"- tracimy pewną informację niesioną przez formę fleksyjną (co w szczególnych\n",
|
|
|
|
|
" przypadkach może być niefortunne, np. *aspiracja* i *aspiracje*);\n",
|
|
|
|
|
"- lematyzacja nie jest trywialnym problemem ze względu na niejednoznaczności\n",
|
|
|
|
|
" (*Lekarzu, lecz się sam*);\n",
|
|
|
|
|
"- niektóre niejednoznaczności są seryjne, wybór lematu może być arbitralny,\n",
|
|
|
|
|
" np. czy *posiadanie*, *gotowanie*, *skakanie* to rzeczowniki czy czasowniki?\n",
|
|
|
|
|
" a *urządzenie*, *mieszkanie*?\n",
|
|
|
|
|
"- zazwyczaj sieci neuronowe (czy nawet prostsze modele typu Word2vec)\n",
|
|
|
|
|
" są w stanie nauczyć się rekonstruowania zależności między formami fleksyjnymi\n",
|
|
|
|
|
" (i więcej: błędnych form, błędów ortograficznych, form archaicznych itd.)\n",
|
|
|
|
|
"\n"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"### Zejście na poziom znaków\n",
|
|
|
|
|
"\n"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"Skoro słownik wyrazów jest zbyt duży, to może zejść na poziom znaków?\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"- pojedynczy znak alfabetu wprawdzie nic nie znaczy (co znaczy *h*?)\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"- … ale rozmiar wejścia przy kodowaniu gorącą jedynką\n",
|
|
|
|
|
" dramatycznie się zmniejsza\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"- może działać, jeśli dodać wielowarstwową sieć\n",
|
|
|
|
|
" neuronową\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"- … ale może być bardzo kosztowne obliczeniowo\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"A może coś pośredniego między znakami a wyrazami?\n",
|
|
|
|
|
"\n"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"### BPE\n",
|
|
|
|
|
"\n"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"Ani znaki, ani wyrazy — coś pomiędzy: jednostki podwyrazowe (*subword\n",
|
|
|
|
|
"units*). Moglibyśmy np. dzielić wyraz *superkomputera* na dwie\n",
|
|
|
|
|
"jednostki *super/+/komputera*, a może nawet trzy: *super/+/komputer/+/a*?\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"Najpopularniejszy algorytm podziału na jednostki podwyrazowe to BPE\n",
|
|
|
|
|
"(*byte-pair encoding*), zainspirowany algorytmami kompresji danych.\n",
|
|
|
|
|
"Lista jednostek jest automatycznie indukowana na podstawie tekstu (nie\n",
|
|
|
|
|
"potrzeba żadnej wiedzy o języku!). Ich liczba musi być natomiast z góry\n",
|
|
|
|
|
"określona.\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"W kroku początkowym zaznaczamy końce wyrazów (tokenów), robimy to po\n",
|
|
|
|
|
"to, żeby jednostki podwyrazowe nie przekraczały granic wyrazów.\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"Następnie wykonujemy tyle kroków iteracji, ile wynosi rozmiar zadanego\n",
|
|
|
|
|
"słownika. W każdym kroku szukamy najczęstszego bigramu, od tego\n",
|
|
|
|
|
"momentu traktujemy go jako całostkę (wkładamy go do „pudełka”).\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"![img](./bpe.png)\n",
|
|
|
|
|
"\n"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"#### Implementacja w Pythonie\n",
|
|
|
|
|
"\n"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 10,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [
|
|
|
|
|
{
|
|
|
|
|
"data": {
|
|
|
|
|
"text/plain": [
|
|
|
|
|
"['e$', 'to', 'to$', 'be$', 't$', 'th', 'or', 'or$', 'no', 'not$']"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
"execution_count": 10,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"output_type": "execute_result"
|
|
|
|
|
}
|
|
|
|
|
],
|
|
|
|
|
"source": [
|
|
|
|
|
"from collections import Counter\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"def replace_bigram(l, b, r):\n",
|
|
|
|
|
" i = 0\n",
|
|
|
|
|
" while i < len(l) - 1:\n",
|
|
|
|
|
" if (l[i], l[i+1]) == b:\n",
|
|
|
|
|
" l[i:i+2] = [r]\n",
|
|
|
|
|
" i += 1\n",
|
|
|
|
|
" return l\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"def learn_bpe_vocab(d, max_vocab_size):\n",
|
|
|
|
|
" d = list(d.replace(' ', '$') + '$')\n",
|
|
|
|
|
"\n",
|
|
|
|
|
" vocab = []\n",
|
|
|
|
|
"\n",
|
|
|
|
|
" for ix in range(0, max_vocab_size):\n",
|
|
|
|
|
" bigrams = [(d[i], d[i+1]) for i in range(0, len(d) - 1) if d[i][-1] != '$']\n",
|
|
|
|
|
" selected_bigram = Counter(bigrams).most_common(1)[0][0]\n",
|
|
|
|
|
"\n",
|
|
|
|
|
" new_subword = selected_bigram[0] + selected_bigram[1]\n",
|
|
|
|
|
" d = replace_bigram(d, selected_bigram, new_subword)\n",
|
|
|
|
|
"\n",
|
|
|
|
|
" vocab.append(new_subword)\n",
|
|
|
|
|
"\n",
|
|
|
|
|
" return vocab\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"vocab1 = learn_bpe_vocab('to be or not to be that is the question', 10)\n",
|
|
|
|
|
"vocab1"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"Słownik jednostek podwyrazowych możemy zastosować do dowolnego tekstu, np. do tekstu,\n",
|
|
|
|
|
"na którym słownik był wyuczony:\n",
|
|
|
|
|
"\n"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 11,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [
|
|
|
|
|
{
|
|
|
|
|
"data": {
|
|
|
|
|
"text/plain": [
|
|
|
|
|
"'to$ be$ or$ not$ to$ be$ th a t$ i s $ th e$ q u e s t i o n $'"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
"execution_count": 11,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"output_type": "execute_result"
|
|
|
|
|
}
|
|
|
|
|
],
|
|
|
|
|
"source": [
|
|
|
|
|
"def apply_bpe_vocab(vocab, d):\n",
|
|
|
|
|
" d = list(d.replace(' ', '$') + '$')\n",
|
|
|
|
|
" vocab_set = set(vocab)\n",
|
|
|
|
|
"\n",
|
|
|
|
|
" modified = True\n",
|
|
|
|
|
" while modified:\n",
|
|
|
|
|
" ix = 0\n",
|
|
|
|
|
" modified = False\n",
|
|
|
|
|
" while ix < len(d) - 1:\n",
|
|
|
|
|
" bigram = d[ix] + d[ix+1]\n",
|
|
|
|
|
" if bigram in vocab_set:\n",
|
|
|
|
|
" d[ix:ix+2] = [bigram]\n",
|
|
|
|
|
" modified = True\n",
|
|
|
|
|
" else:\n",
|
|
|
|
|
" ix += 1\n",
|
|
|
|
|
"\n",
|
|
|
|
|
" return d\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"' '.join(apply_bpe_vocab(vocab1, 'to be or not to be that is the question'))"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"Zauważmy, że oprócz jednostek podwyrazowych zostały izolowane litery,\n",
|
|
|
|
|
"zazwyczaj dodajemy je do słownika. (I zazwyczaj, słownik jest trochę\n",
|
|
|
|
|
"większy niż wartość podana jako parametr przy uczeniu BPE — jest\n",
|
|
|
|
|
"większy o znaki i specjalne tokeny typu `UNK`, `BOS`, `EOS`, `PAD`.)\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"**Pytanie**: Jaki problem może pojawić przy zastosowaniu BPE dla tekstu,\n",
|
|
|
|
|
"gdzie pojawiają się chińskie znaki? Jak można sobie z nim poradzić?\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"Słownik jednostek podwyrazowych można stosować dla dowolnego tekstu:\n",
|
|
|
|
|
"\n"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 12,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [
|
|
|
|
|
{
|
|
|
|
|
"data": {
|
|
|
|
|
"text/plain": [
|
|
|
|
|
"'to m $ w i l l $ be$ th e$ b e s t$'"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
"execution_count": 12,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"output_type": "execute_result"
|
|
|
|
|
}
|
|
|
|
|
],
|
|
|
|
|
"source": [
|
|
|
|
|
"' '.join(apply_bpe_vocab(vocab1, 'tom will be the best'))"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"Jak można zauważyć algorytm BPE daje dwa rodzaje jednostek podwyrazowych:\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"- jednostki, które mogą doklejane na początku wyrazu;\n",
|
|
|
|
|
"- jednostki, które stanowią koniec wyrazu, w szczególności są całym wyrazem.\n",
|
|
|
|
|
"\n"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"#### Gotowa implementacja\n",
|
|
|
|
|
"\n"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"Po raz pierwszy BPE użyto do neuronowego tłumaczenia maszynowego.\n",
|
|
|
|
|
"Użyjmy modułu autorstwa Rica Sennricha ([https://github.com/rsennrich/subword-nmt](https://github.com/rsennrich/subword-nmt)).\n",
|
|
|
|
|
"\n"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 1,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"! pip install subword-nmt"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"Wyindukujmy słownik dla zbioru uczącego zadania identyfikacji płci\n",
|
|
|
|
|
"autora tekstu:\n",
|
|
|
|
|
"\n"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 1,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"! xzcat petite-difference-challenge2/train/in.tsv.xz | perl -C -ne 'print \"$&\\n\" while/\\p{L}+/g;' | python -m subword_nmt.learn_bpe -s 50000 -v > bpe_vocab.txt"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"Procedura trwa kilka minut, trzeba uzbroić się w cierpliwość (ale wypisywanie bigramów przyspieszy!).\n",
|
|
|
|
|
"\n",
|
|
|
|
|
" pair 0: n i -> ni (frequency 17625075)\n",
|
|
|
|
|
" pair 1: i e -> ie (frequency 11471590)\n",
|
|
|
|
|
" pair 2: c z -> cz (frequency 9143490)\n",
|
|
|
|
|
" pair 3: ni e</w> -> nie</w> (frequency 7901783)\n",
|
|
|
|
|
" pair 4: p o -> po (frequency 7790826)\n",
|
|
|
|
|
" pair 5: r z -> rz (frequency 7542046)\n",
|
|
|
|
|
" pair 6: s t -> st (frequency 7269069)\n",
|
|
|
|
|
" pair 7: e m</w> -> em</w> (frequency 7207280)\n",
|
|
|
|
|
" pair 8: d z -> dz (frequency 6860931)\n",
|
|
|
|
|
" pair 9: s z -> sz (frequency 6609907)\n",
|
|
|
|
|
" pair 10: r a -> ra (frequency 6601618)\n",
|
|
|
|
|
" pair 11: o w -> ow (frequency 6395963)\n",
|
|
|
|
|
" pair 12: i e</w> -> ie</w> (frequency 5906869)\n",
|
|
|
|
|
" pair 13: n a -> na (frequency 5300380)\n",
|
|
|
|
|
" pair 14: r o -> ro (frequency 5181363)\n",
|
|
|
|
|
" pair 15: n a</w> -> na</w> (frequency 5125807)\n",
|
|
|
|
|
" pair 16: a ł -> ał (frequency 4786696)\n",
|
|
|
|
|
" pair 17: j e -> je (frequency 4599579)\n",
|
|
|
|
|
" pair 18: s i -> si (frequency 4300984)\n",
|
|
|
|
|
" pair 19: a l -> al (frequency 4276823)\n",
|
|
|
|
|
" pair 20: t e -> te (frequency 4033344)\n",
|
|
|
|
|
" pair 21: w i -> wi (frequency 3939063)\n",
|
|
|
|
|
" pair 22: c h</w> -> ch</w> (frequency 3919410)\n",
|
|
|
|
|
" pair 23: c h -> ch (frequency 3661410)\n",
|
|
|
|
|
" pair 24: k o -> ko (frequency 3629840)\n",
|
|
|
|
|
" pair 25: z a -> za (frequency 3625424)\n",
|
|
|
|
|
" pair 26: t a -> ta (frequency 3570094)\n",
|
|
|
|
|
" pair 27: p rz -> prz (frequency 3494551)\n",
|
|
|
|
|
" pair 28: g o</w> -> go</w> (frequency 3279997)\n",
|
|
|
|
|
" pair 29: a r -> ar (frequency 3081492)\n",
|
|
|
|
|
" pair 30: si ę</w> -> się</w> (frequency 2973681)\n",
|
|
|
|
|
" ...\n",
|
|
|
|
|
" pair 49970: brz mieniu</w> -> brzmieniu</w> (frequency 483)\n",
|
|
|
|
|
" pair 49971: bieżą cych</w> -> bieżących</w> (frequency 483)\n",
|
|
|
|
|
" pair 49972: biegu nkę</w> -> biegunkę</w> (frequency 483)\n",
|
|
|
|
|
" pair 49973: ban kowości</w> -> bankowości</w> (frequency 483)\n",
|
|
|
|
|
" pair 49974: ba ku</w> -> baku</w> (frequency 483)\n",
|
|
|
|
|
" pair 49975: ba cznie</w> -> bacznie</w> (frequency 483)\n",
|
|
|
|
|
" pair 49976: Przypad kowo</w> -> Przypadkowo</w> (frequency 483)\n",
|
|
|
|
|
" pair 49977: MA Ł -> MAŁ (frequency 483)\n",
|
|
|
|
|
" pair 49978: Lep pera</w> -> Leppera</w> (frequency 483)\n",
|
|
|
|
|
" pair 49979: Ko za -> Koza (frequency 483)\n",
|
|
|
|
|
" pair 49980: Jak byś</w> -> Jakbyś</w> (frequency 483)\n",
|
|
|
|
|
" pair 49981: Geni alne</w> -> Genialne</w> (frequency 483)\n",
|
|
|
|
|
" pair 49982: Że nada</w> -> Żenada</w> (frequency 482)\n",
|
|
|
|
|
" pair 49983: ń czykiem</w> -> ńczykiem</w> (frequency 482)\n",
|
|
|
|
|
" pair 49984: zwie ń -> zwień (frequency 482)\n",
|
|
|
|
|
" pair 49985: zost ałaś</w> -> zostałaś</w> (frequency 482)\n",
|
|
|
|
|
" pair 49986: zni szczona</w> -> zniszczona</w> (frequency 482)\n",
|
|
|
|
|
" pair 49987: ze stawi -> zestawi (frequency 482)\n",
|
|
|
|
|
" pair 49988: za sób</w> -> zasób</w> (frequency 482)\n",
|
|
|
|
|
" pair 49989: węd rówkę</w> -> wędrówkę</w> (frequency 482)\n",
|
|
|
|
|
" pair 49990: wysko czyła</w> -> wyskoczyła</w> (frequency 482)\n",
|
|
|
|
|
" pair 49991: wyle czenia</w> -> wyleczenia</w> (frequency 482)\n",
|
|
|
|
|
" pair 49992: wychowaw cze</w> -> wychowawcze</w> (frequency 482)\n",
|
|
|
|
|
" pair 49993: w t -> wt (frequency 482)\n",
|
|
|
|
|
" pair 49994: un da -> unda (frequency 482)\n",
|
|
|
|
|
" pair 49995: udzie lałem</w> -> udzielałem</w> (frequency 482)\n",
|
|
|
|
|
" pair 49996: tę czy</w> -> tęczy</w> (frequency 482)\n",
|
|
|
|
|
" pair 49997: tro sce</w> -> trosce</w> (frequency 482)\n",
|
|
|
|
|
" pair 49998: słusz ności</w> -> słuszności</w> (frequency 482)\n",
|
|
|
|
|
" pair 49999: su me</w> -> sume</w> (frequency 482\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"Zastosujmy teraz wyindukowany słownik BPE dla jakiegoś rzeczywistego tekstu.\n",
|
|
|
|
|
"\n"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 13,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [
|
|
|
|
|
{
|
|
|
|
|
"name": "stdout",
|
|
|
|
|
"output_type": "stream",
|
|
|
|
|
"text": [
|
|
|
|
|
"Cier@@ piałem na straszne la@@ gi [...]"
|
|
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
],
|
|
|
|
|
"source": [
|
|
|
|
|
"! echo 'Cierpiałem na straszne lagi [...]' | perl -C -ne 'print \"$& \" while/\\p{L}+/g;' | python -m subword_nmt.apply_bpe -c bpe_vocab.txt"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"Ta konkretna implementacja zaznacza za pomocą sekwencji ~@@ ~ koniec jednostki podwyrazowej.\n",
|
|
|
|
|
"\n"
|
|
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
],
|
|
|
|
|
"metadata": {
|
|
|
|
|
"author": "Filip Graliński",
|
|
|
|
|
"email": "filipg@amu.edu.pl",
|
|
|
|
|
"kernelspec": {
|
|
|
|
|
"display_name": "Python 3 (ipykernel)",
|
|
|
|
|
"language": "python",
|
|
|
|
|
"name": "python3"
|
|
|
|
|
},
|
|
|
|
|
"lang": "pl",
|
|
|
|
|
"language_info": {
|
|
|
|
|
"codemirror_mode": {
|
|
|
|
|
"name": "ipython",
|
|
|
|
|
"version": 3
|
|
|
|
|
},
|
|
|
|
|
"file_extension": ".py",
|
|
|
|
|
"mimetype": "text/x-python",
|
|
|
|
|
"name": "python",
|
|
|
|
|
"nbconvert_exporter": "python",
|
|
|
|
|
"pygments_lexer": "ipython3",
|
|
|
|
|
"version": "3.9.6"
|
|
|
|
|
},
|
|
|
|
|
"org": null,
|
|
|
|
|
"subtitle": "12.Kodowanie BPE[wykład]",
|
|
|
|
|
"title": "Ekstrakcja informacji",
|
|
|
|
|
"year": "2021"
|
|
|
|
|
},
|
|
|
|
|
"nbformat": 4,
|
|
|
|
|
"nbformat_minor": 4
|
|
|
|
|
}
|