forked from filipg/aitech-eks-pub
861 lines
44 KiB
Plaintext
861 lines
44 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"source": [
|
|
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
|
|
"<div class=\"alert alert-block alert-info\">\n",
|
|
"<h1> Ekstrakcja informacji </h1>\n",
|
|
"<h2> 12. <i>Kodowanie BPE</i> [wyk\u0142ad]</h2> \n",
|
|
"<h3> Filip Grali\u0144ski (2021)</h3>\n",
|
|
"</div>\n",
|
|
"\n",
|
|
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Podzia\u0142 na jednostki podwyrazowe\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### S\u0142ownik nie mo\u017ce by\u0107 za du\u017cy\u2026\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Je\u015bli u\u017cywamy wyuczalnych zanurze\u0144 s\u0142\u00f3w (embedding\u00f3w), w\u00f3wczas musimy\n",
|
|
"je dopisa\u0107 do listy parametr\u00f3w ca\u0142ego modelu \u2014 jest to $|V|n$ wag,\n",
|
|
"gdzie $n$ to rozmiar embedding\u00f3w; w wypadku uczenia dodatkowo musimy\n",
|
|
"jeszcze pami\u0119ta\u0107 zwi\u0105zane z embeddingami gradienty. Pami\u0119\u0107 RAM karty\n",
|
|
"graficznej jest rzecz jasna ograniczona, s\u0142ownik wi\u0119c nie mo\u017ce by\u0107\n",
|
|
"dowolnie du\u017cy. Dla danego modelu karty graficznej do\u015b\u0107 \u0142atwo ustali\u0107\n",
|
|
"maksymalny rozmiar s\u0142ownika \u2014 jest \u201etwarde\u201d ograniczenie, kt\u00f3re musimy\n",
|
|
"spe\u0142ni\u0107.\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Czy rzeczywi\u015bcie s\u0142ownik mo\u017ce by\u0107 taki du\u017cy?\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Ile jest r\u00f3\u017cnych form fleksyjnych w j\u0119zyku polskim? Zobaczmy w s\u0142owniku PoliMorf\u2026\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"a\n",
|
|
"aa\n",
|
|
"AA\n",
|
|
"Aachen\n",
|
|
"Aalborg\n",
|
|
"Aalborgiem\n",
|
|
"Aalborgowi\n",
|
|
"Aalborgu\n",
|
|
"AAP\n",
|
|
"Aar\n",
|
|
"Aarem\n",
|
|
"Aarowi\n",
|
|
"Aaru\n",
|
|
"Aarze\n",
|
|
"Aara\n",
|
|
"Aar\u0105\n",
|
|
"Aar\u0119\n",
|
|
"Aaro\n",
|
|
"Aary\n",
|
|
"Aarze\n",
|
|
"uniq: b\u0142\u0105d zapisu: Przerwany potok\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"! wget -q 'http://zil.ipipan.waw.pl/PoliMorf?action=AttachFile&do=get&target=PoliMorf-0.6.7.tab.gz' -O - | zcat | cut -f 1 | uniq | head -n 20"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"3844535\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"! wget -q 'http://zil.ipipan.waw.pl/PoliMorf?action=AttachFile&do=get&target=PoliMorf-0.6.7.tab.gz' -O - | zcat | cut -f 1 | sort -u | wc -l"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Pytanie** W kt\u00f3rym j\u0119zyku europejskim wyraz\u00f3w b\u0119dzie jeszcze wi\u0119cej ni\u017c j\u0119zyku polskim?\n",
|
|
"\n",
|
|
"Tak naprawd\u0119 form jest jeszcze wi\u0119cej, oczywi\u015bcie PoliMorf nie wyczerpuje zbioru\u2026\n",
|
|
"\n",
|
|
"**Pytanie** Podaj przyk\u0142ady \u201eoczywistych\u201d wyraz\u00f3w, kt\u00f3rych nie ma w PoliMorfie. Jak w spos\u00f3b systematyczny szuka\u0107 takich wyraz\u00f3w?\n",
|
|
"\n",
|
|
"Z drugiej strony, w PoliMorfie jest du\u017co dziwnych, \u201esztucznych\u201d wyraz\u00f3w.\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"niebia\u0142o\u015bcienn\u0105\n",
|
|
"nieponadosobowo\u015bci\n",
|
|
"niekneraj\u0105cy\n",
|
|
"inspektorat\u00f3w\n",
|
|
"Korytkowskich\n",
|
|
"elektrostatyczno\u015bci\n",
|
|
"Okola\n",
|
|
"bezs\u0142owny\n",
|
|
"indygowcu\n",
|
|
"gadany\n",
|
|
"nie\u0142adowarkowo\u015bciach\n",
|
|
"niepaw\u0119\u017cnicowate\n",
|
|
"Thom\n",
|
|
"poradlmy\n",
|
|
"olej\u0105cy\n",
|
|
"Ziemianin\u00f3w\n",
|
|
"stenotropizmami\n",
|
|
"wigiliowo\u015bci\n",
|
|
"pognanej\n",
|
|
"niekinezyterapeutycznym\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"! wget -q 'http://zil.ipipan.waw.pl/PoliMorf?action=AttachFile&do=get&target=PoliMorf-0.6.7.tab.gz' -O - | zcat | cut -f 1 | shuf -n 20"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Inaczej, zobaczmy, ile r\u00f3\u017cnych wyraz\u00f3w jest w jakim\u015b rzeczywistym zbiorze tekst\u00f3w, rozpatrzmy\n",
|
|
"teksty zebrane na potrzeby identyfikacji p\u0142ci autora tekstu:\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"# Out[7]:"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"! git clone --single-branch --depth 1 git://gonito.net/petite-difference-challenge2"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"! xzcat petite-difference-challenge2/train/in.tsv.xz | perl -C -ne 'print \"$&\\n\" while/\\p{L}+/g;' | sort -u > vocab.txt"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\u02c6\n",
|
|
"\u02c7\n",
|
|
"\uff9f\n",
|
|
"a\n",
|
|
"A\n",
|
|
"\u00e1\n",
|
|
"\u00c1\n",
|
|
"\u00e0\n",
|
|
"\u00c0\n",
|
|
"\u0103\n",
|
|
"\u0102\n",
|
|
"\u00e2\n",
|
|
"\u00c2\n",
|
|
"\u00e5\n",
|
|
"\u00c5\n",
|
|
"\u00e4\n",
|
|
"\u00c4\n",
|
|
"\u00c3\n",
|
|
"\u0101\n",
|
|
"aa\n",
|
|
"aA\n",
|
|
"Aa\n",
|
|
"AA\n",
|
|
"a\u0102\n",
|
|
"A\u0102\n",
|
|
"a\u00e2\n",
|
|
"a\u00c2\n",
|
|
"A\u00e2\n",
|
|
"a\u00c5\n",
|
|
"a\u00c4\n",
|
|
"\u00c2\u00aa\n",
|
|
"aaa\n",
|
|
"aAa\n",
|
|
"Aaa\n",
|
|
"AaA\n",
|
|
"AAa\n",
|
|
"AAA\n",
|
|
"aaaa\n",
|
|
"aAaa\n",
|
|
"Aaaa\n",
|
|
"AaAa\n",
|
|
"AAaa\n",
|
|
"AAAa\n",
|
|
"AAAA\n",
|
|
"aaaaa\n",
|
|
"Aaaaa\n",
|
|
"AaaaA\n",
|
|
"AAaaa\n",
|
|
"AAAAA\n",
|
|
"aaaaaa\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"! head -n 50 vocab.txt"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"2974556 vocab.txt\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"! wc -l vocab.txt"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Co gorsza, nawet jak we\u017amiemy ca\u0142y taki s\u0142ownik bez ogranicze\u0144 i tak\n",
|
|
"nie pokryje on sporej cz\u0119\u015bci tekst\u00f3w przetwarzanych w czasie inferencji.\n",
|
|
"Zobaczmy, ilu wyraz\u00f3w ze zbioru deweloperskiego nie b\u0119dzie w s\u0142owniku.\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"81380\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"! cat petite-difference-challenge2/dev-0/in.tsv | perl -C -ne 'print \"$&\\n\" while/\\p{L}+/g;' | sort -u | comm vocab.txt - -13 | wc -l"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Takie wyrazy nazywamy wyrazami **OOV** (*out-of-vocabulary*).\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Obci\u0119cie s\u0142ownika\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Najprostszy spos\u00f3b ograniczenia s\u0142ownika to po prostu obci\u0119cie do $N$ najcz\u0119stszych s\u0142\u00f3w.\n",
|
|
"\n",
|
|
"Spr\u00f3bujmy zastosowa\u0107 do korpusu \u201ep\u0142ci\u201d:\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"sort: b\u0142\u0105d zapisu: 'standardowe wyj\u015bcie': Przerwany potok\n",
|
|
"sort: b\u0142\u0105d zapisu\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"! xzcat petite-difference-challenge2/train/in.tsv.xz | perl -C -ne 'print \"$&\\n\" while/\\p{L}+/g;' | sort | uniq -c | sort -k 1rn | head -n 50000 | sort -k 2 > vocab50000.txt"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Daje to lepszy efekt ni\u017c mo\u017cna si\u0119 spodziewa\u0107. Odrzucamy w ten spos\u00f3b\n",
|
|
"tylko bardzo rzadkie s\u0142owa (albo takie, kt\u00f3re wyst\u0105pi\u0142y tylko raz w\n",
|
|
"korpusie \u2014 tzw. *hapax legomena*), cho\u0107 tych s\u0142\u00f3w jest bardzo du\u017co.\n",
|
|
"\n",
|
|
"**Zagadka**: 50000 najcz\u0119stszych s\u0142\u00f3w (1,9% **typ\u00f3w**) pokrywa jaki odsetek **wyst\u0105pie\u0144**?\n",
|
|
"\n",
|
|
"Rozk\u0142ad normalny w j\u0119zyku nie jest\u2026 normalny \u2014 nie spotkamy si\u0119 z nim\n",
|
|
"badaj\u0105c j\u0119zyki. W tekstach dominuj\u0105 \u201eskrzywione\u201d rozk\u0142ady z d\u0142ugimi,\n",
|
|
"\u201echudymi\u201d ogonami.\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"! xzcat petite-difference-challenge2/train/in.tsv.xz | perl -C -ne 'print \"$&\\n\" while/\\p{L}+/g;' | sort | uniq -c | sort -k 1rn | cut -f 1 > freqs.txt"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"'word-distribution.png'"
|
|
]
|
|
},
|
|
"execution_count": 9,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
},
|
|
{
|
|
"data": {
|
|
"image/png": "\n",
|
|
"text/plain": [
|
|
"<Figure size 432x288 with 1 Axes>"
|
|
]
|
|
},
|
|
"metadata": {
|
|
"needs_background": "light"
|
|
},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"%matplotlib inline\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"import re\n",
|
|
"from math import log\n",
|
|
"\n",
|
|
"freqs = []\n",
|
|
"\n",
|
|
"with open('freqs.txt', 'r') as fh:\n",
|
|
" for line in fh:\n",
|
|
" m = re.match(r'\\s*(\\d+)', line)\n",
|
|
" if m:\n",
|
|
" freqs.append(log(float(m.group(1))))\n",
|
|
"\n",
|
|
"plt.plot(range(len(freqs)), freqs)\n",
|
|
"fname = 'word-distribution.png'\n",
|
|
"plt.savefig(fname)\n",
|
|
"fname"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"[[file:# Out[25]:\n",
|
|
"\n",
|
|
" 'word-distribution.png'\n",
|
|
"\n",
|
|
"![img](./obipy-resources/c0TrCn.png)]]\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Lematyzacja\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Lematyzacja wydaje si\u0119 dobrym pomys\u0142em, zw\u0142aszcza dla j\u0119zyk\u00f3w dla bogatej fleksji:\n",
|
|
"\n",
|
|
"- znacznie redukujemy s\u0142ownik,\n",
|
|
"- formy fleksyjne tego samego wyrazu s\u0105 traktowane tak samo (co wydaje si\u0119 s\u0142uszne).\n",
|
|
"\n",
|
|
"W praktyce wsp\u00f3\u0142cze\u015bnie **nie** stosuje si\u0119 lematyzacji (w po\u0142\u0105czeniu z\n",
|
|
"metodami opartymi na sieciach neuronowych):\n",
|
|
"\n",
|
|
"- lematyzacja wymaga wiedzy j\u0119zykowej (regu\u0142 lub s\u0142ownika),\n",
|
|
" wytworzenie takiej wiedzy mo\u017ce by\u0107 kosztowne, obecnie preferowane\n",
|
|
" s\u0105 metody niezale\u017cne od j\u0119zyka;\n",
|
|
"- tracimy pewn\u0105 informacj\u0119 niesion\u0105 przez form\u0119 fleksyjn\u0105 (co w szczeg\u00f3lnych\n",
|
|
" przypadkach mo\u017ce by\u0107 niefortunne, np. *aspiracja* i *aspiracje*);\n",
|
|
"- lematyzacja nie jest trywialnym problemem ze wzgl\u0119du na niejednoznaczno\u015bci\n",
|
|
" (*Lekarzu, lecz si\u0119 sam*);\n",
|
|
"- niekt\u00f3re niejednoznaczno\u015bci s\u0105 seryjne, wyb\u00f3r lematu mo\u017ce by\u0107 arbitralny,\n",
|
|
" np. czy *posiadanie*, *gotowanie*, *skakanie* to rzeczowniki czy czasowniki?\n",
|
|
" a *urz\u0105dzenie*, *mieszkanie*?\n",
|
|
"- zazwyczaj sieci neuronowe (czy nawet prostsze modele typu Word2vec)\n",
|
|
" s\u0105 w stanie nauczy\u0107 si\u0119 rekonstruowania zale\u017cno\u015bci mi\u0119dzy formami fleksyjnymi\n",
|
|
" (i wi\u0119cej: b\u0142\u0119dnych form, b\u0142\u0119d\u00f3w ortograficznych, form archaicznych itd.)\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Zej\u015bcie na poziom znak\u00f3w\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Skoro s\u0142ownik wyraz\u00f3w jest zbyt du\u017cy, to mo\u017ce zej\u015b\u0107 na poziom znak\u00f3w?\n",
|
|
"\n",
|
|
"- pojedynczy znak alfabetu wprawdzie nic nie znaczy (co znaczy *h*?)\n",
|
|
"\n",
|
|
"- \u2026 ale rozmiar wej\u015bcia przy kodowaniu gor\u0105c\u0105 jedynk\u0105\n",
|
|
" dramatycznie si\u0119 zmniejsza\n",
|
|
"\n",
|
|
"- mo\u017ce dzia\u0142a\u0107, je\u015bli doda\u0107 wielowarstwow\u0105 sie\u0107\n",
|
|
" neuronow\u0105\n",
|
|
"\n",
|
|
"- \u2026 ale mo\u017ce by\u0107 bardzo kosztowne obliczeniowo\n",
|
|
"\n",
|
|
"A mo\u017ce co\u015b po\u015bredniego mi\u0119dzy znakami a wyrazami?\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### BPE\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Ani znaki, ani wyrazy \u2014 co\u015b pomi\u0119dzy: jednostki podwyrazowe (*subword\n",
|
|
"units*). Mogliby\u015bmy np. dzieli\u0107 wyraz *superkomputera* na dwie\n",
|
|
"jednostki *super/+/komputera*, a mo\u017ce nawet trzy: *super/+/komputer/+/a*?\n",
|
|
"\n",
|
|
"Najpopularniejszy algorytm podzia\u0142u na jednostki podwyrazowe to BPE\n",
|
|
"(*byte-pair encoding*), zainspirowany algorytmami kompresji danych.\n",
|
|
"Lista jednostek jest automatycznie indukowana na podstawie tekstu (nie\n",
|
|
"potrzeba \u017cadnej wiedzy o j\u0119zyku!). Ich liczba musi by\u0107 natomiast z g\u00f3ry\n",
|
|
"okre\u015blona.\n",
|
|
"\n",
|
|
"W kroku pocz\u0105tkowym zaznaczamy ko\u0144ce wyraz\u00f3w (token\u00f3w), robimy to po\n",
|
|
"to, \u017ceby jednostki podwyrazowe nie przekracza\u0142y granic wyraz\u00f3w.\n",
|
|
"\n",
|
|
"Nast\u0119pnie wykonujemy tyle krok\u00f3w iteracji, ile wynosi rozmiar zadanego\n",
|
|
"s\u0142ownika. W ka\u017cdym kroku szukamy najcz\u0119stszego bigramu, od tego\n",
|
|
"momentu traktujemy go jako ca\u0142ostk\u0119 (wk\u0142adamy go do \u201epude\u0142ka\u201d).\n",
|
|
"\n",
|
|
"![img](./bpe.png)\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Implementacja w Pythonie\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"['e$', 'to', 'to$', 'be$', 't$', 'th', 'or', 'or$', 'no', 'not$']"
|
|
]
|
|
},
|
|
"execution_count": 10,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"from collections import Counter\n",
|
|
"\n",
|
|
"def replace_bigram(l, b, r):\n",
|
|
" i = 0\n",
|
|
" while i < len(l) - 1:\n",
|
|
" if (l[i], l[i+1]) == b:\n",
|
|
" l[i:i+2] = [r]\n",
|
|
" i += 1\n",
|
|
" return l\n",
|
|
"\n",
|
|
"def learn_bpe_vocab(d, max_vocab_size):\n",
|
|
" d = list(d.replace(' ', '$') + '$')\n",
|
|
"\n",
|
|
" vocab = []\n",
|
|
"\n",
|
|
" for ix in range(0, max_vocab_size):\n",
|
|
" bigrams = [(d[i], d[i+1]) for i in range(0, len(d) - 1) if d[i][-1] != '$']\n",
|
|
" selected_bigram = Counter(bigrams).most_common(1)[0][0]\n",
|
|
"\n",
|
|
" new_subword = selected_bigram[0] + selected_bigram[1]\n",
|
|
" d = replace_bigram(d, selected_bigram, new_subword)\n",
|
|
"\n",
|
|
" vocab.append(new_subword)\n",
|
|
"\n",
|
|
" return vocab\n",
|
|
"\n",
|
|
"vocab1 = learn_bpe_vocab('to be or not to be that is the question', 10)\n",
|
|
"vocab1"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"S\u0142ownik jednostek podwyrazowych mo\u017cemy zastosowa\u0107 do dowolnego tekstu, np. do tekstu,\n",
|
|
"na kt\u00f3rym s\u0142ownik by\u0142 wyuczony:\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"'to$ be$ or$ not$ to$ be$ th a t$ i s $ th e$ q u e s t i o n $'"
|
|
]
|
|
},
|
|
"execution_count": 11,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"def apply_bpe_vocab(vocab, d):\n",
|
|
" d = list(d.replace(' ', '$') + '$')\n",
|
|
" vocab_set = set(vocab)\n",
|
|
"\n",
|
|
" modified = True\n",
|
|
" while modified:\n",
|
|
" ix = 0\n",
|
|
" modified = False\n",
|
|
" while ix < len(d) - 1:\n",
|
|
" bigram = d[ix] + d[ix+1]\n",
|
|
" if bigram in vocab_set:\n",
|
|
" d[ix:ix+2] = [bigram]\n",
|
|
" modified = True\n",
|
|
" else:\n",
|
|
" ix += 1\n",
|
|
"\n",
|
|
" return d\n",
|
|
"\n",
|
|
"' '.join(apply_bpe_vocab(vocab1, 'to be or not to be that is the question'))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Zauwa\u017cmy, \u017ce opr\u00f3cz jednostek podwyrazowych zosta\u0142y izolowane litery,\n",
|
|
"zazwyczaj dodajemy je do s\u0142ownika. (I zazwyczaj, s\u0142ownik jest troch\u0119\n",
|
|
"wi\u0119kszy ni\u017c warto\u015b\u0107 podana jako parametr przy uczeniu BPE \u2014 jest\n",
|
|
"wi\u0119kszy o znaki i specjalne tokeny typu `UNK`, `BOS`, `EOS`, `PAD`.)\n",
|
|
"\n",
|
|
"**Pytanie**: Jaki problem mo\u017ce pojawi\u0107 przy zastosowaniu BPE dla tekstu,\n",
|
|
"gdzie pojawiaj\u0105 si\u0119 chi\u0144skie znaki? Jak mo\u017cna sobie z nim poradzi\u0107?\n",
|
|
"\n",
|
|
"S\u0142ownik jednostek podwyrazowych mo\u017cna stosowa\u0107 dla dowolnego tekstu:\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"'to m $ w i l l $ be$ th e$ b e s t$'"
|
|
]
|
|
},
|
|
"execution_count": 12,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"' '.join(apply_bpe_vocab(vocab1, 'tom will be the best'))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Jak mo\u017cna zauwa\u017cy\u0107 algorytm BPE daje dwa rodzaje jednostek podwyrazowych:\n",
|
|
"\n",
|
|
"- jednostki, kt\u00f3re mog\u0105 doklejane na pocz\u0105tku wyrazu;\n",
|
|
"- jednostki, kt\u00f3re stanowi\u0105 koniec wyrazu, w szczeg\u00f3lno\u015bci s\u0105 ca\u0142ym wyrazem.\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Gotowa implementacja\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Po raz pierwszy BPE u\u017cyto do neuronowego t\u0142umaczenia maszynowego.\n",
|
|
"U\u017cyjmy modu\u0142u autorstwa Rica Sennricha ([https://github.com/rsennrich/subword-nmt](https://github.com/rsennrich/subword-nmt)).\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"! pip install subword-nmt"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Wyindukujmy s\u0142ownik dla zbioru ucz\u0105cego zadania identyfikacji p\u0142ci\n",
|
|
"autora tekstu:\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"! xzcat petite-difference-challenge2/train/in.tsv.xz | perl -C -ne 'print \"$&\\n\" while/\\p{L}+/g;' | python -m subword_nmt.learn_bpe -s 50000 -v > bpe_vocab.txt"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Procedura trwa kilka minut, trzeba uzbroi\u0107 si\u0119 w cierpliwo\u015b\u0107 (ale wypisywanie bigram\u00f3w przyspieszy!).\n",
|
|
"\n",
|
|
" pair 0: n i -> ni (frequency 17625075)\n",
|
|
" pair 1: i e -> ie (frequency 11471590)\n",
|
|
" pair 2: c z -> cz (frequency 9143490)\n",
|
|
" pair 3: ni e</w> -> nie</w> (frequency 7901783)\n",
|
|
" pair 4: p o -> po (frequency 7790826)\n",
|
|
" pair 5: r z -> rz (frequency 7542046)\n",
|
|
" pair 6: s t -> st (frequency 7269069)\n",
|
|
" pair 7: e m</w> -> em</w> (frequency 7207280)\n",
|
|
" pair 8: d z -> dz (frequency 6860931)\n",
|
|
" pair 9: s z -> sz (frequency 6609907)\n",
|
|
" pair 10: r a -> ra (frequency 6601618)\n",
|
|
" pair 11: o w -> ow (frequency 6395963)\n",
|
|
" pair 12: i e</w> -> ie</w> (frequency 5906869)\n",
|
|
" pair 13: n a -> na (frequency 5300380)\n",
|
|
" pair 14: r o -> ro (frequency 5181363)\n",
|
|
" pair 15: n a</w> -> na</w> (frequency 5125807)\n",
|
|
" pair 16: a \u0142 -> a\u0142 (frequency 4786696)\n",
|
|
" pair 17: j e -> je (frequency 4599579)\n",
|
|
" pair 18: s i -> si (frequency 4300984)\n",
|
|
" pair 19: a l -> al (frequency 4276823)\n",
|
|
" pair 20: t e -> te (frequency 4033344)\n",
|
|
" pair 21: w i -> wi (frequency 3939063)\n",
|
|
" pair 22: c h</w> -> ch</w> (frequency 3919410)\n",
|
|
" pair 23: c h -> ch (frequency 3661410)\n",
|
|
" pair 24: k o -> ko (frequency 3629840)\n",
|
|
" pair 25: z a -> za (frequency 3625424)\n",
|
|
" pair 26: t a -> ta (frequency 3570094)\n",
|
|
" pair 27: p rz -> prz (frequency 3494551)\n",
|
|
" pair 28: g o</w> -> go</w> (frequency 3279997)\n",
|
|
" pair 29: a r -> ar (frequency 3081492)\n",
|
|
" pair 30: si \u0119</w> -> si\u0119</w> (frequency 2973681)\n",
|
|
" ...\n",
|
|
" pair 49970: brz mieniu</w> -> brzmieniu</w> (frequency 483)\n",
|
|
" pair 49971: bie\u017c\u0105 cych</w> -> bie\u017c\u0105cych</w> (frequency 483)\n",
|
|
" pair 49972: biegu nk\u0119</w> -> biegunk\u0119</w> (frequency 483)\n",
|
|
" pair 49973: ban kowo\u015bci</w> -> bankowo\u015bci</w> (frequency 483)\n",
|
|
" pair 49974: ba ku</w> -> baku</w> (frequency 483)\n",
|
|
" pair 49975: ba cznie</w> -> bacznie</w> (frequency 483)\n",
|
|
" pair 49976: Przypad kowo</w> -> Przypadkowo</w> (frequency 483)\n",
|
|
" pair 49977: MA \u0141 -> MA\u0141 (frequency 483)\n",
|
|
" pair 49978: Lep pera</w> -> Leppera</w> (frequency 483)\n",
|
|
" pair 49979: Ko za -> Koza (frequency 483)\n",
|
|
" pair 49980: Jak by\u015b</w> -> Jakby\u015b</w> (frequency 483)\n",
|
|
" pair 49981: Geni alne</w> -> Genialne</w> (frequency 483)\n",
|
|
" pair 49982: \u017be nada</w> -> \u017benada</w> (frequency 482)\n",
|
|
" pair 49983: \u0144 czykiem</w> -> \u0144czykiem</w> (frequency 482)\n",
|
|
" pair 49984: zwie \u0144 -> zwie\u0144 (frequency 482)\n",
|
|
" pair 49985: zost a\u0142a\u015b</w> -> zosta\u0142a\u015b</w> (frequency 482)\n",
|
|
" pair 49986: zni szczona</w> -> zniszczona</w> (frequency 482)\n",
|
|
" pair 49987: ze stawi -> zestawi (frequency 482)\n",
|
|
" pair 49988: za s\u00f3b</w> -> zas\u00f3b</w> (frequency 482)\n",
|
|
" pair 49989: w\u0119d r\u00f3wk\u0119</w> -> w\u0119dr\u00f3wk\u0119</w> (frequency 482)\n",
|
|
" pair 49990: wysko czy\u0142a</w> -> wyskoczy\u0142a</w> (frequency 482)\n",
|
|
" pair 49991: wyle czenia</w> -> wyleczenia</w> (frequency 482)\n",
|
|
" pair 49992: wychowaw cze</w> -> wychowawcze</w> (frequency 482)\n",
|
|
" pair 49993: w t -> wt (frequency 482)\n",
|
|
" pair 49994: un da -> unda (frequency 482)\n",
|
|
" pair 49995: udzie la\u0142em</w> -> udziela\u0142em</w> (frequency 482)\n",
|
|
" pair 49996: t\u0119 czy</w> -> t\u0119czy</w> (frequency 482)\n",
|
|
" pair 49997: tro sce</w> -> trosce</w> (frequency 482)\n",
|
|
" pair 49998: s\u0142usz no\u015bci</w> -> s\u0142uszno\u015bci</w> (frequency 482)\n",
|
|
" pair 49999: su me</w> -> sume</w> (frequency 482\n",
|
|
"\n",
|
|
"Zastosujmy teraz wyindukowany s\u0142ownik BPE dla jakiego\u015b rzeczywistego tekstu.\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 13,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Cier@@ pia\u0142em na straszne la@@ gi kilkana\u015bcie sekund lub d\u0142u\u017cej czarnego ekranu przy pr\u00f3bie prze\u0142\u0105@@ czenia si\u0119 uruchomienia prawie ka\u017cdej aplikacji Dodatkowo telefon mi si\u0119 wy\u0142\u0105@@ cza\u0142 czasem bez powodu sam z siebie albo rese@@ towa\u0142 Ostatnio nawet przegl\u0105darka zacz\u0119\u0142a si\u0119 cz\u0119sto zawie@@ sza\u0107 i Android proponowa\u0142 wymu@@ szone zamkni\u0119cie Do tego te problemy z po\u0142\u0105czeniem do komputera przez USB "
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"! echo 'Cierpia\u0142em na straszne lagi \u2013 kilkana\u015bcie sekund lub d\u0142u\u017cej czarnego ekranu przy pr\u00f3bie prze\u0142\u0105czenia si\u0119 / uruchomienia prawie ka\u017cdej aplikacji. Dodatkowo telefon mi si\u0119 wy\u0142\u0105cza\u0142 czasem bez powodu \u2013 sam z siebie, albo resetowa\u0142. Ostatnio nawet przegl\u0105darka zacz\u0119\u0142a si\u0119 cz\u0119sto zawiesza\u0107 i Android proponowa\u0142 wymuszone zamkni\u0119cie. Do tego te problemy z po\u0142\u0105czeniem do komputera przez USB.' | perl -C -ne 'print \"$& \" while/\\p{L}+/g;' | python -m subword_nmt.apply_bpe -c bpe_vocab.txt"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Ta konkretna implementacja zaznacza za pomoc\u0105 sekwencji ~@@ ~ koniec jednostki podwyrazowej.\n",
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.9.2"
|
|
},
|
|
"org": null,
|
|
"author": "Filip Grali\u0144ski",
|
|
"email": "filipg@amu.edu.pl",
|
|
"lang": "pl",
|
|
"subtitle": "12.Kodowanie BPE[wyk\u0142ad]",
|
|
"title": "Ekstrakcja informacji",
|
|
"year": "2021"
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
} |