This commit is contained in:
Jakub Pokrywka 2022-02-23 15:12:56 +01:00
parent 225a0414fd
commit c8128a9e08
6 changed files with 7909 additions and 195 deletions

View File

@ -12,7 +12,15 @@
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"NR_INDEKSU = 375985"
]
},
@ -725,6 +733,13 @@
"- następnie wygeneruj z notebooka PDF (File → Download As → PDF via Latex).\n",
"- notebook z kodem oraz PDF zamieść w zakładce zadań w MS TEAMS"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {

View File

@ -35,6 +35,15 @@
"## Literatura\n",
"Polecana literatura do przedmiotu:\n",
"\n",
"- Philipp Koehn. \"Neural Machine Translation\". 2020. (darmowa- https://www.cambridge.org/core/books/neural-machine-translation/7AAA628F88ADD64124EA008C425C0197)\n",
"- https://web.stanford.edu/~jurafsky/slp3/3.pdf\n",
"- Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. North American Association for Computational Linguistics (NAACL).\n",
"- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research vol 21, number 140, pages 1-67.\n",
"- Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya. 2019. Language Models are Unsupervised Multitask Learners\n",
"- https://jalammar.github.io/illustrated-transformer/\n",
"- https://www.youtube.com/watch?v=-9evrZnBorM&ab_channel=YannicKilcher\n",
"- https://www.youtube.com/watch?v=u1_qMdb0kYU&ab_channel=YannicKilcher\n",
"\n",
"\n",
"\n",
"## Zaliczenie\n",

View File

@ -17,7 +17,7 @@
},
{
"cell_type": "code",
"execution_count": 14,
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
@ -52,33 +52,13 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"c = '⨃'"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"10755"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ord(c)"
]
},
{
"cell_type": "code",
"execution_count": 3,
@ -87,7 +67,7 @@
{
"data": {
"text/plain": [
"'⨃'"
"10755"
]
},
"execution_count": 3,
@ -96,7 +76,7 @@
}
],
"source": [
"chr(10755)"
"ord(c)"
]
},
{
@ -107,7 +87,7 @@
{
"data": {
"text/plain": [
"0"
"'⨃'"
]
},
"execution_count": 4,
@ -116,15 +96,7 @@
}
],
"source": [
"10755 - 2* 16**3 - 10* 16**2 - 0 * 16**1 - 3* 16**0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$10755_{10} = 2* 16^3 + 10* 16^2 + 0 * 16^1 + 3* 16^0 =$ U+2A03 \n",
"\n"
"chr(10755)"
]
},
{
@ -143,6 +115,34 @@
"output_type": "execute_result"
}
],
"source": [
"10755 - 2* 16**3 - 10* 16**2 - 0 * 16**1 - 3* 16**0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$10755_{10} = 2* 16^3 + 10* 16^2 + 0 * 16^1 + 3* 16^0 =$ U+2A03 \n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"10755 - 1*2**13 - 0*2**12 - 1*2**11 - 0*2**10 - 1*2**9 -0*2**8 -0*2**7-0*2**6-0*2**5-0*2**4-0*2**3-0*2**2-0*2**1 - 1*2**1 - 1*2**0"
]
@ -156,7 +156,7 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 7,
"metadata": {},
"outputs": [
{
@ -165,7 +165,7 @@
"14"
]
},
"execution_count": 6,
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
@ -176,7 +176,7 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 8,
"metadata": {},
"outputs": [
{
@ -185,7 +185,7 @@
"'0010101000000011'"
]
},
"execution_count": 7,
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
@ -210,7 +210,7 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 9,
"metadata": {},
"outputs": [
{
@ -219,7 +219,7 @@
"'11100010 10101000 10000011'"
]
},
"execution_count": 8,
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
@ -230,7 +230,7 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
@ -239,7 +239,7 @@
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": 11,
"metadata": {},
"outputs": [
{
@ -256,7 +256,7 @@
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": 12,
"metadata": {},
"outputs": [
{
@ -273,7 +273,7 @@
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": 13,
"metadata": {},
"outputs": [
{
@ -282,7 +282,7 @@
"'\\x0c'"
]
},
"execution_count": 12,
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
@ -302,19 +302,18 @@
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": 14,
"metadata": {},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'NR_INDEKSU' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-13-ac7a6bf37d41>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mchr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mNR_INDEKSU\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0;36m100\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;31mNameError\u001b[0m: name 'NR_INDEKSU' is not defined"
]
"data": {
"text/plain": [
"'U'"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
@ -323,18 +322,40 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 15,
"metadata": {},
"outputs": [],
"outputs": [
{
"data": {
"text/plain": [
"'ϙ'"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chr(NR_INDEKSU % 1000)"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 16,
"metadata": {},
"outputs": [],
"outputs": [
{
"data": {
"text/plain": [
"'\\U00012856'"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chr(NR_INDEKSU % 100000 - 123)"
]
@ -362,7 +383,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
@ -383,9 +404,21 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 18,
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"00000000: 01111010 01100001 11000101 10111100 11000011 10110011 za....\r\n",
"00000006: 11000101 10000010 11000100 10000111 00100000 01100111 .... g\r\n",
"0000000c: 11000100 10011001 11000101 10011011 01101100 11000100 ....l.\r\n",
"00000012: 10000101 00100000 01101010 01100001 11000101 10111010 . ja..\r\n",
"00000018: 11000101 10000100 00001010 ...\r\n"
]
}
],
"source": [
"!xxd -b '01_materialy/polski_tekst.txt'"
]
@ -401,7 +434,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
@ -410,7 +443,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
@ -419,7 +452,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
@ -428,11 +461,22 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 22,
"metadata": {
"scrolled": false
},
"outputs": [],
"outputs": [
{
"data": {
"text/plain": [
"'00101100 00001010 01001010 01100001 01101011 00100000 01100010 01111001 11000101 10000010 00100000 01010011 01110100 01100101 01100110 01100101 01101011 00100000 01000010 01110101 01110010 01100011 01111010 01111001 01101101 01110101 01100011 01101000 01100001 11100010 10000000 10100110 00001010 11100010 10000000 10010100 00100000 01001010 01100001 00100000 01101110 01101001 01101011 01101111 01100111 01101111 00100000 01110011 01101001 11000100 10011001 00100000 01101110 01101001 01100101 00100000 01100010 01101111 01101010 11000100 10011001 00100001 00001010 01000011 01101000 01101111 11000100 10000111 01100010 01111001 00100000 01101110 01101001 01100101 01100100 11000101 10111010 01110111 01101001 01100101 01100100 11000101 10111010 11100010 10000000 10100110 00100000 01110100'"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tekst"
]
@ -462,7 +506,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
@ -471,7 +515,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
@ -480,7 +524,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
@ -489,11 +533,22 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 26,
"metadata": {
"scrolled": true
},
"outputs": [],
"outputs": [
{
"data": {
"text/plain": [
"'0x2e 0x20 0x31 0x36 0x37 0x30 0x2c 0x20 0x70 0x72 0x7a 0x65 0x64 0x20 0x75 0x70 0x61 0x64 0x6b 0x69 0x65 0x6d 0x20 0x4b 0x61 0x6d 0x69 0x65 0xc5 0x84 0x63 0x61 0x20 0x69 0x20 0x68 0x61 0x6e 0x69 0x65 0x62 0x6e 0x79 0x6d 0x69 0x20 0x75 0x6b 0xc5 0x82 0x61 0x64 0x61 0x6d 0x69 0x20 0x62 0x75 0x63 0x7a 0x61 0x63 0x6b 0x69 0x6d 0x69 0x2c 0x20 0x6b 0x74 0xc3 0xb3 0x72 0x65 0x20 0x6f 0x62 0x6f 0x77 0x69 0xc4 0x85 0x7a 0x79 0x77 0x61 0xc5 0x82'"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tekst"
]
@ -583,7 +638,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
@ -599,9 +654,20 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 28,
"metadata": {},
"outputs": [],
"outputs": [
{
"data": {
"text/plain": [
"['A', 'a', 'b', 'ce', 'cef', 'Ą', 'ą', 'ż']"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sorted(przykladowa_lista)"
]
@ -615,9 +681,20 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 29,
"metadata": {},
"outputs": [],
"outputs": [
{
"data": {
"text/plain": [
"['A', 'Ą', 'a', 'ą', 'b', 'ce', 'cef', 'ż']"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"['A', 'Ą', 'a', 'ą' ,'b', 'ce', 'cef', 'ż']"
]
@ -656,6 +733,13 @@
"- następnie wygeneruj z notebooka PDF (File → Download As → PDF via Latex).\n",
"- notebook z kodem oraz PDF zamieść w zakładce zadań w MS TEAMS"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {

7547
cw/02_Jezyk.ipynb Normal file

File diff suppressed because one or more lines are too long

View File

@ -1,117 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Ekstrakcja informacji </h1>\n",
"<h2> 0. <i>Język</i> [ćwiczenia]</h2> \n",
"<h3> Jakub Pokrywka (2022)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "code",
"execution_count": 278,
"metadata": {},
"outputs": [],
"source": [
"NR_INDEKSU = 375985"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ZNAJDŹ PRZYKŁAD TEKSTÓW Z TEJ SAMEJ DOMENY 1_000_000 słów:\n",
"- język angielski \n",
"- język polski\n",
"- język z rodziny romańskich\n",
"\n",
"Narzędzia:\n",
"- spacy\n",
"- nltk\n",
"\n",
"\n",
"\n",
"Dla każdego z języków:\n",
"- policz ilosć unikalnych słów (ze stemmingiem i bez)\n",
"- policz ilosć unikalnych znaków\n",
"- policz ilosć unikalnych zdań\n",
"- podaj ilość unikalnych \n",
"- podaj min, max, średnią oraz medianę ilości znaków w słowie\n",
"- podaj min, max, średnią oraz medianę ilości słów w zdaniu\n",
"- wygeneruj word cloud (normalnie i po usunięciu stopwordów)\n",
"- wypisz 20 najbardziej popularnych słów (normalnie i po usunięciu stopwordów)\n",
"- wypisz 20 najbardziej popularnych bigramów (normalnie i po usunięciu stopwordów)\n",
"- narysuj wykres częstotliwości słów w taki sposób żeby był maksymalnie czytelny, wypróbuj skali logarytmicznej x, y, usuwanie słów poniżej limitu wystąpień itp. \n",
"- dla próbki 10000 zdań sprawdź jak często langdetect https://pypi.org/project/langdetect/ się myli i jakie języki odgaduje \n",
"\n",
"\n",
"NAPISZ WNIOSKI"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ZADANIE\n",
"\n",
"Weź teksty w języku polskim:\n",
"- tekst prawny\n",
"- tekst z polskiego naukowy\n",
"- tekst z polskiego z powieści (wolne lektury)\n",
"- tekst z polskiego gg\n",
"- transkrypcja tekstu mówionego\n",
"\n",
"\n",
"- gunning_fog INDEX ( https://pypi.org/project/textstat/ ) \n",
"- średnia długość zdania\n",
"- narysuj na jednym wykresie te wartości\n",
"\n",
"\n",
"\n",
"\n",
"NAPISZ WNIOSKI\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"author": "Jakub Pokrywka",
"email": "kubapok@wmi.amu.edu.pl",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"lang": "pl",
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
},
"subtitle": "0.Informacje na temat przedmiotu[ćwiczenia]",
"title": "Ekstrakcja informacji",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -0,0 +1,176 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Ekstrakcja informacji </h1>\n",
"<h2> 0. <i>Jezyk</i> [ćwiczenia]</h2> \n",
"<h3> Jakub Pokrywka (2022)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "code",
"execution_count": 278,
"metadata": {},
"outputs": [],
"source": [
"NR_INDEKSU = 375985"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"https://web.stanford.edu/~jurafsky/slp3/3.pdf"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"class Model():\n",
" \n",
" def __init__(self, vocab_size=30_000, UNK_token= '<UNK>'):\n",
" pass\n",
" \n",
" def train(corpus:list) -> None:\n",
" pass\n",
" \n",
" def get_conditional_prob_for_word(text: list, word: str) -> float:\n",
" pass\n",
" \n",
" def get_prob_for_text(text: list) -> float:\n",
" pass\n",
" \n",
" def most_probable_next_word(text:list) -> str:\n",
" 'nie powinien zwracań nigdy <UNK>'\n",
" pass\n",
" \n",
" def high_probable_next_word(text:list) -> str:\n",
" 'nie powinien zwracań nigdy <UNK>'\n",
" pass\n",
" \n",
" def generate_text(text_beggining:list, length: int, greedy: bool) -> list:\n",
" 'nie powinien zwracań nigdy <UNK>'\n",
" pass"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"def get_ppl(text: list) -> float:\n",
" pass"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"def get_entropy(text: list) -> float:\n",
" pass"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- wybierz tekst w dowolnym języku (10_000_000 słów)\n",
"- podziel zbiór na train/test w proporcji 90/100\n",
"- stworzyć unigramowy model językowy\n",
"- stworzyć bigramowy model językowy\n",
"- stworzyć trigramowy model językowy\n",
"- wymyśl 5 krótkich zdań. Policz ich prawdopodobieństwo\n",
"- napisz włąsnoręcznie funkcję, która liczy perplexity na korpusie i policz perplexity na każdym z modeli dla train i test\n",
"- wygeneruj tekst, zaczynając od wymyślonych 5 początków. Postaraj się, żeby dla obu funkcji, a przynajmniej dla high_probable_next_word teksty były orginalne. Czy wynik będzię sie róźnił dla tekstów np.\n",
"`We sketch how LoomisWhitney follows from this: Indeed, let X be a uniformly distributed random variable with values` oraz `random variable with values`?\n",
"- stwórz model dla korpusu z ZADANIE 1 i policz perplexity dla każdego z tekstów (zrób split 90/10) dla train i test\n",
"\n",
"- klasyfikacja za pomocą modelu językowego\n",
"- wygładzanie metodą laplace'a"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### START ZADANIA"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### KONIEC ZADANIA"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- znajdź duży zbiór danych dla klasyfikacji binarnej, wytrenuj osobne modele dla każdej z klas i użyj dla klasyfikacji. Warunkiem zaliczenia jest uzyskanie wyniku większego niż baseline (zwracanie zawsze bardziej licznej klasy)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## WYKONANIE ZADAŃ\n",
"Zgodnie z instrukcją 01_Kodowanie_tekstu.ipynb"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Teoria informacji"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Wygładzanie modeli językowych"
]
}
],
"metadata": {
"author": "Jakub Pokrywka",
"email": "kubapok@wmi.amu.edu.pl",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"lang": "pl",
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
},
"subtitle": "0.Informacje na temat przedmiotu[ćwiczenia]",
"title": "Ekstrakcja informacji",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4
}