reformat
This commit is contained in:
parent
0f34dcdeb4
commit
3c0223d434
@ -2,14 +2,12 @@
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
|
||||
"<div class=\"alert alert-block alert-info\">\n",
|
||||
"<h1> Ekstrakcja informacji </h1>\n",
|
||||
"<h2> 0. <i>Informacje na temat przedmiotu</i> [\u0107wiczenia]</h2> \n",
|
||||
"<h2> 0. <i>Informacje na temat przedmiotu</i> [ćwiczenia]</h2> \n",
|
||||
"<h3> Jakub Pokrywka (2021)</h3>\n",
|
||||
"</div>\n",
|
||||
"\n",
|
||||
@ -20,74 +18,70 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Informacje og\u00f3lne"
|
||||
"# Informacje ogólne"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Kontakt z prowadz\u0105cym\n",
|
||||
"## Kontakt z prowadzącym\n",
|
||||
"\n",
|
||||
"prowadz\u0105cy: mgr in\u017c. Jakub Pokrywka\n",
|
||||
"prowadzący: mgr inż. Jakub Pokrywka\n",
|
||||
"\n",
|
||||
"Najlepiej kontaktow\u0105\u0107 si\u0119 ze mn\u0105 przez MS TEAMS na grupie kana\u0142u (og\u00f3lne sprawy) lub w prywatnych wiadomo\u015bciach. Odpisuj\u0119 co 2-3 dni. Mo\u017cna te\u017c um\u00f3wi\u0107 si\u0119 na zdzwonko w godzinach dy\u017curu (wt 12.00-13.00) lub um\u00f3wi\u0107 si\u0119 w innym terminie.\n",
|
||||
"Najlepiej kontaktowąć się ze mną przez MS TEAMS na grupie kanału (ogólne sprawy) lub w prywatnych wiadomościach. Odpisuję co 2-3 dni. Można też umówić się na zdzwonko w godzinach dyżuru (wt 12.00-13.00) lub umówić się w innym terminie.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"## Literatura\n",
|
||||
"Polecana literatura do przedmiotu:\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"- https://www.manning.com/books/relevant-search#toc (darmowa) Polecam chocia\u017c przejrze\u0107.\n",
|
||||
"- Marie-Francine Moens. 2006. Information Extraction: Algorithms and Prospects in a Retrieval Context. Springer. (polecam mniej, jest troch\u0119 nieaktualna)\n",
|
||||
"- https://www.manning.com/books/relevant-search#toc (darmowa) Polecam chociaż przejrzeć.\n",
|
||||
"- Marie-Francine Moens. 2006. Information Extraction: Algorithms and Prospects in a Retrieval Context. Springer. (polecam mniej, jest trochę nieaktualna)\n",
|
||||
"- Alex Graves. 2012. Supervised sequence labelling. Studies in Computational Intelligence, vol 385. Springer. Berlin, Heidelberg. \n",
|
||||
"\n",
|
||||
"- Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. North American Association for Computational Linguistics (NAACL). \n",
|
||||
"\n",
|
||||
"- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research vol 21, number 140, pages 1-67. \n",
|
||||
"\n",
|
||||
"- Flip Grali\u0144ski, Tomasz Stanis\u0142awek, Anna Wr\u00f3blewska, Dawid Lipi\u0144ski, Agnieszka Kaliska, Paulina Rosalska, Bartosz Topolski, Przemys\u0142aw Biecek. 2020. Kleister: A novel task for information extraction involving long documents with complex layout. URL https://arxiv.org/abs/2003.02356 \n",
|
||||
"- Flip Graliński, Tomasz Stanisławek, Anna Wróblewska, Dawid Lipiński, Agnieszka Kaliska, Paulina Rosalska, Bartosz Topolski, Przemysław Biecek. 2020. Kleister: A novel task for information extraction involving long documents with complex layout. URL https://arxiv.org/abs/2003.02356 \n",
|
||||
"\n",
|
||||
"- \u0141ukasz Garncarek, Rafa\u0142 Powalski, Tomasz Stanis\u0142awek, Bartosz Topolski, Piotr Halama, Filip Grali\u0144ski. 2020. LAMBERT: Layout-Aware (Language) Modeling using BERT. URL https://arxiv.org/pdf/2002.08087 \n",
|
||||
"- Łukasz Garncarek, Rafał Powalski, Tomasz Stanisławek, Bartosz Topolski, Piotr Halama, Filip Graliński. 2020. LAMBERT: Layout-Aware (Language) Modeling using BERT. URL https://arxiv.org/pdf/2002.08087 \n",
|
||||
"\n",
|
||||
"## Zaliczenie\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"Do zdobycia b\u0119dzie conajmniej 600 punkt\u00f3w.\n",
|
||||
"Do zdobycia będzie conajmniej 600 punktów.\n",
|
||||
"\n",
|
||||
"Ocena:\n",
|
||||
"\n",
|
||||
"- -299 \u2014 2\n",
|
||||
"- -299 — 2\n",
|
||||
"\n",
|
||||
"- 300-349 \u2014 3\n",
|
||||
"- 300-349 — 3\n",
|
||||
"\n",
|
||||
"- 350-399 \u2014 3+\n",
|
||||
"- 350-399 — 3+\n",
|
||||
"\n",
|
||||
"- 400-449 \u2014 4\n",
|
||||
"- 400-449 — 4\n",
|
||||
"\n",
|
||||
"- 450\u2014499 \u2014 4+\n",
|
||||
"- 450—499 — 4+\n",
|
||||
"\n",
|
||||
"- 500- \u2014 5\n",
|
||||
"- 500- — 5\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"**\u017beby zaliczy\u0107 przedmiot nale\u017cy pojawia\u0107 si\u0119 na laboratoriach. Maksymalna liczba nieobecno\u015bci to 3. Obecno\u015b\u0107 b\u0119d\u0119 sprawdza\u0142 poprzez panel MS TEAMS, czyli b\u0119d\u0119 sprawdza\u0142 czy kto\u015b jest wdzwoniony na \u0107wiczenia. Je\u017celi kogo\u015b nie b\u0119dzie wi\u0119cej ni\u017c 3 razy, to nie b\u0119dzie mia\u0142 zaliczonego przedmiotu** \n"
|
||||
"**Żeby zaliczyć przedmiot należy pojawiać się na laboratoriach. Maksymalna liczba nieobecności to 3. Obecność będę sprawdzał poprzez panel MS TEAMS, czyli będę sprawdzał czy ktoś jest wdzwoniony na ćwiczenia. Jeżeli kogoś nie będzie więcej niż 3 razy, to nie będzie miał zaliczonego przedmiotu** \n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"author": "Jakub Pokrywka",
|
||||
"email": "kubapok@wmi.amu.edu.pl",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"lang": "pl",
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
@ -100,10 +94,7 @@
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.3"
|
||||
},
|
||||
"author": "Jakub Pokrywka",
|
||||
"email": "kubapok@wmi.amu.edu.pl",
|
||||
"lang": "pl",
|
||||
"subtitle": "0.Informacje na temat przedmiotu[\u0107wiczenia]",
|
||||
"subtitle": "0.Informacje na temat przedmiotu[ćwiczenia]",
|
||||
"title": "Ekstrakcja informacji",
|
||||
"year": "2021"
|
||||
},
|
||||
|
@ -276,13 +276,6 @@
|
||||
"67. [Instytut Techniki Górniczej - wycinki](http://www.komag.gliwice.pl/archiwum/historia-komag)\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
|
@ -2,14 +2,12 @@
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
|
||||
"<div class=\"alert alert-block alert-info\">\n",
|
||||
"<h1> Ekstrakcja informacji </h1>\n",
|
||||
"<h2> 3. <i>tfidf (1)</i> [\u0107wiczenia]</h2> \n",
|
||||
"<h2> 3. <i>tfidf (1)</i> [ćwiczenia]</h2> \n",
|
||||
"<h3> Jakub Pokrywka (2021)</h3>\n",
|
||||
"</div>\n",
|
||||
"\n",
|
||||
@ -20,9 +18,9 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Zaj\u0119cia 2\n",
|
||||
"# Zajęcia 2\n",
|
||||
"\n",
|
||||
"Na tych zaj\u0119ciach za aktywno\u015bc mo\u017cna otrzyma\u0107 po 5 punkt\u00f3w za warto\u015bciow\u0105 wypowied\u017a. Maksymalnie jedna osoba mo\u017ce zdoby\u0107 na tych \u0107wiczeniach do 15 punkt\u00f3w."
|
||||
"Na tych zajęciach za aktywnośc można otrzymać po 5 punktów za wartościową wypowiedź. Maksymalnie jedna osoba może zdobyć na tych ćwiczeniach do 15 punktów."
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -39,7 +37,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## zbi\u00f3r dokument\u00f3w"
|
||||
"## zbiór dokumentów"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -48,11 +46,11 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"documents = ['Ala lubi zwierz\u0119ta i ma kota oraz psa!',\n",
|
||||
" 'Ola lubi zwierz\u0119ta oraz ma kota a tak\u017ce chomika!',\n",
|
||||
" 'I Jan je\u017adzi na rowerze.',\n",
|
||||
" '2 wojna \u015bwiatowa by\u0142a wielkim konfliktem zbrojnym',\n",
|
||||
" 'Tomek lubi psy, ma psa i je\u017adzi na motorze i rowerze.',\n",
|
||||
"documents = ['Ala lubi zwierzęta i ma kota oraz psa!',\n",
|
||||
" 'Ola lubi zwierzęta oraz ma kota a także chomika!',\n",
|
||||
" 'I Jan jeździ na rowerze.',\n",
|
||||
" '2 wojna światowa była wielkim konfliktem zbrojnym',\n",
|
||||
" 'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.',\n",
|
||||
" ]"
|
||||
]
|
||||
},
|
||||
@ -61,11 +59,11 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### CZEGO CHCEMY?\n",
|
||||
"- chcemy zamieni\u0107 teksty na zbi\u00f3r s\u0142\u00f3w\n",
|
||||
"- chcemy zamienić teksty na zbiór słów\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"### PYTANIE\n",
|
||||
"- czy mo\u017cemy ztokenizowa\u0107 tekst np. documents.split(' ') jakie wyst\u0105pi\u0105 wtedy problemy?"
|
||||
"- czy możemy ztokenizować tekst np. documents.split(' ') jakie wystąpią wtedy problemy?"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -107,7 +105,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'ala lubi zwierz\u0119ta i ma kota oraz psa'"
|
||||
"'ala lubi zwierzęta i ma kota oraz psa'"
|
||||
]
|
||||
},
|
||||
"execution_count": 5,
|
||||
@ -144,7 +142,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"['ala', 'lubi', 'zwierz\u0119ta', 'i', 'ma', 'kota', 'oraz', 'psa']"
|
||||
"['ala', 'lubi', 'zwierzęta', 'i', 'ma', 'kota', 'oraz', 'psa']"
|
||||
]
|
||||
},
|
||||
"execution_count": 7,
|
||||
@ -173,11 +171,11 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"['ala lubi zwierz\u0119ta i ma kota oraz psa',\n",
|
||||
" 'ola lubi zwierz\u0119ta oraz ma kota a tak\u017ce chomika',\n",
|
||||
" 'i jan je\u017adzi na rowerze',\n",
|
||||
" '2 wojna \u015bwiatowa by\u0142a wielkim konfliktem zbrojnym',\n",
|
||||
" 'tomek lubi psy ma psa i je\u017adzi na motorze i rowerze']"
|
||||
"['ala lubi zwierzęta i ma kota oraz psa',\n",
|
||||
" 'ola lubi zwierzęta oraz ma kota a także chomika',\n",
|
||||
" 'i jan jeździ na rowerze',\n",
|
||||
" '2 wojna światowa była wielkim konfliktem zbrojnym',\n",
|
||||
" 'tomek lubi psy ma psa i jeździ na motorze i rowerze']"
|
||||
]
|
||||
},
|
||||
"execution_count": 9,
|
||||
@ -206,17 +204,17 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[['ala', 'lubi', 'zwierz\u0119ta', 'i', 'ma', 'kota', 'oraz', 'psa'],\n",
|
||||
" ['ola', 'lubi', 'zwierz\u0119ta', 'oraz', 'ma', 'kota', 'a', 'tak\u017ce', 'chomika'],\n",
|
||||
" ['i', 'jan', 'je\u017adzi', 'na', 'rowerze'],\n",
|
||||
" ['2', 'wojna', '\u015bwiatowa', 'by\u0142a', 'wielkim', 'konfliktem', 'zbrojnym'],\n",
|
||||
"[['ala', 'lubi', 'zwierzęta', 'i', 'ma', 'kota', 'oraz', 'psa'],\n",
|
||||
" ['ola', 'lubi', 'zwierzęta', 'oraz', 'ma', 'kota', 'a', 'także', 'chomika'],\n",
|
||||
" ['i', 'jan', 'jeździ', 'na', 'rowerze'],\n",
|
||||
" ['2', 'wojna', 'światowa', 'była', 'wielkim', 'konfliktem', 'zbrojnym'],\n",
|
||||
" ['tomek',\n",
|
||||
" 'lubi',\n",
|
||||
" 'psy',\n",
|
||||
" 'ma',\n",
|
||||
" 'psa',\n",
|
||||
" 'i',\n",
|
||||
" 'je\u017adzi',\n",
|
||||
" 'jeździ',\n",
|
||||
" 'na',\n",
|
||||
" 'motorze',\n",
|
||||
" 'i',\n",
|
||||
@ -237,8 +235,8 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## PYTANIA\n",
|
||||
"- jaki jest nast\u0119pny krok w celu stworzenia wekt\u00f3r\u00f3w TF lub TF-IDF\n",
|
||||
"- jakie wielko\u015bci b\u0119dzie wektor TF lub TF-IDF?\n"
|
||||
"- jaki jest następny krok w celu stworzenia wektórów TF lub TF-IDF\n",
|
||||
"- jakie wielkości będzie wektor TF lub TF-IDF?\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -265,11 +263,11 @@
|
||||
"['2',\n",
|
||||
" 'a',\n",
|
||||
" 'ala',\n",
|
||||
" 'by\u0142a',\n",
|
||||
" 'była',\n",
|
||||
" 'chomika',\n",
|
||||
" 'i',\n",
|
||||
" 'jan',\n",
|
||||
" 'je\u017adzi',\n",
|
||||
" 'jeździ',\n",
|
||||
" 'konfliktem',\n",
|
||||
" 'kota',\n",
|
||||
" 'lubi',\n",
|
||||
@ -281,13 +279,13 @@
|
||||
" 'psa',\n",
|
||||
" 'psy',\n",
|
||||
" 'rowerze',\n",
|
||||
" 'tak\u017ce',\n",
|
||||
" 'także',\n",
|
||||
" 'tomek',\n",
|
||||
" 'wielkim',\n",
|
||||
" 'wojna',\n",
|
||||
" 'zbrojnym',\n",
|
||||
" 'zwierz\u0119ta',\n",
|
||||
" '\u015bwiatowa']"
|
||||
" 'zwierzęta',\n",
|
||||
" 'światowa']"
|
||||
]
|
||||
},
|
||||
"execution_count": 13,
|
||||
@ -305,14 +303,14 @@
|
||||
"source": [
|
||||
"## PYTANIA\n",
|
||||
"\n",
|
||||
"jak b\u0119dzie s\u0142owo \"jak\" w reprezentacji wektorowej TF?"
|
||||
"jak będzie słowo \"jak\" w reprezentacji wektorowej TF?"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### ZADANIE 1 stworzy\u0107 funkcj\u0119 word_to_index(word:str), funkcja ma zwara\u0107 one-hot vector w postaciu numpy array"
|
||||
"### ZADANIE 1 stworzyć funkcję word_to_index(word:str), funkcja ma zwarać one-hot vector w postaciu numpy array"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -350,7 +348,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### ZADANIE 2 NAPISAC FUNKCJ\u0118, kt\u00f3ra bierze list\u0119 s\u0142\u00f3w i zamienia na wetktor TF\n"
|
||||
"### ZADANIE 2 NAPISAC FUNKCJĘ, która bierze listę słów i zamienia na wetktor TF\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -481,7 +479,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### ZADANIE 3 Napisa\u0107 funkcj\u0119 similarity, kt\u00f3ra zwraca podobie\u0144stwo kosinusowe mi\u0119dzy dwoma dokumentami w postaci zwektoryzowanej"
|
||||
"### ZADANIE 3 Napisać funkcję similarity, która zwraca podobieństwo kosinusowe między dwoma dokumentami w postaci zwektoryzowanej"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -502,7 +500,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'Ala lubi zwierz\u0119ta i ma kota oraz psa!'"
|
||||
"'Ala lubi zwierzęta i ma kota oraz psa!'"
|
||||
]
|
||||
},
|
||||
"execution_count": 25,
|
||||
@ -543,7 +541,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'Ola lubi zwierz\u0119ta oraz ma kota a tak\u017ce chomika!'"
|
||||
"'Ola lubi zwierzęta oraz ma kota a także chomika!'"
|
||||
]
|
||||
},
|
||||
"execution_count": 27,
|
||||
@ -656,7 +654,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'Ala lubi zwierz\u0119ta i ma kota oraz psa!'"
|
||||
"'Ala lubi zwierzęta i ma kota oraz psa!'"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
@ -674,7 +672,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'Ola lubi zwierz\u0119ta oraz ma kota a tak\u017ce chomika!'"
|
||||
"'Ola lubi zwierzęta oraz ma kota a także chomika!'"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
@ -692,7 +690,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'I Jan je\u017adzi na rowerze.'"
|
||||
"'I Jan jeździ na rowerze.'"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
@ -710,7 +708,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'2 wojna \u015bwiatowa by\u0142a wielkim konfliktem zbrojnym'"
|
||||
"'2 wojna światowa była wielkim konfliktem zbrojnym'"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
@ -728,7 +726,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'Tomek lubi psy, ma psa i je\u017adzi na motorze i rowerze.'"
|
||||
"'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.'"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
@ -745,7 +743,7 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# tak s\u0105 obs\u0142ugiwane 2 s\u0142owa\n",
|
||||
"# tak są obsługiwane 2 słowa\n",
|
||||
"query = 'psa kota'\n",
|
||||
"for i in range(len(documents)):\n",
|
||||
" display(documents[i])\n",
|
||||
@ -760,7 +758,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'Ala lubi zwierz\u0119ta i ma kota oraz psa!'"
|
||||
"'Ala lubi zwierzęta i ma kota oraz psa!'"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
@ -778,7 +776,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'Ola lubi zwierz\u0119ta oraz ma kota a tak\u017ce chomika!'"
|
||||
"'Ola lubi zwierzęta oraz ma kota a także chomika!'"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
@ -796,7 +794,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'I Jan je\u017adzi na rowerze.'"
|
||||
"'I Jan jeździ na rowerze.'"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
@ -814,7 +812,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'2 wojna \u015bwiatowa by\u0142a wielkim konfliktem zbrojnym'"
|
||||
"'2 wojna światowa była wielkim konfliktem zbrojnym'"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
@ -832,7 +830,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'Tomek lubi psy, ma psa i je\u017adzi na motorze i rowerze.'"
|
||||
"'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.'"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
@ -866,7 +864,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'Ala lubi zwierz\u0119ta i ma kota oraz psa!'"
|
||||
"'Ala lubi zwierzęta i ma kota oraz psa!'"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
@ -884,7 +882,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'Ola lubi zwierz\u0119ta oraz ma kota a tak\u017ce chomika!'"
|
||||
"'Ola lubi zwierzęta oraz ma kota a także chomika!'"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
@ -902,7 +900,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'I Jan je\u017adzi na rowerze.'"
|
||||
"'I Jan jeździ na rowerze.'"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
@ -920,7 +918,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'2 wojna \u015bwiatowa by\u0142a wielkim konfliktem zbrojnym'"
|
||||
"'2 wojna światowa była wielkim konfliktem zbrojnym'"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
@ -938,7 +936,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'Tomek lubi psy, ma psa i je\u017adzi na motorze i rowerze.'"
|
||||
"'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.'"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
@ -955,7 +953,7 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# dlatego potrzebujemy term frequency \u2192 wiecej znaczy bardziej dopasowany dokument\n",
|
||||
"# dlatego potrzebujemy term frequency → wiecej znaczy bardziej dopasowany dokument\n",
|
||||
"query = 'i'\n",
|
||||
"for i in range(len(documents)):\n",
|
||||
" display(documents[i])\n",
|
||||
@ -970,7 +968,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'Ala lubi zwierz\u0119ta i ma kota oraz psa!'"
|
||||
"'Ala lubi zwierzęta i ma kota oraz psa!'"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
@ -988,7 +986,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'Ola lubi zwierz\u0119ta oraz ma kota a tak\u017ce chomika!'"
|
||||
"'Ola lubi zwierzęta oraz ma kota a także chomika!'"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
@ -1006,7 +1004,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'I Jan je\u017adzi na rowerze.'"
|
||||
"'I Jan jeździ na rowerze.'"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
@ -1024,7 +1022,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'2 wojna \u015bwiatowa by\u0142a wielkim konfliktem zbrojnym'"
|
||||
"'2 wojna światowa była wielkim konfliktem zbrojnym'"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
@ -1042,7 +1040,7 @@
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'Tomek lubi psy, ma psa i je\u017adzi na motorze i rowerze.'"
|
||||
"'Tomek lubi psy, ma psa i jeździ na motorze i rowerze.'"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
@ -1059,7 +1057,7 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# dlatego IDF - \u017ceby wa\u017cniejsze s\u0142owa mia\u0142 wi\u0119ksz\u0105 wag\u0119\n",
|
||||
"# dlatego IDF - żeby ważniejsze słowa miał większą wagę\n",
|
||||
"query = 'i chomika'\n",
|
||||
"for i in range(len(documents)):\n",
|
||||
" display(documents[i])\n",
|
||||
@ -1070,32 +1068,28 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### ZADANIE 4 NAPISA\u0106 IDF w celu zmiany wag z TF na TF- IDF \n",
|
||||
"### ZADANIE 4 NAPISAĆ IDF w celu zmiany wag z TF na TF- IDF \n",
|
||||
"\n",
|
||||
"Prosz\u0119 u\u017cy\u0107 wersj\u0119 bez \u017cadnej normalizacji\n",
|
||||
"Proszę użyć wersję bez żadnej normalizacji\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"$idf_i = \\Large\\frac{|D|}{|\\{d : t_i \\in d \\}|}$\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"$|D|$ - ilo\u015b\u0107 dokument\u00f3w w korpusie\n",
|
||||
"$|\\{d : t_i \\in d \\}|$ - ilo\u015b\u0107 dokument\u00f3w w korpusie, gdzie dany term wyst\u0119puje chocia\u017c jeden raz"
|
||||
"$|D|$ - ilość dokumentów w korpusie\n",
|
||||
"$|\\{d : t_i \\in d \\}|$ - ilość dokumentów w korpusie, gdzie dany term występuje chociaż jeden raz"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"author": "Jakub Pokrywka",
|
||||
"email": "kubapok@wmi.amu.edu.pl",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"lang": "pl",
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
@ -1108,10 +1102,7 @@
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.3"
|
||||
},
|
||||
"author": "Jakub Pokrywka",
|
||||
"email": "kubapok@wmi.amu.edu.pl",
|
||||
"lang": "pl",
|
||||
"subtitle": "3.tfidf (1)[\u0107wiczenia]",
|
||||
"subtitle": "3.tfidf (1)[ćwiczenia]",
|
||||
"title": "Ekstrakcja informacji",
|
||||
"year": "2021"
|
||||
},
|
||||
|
@ -2,14 +2,12 @@
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
|
||||
"<div class=\"alert alert-block alert-info\">\n",
|
||||
"<h1> Ekstrakcja informacji </h1>\n",
|
||||
"<h2> 3. <i>tfidf (1)</i> [\u0107wiczenia]</h2> \n",
|
||||
"<h2> 3. <i>tfidf (1)</i> [ćwiczenia]</h2> \n",
|
||||
"<h3> Jakub Pokrywka (2021)</h3>\n",
|
||||
"</div>\n",
|
||||
"\n",
|
||||
@ -62,11 +60,14 @@
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"author": "Jakub Pokrywka",
|
||||
"email": "kubapok@wmi.amu.edu.pl",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"lang": "pl",
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
@ -79,10 +80,7 @@
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.3"
|
||||
},
|
||||
"author": "Jakub Pokrywka",
|
||||
"email": "kubapok@wmi.amu.edu.pl",
|
||||
"lang": "pl",
|
||||
"subtitle": "3.tfidf (1)[\u0107wiczenia]",
|
||||
"subtitle": "3.tfidf (1)[ćwiczenia]",
|
||||
"title": "Ekstrakcja informacji",
|
||||
"year": "2021"
|
||||
},
|
||||
|
@ -2,14 +2,12 @@
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
|
||||
"<div class=\"alert alert-block alert-info\">\n",
|
||||
"<h1> Ekstrakcja informacji </h1>\n",
|
||||
"<h2> 3. <i>tfidf (2)</i> [\u0107wiczenia]</h2> \n",
|
||||
"<h2> 3. <i>tfidf (2)</i> [ćwiczenia]</h2> \n",
|
||||
"<h3> Jakub Pokrywka (2021)</h3>\n",
|
||||
"</div>\n",
|
||||
"\n",
|
||||
@ -22,7 +20,7 @@
|
||||
"source": [
|
||||
"# Zajecia 2\n",
|
||||
"\n",
|
||||
"Przydatne materia\u0142y:\n",
|
||||
"Przydatne materiały:\n",
|
||||
"\n",
|
||||
"https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html\n",
|
||||
"\n",
|
||||
@ -55,7 +53,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Zbi\u00f3r danych"
|
||||
"## Zbiór danych"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -223,14 +221,14 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### jakie s\u0105 problemy z takim podej\u015bciem?\n"
|
||||
"### jakie są problemy z takim podejściem?\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## TFIDF i odleg\u0142o\u015b\u0107 cosinusowa- gotowe biblioteki"
|
||||
"## TFIDF i odległość cosinusowa- gotowe biblioteki"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -450,217 +448,11 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 21,
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"scrolled": false
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"From: ray@netcom.com (Ray Fischer)\n",
|
||||
"Subject: Re: x86 ~= 680x0 ?? (How do they compare?)\n",
|
||||
"Organization: Netcom. San Jose, California\n",
|
||||
"Distribution: usa\n",
|
||||
"Lines: 36\n",
|
||||
"\n",
|
||||
"dhk@ubbpc.uucp (Dave Kitabjian) writes ...\n",
|
||||
">I'm sure Intel and Motorola are competing neck-and-neck for \n",
|
||||
">crunch-power, but for a given clock speed, how do we rank the\n",
|
||||
">following (from 1st to 6th):\n",
|
||||
"> 486\t\t68040\n",
|
||||
"> 386\t\t68030\n",
|
||||
"> 286\t\t68020\n",
|
||||
"\n",
|
||||
"040 486 030 386 020 286\n",
|
||||
"\n",
|
||||
">While you're at it, where will the following fit into the list:\n",
|
||||
"> 68060\n",
|
||||
"> Pentium\n",
|
||||
"> PowerPC\n",
|
||||
"\n",
|
||||
"060 fastest, then Pentium, with the first versions of the PowerPC\n",
|
||||
"somewhere in the vicinity.\n",
|
||||
"\n",
|
||||
">And about clock speed: Does doubling the clock speed double the\n",
|
||||
">overall processor speed? And fill in the __'s below:\n",
|
||||
"> 68030 @ __ MHz = 68040 @ __ MHz\n",
|
||||
"\n",
|
||||
"No. Computer speed is only partly dependent of processor/clock speed.\n",
|
||||
"Memory system speed play a large role as does video system speed and\n",
|
||||
"I/O speed. As processor clock rates go up, the speed of the memory\n",
|
||||
"system becomes the greatest factor in the overall system speed. If\n",
|
||||
"you have a 50MHz processor, it can be reading another word from memory\n",
|
||||
"every 20ns. Sure, you can put all 20ns memory in your computer, but\n",
|
||||
"it will cost 10 times as much as the slower 80ns SIMMs.\n",
|
||||
"\n",
|
||||
"And roughly, the 68040 is twice as fast at a given clock\n",
|
||||
"speed as is the 68030.\n",
|
||||
"\n",
|
||||
"-- \n",
|
||||
"Ray Fischer \"Convictions are more dangerous enemies of truth\n",
|
||||
"ray@netcom.com than lies.\" -- Friedrich Nietzsche\n",
|
||||
"\n",
|
||||
"0.4778416465020907\n",
|
||||
"----------------------------------------------------------------------------------------------------\n",
|
||||
"----------------------------------------------------------------------------------------------------\n",
|
||||
"----------------------------------------------------------------------------------------------------\n",
|
||||
"From: rvenkate@ux4.cso.uiuc.edu (Ravikuma Venkateswar)\n",
|
||||
"Subject: Re: x86 ~= 680x0 ?? (How do they compare?)\n",
|
||||
"Distribution: usa\n",
|
||||
"Organization: University of Illinois at Urbana\n",
|
||||
"Lines: 59\n",
|
||||
"\n",
|
||||
"ray@netcom.com (Ray Fischer) writes:\n",
|
||||
"\n",
|
||||
">dhk@ubbpc.uucp (Dave Kitabjian) writes ...\n",
|
||||
">>I'm sure Intel and Motorola are competing neck-and-neck for \n",
|
||||
">>crunch-power, but for a given clock speed, how do we rank the\n",
|
||||
">>following (from 1st to 6th):\n",
|
||||
">> 486\t\t68040\n",
|
||||
">> 386\t\t68030\n",
|
||||
">> 286\t\t68020\n",
|
||||
"\n",
|
||||
">040 486 030 386 020 286\n",
|
||||
"\n",
|
||||
"How about some numbers here? Some kind of benchmark?\n",
|
||||
"If you want, let me start it - 486DX2-66 - 32 SPECint92, 16 SPECfp92 .\n",
|
||||
"\n",
|
||||
">>While you're at it, where will the following fit into the list:\n",
|
||||
">> 68060\n",
|
||||
">> Pentium\n",
|
||||
">> PowerPC\n",
|
||||
"\n",
|
||||
">060 fastest, then Pentium, with the first versions of the PowerPC\n",
|
||||
">somewhere in the vicinity.\n",
|
||||
"\n",
|
||||
"Numbers? Pentium @66MHz - 65 SPECint92, 57 SPECfp92 .\n",
|
||||
"\t PowerPC @66MHz - 50 SPECint92, 80 SPECfp92 . (Note this is the 601)\n",
|
||||
" (Alpha @150MHz - 74 SPECint92,126 SPECfp92 - just for comparison)\n",
|
||||
"\n",
|
||||
">>And about clock speed: Does doubling the clock speed double the\n",
|
||||
">>overall processor speed? And fill in the __'s below:\n",
|
||||
">> 68030 @ __ MHz = 68040 @ __ MHz\n",
|
||||
"\n",
|
||||
">No. Computer speed is only partly dependent of processor/clock speed.\n",
|
||||
">Memory system speed play a large role as does video system speed and\n",
|
||||
">I/O speed. As processor clock rates go up, the speed of the memory\n",
|
||||
">system becomes the greatest factor in the overall system speed. If\n",
|
||||
">you have a 50MHz processor, it can be reading another word from memory\n",
|
||||
">every 20ns. Sure, you can put all 20ns memory in your computer, but\n",
|
||||
">it will cost 10 times as much as the slower 80ns SIMMs.\n",
|
||||
"\n",
|
||||
"Not in a clock-doubled system. There isn't a doubling in performance, but\n",
|
||||
"it _is_ quite significant. Maybe about a 70% increase in performance.\n",
|
||||
"\n",
|
||||
"Besides, for 0 wait state performance, you'd need a cache anyway. I mean,\n",
|
||||
"who uses a processor that runs at the speed of 80ns SIMMs? Note that this\n",
|
||||
"memory speed corresponds to a clock speed of 12.5 MHz.\n",
|
||||
"\n",
|
||||
">And roughly, the 68040 is twice as fast at a given clock\n",
|
||||
">speed as is the 68030.\n",
|
||||
"\n",
|
||||
"Numbers?\n",
|
||||
"\n",
|
||||
">-- \n",
|
||||
">Ray Fischer \"Convictions are more dangerous enemies of truth\n",
|
||||
">ray@netcom.com than lies.\" -- Friedrich Nietzsche\n",
|
||||
"-- \n",
|
||||
"Ravikumar Venkateswar\n",
|
||||
"rvenkate@uiuc.edu\n",
|
||||
"\n",
|
||||
"A pun is a no' blessed form of whit.\n",
|
||||
"\n",
|
||||
"0.44292082969477664\n",
|
||||
"----------------------------------------------------------------------------------------------------\n",
|
||||
"----------------------------------------------------------------------------------------------------\n",
|
||||
"----------------------------------------------------------------------------------------------------\n",
|
||||
"From: ray@netcom.com (Ray Fischer)\n",
|
||||
"Subject: Re: x86 ~= 680x0 ?? (How do they compare?)\n",
|
||||
"Organization: Netcom. San Jose, California\n",
|
||||
"Distribution: usa\n",
|
||||
"Lines: 30\n",
|
||||
"\n",
|
||||
"rvenkate@ux4.cso.uiuc.edu (Ravikuma Venkateswar) writes ...\n",
|
||||
">ray@netcom.com (Ray Fischer) writes:\n",
|
||||
">>040 486 030 386 020 286\n",
|
||||
">\n",
|
||||
">How about some numbers here? Some kind of benchmark?\n",
|
||||
"\n",
|
||||
"Benchmarks are for marketing dweebs and CPU envy. OK, if it will make\n",
|
||||
"you happy, the 486 is faster than the 040. BFD. Both architectures\n",
|
||||
"are nearing then end of their lifetimes. And especially with the x86\n",
|
||||
"architecture: good riddance.\n",
|
||||
"\n",
|
||||
">Besides, for 0 wait state performance, you'd need a cache anyway. I mean,\n",
|
||||
">who uses a processor that runs at the speed of 80ns SIMMs? Note that this\n",
|
||||
">memory speed corresponds to a clock speed of 12.5 MHz.\n",
|
||||
"\n",
|
||||
"The point being the processor speed is only one of many aspects of a\n",
|
||||
"computers performance. Clock speed, processor, memory speed, CPU\n",
|
||||
"architecture, I/O systems, even the application program all contribute \n",
|
||||
"to the overall system performance.\n",
|
||||
"\n",
|
||||
">>And roughly, the 68040 is twice as fast at a given clock\n",
|
||||
">>speed as is the 68030.\n",
|
||||
">\n",
|
||||
">Numbers?\n",
|
||||
"\n",
|
||||
"Look them up yourself.\n",
|
||||
"\n",
|
||||
"-- \n",
|
||||
"Ray Fischer \"Convictions are more dangerous enemies of truth\n",
|
||||
"ray@netcom.com than lies.\" -- Friedrich Nietzsche\n",
|
||||
"\n",
|
||||
"0.3491800997095306\n",
|
||||
"----------------------------------------------------------------------------------------------------\n",
|
||||
"----------------------------------------------------------------------------------------------------\n",
|
||||
"----------------------------------------------------------------------------------------------------\n",
|
||||
"From: mb4008@cehp11 (Morgan J Bullard)\n",
|
||||
"Subject: Re: speeding up windows\n",
|
||||
"Keywords: speed\n",
|
||||
"Organization: University of Illinois at Urbana\n",
|
||||
"Lines: 30\n",
|
||||
"\n",
|
||||
"djserian@flash.LakeheadU.Ca (Reincarnation of Elvis) writes:\n",
|
||||
"\n",
|
||||
">I have a 386/33 with 8 megs of memory\n",
|
||||
"\n",
|
||||
">I have noticed that lately when I use programs like WpfW or Corel Draw\n",
|
||||
">my computer \"boggs\" down and becomes really sluggish!\n",
|
||||
"\n",
|
||||
">What can I do to increase performance? What should I turn on or off\n",
|
||||
"\n",
|
||||
">Will not loading wallpapers or stuff like that help when it comes to\n",
|
||||
">the running speed of windows and the programs that run under it?\n",
|
||||
"\n",
|
||||
">Thanx in advance\n",
|
||||
"\n",
|
||||
">Derek\n",
|
||||
"\n",
|
||||
"1) make sure your hard drive is defragmented. This will speed up more than \n",
|
||||
" just windows BTW. Use something like Norton's or PC Tools.\n",
|
||||
"2) I _think_ that leaving the wall paper out will use less RAM and therefore\n",
|
||||
" will speed up your machine but I could very will be wrong on this.\n",
|
||||
"There's a good chance you've already done this but if not it may speed things\n",
|
||||
"up. good luck\n",
|
||||
"\t\t\t\tMorgan Bullard mb4008@coewl.cen.uiuc.edu\n",
|
||||
"\t\t\t\t\t or mjbb@uxa.cso.uiuc.edu\n",
|
||||
"\n",
|
||||
">--\n",
|
||||
">$_ /|$Derek J.P. Serianni $ E-Mail : djserian@flash.lakeheadu.ca $ \n",
|
||||
">$\\'o.O' $Sociologist $ It's 106 miles to Chicago,we've got a full tank$\n",
|
||||
">$=(___)=$Lakehead University $ of gas, half a pack of cigarettes,it's dark,and$\n",
|
||||
">$ U $Thunder Bay, Ontario$ we're wearing sunglasses. -Elwood Blues $ \n",
|
||||
"\n",
|
||||
"0.26949927393886913\n",
|
||||
"----------------------------------------------------------------------------------------------------\n",
|
||||
"----------------------------------------------------------------------------------------------------\n",
|
||||
"----------------------------------------------------------------------------------------------------\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"for i in range (1,5):\n",
|
||||
" print(newsgroups[similarities.argsort()[0][-i]])\n",
|
||||
@ -677,19 +469,19 @@
|
||||
"## Zadanie domowe\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"- Wybra\u0107 zbi\u00f3r tekstowy, kt\u00f3ry ma conajmniej 10000 dokument\u00f3w (inny ni\u017c w tym przyk\u0142adzie).\n",
|
||||
"- Na jego podstawie stworzy\u0107 wyszukiwark\u0119 bazuj\u0105c\u0105 na OKAPI BM25, tzn. system kt\u00f3ry dla podanej frazy podaje kilka (5-10) posortowanych najbardziej pasuj\u0105cych dokument\u00f3w razem ze scorami. Nale\u017cy wypisywa\u0107 te\u017c ilo\u015b\u0107 zwracanych dokument\u00f3w, czyli takich z niezerowym scorem. Mo\u017cna korzysta\u0107 z gotowych bibliotek do wektoryzacji dokument\u00f3w, nale\u017cy jednak samemu zaimplementowa\u0107 OKAPI BM25. \n",
|
||||
"- Znale\u017a\u0107 fraz\u0119 (query), dla kt\u00f3rej wynik nie jest satysfakcjonuj\u0105cy.\n",
|
||||
"- Poprawi\u0107 wyszukiwark\u0119 (np. poprzez zmian\u0119 preprocessingu tekstu, wektoryzer, zmian\u0119 parametr\u00f3w algorytmu rankuj\u0105cego lub sam algorytm) tak, \u017ceby zwraca\u0142a satysfakcjonuj\u0105ce wyniki dla poprzedniej frazy. Nale\u017cy zrobi\u0107 inn\u0105 zmian\u0119 ni\u017c w tym przyk\u0142adzie, tylko wymy\u015bli\u0107 co\u015b w\u0142asnego.\n",
|
||||
"- prezentowa\u0107 prac\u0119 na nast\u0119pnych zaj\u0119ciach (14.04) odpowiadaj\u0105c na pytania:\n",
|
||||
" - jak wygl\u0105da zbi\u00f3r i system wyszukiwania przed zmianami\n",
|
||||
" - dla jakiej frazy wyniki s\u0105 niesatysfakcjonuj\u0105ce (pokaza\u0107 wyniki)\n",
|
||||
" - jakie zmiany zosta\u0142y naniesione\n",
|
||||
" - jak wygl\u0105daj\u0105 wyniki wyszukiwania po zmianach\n",
|
||||
" - jak zmiany wp\u0142yn\u0119\u0142y na wyniki (1-2 zdania)\n",
|
||||
"- Wybrać zbiór tekstowy, który ma conajmniej 10000 dokumentów (inny niż w tym przykładzie).\n",
|
||||
"- Na jego podstawie stworzyć wyszukiwarkę bazującą na OKAPI BM25, tzn. system który dla podanej frazy podaje kilka (5-10) posortowanych najbardziej pasujących dokumentów razem ze scorami. Należy wypisywać też ilość zwracanych dokumentów, czyli takich z niezerowym scorem. Można korzystać z gotowych bibliotek do wektoryzacji dokumentów, należy jednak samemu zaimplementować OKAPI BM25. \n",
|
||||
"- Znaleźć frazę (query), dla której wynik nie jest satysfakcjonujący.\n",
|
||||
"- Poprawić wyszukiwarkę (np. poprzez zmianę preprocessingu tekstu, wektoryzer, zmianę parametrów algorytmu rankującego lub sam algorytm) tak, żeby zwracała satysfakcjonujące wyniki dla poprzedniej frazy. Należy zrobić inną zmianę niż w tym przykładzie, tylko wymyślić coś własnego.\n",
|
||||
"- prezentować pracę na następnych zajęciach (14.04) odpowiadając na pytania:\n",
|
||||
" - jak wygląda zbiór i system wyszukiwania przed zmianami\n",
|
||||
" - dla jakiej frazy wyniki są niesatysfakcjonujące (pokazać wyniki)\n",
|
||||
" - jakie zmiany zostały naniesione\n",
|
||||
" - jak wyglądają wyniki wyszukiwania po zmianach\n",
|
||||
" - jak zmiany wpłynęły na wyniki (1-2 zdania)\n",
|
||||
" \n",
|
||||
"Prezentacja powinna by\u0107 maksymalnie prosta i trwa\u0107 maksymalnie 2-3 minuty.\n",
|
||||
"punkt\u00f3w do zdobycia: 60\n"
|
||||
"Prezentacja powinna być maksymalnie prosta i trwać maksymalnie 2-3 minuty.\n",
|
||||
"punktów do zdobycia: 60\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -701,11 +493,14 @@
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"author": "Jakub Pokrywka",
|
||||
"email": "kubapok@wmi.amu.edu.pl",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"lang": "pl",
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
@ -718,10 +513,7 @@
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.3"
|
||||
},
|
||||
"author": "Jakub Pokrywka",
|
||||
"email": "kubapok@wmi.amu.edu.pl",
|
||||
"lang": "pl",
|
||||
"subtitle": "3.tfidf (2)[\u0107wiczenia]",
|
||||
"subtitle": "3.tfidf (2)[ćwiczenia]",
|
||||
"title": "Ekstrakcja informacji",
|
||||
"year": "2021"
|
||||
},
|
||||
|
@ -85,13 +85,6 @@
|
||||
" * proszę zaznaczyć w MS TEAMS, że Państwo zrobili zadanie w assigments\n",
|
||||
" * zdawanie zadania będzie na zajęciach. Proszę przygotować prezentację do 5 minut"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
|
@ -2,14 +2,12 @@
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
|
||||
"<div class=\"alert alert-block alert-info\">\n",
|
||||
"<h1> Ekstrakcja informacji </h1>\n",
|
||||
"<h2> 6. <i>Klasyfikacja</i> [\u0107wiczenia]</h2> \n",
|
||||
"<h2> 6. <i>Klasyfikacja</i> [ćwiczenia]</h2> \n",
|
||||
"<h3> Jakub Pokrywka (2021)</h3>\n",
|
||||
"</div>\n",
|
||||
"\n",
|
||||
@ -20,14 +18,14 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Zaj\u0119cia klasyfikacja"
|
||||
"# Zajęcia klasyfikacja"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Zbi\u00f3r kleister"
|
||||
"## Zbiór kleister"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -56,7 +54,7 @@
|
||||
"source": [
|
||||
"### Pytanie\n",
|
||||
"\n",
|
||||
"Czy jurysdykcja musi by\u0107 zapisana explicite w umowie?"
|
||||
"Czy jurysdykcja musi być zapisana explicite w umowie?"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -163,7 +161,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Czy wszystkie stany musz\u0105 wyst\u0119powa\u0107 w zbiorze trenuj\u0105cym w zbiorze kleister?\n",
|
||||
"### Czy wszystkie stany muszą występować w zbiorze trenującym w zbiorze kleister?\n",
|
||||
"\n",
|
||||
"https://en.wikipedia.org/wiki/U.S._state\n",
|
||||
"\n",
|
||||
@ -269,104 +267,11 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"['New_York',\n",
|
||||
" 'New_York',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'Massachusetts',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'Washington',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'New_Jersey',\n",
|
||||
" 'New_York',\n",
|
||||
" 'NONE',\n",
|
||||
" 'NONE',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'New_York',\n",
|
||||
" 'Massachusetts',\n",
|
||||
" 'Minnesota',\n",
|
||||
" 'California',\n",
|
||||
" 'New_York',\n",
|
||||
" 'California',\n",
|
||||
" 'Iowa',\n",
|
||||
" 'California',\n",
|
||||
" 'Virginia',\n",
|
||||
" 'North_Carolina',\n",
|
||||
" 'Arizona',\n",
|
||||
" 'Indiana',\n",
|
||||
" 'New_Jersey',\n",
|
||||
" 'California',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'Georgia',\n",
|
||||
" 'New_York',\n",
|
||||
" 'New_York',\n",
|
||||
" 'California',\n",
|
||||
" 'Minnesota',\n",
|
||||
" 'California',\n",
|
||||
" 'Kentucky',\n",
|
||||
" 'Minnesota',\n",
|
||||
" 'Ohio',\n",
|
||||
" 'Michigan',\n",
|
||||
" 'California',\n",
|
||||
" 'Minnesota',\n",
|
||||
" 'California',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'Illinois',\n",
|
||||
" 'Minnesota',\n",
|
||||
" 'Texas',\n",
|
||||
" 'New_Jersey',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'Washington',\n",
|
||||
" 'NONE',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'Oregon',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'Massachusetts',\n",
|
||||
" 'California',\n",
|
||||
" 'NONE',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'Illinois',\n",
|
||||
" 'Idaho',\n",
|
||||
" 'Washington',\n",
|
||||
" 'New_York',\n",
|
||||
" 'New_York',\n",
|
||||
" 'California',\n",
|
||||
" 'Utah',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'Washington',\n",
|
||||
" 'Virginia',\n",
|
||||
" 'New_York',\n",
|
||||
" 'New_York',\n",
|
||||
" 'Illinois',\n",
|
||||
" 'California',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'NONE',\n",
|
||||
" 'Texas',\n",
|
||||
" 'California',\n",
|
||||
" 'Washington',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'Washington',\n",
|
||||
" 'New_York',\n",
|
||||
" 'Washington',\n",
|
||||
" 'Illinois']"
|
||||
]
|
||||
},
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dev_expected_jurisdiction"
|
||||
]
|
||||
@ -416,7 +321,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Co je\u017celi nazwy klas nie wyst\u0119puj\u0105 explicite w zbiorach?"
|
||||
"### Co jeżeli nazwy klas nie występują explicite w zbiorach?"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -473,7 +378,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Sprytne podej\u015bcie do klasyfikacji tekstu? Naiwny bayess"
|
||||
"### Sprytne podejście do klasyfikacji tekstu? Naiwny bayess"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -879,14 +784,14 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# listing dla get_prob2, s\u0142owo 'god'"
|
||||
"# listing dla get_prob2, słowo 'god'"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## za\u0142o\u017cenie naiwnego bayesa"
|
||||
"## założenie naiwnego bayesa"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -900,7 +805,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**przy za\u0142o\u017ceniu o niezale\u017cno\u015bci zmiennych losowych $word1$, $word2$, $word3$**:\n",
|
||||
"**przy założeniu o niezależności zmiennych losowych $word1$, $word2$, $word3$**:\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"$P(word1, word2, word3|class) = P(word1|class)* P(word2|class) * P(word3|class)$"
|
||||
@ -920,18 +825,18 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## zadania domowe naiwny bayes1 r\u0119cznie"
|
||||
"## zadania domowe naiwny bayes1 ręcznie"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- analogicznie zaimplementowa\u0107 funkcj\u0119 get_prob3(index, document_tokenized), argument document_tokenized ma by\u0107 zbiorem s\u0142\u00f3w dokumentu. funkcja ma by\u0107 naiwnym klasyfikatorem bayesowskim (w przypadku wielu s\u0142\u00f3w)\n",
|
||||
"- odpali\u0107 powy\u017cszy listing prawdopodobie\u0144stw z funkcj\u0105 get_prob3 dla dokument\u00f3w: {'i','love','guns'} oraz {'is','there','life','after'\n",
|
||||
"- analogicznie zaimplementować funkcję get_prob3(index, document_tokenized), argument document_tokenized ma być zbiorem słów dokumentu. funkcja ma być naiwnym klasyfikatorem bayesowskim (w przypadku wielu słów)\n",
|
||||
"- odpalić powyższy listing prawdopodobieństw z funkcją get_prob3 dla dokumentów: {'i','love','guns'} oraz {'is','there','life','after'\n",
|
||||
",'death'}\n",
|
||||
"- zadanie prosz\u0119 zrobi\u0107 w jupyterze, wygenerowa\u0107 pdf (kod + wyniki odpalenia) i umie\u015bci\u0107 go jako zadanie w teams\n",
|
||||
"- termin 12.05, punkt\u00f3w: 40\n"
|
||||
"- zadanie proszę zrobić w jupyterze, wygenerować pdf (kod + wyniki odpalenia) i umieścić go jako zadanie w teams\n",
|
||||
"- termin 12.05, punktów: 40\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -946,23 +851,26 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- wybra\u0107 jedno z poni\u017cszych repozytori\u00f3w i je sforkowa\u0107:\n",
|
||||
"- wybrać jedno z poniższych repozytoriów i je sforkować:\n",
|
||||
" - https://git.wmi.amu.edu.pl/kubapok/paranormal-or-skeptic-ISI-public\n",
|
||||
" - https://git.wmi.amu.edu.pl/kubapok/sport-text-classification-ball-ISI-public\n",
|
||||
"- stworzy\u0107 klasyfikator bazuj\u0105cy na naiwnym bayessie (mo\u017ce by\u0107 gotowa biblioteka), mo\u017ce te\u017c korzysta\u0107 z gotowych implementacji tfidf\n",
|
||||
"- stworzy\u0107 predykcje w plikach dev-0/out.tsv oraz test-A/out.tsv\n",
|
||||
"- wynik accuracy sprawdzony za pomoc\u0105 narz\u0119dzia geval (patrz poprzednie zadanie) powinien wynosi\u0107 conajmniej 0.67\n",
|
||||
"- prosz\u0119 umie\u015bci\u0107 predykcj\u0119 oraz skrypty generuj\u0105ce (w postaci tekstowej a nie jupyter) w repo, a w MS TEAMS umie\u015bci\u0107 link do swojego repo\n",
|
||||
"termin 12.05, 40 punkt\u00f3w\n"
|
||||
"- stworzyć klasyfikator bazujący na naiwnym bayessie (może być gotowa biblioteka), może też korzystać z gotowych implementacji tfidf\n",
|
||||
"- stworzyć predykcje w plikach dev-0/out.tsv oraz test-A/out.tsv\n",
|
||||
"- wynik accuracy sprawdzony za pomocą narzędzia geval (patrz poprzednie zadanie) powinien wynosić conajmniej 0.67\n",
|
||||
"- proszę umieścić predykcję oraz skrypty generujące (w postaci tekstowej a nie jupyter) w repo, a w MS TEAMS umieścić link do swojego repo\n",
|
||||
"termin 12.05, 40 punktów\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"author": "Jakub Pokrywka",
|
||||
"email": "kubapok@wmi.amu.edu.pl",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"lang": "pl",
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
@ -975,10 +883,7 @@
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.3"
|
||||
},
|
||||
"author": "Jakub Pokrywka",
|
||||
"email": "kubapok@wmi.amu.edu.pl",
|
||||
"lang": "pl",
|
||||
"subtitle": "6.Klasyfikacja[\u0107wiczenia]",
|
||||
"subtitle": "6.Klasyfikacja[ćwiczenia]",
|
||||
"title": "Ekstrakcja informacji",
|
||||
"year": "2021"
|
||||
},
|
||||
|
@ -2,14 +2,12 @@
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
|
||||
"<div class=\"alert alert-block alert-info\">\n",
|
||||
"<h1> Ekstrakcja informacji </h1>\n",
|
||||
"<h2> 6. <i>Klasyfikacja</i> [\u0107wiczenia]</h2> \n",
|
||||
"<h2> 6. <i>Klasyfikacja</i> [ćwiczenia]</h2> \n",
|
||||
"<h3> Jakub Pokrywka (2021)</h3>\n",
|
||||
"</div>\n",
|
||||
"\n",
|
||||
@ -20,14 +18,14 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Zaj\u0119cia klasyfikacja"
|
||||
"# Zajęcia klasyfikacja"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Zbi\u00f3r kleister"
|
||||
"## Zbiór kleister"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -56,7 +54,7 @@
|
||||
"source": [
|
||||
"### Pytanie\n",
|
||||
"\n",
|
||||
"Czy jurysdykcja musi by\u0107 zapisana explicite w umowie?"
|
||||
"Czy jurysdykcja musi być zapisana explicite w umowie?"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -163,7 +161,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Czy wszystkie stany musz\u0105 wyst\u0119powa\u0107 w zbiorze trenuj\u0105cym w zbiorze kleister?\n",
|
||||
"### Czy wszystkie stany muszą występować w zbiorze trenującym w zbiorze kleister?\n",
|
||||
"\n",
|
||||
"https://en.wikipedia.org/wiki/U.S._state\n",
|
||||
"\n",
|
||||
@ -269,104 +267,11 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"['New_York',\n",
|
||||
" 'New_York',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'Massachusetts',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'Washington',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'New_Jersey',\n",
|
||||
" 'New_York',\n",
|
||||
" 'NONE',\n",
|
||||
" 'NONE',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'New_York',\n",
|
||||
" 'Massachusetts',\n",
|
||||
" 'Minnesota',\n",
|
||||
" 'California',\n",
|
||||
" 'New_York',\n",
|
||||
" 'California',\n",
|
||||
" 'Iowa',\n",
|
||||
" 'California',\n",
|
||||
" 'Virginia',\n",
|
||||
" 'North_Carolina',\n",
|
||||
" 'Arizona',\n",
|
||||
" 'Indiana',\n",
|
||||
" 'New_Jersey',\n",
|
||||
" 'California',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'Georgia',\n",
|
||||
" 'New_York',\n",
|
||||
" 'New_York',\n",
|
||||
" 'California',\n",
|
||||
" 'Minnesota',\n",
|
||||
" 'California',\n",
|
||||
" 'Kentucky',\n",
|
||||
" 'Minnesota',\n",
|
||||
" 'Ohio',\n",
|
||||
" 'Michigan',\n",
|
||||
" 'California',\n",
|
||||
" 'Minnesota',\n",
|
||||
" 'California',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'Illinois',\n",
|
||||
" 'Minnesota',\n",
|
||||
" 'Texas',\n",
|
||||
" 'New_Jersey',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'Washington',\n",
|
||||
" 'NONE',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'Oregon',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'Massachusetts',\n",
|
||||
" 'California',\n",
|
||||
" 'NONE',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'Illinois',\n",
|
||||
" 'Idaho',\n",
|
||||
" 'Washington',\n",
|
||||
" 'New_York',\n",
|
||||
" 'New_York',\n",
|
||||
" 'California',\n",
|
||||
" 'Utah',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'Washington',\n",
|
||||
" 'Virginia',\n",
|
||||
" 'New_York',\n",
|
||||
" 'New_York',\n",
|
||||
" 'Illinois',\n",
|
||||
" 'California',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'NONE',\n",
|
||||
" 'Texas',\n",
|
||||
" 'California',\n",
|
||||
" 'Washington',\n",
|
||||
" 'Delaware',\n",
|
||||
" 'Washington',\n",
|
||||
" 'New_York',\n",
|
||||
" 'Washington',\n",
|
||||
" 'Illinois']"
|
||||
]
|
||||
},
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dev_expected_jurisdiction"
|
||||
]
|
||||
@ -416,7 +321,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Co je\u017celi nazwy klas nie wyst\u0119puj\u0105 explicite w zbiorach?"
|
||||
"### Co jeżeli nazwy klas nie występują explicite w zbiorach?"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -473,23 +378,14 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Sprytne podej\u015bcie do klasyfikacji tekstu? Naiwny bayess"
|
||||
"### Sprytne podejście do klasyfikacji tekstu? Naiwny bayess"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 17,
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"/home/kuba/anaconda3/lib/python3.8/site-packages/gensim/similarities/__init__.py:15: UserWarning: The gensim.similarities.levenshtein submodule is disabled, because the optional Levenshtein package <https://pypi.org/project/python-Levenshtein/> is unavailable. Install Levenhstein (e.g. `pip install python-Levenshtein`) to suppress this warning.\n",
|
||||
" warnings.warn(msg)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from sklearn.datasets import fetch_20newsgroups\n",
|
||||
"# https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html\n",
|
||||
@ -1033,7 +929,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## za\u0142o\u017cenie naiwnego bayesa"
|
||||
"## założenie naiwnego bayesa"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -1047,7 +943,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**przy za\u0142o\u017ceniu o niezale\u017cno\u015bci zmiennych losowych $word1$, $word2$, $word3$**:\n",
|
||||
"**przy założeniu o niezależności zmiennych losowych $word1$, $word2$, $word3$**:\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"$P(word1, word2, word3|class) = P(word1|class)* P(word2|class) * P(word3|class)$"
|
||||
@ -1067,18 +963,18 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## zadania domowe naiwny bayes1 r\u0119cznie"
|
||||
"## zadania domowe naiwny bayes1 ręcznie"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- analogicznie zaimplementowa\u0107 funkcj\u0119 get_prob3(index, document_tokenized), argument document_tokenized ma by\u0107 zbiorem s\u0142\u00f3w dokumentu. funkcja ma by\u0107 naiwnym klasyfikatorem bayesowskim (w przypadku wielu s\u0142\u00f3w)\n",
|
||||
"- odpali\u0107 powy\u017cszy listing prawdopodobie\u0144stw z funkcj\u0105 get_prob3 dla dokument\u00f3w: {'i','love','guns'} oraz {'is','there','life','after'\n",
|
||||
"- analogicznie zaimplementować funkcję get_prob3(index, document_tokenized), argument document_tokenized ma być zbiorem słów dokumentu. funkcja ma być naiwnym klasyfikatorem bayesowskim (w przypadku wielu słów)\n",
|
||||
"- odpalić powyższy listing prawdopodobieństw z funkcją get_prob3 dla dokumentów: {'i','love','guns'} oraz {'is','there','life','after'\n",
|
||||
",'death'}\n",
|
||||
"- zadanie prosz\u0119 zrobi\u0107 w jupyterze, wygenerowa\u0107 pdf (kod + wyniki odpalenia) i umie\u015bci\u0107 go jako zadanie w teams\n",
|
||||
"- termin 12.05, punkt\u00f3w: 40\n"
|
||||
"- zadanie proszę zrobić w jupyterze, wygenerować pdf (kod + wyniki odpalenia) i umieścić go jako zadanie w teams\n",
|
||||
"- termin 12.05, punktów: 40\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -1092,23 +988,26 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"- wybra\u0107 jedno z poni\u017cszych repozytori\u00f3w i je sforkowa\u0107:\n",
|
||||
"- wybrać jedno z poniższych repozytoriów i je sforkować:\n",
|
||||
" - https://git.wmi.amu.edu.pl/kubapok/paranormal-or-skeptic-ISI-public\n",
|
||||
" - https://git.wmi.amu.edu.pl/kubapok/sport-text-classification-ball-ISI-public\n",
|
||||
"- stworzy\u0107 klasyfikator bazuj\u0105cy na naiwnym bayessie (mo\u017ce by\u0107 gotowa biblioteka), mo\u017ce te\u017c korzysta\u0107 z gotowych implementacji tfidf\n",
|
||||
"- stworzy\u0107 predykcje w plikach dev-0/out.tsv oraz test-A/out.tsv\n",
|
||||
"- wynik accuracy sprawdzony za pomoc\u0105 narz\u0119dzia geval (patrz poprzednie zadanie) powinien wynosi\u0107 conajmniej 0.67\n",
|
||||
"- prosz\u0119 umie\u015bci\u0107 predykcj\u0119 oraz skrypty generuj\u0105ce (w postaci tekstowej a nie jupyter) w repo, a w MS TEAMS umie\u015bci\u0107 link do swojego repo\n",
|
||||
"termin 12.05, 40 punkt\u00f3w\n"
|
||||
"- stworzyć klasyfikator bazujący na naiwnym bayessie (może być gotowa biblioteka), może też korzystać z gotowych implementacji tfidf\n",
|
||||
"- stworzyć predykcje w plikach dev-0/out.tsv oraz test-A/out.tsv\n",
|
||||
"- wynik accuracy sprawdzony za pomocą narzędzia geval (patrz poprzednie zadanie) powinien wynosić conajmniej 0.67\n",
|
||||
"- proszę umieścić predykcję oraz skrypty generujące (w postaci tekstowej a nie jupyter) w repo, a w MS TEAMS umieścić link do swojego repo\n",
|
||||
"termin 12.05, 40 punktów\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"author": "Jakub Pokrywka",
|
||||
"email": "kubapok@wmi.amu.edu.pl",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"lang": "pl",
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
@ -1121,10 +1020,7 @@
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.3"
|
||||
},
|
||||
"author": "Jakub Pokrywka",
|
||||
"email": "kubapok@wmi.amu.edu.pl",
|
||||
"lang": "pl",
|
||||
"subtitle": "6.Klasyfikacja[\u0107wiczenia]",
|
||||
"subtitle": "6.Klasyfikacja[ćwiczenia]",
|
||||
"title": "Ekstrakcja informacji",
|
||||
"year": "2021"
|
||||
},
|
||||
|
@ -30,20 +30,11 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"/media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages/gensim/similarities/__init__.py:15: UserWarning: The gensim.similarities.levenshtein submodule is disabled, because the optional Levenshtein package <https://pypi.org/project/python-Levenshtein/> is unavailable. Install Levenhstein (e.g. `pip install python-Levenshtein`) to suppress this warning.\n",
|
||||
" warnings.warn(msg)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import numpy as np\n",
|
||||
"import gensim\n",
|
||||
|
@ -30,20 +30,11 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"/media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages/gensim/similarities/__init__.py:15: UserWarning: The gensim.similarities.levenshtein submodule is disabled, because the optional Levenshtein package <https://pypi.org/project/python-Levenshtein/> is unavailable. Install Levenhstein (e.g. `pip install python-Levenshtein`) to suppress this warning.\n",
|
||||
" warnings.warn(msg)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import numpy as np\n",
|
||||
"import gensim\n",
|
||||
|
File diff suppressed because one or more lines are too long
@ -23,18 +23,9 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"/media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages/gensim/similarities/__init__.py:15: UserWarning: The gensim.similarities.levenshtein submodule is disabled, because the optional Levenshtein package <https://pypi.org/project/python-Levenshtein/> is unavailable. Install Levenhstein (e.g. `pip install python-Levenshtein`) to suppress this warning.\n",
|
||||
" warnings.warn(msg)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import numpy as np\n",
|
||||
"import gensim\n",
|
||||
@ -60,19 +51,11 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"scrolled": false
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Reusing dataset conll2003 (/home/kuba/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/40e7cb6bcc374f7c349c83acd1e9352a4f09474eb691f64f364ee62eb65d0ca6)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"dataset = load_dataset(\"conll2003\")"
|
||||
]
|
||||
@ -432,227 +415,11 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 26,
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
"scrolled": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "75184b632ce54ae690b3444778f44651",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
"HBox(children=(FloatProgress(value=0.0, max=14041.0), HTML(value='')))"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "74a55a414fa948a3b251b89f780564d0",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
"HBox(children=(FloatProgress(value=0.0, max=3250.0), HTML(value='')))"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n",
|
||||
"(0.5068524970963996, 0.5072649075903755, 0.5070586184860281)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "0c9c580076fb4ec48b7ea2f300878594",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
"HBox(children=(FloatProgress(value=0.0, max=14041.0), HTML(value='')))"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "0be8681c67f64aca95ce5d3c44f10538",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
"HBox(children=(FloatProgress(value=0.0, max=3250.0), HTML(value='')))"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n",
|
||||
"(0.653649243957614, 0.6381494827385795, 0.6458063757205035)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "2dec403004bb4ae298bc73553ea3f4bc",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
"HBox(children=(FloatProgress(value=0.0, max=14041.0), HTML(value='')))"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "eebed0407ba343e29cf8c2d607f631dc",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
"HBox(children=(FloatProgress(value=0.0, max=3250.0), HTML(value='')))"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n",
|
||||
"(0.7140486069946651, 0.7001046146693014, 0.7070078647728607)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "70792f22eea343c8916bcfcf9215c298",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
"HBox(children=(FloatProgress(value=0.0, max=14041.0), HTML(value='')))"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "5d400bf1b656433ba2091cf750ec2d78",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
"HBox(children=(FloatProgress(value=0.0, max=3250.0), HTML(value='')))"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n",
|
||||
"(0.756327964151629, 0.725909566430315, 0.7408066429418744)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "604c4fa13c03435d81bf68be37977d74",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
"HBox(children=(FloatProgress(value=0.0, max=14041.0), HTML(value='')))"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "2f78871f366f4fd1b7de6c4be5303906",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
"text/plain": [
|
||||
"HBox(children=(FloatProgress(value=0.0, max=3250.0), HTML(value='')))"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n",
|
||||
"(0.7963248522230789, 0.7203301174009067, 0.7564235581324383)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"for i in range(NUM_EPOCHS):\n",
|
||||
" lstm.train()\n",
|
||||
|
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
@ -2,14 +2,12 @@
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
|
||||
"<div class=\"alert alert-block alert-info\">\n",
|
||||
"<h1> Ekstrakcja informacji </h1>\n",
|
||||
"<h2> 14. <i>Ekstrakcja informacji seq2seq</i> [\u0107wiczenia]</h2> \n",
|
||||
"<h2> 14. <i>Ekstrakcja informacji seq2seq</i> [ćwiczenia]</h2> \n",
|
||||
"<h3> Jakub Pokrywka (2021)</h3>\n",
|
||||
"</div>\n",
|
||||
"\n",
|
||||
@ -21,9 +19,9 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### SIMILARITY SEARCH\n",
|
||||
"1. zainstaluj faiss i zr\u00f3b tutorial: https://github.com/facebookresearch/faiss\n",
|
||||
"2. wczytaj tre\u015bci artyku\u0142\u00f3w z BBC News Train.csv\n",
|
||||
"3. U\u017cyj kt\u00f3rego\u015b z transformer\u00f3w (mo\u017cesz u\u017cy\u0107 biblioteki sentence-transformers) do stworzenia embedding\u00f3w dokument\u00f3w\n",
|
||||
"1. zainstaluj faiss i zrób tutorial: https://github.com/facebookresearch/faiss\n",
|
||||
"2. wczytaj treści artykułów z BBC News Train.csv\n",
|
||||
"3. Użyj któregoś z transformerów (możesz użyć biblioteki sentence-transformers) do stworzenia embeddingów dokumentów\n",
|
||||
"4. wczytaj embeddingi do bazy danych faiss\n",
|
||||
"5. wyszukaj query 'consumer electronics market'"
|
||||
]
|
||||
@ -37,7 +35,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 25,
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
@ -50,65 +48,20 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 26,
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Requirement already satisfied: sentence-transformers in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (1.2.0)\n",
|
||||
"Requirement already satisfied: sentencepiece in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (0.1.91)\n",
|
||||
"Requirement already satisfied: torchvision in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (0.6.0)\n",
|
||||
"Requirement already satisfied: scipy in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (1.4.1)\n",
|
||||
"Requirement already satisfied: torch>=1.6.0 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (1.8.1)\n",
|
||||
"Requirement already satisfied: tqdm in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (4.48.2)\n",
|
||||
"Requirement already satisfied: scikit-learn in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (0.23.2)\n",
|
||||
"Requirement already satisfied: nltk in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (3.5)\n",
|
||||
"Requirement already satisfied: transformers<5.0.0,>=3.1.0 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (4.4.2)\n",
|
||||
"Requirement already satisfied: numpy in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (1.20.3)\n",
|
||||
"Requirement already satisfied: pillow>=4.1.1 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from torchvision->sentence-transformers) (8.0.1)\n",
|
||||
"Requirement already satisfied: typing-extensions in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers) (3.7.4.3)\n",
|
||||
"Requirement already satisfied: joblib>=0.11 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from scikit-learn->sentence-transformers) (0.16.0)\n",
|
||||
"Requirement already satisfied: threadpoolctl>=2.0.0 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from scikit-learn->sentence-transformers) (2.1.0)\n",
|
||||
"Requirement already satisfied: click in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from nltk->sentence-transformers) (7.1.2)\n",
|
||||
"Requirement already satisfied: regex in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from nltk->sentence-transformers) (2020.7.14)\n",
|
||||
"Requirement already satisfied: sacremoses in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers) (0.0.43)\n",
|
||||
"Requirement already satisfied: packaging in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers) (20.4)\n",
|
||||
"Requirement already satisfied: filelock in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers) (3.0.12)\n",
|
||||
"Requirement already satisfied: tokenizers<0.11,>=0.10.1 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers) (0.10.1)\n",
|
||||
"Requirement already satisfied: requests in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers) (2.24.0)\n",
|
||||
"Requirement already satisfied: six in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sacremoses->transformers<5.0.0,>=3.1.0->sentence-transformers) (1.15.0)\n",
|
||||
"Requirement already satisfied: pyparsing>=2.0.2 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from packaging->transformers<5.0.0,>=3.1.0->sentence-transformers) (2.4.7)\n",
|
||||
"Requirement already satisfied: certifi>=2017.4.17 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from requests->transformers<5.0.0,>=3.1.0->sentence-transformers) (2020.6.20)\n",
|
||||
"Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from requests->transformers<5.0.0,>=3.1.0->sentence-transformers) (1.25.10)\n",
|
||||
"Requirement already satisfied: idna<3,>=2.5 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from requests->transformers<5.0.0,>=3.1.0->sentence-transformers) (2.10)\n",
|
||||
"Requirement already satisfied: chardet<4,>=3.0.2 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from requests->transformers<5.0.0,>=3.1.0->sentence-transformers) (3.0.4)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!pip install sentence-transformers"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 27,
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[[-0.07142266 -0.07716199 -0.03047761 ... 0.01356028 -0.04016104\n",
|
||||
" -0.02446149]\n",
|
||||
" [-0.06508802 -0.06923407 -0.03735013 ... 0.01013562 -0.04027328\n",
|
||||
" -0.02171571]]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from sentence_transformers import SentenceTransformer\n",
|
||||
"sentences = [\"Hello World\", \"Hallo Welt\"]\n",
|
||||
@ -120,7 +73,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 28,
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
@ -131,7 +84,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 29,
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
@ -140,7 +93,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 30,
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
@ -149,7 +102,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 31,
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
@ -158,7 +111,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 32,
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
@ -167,7 +120,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 33,
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
@ -176,7 +129,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 34,
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
@ -185,7 +138,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 35,
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
@ -194,7 +147,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 36,
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
@ -203,72 +156,41 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 37,
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"array([[1363, 1371, 898, 744, 292]])"
|
||||
]
|
||||
},
|
||||
"execution_count": 37,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"I"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 38,
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"array([[1.3110979, 1.4027181, 1.4045265, 1.4421673, 1.4421673]],\n",
|
||||
" dtype=float32)"
|
||||
]
|
||||
},
|
||||
"execution_count": 38,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"D"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 39,
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'internet boom for gift shopping cyberspace is becoming a very popular destination for christmas shoppers. forecasts predict that british people will spend \u00a34bn buying gifts online during the festive season an increase of 64% on 2003. surveys also show that the average amount that people are spending is rising as is the range of goods that they are happy to buy online. savvy shoppers are also using the net to find the hot presents that are all but sold out in high street stores. almost half of the uk population now shop online according to figures collected by the interactive media in retail group which represents web retailers. about 85% of this group 18m people expect to do a lot of their christmas gift buying online this year reports the industry group. on average each shopper will spend \u00a3220 and britons lead europe in their affection for online shopping. almost a third of all the money spent online this christmas will come out of british wallets and purses compared to 29% from german shoppers and only 4% from italian gift buyers. james roper director of the imrg said shoppers were now much happier to buy so-called big ticket items such as lcd television sets and digital cameras. mr roper added that many retailers were working hard to reassure consumers that online shopping was safe and that goods ordered as presents would arrive in time for christmas. he advised consumers to give shops a little more time than usual to fulfil orders given that online buying is proving so popular. a survey by hostway suggests that many men prefer to shop online to avoid the embarrassment of buying some types of presents such as lingerie for wives and girlfriends. much of this online shopping is likely to be done during work time according to research carried out by security firm saint bernard software. the research reveals that up to two working days will be lost by staff who do their shopping via their work computer. worst offenders will be those in the 18-35 age bracket suggests the research who will spend up to five hours per week in december browsing and buying at online shops. iggy fanlo chief revenue officer at shopping.com said that the growing numbers of people using broadband was driving interest in online shopping. when you consider narrowband and broadband the conversion to sale is two times higher he said. higher speeds meant that everything happened much faster he said which let people spend time browsing and finding out about products before they buy. the behaviour of online shoppers was also changing he said. the single biggest reason people went online before this year was price he said. the number one reason now is convenience. very few consumers click on the lowest price he said. they are looking for good prices and merchant reliability. consumer comments and reviews were also proving popular with shoppers keen to find out who had the most reliable customer service. data collected by ebay suggests that some smart shoppers are getting round the shortages of hot presents by buying them direct through the auction site. according to ebay uk there are now more than 150 robosapiens remote control robots for sale via the site. the robosapiens toy is almost impossible to find in online and offline stores. similarly many shoppers are turning to ebay to help them get hold of the hard-to-find slimline playstation 2 which many retailers are only selling as part of an expensive bundle. the high demand for the playstation 2 has meant that prices for it are being driven up. in shops the ps2 is supposed to sell for \u00a3104.99. in some ebay uk auctions the price has risen to more than double this figure. many people are also using ebay to get hold of gadgets not even released in this country. the portable version of the playstation has only just gone on sale in japan yet some enterprising ebay users are selling the device to uk gadget fans.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 39,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"DOCUMENTS[1363]"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"author": "Jakub Pokrywka",
|
||||
"email": "kubapok@wmi.amu.edu.pl",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"lang": "pl",
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
@ -281,10 +203,7 @@
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.3"
|
||||
},
|
||||
"author": "Jakub Pokrywka",
|
||||
"email": "kubapok@wmi.amu.edu.pl",
|
||||
"lang": "pl",
|
||||
"subtitle": "14.Ekstrakcja informacji seq2seq[\u0107wiczenia]",
|
||||
"subtitle": "14.Ekstrakcja informacji seq2seq[ćwiczenia]",
|
||||
"title": "Ekstrakcja informacji",
|
||||
"year": "2021"
|
||||
},
|
||||
|
File diff suppressed because one or more lines are too long
Loading…
Reference in New Issue
Block a user