add metadata to cw

This commit is contained in:
Jakub Pokrywka 2021-09-27 12:34:44 +02:00
parent 1836dc18c1
commit ad34aaeae0
26 changed files with 15211 additions and 16807 deletions

View File

@ -1,90 +1,112 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Informacje ogólne"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Kontakt z prowadzącym\n",
"\n",
"prowadzący: mgr inż. Jakub Pokrywka\n",
"\n",
"Najlepiej kontaktowąć się ze mną przez MS TEAMS na grupie kanału (ogólne sprawy) lub w prywatnych wiadomościach. Odpisuję co 2-3 dni. Można też umówić się na zdzwonko w godzinach dyżuru (wt 12.00-13.00) lub umówić się w innym terminie.\n",
"\n",
"\n",
"## Literatura\n",
"Polecana literatura do przedmiotu:\n",
"\n",
"\n",
"- https://www.manning.com/books/relevant-search#toc (darmowa) Polecam chociaż przejrzeć.\n",
"- Marie-Francine Moens. 2006. Information Extraction: Algorithms and Prospects in a Retrieval Context. Springer. (polecam mniej, jest trochę nieaktualna)\n",
"- Alex Graves. 2012. Supervised sequence labelling. Studies in Computational Intelligence, vol 385. Springer. Berlin, Heidelberg. \n",
"\n",
"- Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. North American Association for Computational Linguistics (NAACL). \n",
"\n",
"- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research vol 21, number 140, pages 1-67. \n",
"\n",
"- Flip Graliński, Tomasz Stanisławek, Anna Wróblewska, Dawid Lipiński, Agnieszka Kaliska, Paulina Rosalska, Bartosz Topolski, Przemysław Biecek. 2020. Kleister: A novel task for information extraction involving long documents with complex layout. URL https://arxiv.org/abs/2003.02356 \n",
"\n",
"- Łukasz Garncarek, Rafał Powalski, Tomasz Stanisławek, Bartosz Topolski, Piotr Halama, Filip Graliński. 2020. LAMBERT: Layout-Aware (Language) Modeling using BERT. URL https://arxiv.org/pdf/2002.08087 \n",
"\n",
"## Zaliczenie\n",
"\n",
"\n",
"\n",
"Do zdobycia będzie conajmniej 600 punktów.\n",
"\n",
"Ocena:\n",
"\n",
"- -299 — 2\n",
"\n",
"- 300-349 — 3\n",
"\n",
"- 350-399 — 3+\n",
"\n",
"- 400-449 — 4\n",
"\n",
"- 450—499 — 4+\n",
"\n",
"- 500- — 5\n",
"\n",
"\n",
"**Żeby zaliczyć przedmiot należy pojawiać się na laboratoriach. Maksymalna liczba nieobecności to 3. Obecność będę sprawdzał poprzez panel MS TEAMS, czyli będę sprawdzał czy ktoś jest wdzwoniony na ćwiczenia. Jeżeli kogoś nie będzie więcej niż 3 razy, to nie będzie miał zaliczonego przedmiotu** \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Ekstrakcja informacji </h1>\n",
"<h2> 0. <i>Informacje na temat przedmiotu</i> [\u0107wiczenia]</h2> \n",
"<h3> Jakub Pokrywka (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Informacje og\u00f3lne"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Kontakt z prowadz\u0105cym\n",
"\n",
"prowadz\u0105cy: mgr in\u017c. Jakub Pokrywka\n",
"\n",
"Najlepiej kontaktow\u0105\u0107 si\u0119 ze mn\u0105 przez MS TEAMS na grupie kana\u0142u (og\u00f3lne sprawy) lub w prywatnych wiadomo\u015bciach. Odpisuj\u0119 co 2-3 dni. Mo\u017cna te\u017c um\u00f3wi\u0107 si\u0119 na zdzwonko w godzinach dy\u017curu (wt 12.00-13.00) lub um\u00f3wi\u0107 si\u0119 w innym terminie.\n",
"\n",
"\n",
"## Literatura\n",
"Polecana literatura do przedmiotu:\n",
"\n",
"\n",
"- https://www.manning.com/books/relevant-search#toc (darmowa) Polecam chocia\u017c przejrze\u0107.\n",
"- Marie-Francine Moens. 2006. Information Extraction: Algorithms and Prospects in a Retrieval Context. Springer. (polecam mniej, jest troch\u0119 nieaktualna)\n",
"- Alex Graves. 2012. Supervised sequence labelling. Studies in Computational Intelligence, vol 385. Springer. Berlin, Heidelberg. \n",
"\n",
"- Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. North American Association for Computational Linguistics (NAACL). \n",
"\n",
"- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research vol 21, number 140, pages 1-67. \n",
"\n",
"- Flip Grali\u0144ski, Tomasz Stanis\u0142awek, Anna Wr\u00f3blewska, Dawid Lipi\u0144ski, Agnieszka Kaliska, Paulina Rosalska, Bartosz Topolski, Przemys\u0142aw Biecek. 2020. Kleister: A novel task for information extraction involving long documents with complex layout. URL https://arxiv.org/abs/2003.02356 \n",
"\n",
"- \u0141ukasz Garncarek, Rafa\u0142 Powalski, Tomasz Stanis\u0142awek, Bartosz Topolski, Piotr Halama, Filip Grali\u0144ski. 2020. LAMBERT: Layout-Aware (Language) Modeling using BERT. URL https://arxiv.org/pdf/2002.08087 \n",
"\n",
"## Zaliczenie\n",
"\n",
"\n",
"\n",
"Do zdobycia b\u0119dzie conajmniej 600 punkt\u00f3w.\n",
"\n",
"Ocena:\n",
"\n",
"- -299 \u2014 2\n",
"\n",
"- 300-349 \u2014 3\n",
"\n",
"- 350-399 \u2014 3+\n",
"\n",
"- 400-449 \u2014 4\n",
"\n",
"- 450\u2014499 \u2014 4+\n",
"\n",
"- 500- \u2014 5\n",
"\n",
"\n",
"**\u017beby zaliczy\u0107 przedmiot nale\u017cy pojawia\u0107 si\u0119 na laboratoriach. Maksymalna liczba nieobecno\u015bci to 3. Obecno\u015b\u0107 b\u0119d\u0119 sprawdza\u0142 poprzez panel MS TEAMS, czyli b\u0119d\u0119 sprawdza\u0142 czy kto\u015b jest wdzwoniony na \u0107wiczenia. Je\u017celi kogo\u015b nie b\u0119dzie wi\u0119cej ni\u017c 3 razy, to nie b\u0119dzie mia\u0142 zaliczonego przedmiotu** \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
},
"author": "Jakub Pokrywka",
"email": "kubapok@wmi.amu.edu.pl",
"lang": "pl",
"subtitle": "0.Informacje na temat przedmiotu[\u0107wiczenia]",
"title": "Ekstrakcja informacji",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -1,181 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Opracować w języku Haskell wyspecjalizowanego robota pobierającego dane z konkretnego serwisu.\n",
"\n",
"Punkty: 80 (domyślnie - niektóre zadanie są trudniejsze, wówczas podaję osobno liczbę punktów)\n",
"\n",
"Ogólne zasady:\n",
"\n",
"* pobieramy informacje (metadane) o plikach PDF, DjVU, JPG itp, ale nie same pliki,\n",
"* nie pobierajmy całego serwisu, tylko tyle, ile trzeba, by pobrać metadane o interesujących nas zasobach,\n",
"* interesują nas tylko teksty polskie, jeśli nie jest to trudne, należy odfiltrować publikacje obcojęzyczne,\n",
"* staramy się ustalać datę z możliwie dużą dokładnością.\n",
"\n",
"Sposób pracy:\n",
"\n",
"0. Pobrać Haskell Stack\n",
"\n",
"~~~\n",
"curl -sSL https://get.haskellstack.org/ | sh -s - -d ~/bin\n",
"~~~\n",
"\n",
"Na fizycznych komputerach wydziałowych są błędnie ustawione prawa dostępu na dyskach sieciowych, Haskell Stack musi działać na fizycznym dysku:\n",
"\n",
"~~~\n",
"rm -rf /mnt/poligon/.stack\n",
"mkdir /mnt/poligon/.stack\n",
"mv ~/.stack ~/.stack-bak # gdyby już był... proszę się nie przejmować błędem\n",
"ln -s /mnt/poligon/.stack ~/.stack\n",
"~~~\n",
"\n",
"1. Pobrać repozytorium:\n",
"\n",
"~~~\n",
"git clone https://git.wmi.amu.edu.pl/filipg/twilight-library.git\n",
"~~~\n",
"\n",
"2. Wypchnąć na początek do swojego repozytorium (trzeba sobie najpierw założyć to repozytorium na <https://git.wmi.amu.edu.pl>)\n",
"\n",
"~~~\n",
"cd twilight-library\n",
"git remote set-url origin git@git.wmi.amu.edu.pl:YOURID/twilight-library\n",
"git push origin master\n",
"git remote add mother git://gonito.net/twilight-library\n",
"~~~\n",
"\n",
"3. Zobacz, czy przykładowy robot dla strony z „Alamanachem Muszyny” działa:\n",
"\n",
"~~~\n",
"~/bin/stack install # może trwać długo za pierwszym razem\n",
"~/bin/stack exec almanachmuszyny\n",
"~~~\n",
"\n",
"\n",
"W razie problemów z instalacją:\n",
"\n",
"~~~\n",
"sudo apt install libpcre3 libpcre3-dev\n",
"~~~\n",
"\n",
"3. Opracuj swojego robota wzorując się na pliku `almanachmuszyny.hs`.\n",
" (Ale dodaj swój plik, nie zmieniaj `almanachmuszyny.hs`!)\n",
"\n",
"4. Dopisz specyfikację swojego robota do `shadow-library.cabal`.\n",
"\n",
"5. Pracuj nad swoim robotem, uruchamiaj go w następujący sposób:\n",
"\n",
"~~~\n",
"~/bin/stack install\n",
"~/bin/stack exec mojrobot\n",
"~~~\n",
"\n",
"(Tzn. nie nazywaj go „mojrobot”, tylko użyj jakieś sensownej nazwy.)\n",
"\n",
"6. Jeśli publikacja (np. pojedynczy numer gazety) składa się z wielu plików, powinien zostać wygenerowany jeden\n",
"rekord, w `finalUrl` powinny znaleźć się URL do poszczególnych stron (np. plików JPR) oddzielone ` // `.\n",
"\n",
"7. Po zakończeniu prac prześlij mejla do prowadzącego zajęcia z URL-em do swojego repozytorium.\n",
"\n",
"Lista serwisów do wyboru (na każdy serwis 1 osoba):\n",
"\n",
"1. [Teksty Drugie](http://tekstydrugie.pl)\n",
"2. [Archiwum Inspektora Pracy](https://www.pip.gov.pl/pl/inspektor-pracy/66546,archiwum-inspektora-pracy-.html)\n",
"3. [Medycyna Weterynaryjna](http://www.medycynawet.edu.pl/archives) — również historyczne zasoby od 1945 roku, **120 punktów**\n",
"4. [Polskie Towarzystwo Botaniczne](https://pbsociety.org.pl/default/dzialalnosc-wydawnicza/) — wszystkie dostępne zdigitalizowane publikacje!, **130 punktow**\n",
"5. [Wieści Pepowa](http://archiwum2019.pepowo.pl/news/c-10/gazeta) — nie pominąć strony nr 2 z wynikami, **110 punktów**\n",
"6. [Czasopismo Kosmos](http://kosmos.icm.edu.pl/)\n",
"7. [Czasopismo Wszechświat](http://www.ptpk.org/archiwum.html)\n",
"8. [Czasopisma polonijne we Francji](https://argonnaute.parisnanterre.fr/ark:/14707/a011403267917yQQFAS) — najlepiej w postaci PDF-ów, jak np. [https://argonnaute.parisnanterre.fr/medias/customer_3/periodique/immi_pol_lotmz1_pdf/BDIC_GFP_2929_1945_039.pdf](), **220 punktów**\n",
"9. [Muzeum Sztuki — czasopisma](https://zasoby.msl.org.pl/mobjects/show), **220 punktów**, publikacje, teksty, czasopisma, wycinki\n",
"10. [Wiadomości Urzędu Patentowego](https://grab.uprp.pl/sites/Wydawnictwa/WydawnictwaArchiwum/WydawnictwaArchiwum/Forms/AllItems.aspx)\n",
"11. [Czas, czasopismo polonijne](https://digitalcollections.lib.umanitoba.ca/islandora/object/uofm:2222545), **140 punktów** S.G.\n",
"12. [Stenogramy Okrągłego Stołu](http://okragly-stol.pl/stenogramy/), **110 punktów**\n",
"13. [Nasze Popowice](https://smpopowice.pl/index.php/numery-archiwalne)\n",
"14. [Czasopisma entomologiczne](http://pte.au.poznan.pl/)\n",
"15. [Wiadomości matematyczne](https://wydawnictwa.ptm.org.pl/index.php/wiadomosci-matematyczne/issue/archive?issuesPage=2), **120 punktow**\n",
"16. [Alkoholizm i Narkomania](http://www.ain.ipin.edu.pl/archiwum-starsze.html)\n",
"17. [Czasopismo Etyka](https://etyka.uw.edu.pl/tag/etyka-562018/), O.K.\n",
"18. [Skup makulatury](https://chomikuj.pl/skup.makulatury.prl), **250 punktów**\n",
"19. [Hermes](https://chomikuj.pl/hermes50-1) i https://chomikuj.pl/hermes50-2, **250 punktów**\n",
"20. [E-dziennik Województwa Mazowieckiego](https://edziennik.mazowieckie.pl/actbymonths) **150 punktów**\n",
"21. [Czasopismo Węgiel Brunatny](http://www.ppwb.org.pl/wegiel_brunatny)\n",
"22. [Gazeta GUM](https://gazeta.gumed.edu.pl/61323.html)\n",
"23. [Nowiny Andrychowskie](https://radioandrychow.pl/nowiny/)\n",
"24. [Kawęczyniak](http://bip.kaweczyn.pl/kaweczyn/pl/dla-mieszkanca/publikacje/archiwalne-numery-kaweczyniaka-rok-1995-2005/kaweczyniaki-rok-1997.html)\n",
"25. [Zbór Chrześcijański w Bielawia](http://zborbielawa.pl/archiwum/)\n",
"26. [Gazeta Rytwiańska](http://www.rytwiany.com.pl/index.php?sid=5)\n",
"27. [Nasze Popowice](https://smpopowice.pl/gazeta/2005_12_nasze-popowice-nr_01.pdf)\n",
"28. [Echo Chełmka](http://moksir.chelmek.pl/o-nas/echo-chelmka)\n",
"29. [Głos Świdnika](http://s.bibliotekaswidnik.pl/index.php/archwium/116-glos-swidnika) **100 punktów**\n",
"30. [Aneks](https://aneks.kulturaliberalna.pl/archiwum-aneksu/) **90 punktów**\n",
"31. [Teatr Lalel](http://polunima.pl/teatr-lalek)\n",
"32. [Biuletyn Bezpieczna Chemia](https://www.pipc.org.pl/publikacje/biuletyn-bezpieczna-chemia)\n",
"33. [Głos Maszynisty](https://zzm.org.pl/glos-maszynisty/)\n",
"34. [Kultura Paryska](https://www.kulturaparyska.com/pl/index), całe archiwum z książkami i innymi czasopismami, **180 punktów**\n",
"35. [Gazeta Fabryczna - Kraśnik](https://80lat.flt.krasnik.pl/index.php/gazeta-fabryczna/) **120 punktów**\n",
"36. [Artykuły o Jujutsu](http://www.kobudo.pl/artykuly_jujutsu.html)\n",
"37. [Wycinki o Taekwon-Do](https://www2.pztkd.lublin.pl/archpras.html#z1996)\n",
"38. [Materiały o kolejnictwie](https://enkol.pl/Strona_g%C5%82%C3%B3wna) **180 punktów**\n",
"39. [Centralny Instytut Ochrony Pracy](http://archiwum.ciop.pl/), znaleźć wszystkie publikacje typu <http://archiwum.ciop.pl/44938>, wymaga trochę sprytu **130 punktów**\n",
"40. [Biblioteka Sejmowa - Zasoby Cyfrowe](https://biblioteka.sejm.gov.pl/zasoby_cyfrowe/), **200 punktów**\n",
"41. [Elektronika Praktyczna](https://ep.com.pl/archiwum), te numery, które dostępne w otwarty sposób, np. rok 1993\n",
"42. [Litewska Akademia Nauk](http://www.mab.lt/), tylko materiały w jęz. polskim, takie jak np.\n",
" <https://elibrary.mab.lt/handle/1/840>, **170 punktów**\n",
"43. [Litewska Biblioteka Cyfrowa](https://www.epaveldas.lt), wyłuskać tylko materiały w jęz. polskim, **190 punktów**\n",
"44. [Czasopisma Geologiczne](https://geojournals.pgi.gov.pl), **120 punktów**\n",
"45. [Czasopisma PTTK](https://www.czasopisma.centralnabibliotekapttk.pl/index.php?i3), **120 punktów**\n",
"46. [Czasopisma Polskiego Towarzystwa Dendrologicznego](https://www.ptd.pl/?page_id=7), **100 punktów**\n",
"47. [Kilka przedwojennych książek](https://dziemiela.com/documents.htm)\n",
"48. [Historia polskiej informatyki](http://klio.spit.iq.pl/a4-wyroby-polskiej-informatyki/a4-2-sprzet/) - wyjątkowo bez datowania\n",
"49. [Zeszyty Formacyjne Katolickiego Stowarzyszenia „Civitas Christania”](http://podkarpacki.civitaschristiana.pl/formacja/zeszyty-formacyjne/), tylko niektóre pliki można zdatować\n",
"50. [Józef Piłsudski Institute of America](https://archiwa.pilsudski.org/) - **220 punktów**\n",
"51. [Prasa podziemna — Częstochowa](http://www.podziemie.com.pl), również ulotki i inne materiały skanowane - **180 punktów**\n",
"52. [Tajemnica Atari](http://krap.pl/mirrorz/atari/horror.mirage.com.pl/pixel/), plik ZIP z DjVu\n",
"\n",
"\n",
"### F.A.Q.\n",
"\n",
"**P: Nie działają strony z protokołem https, co zrobić?**\n",
"\n",
"O: Trzeba użyć modułu opartego na bibliotece curl. Paczka Ubuntu została zainstalowana na komputerach wydziałowych. Na\n",
"swoim komputerze możemy zainstalować paczkę libcurl4-openssl-dev, a\n",
"następnie można sobie ściągnąć wersję twilight-library opartą na libcurl:\n",
"\n",
" git fetch git://gonito.net/twilight-library withcurl\n",
" git merge FETCH_HEAD\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -1,5 +1,19 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Ekstrakcja informacji </h1>\n",
"<h2> 1. <i>Wyszukiwarki wprowadzenie</i> [ćwiczenia]</h2> \n",
"<h3> Jakub Pokrywka (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -234,11 +248,14 @@
}
],
"metadata": {
"author": "Jakub Pokrywka",
"email": "kubapok@wmi.amu.edu.pl",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"lang": "pl",
"language_info": {
"codemirror_mode": {
"name": "ipython",
@ -249,8 +266,11 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
"version": "3.8.3"
},
"subtitle": "1.Wyszukiwarki wprowadzenie[ćwiczenia]",
"title": "Ekstrakcja informacji",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4

View File

@ -1,5 +1,19 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Ekstrakcja informacji </h1>\n",
"<h2> 2. <i>Wyszukiwarki roboty</i> [ćwiczenia]</h2> \n",
"<h3> Jakub Pokrywka (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -272,11 +286,14 @@
}
],
"metadata": {
"author": "Jakub Pokrywka",
"email": "kubapok@wmi.amu.edu.pl",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"lang": "pl",
"language_info": {
"codemirror_mode": {
"name": "ipython",
@ -287,8 +304,11 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
"version": "3.8.3"
},
"subtitle": "2.Wyszukiwarki roboty[ćwiczenia]",
"title": "Ekstrakcja informacji",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4

File diff suppressed because it is too large Load Diff

View File

@ -1,69 +0,0 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"def word_to_index(word):\n",
" vec = np.zeros(len(vocabulary))\n",
" if word in vocabulary:\n",
" idx = vocabulary.index(word)\n",
" vec[idx] = 1\n",
" else:\n",
" vec[-1] = 1\n",
" return vec"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def tf(document):\n",
" document_vector = None\n",
" for word in document:\n",
" if document_vector is None:\n",
" document_vector = word_to_index(word)\n",
" else:\n",
" document_vector += word_to_index(word)\n",
" return document_vector"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def similarity(query, document):\n",
" numerator = np.sum(query * document)\n",
" denominator = np.sqrt(np.sum(query*query)) * np.sqrt(np.sum(document*document)) \n",
" return numerator / denominator"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -1,708 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Zajecia 2\n",
"\n",
"Przydatne materiały:\n",
"\n",
"https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html\n",
"\n",
"https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Importy"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import sklearn.metrics\n",
"\n",
"from sklearn.datasets import fetch_20newsgroups\n",
"\n",
"from sklearn.feature_extraction.text import TfidfVectorizer"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Zbiór danych"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"newsgroups = fetch_20newsgroups()['data']"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"11314"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(newsgroups)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"From: lerxst@wam.umd.edu (where's my thing)\n",
"Subject: WHAT car is this!?\n",
"Nntp-Posting-Host: rac3.wam.umd.edu\n",
"Organization: University of Maryland, College Park\n",
"Lines: 15\n",
"\n",
" I was wondering if anyone out there could enlighten me on this car I saw\n",
"the other day. It was a 2-door sports car, looked to be from the late 60s/\n",
"early 70s. It was called a Bricklin. The doors were really small. In addition,\n",
"the front bumper was separate from the rest of the body. This is \n",
"all I know. If anyone can tellme a model name, engine specs, years\n",
"of production, where this car is made, history, or whatever info you\n",
"have on this funky looking car, please e-mail.\n",
"\n",
"Thanks,\n",
"- IL\n",
" ---- brought to you by your neighborhood Lerxst ----\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
}
],
"source": [
"print(newsgroups[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Naiwne przeszukiwanie"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"all_documents = list() \n",
"for document in newsgroups:\n",
" if 'car' in document:\n",
" all_documents.append(document)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"From: lerxst@wam.umd.edu (where's my thing)\n",
"Subject: WHAT car is this!?\n",
"Nntp-Posting-Host: rac3.wam.umd.edu\n",
"Organization: University of Maryland, College Park\n",
"Lines: 15\n",
"\n",
" I was wondering if anyone out there could enlighten me on this car I saw\n",
"the other day. It was a 2-door sports car, looked to be from the late 60s/\n",
"early 70s. It was called a Bricklin. The doors were really small. In addition,\n",
"the front bumper was separate from the rest of the body. This is \n",
"all I know. If anyone can tellme a model name, engine specs, years\n",
"of production, where this car is made, history, or whatever info you\n",
"have on this funky looking car, please e-mail.\n",
"\n",
"Thanks,\n",
"- IL\n",
" ---- brought to you by your neighborhood Lerxst ----\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
}
],
"source": [
"print(all_documents[0])"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"From: guykuo@carson.u.washington.edu (Guy Kuo)\n",
"Subject: SI Clock Poll - Final Call\n",
"Summary: Final call for SI clock reports\n",
"Keywords: SI,acceleration,clock,upgrade\n",
"Article-I.D.: shelley.1qvfo9INNc3s\n",
"Organization: University of Washington\n",
"Lines: 11\n",
"NNTP-Posting-Host: carson.u.washington.edu\n",
"\n",
"A fair number of brave souls who upgraded their SI clock oscillator have\n",
"shared their experiences for this poll. Please send a brief message detailing\n",
"your experiences with the procedure. Top speed attained, CPU rated speed,\n",
"add on cards and adapters, heat sinks, hour of usage per day, floppy disk\n",
"functionality with 800 and 1.4 m floppies are especially requested.\n",
"\n",
"I will be summarizing in the next two days, so please add to the network\n",
"knowledge base if you have done the clock upgrade and haven't answered this\n",
"poll. Thanks.\n",
"\n",
"Guy Kuo <guykuo@u.washington.edu>\n",
"\n"
]
}
],
"source": [
"print(all_documents[1])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### jakie są problemy z takim podejściem?\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## TFIDF i odległość cosinusowa- gotowe biblioteki"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"vectorizer = TfidfVectorizer()\n",
"#vectorizer = TfidfVectorizer(use_idf = False, ngram_range=(1,2))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"document_vectors = vectorizer.fit_transform(newsgroups)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<11314x130107 sparse matrix of type '<class 'numpy.float64'>'\n",
"\twith 1787565 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"document_vectors"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<1x130107 sparse matrix of type '<class 'numpy.float64'>'\n",
"\twith 89 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"document_vectors[0]"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"matrix([[0., 0., 0., ..., 0., 0., 0.]])"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"document_vectors[0].todense()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"matrix([[0., 0., 0., ..., 0., 0., 0.],\n",
" [0., 0., 0., ..., 0., 0., 0.],\n",
" [0., 0., 0., ..., 0., 0., 0.],\n",
" [0., 0., 0., ..., 0., 0., 0.]])"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"document_vectors[0:4].todense()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"query_str = 'speed'\n",
"#query_str = 'speed car'\n",
"#query_str = 'spider man'"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"query_vector = vectorizer.transform([query_str])"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<11314x130107 sparse matrix of type '<class 'numpy.float64'>'\n",
"\twith 1787565 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"document_vectors"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<1x130107 sparse matrix of type '<class 'numpy.float64'>'\n",
"\twith 1 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_vector"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"similarities = sklearn.metrics.pairwise.cosine_similarity(query_vector,document_vectors)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0.26949927, 0.3491801 , 0.44292083, 0.47784165])"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.sort(similarities)[0][-4:]"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([4517, 5509, 2116, 9921])"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"similarities.argsort()[0][-4:]"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"From: ray@netcom.com (Ray Fischer)\n",
"Subject: Re: x86 ~= 680x0 ?? (How do they compare?)\n",
"Organization: Netcom. San Jose, California\n",
"Distribution: usa\n",
"Lines: 36\n",
"\n",
"dhk@ubbpc.uucp (Dave Kitabjian) writes ...\n",
">I'm sure Intel and Motorola are competing neck-and-neck for \n",
">crunch-power, but for a given clock speed, how do we rank the\n",
">following (from 1st to 6th):\n",
"> 486\t\t68040\n",
"> 386\t\t68030\n",
"> 286\t\t68020\n",
"\n",
"040 486 030 386 020 286\n",
"\n",
">While you're at it, where will the following fit into the list:\n",
"> 68060\n",
"> Pentium\n",
"> PowerPC\n",
"\n",
"060 fastest, then Pentium, with the first versions of the PowerPC\n",
"somewhere in the vicinity.\n",
"\n",
">And about clock speed: Does doubling the clock speed double the\n",
">overall processor speed? And fill in the __'s below:\n",
"> 68030 @ __ MHz = 68040 @ __ MHz\n",
"\n",
"No. Computer speed is only partly dependent of processor/clock speed.\n",
"Memory system speed play a large role as does video system speed and\n",
"I/O speed. As processor clock rates go up, the speed of the memory\n",
"system becomes the greatest factor in the overall system speed. If\n",
"you have a 50MHz processor, it can be reading another word from memory\n",
"every 20ns. Sure, you can put all 20ns memory in your computer, but\n",
"it will cost 10 times as much as the slower 80ns SIMMs.\n",
"\n",
"And roughly, the 68040 is twice as fast at a given clock\n",
"speed as is the 68030.\n",
"\n",
"-- \n",
"Ray Fischer \"Convictions are more dangerous enemies of truth\n",
"ray@netcom.com than lies.\" -- Friedrich Nietzsche\n",
"\n",
"0.4778416465020907\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n",
"From: rvenkate@ux4.cso.uiuc.edu (Ravikuma Venkateswar)\n",
"Subject: Re: x86 ~= 680x0 ?? (How do they compare?)\n",
"Distribution: usa\n",
"Organization: University of Illinois at Urbana\n",
"Lines: 59\n",
"\n",
"ray@netcom.com (Ray Fischer) writes:\n",
"\n",
">dhk@ubbpc.uucp (Dave Kitabjian) writes ...\n",
">>I'm sure Intel and Motorola are competing neck-and-neck for \n",
">>crunch-power, but for a given clock speed, how do we rank the\n",
">>following (from 1st to 6th):\n",
">> 486\t\t68040\n",
">> 386\t\t68030\n",
">> 286\t\t68020\n",
"\n",
">040 486 030 386 020 286\n",
"\n",
"How about some numbers here? Some kind of benchmark?\n",
"If you want, let me start it - 486DX2-66 - 32 SPECint92, 16 SPECfp92 .\n",
"\n",
">>While you're at it, where will the following fit into the list:\n",
">> 68060\n",
">> Pentium\n",
">> PowerPC\n",
"\n",
">060 fastest, then Pentium, with the first versions of the PowerPC\n",
">somewhere in the vicinity.\n",
"\n",
"Numbers? Pentium @66MHz - 65 SPECint92, 57 SPECfp92 .\n",
"\t PowerPC @66MHz - 50 SPECint92, 80 SPECfp92 . (Note this is the 601)\n",
" (Alpha @150MHz - 74 SPECint92,126 SPECfp92 - just for comparison)\n",
"\n",
">>And about clock speed: Does doubling the clock speed double the\n",
">>overall processor speed? And fill in the __'s below:\n",
">> 68030 @ __ MHz = 68040 @ __ MHz\n",
"\n",
">No. Computer speed is only partly dependent of processor/clock speed.\n",
">Memory system speed play a large role as does video system speed and\n",
">I/O speed. As processor clock rates go up, the speed of the memory\n",
">system becomes the greatest factor in the overall system speed. If\n",
">you have a 50MHz processor, it can be reading another word from memory\n",
">every 20ns. Sure, you can put all 20ns memory in your computer, but\n",
">it will cost 10 times as much as the slower 80ns SIMMs.\n",
"\n",
"Not in a clock-doubled system. There isn't a doubling in performance, but\n",
"it _is_ quite significant. Maybe about a 70% increase in performance.\n",
"\n",
"Besides, for 0 wait state performance, you'd need a cache anyway. I mean,\n",
"who uses a processor that runs at the speed of 80ns SIMMs? Note that this\n",
"memory speed corresponds to a clock speed of 12.5 MHz.\n",
"\n",
">And roughly, the 68040 is twice as fast at a given clock\n",
">speed as is the 68030.\n",
"\n",
"Numbers?\n",
"\n",
">-- \n",
">Ray Fischer \"Convictions are more dangerous enemies of truth\n",
">ray@netcom.com than lies.\" -- Friedrich Nietzsche\n",
"-- \n",
"Ravikumar Venkateswar\n",
"rvenkate@uiuc.edu\n",
"\n",
"A pun is a no' blessed form of whit.\n",
"\n",
"0.44292082969477664\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n",
"From: ray@netcom.com (Ray Fischer)\n",
"Subject: Re: x86 ~= 680x0 ?? (How do they compare?)\n",
"Organization: Netcom. San Jose, California\n",
"Distribution: usa\n",
"Lines: 30\n",
"\n",
"rvenkate@ux4.cso.uiuc.edu (Ravikuma Venkateswar) writes ...\n",
">ray@netcom.com (Ray Fischer) writes:\n",
">>040 486 030 386 020 286\n",
">\n",
">How about some numbers here? Some kind of benchmark?\n",
"\n",
"Benchmarks are for marketing dweebs and CPU envy. OK, if it will make\n",
"you happy, the 486 is faster than the 040. BFD. Both architectures\n",
"are nearing then end of their lifetimes. And especially with the x86\n",
"architecture: good riddance.\n",
"\n",
">Besides, for 0 wait state performance, you'd need a cache anyway. I mean,\n",
">who uses a processor that runs at the speed of 80ns SIMMs? Note that this\n",
">memory speed corresponds to a clock speed of 12.5 MHz.\n",
"\n",
"The point being the processor speed is only one of many aspects of a\n",
"computers performance. Clock speed, processor, memory speed, CPU\n",
"architecture, I/O systems, even the application program all contribute \n",
"to the overall system performance.\n",
"\n",
">>And roughly, the 68040 is twice as fast at a given clock\n",
">>speed as is the 68030.\n",
">\n",
">Numbers?\n",
"\n",
"Look them up yourself.\n",
"\n",
"-- \n",
"Ray Fischer \"Convictions are more dangerous enemies of truth\n",
"ray@netcom.com than lies.\" -- Friedrich Nietzsche\n",
"\n",
"0.3491800997095306\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n",
"From: mb4008@cehp11 (Morgan J Bullard)\n",
"Subject: Re: speeding up windows\n",
"Keywords: speed\n",
"Organization: University of Illinois at Urbana\n",
"Lines: 30\n",
"\n",
"djserian@flash.LakeheadU.Ca (Reincarnation of Elvis) writes:\n",
"\n",
">I have a 386/33 with 8 megs of memory\n",
"\n",
">I have noticed that lately when I use programs like WpfW or Corel Draw\n",
">my computer \"boggs\" down and becomes really sluggish!\n",
"\n",
">What can I do to increase performance? What should I turn on or off\n",
"\n",
">Will not loading wallpapers or stuff like that help when it comes to\n",
">the running speed of windows and the programs that run under it?\n",
"\n",
">Thanx in advance\n",
"\n",
">Derek\n",
"\n",
"1) make sure your hard drive is defragmented. This will speed up more than \n",
" just windows BTW. Use something like Norton's or PC Tools.\n",
"2) I _think_ that leaving the wall paper out will use less RAM and therefore\n",
" will speed up your machine but I could very will be wrong on this.\n",
"There's a good chance you've already done this but if not it may speed things\n",
"up. good luck\n",
"\t\t\t\tMorgan Bullard mb4008@coewl.cen.uiuc.edu\n",
"\t\t\t\t\t or mjbb@uxa.cso.uiuc.edu\n",
"\n",
">--\n",
">$_ /|$Derek J.P. Serianni $ E-Mail : djserian@flash.lakeheadu.ca $ \n",
">$\\'o.O' $Sociologist $ It's 106 miles to Chicago,we've got a full tank$\n",
">$=(___)=$Lakehead University $ of gas, half a pack of cigarettes,it's dark,and$\n",
">$ U $Thunder Bay, Ontario$ we're wearing sunglasses. -Elwood Blues $ \n",
"\n",
"0.26949927393886913\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n"
]
}
],
"source": [
"for i in range (1,5):\n",
" print(newsgroups[similarities.argsort()[0][-i]])\n",
" print(np.sort(similarities)[0,-i])\n",
" print('-'*100)\n",
" print('-'*100)\n",
" print('-'*100)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Zadanie domowe\n",
"\n",
"\n",
"- Wybrać zbiór tekstowy, który ma conajmniej 10000 dokumentów (inny niż w tym przykładzie).\n",
"- Na jego podstawie stworzyć wyszukiwarkę bazującą na OKAPI BM25, tzn. system który dla podanej frazy podaje kilka (5-10) posortowanych najbardziej pasujących dokumentów razem ze scorami. Należy wypisywać też ilość zwracanych dokumentów, czyli takich z niezerowym scorem. Można korzystać z gotowych bibliotek do wektoryzacji dokumentów, należy jednak samemu zaimplementować OKAPI BM25. \n",
"- Znaleźć frazę (query), dla której wynik nie jest satysfakcjonujący.\n",
"- Poprawić wyszukiwarkę (np. poprzez zmianę preprocessingu tekstu, wektoryzer, zmianę parametrów algorytmu rankującego lub sam algorytm) tak, żeby zwracała satysfakcjonujące wyniki dla poprzedniej frazy. Należy zrobić inną zmianę niż w tym przykładzie, tylko wymyślić coś własnego.\n",
"- prezentować pracę na następnych zajęciach (14.04) odpowiadając na pytania:\n",
" - jak wygląda zbiór i system wyszukiwania przed zmianami\n",
" - dla jakiej frazy wyniki są niesatysfakcjonujące (pokazać wyniki)\n",
" - jakie zmiany zostały naniesione\n",
" - jak wyglądają wyniki wyszukiwania po zmianach\n",
" - jak zmiany wpłynęły na wyniki (1-2 zdania)\n",
" \n",
"Prezentacja powinna być maksymalnie prosta i trwać maksymalnie 2-3 minuty.\n",
"punktów do zdobycia: 60\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -1,5 +1,19 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Ekstrakcja informacji </h1>\n",
"<h2> 4. <i>Wyszukiwarki</i> [ćwiczenia]</h2> \n",
"<h3> Jakub Pokrywka (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -81,11 +95,14 @@
}
],
"metadata": {
"author": "Jakub Pokrywka",
"email": "kubapok@wmi.amu.edu.pl",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"lang": "pl",
"language_info": {
"codemirror_mode": {
"name": "ipython",
@ -96,8 +113,11 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
"version": "3.8.3"
},
"subtitle": "4.wyszukiwarki[ćwiczenia]",
"title": "Ekstrakcja informacji",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4

View File

@ -1,5 +1,19 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Ekstrakcja informacji </h1>\n",
"<h2> 5. <i>Ekstrakcja informacji z dokumentów</i> [ćwiczenia]</h2> \n",
"<h3> Jakub Pokrywka (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -213,11 +227,14 @@
}
],
"metadata": {
"author": "Jakub Pokrywka",
"email": "kubapok@wmi.amu.edu.pl",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"lang": "pl",
"language_info": {
"codemirror_mode": {
"name": "ipython",
@ -229,7 +246,10 @@
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
}
},
"subtitle": "5.ekEtrakcja informacji z dokumentCCow[ćwiczenia]",
"title": "Ekstrakcja informacji",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -1,5 +1,19 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Ekstrakcja informacji </h1>\n",
"<h2> 7. <i>Regresja liniowa</i> [ćwiczenia]</h2> \n",
"<h3> Jakub Pokrywka (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -1046,11 +1060,14 @@
}
],
"metadata": {
"author": "Jakub Pokrywka",
"email": "kubapok@wmi.amu.edu.pl",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"lang": "pl",
"language_info": {
"codemirror_mode": {
"name": "ipython",
@ -1061,8 +1078,11 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
"version": "3.8.3"
},
"subtitle": "7.Regresja liniowa[ćwiczenia]",
"title": "Ekstrakcja informacji",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4

View File

@ -1,5 +1,19 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Ekstrakcja informacji </h1>\n",
"<h2> 7. <i>Regresja liniowa</i> [ćwiczenia]</h2> \n",
"<h3> Jakub Pokrywka (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -1354,11 +1368,14 @@
}
],
"metadata": {
"author": "Jakub Pokrywka",
"email": "kubapok@wmi.amu.edu.pl",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"lang": "pl",
"language_info": {
"codemirror_mode": {
"name": "ipython",
@ -1369,8 +1386,11 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
"version": "3.8.3"
},
"subtitle": "7.Regresja liniowa[ćwiczenia]",
"title": "Ekstrakcja informacji",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4

View File

@ -1,5 +1,19 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Ekstrakcja informacji </h1>\n",
"<h2> 8. <i>Regresja logistyczna</i> [ćwiczenia]</h2> \n",
"<h3> Jakub Pokrywka (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -1024,11 +1038,14 @@
}
],
"metadata": {
"author": "Jakub Pokrywka",
"email": "kubapok@wmi.amu.edu.pl",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"lang": "pl",
"language_info": {
"codemirror_mode": {
"name": "ipython",
@ -1039,8 +1056,11 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
"version": "3.8.3"
},
"subtitle": "8.Regresja logistyczna[ćwiczenia]",
"title": "Ekstrakcja informacji",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4

View File

@ -1,5 +1,19 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Ekstrakcja informacji </h1>\n",
"<h2> 8. <i>Regresja logistyczna</i> [ćwiczenia]</h2> \n",
"<h3> Jakub Pokrywka (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -1216,11 +1230,14 @@
}
],
"metadata": {
"author": "Jakub Pokrywka",
"email": "kubapok@wmi.amu.edu.pl",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"lang": "pl",
"language_info": {
"codemirror_mode": {
"name": "ipython",
@ -1231,8 +1248,11 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
"version": "3.8.3"
},
"subtitle": "8.Regresja logistyczna[ćwiczenia]",
"title": "Ekstrakcja informacji",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4

View File

@ -1,5 +1,19 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Ekstrakcja informacji </h1>\n",
"<h2> 9. <i>Sequence labeling</i> [ćwiczenia]</h2> \n",
"<h3> Jakub Pokrywka (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -12357,11 +12371,14 @@
}
],
"metadata": {
"author": "Jakub Pokrywka",
"email": "kubapok@wmi.amu.edu.pl",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"lang": "pl",
"language_info": {
"codemirror_mode": {
"name": "ipython",
@ -12372,8 +12389,11 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
"version": "3.8.3"
},
"subtitle": "9.Sequence labeling[ćwiczenia]",
"title": "Ekstrakcja informacji",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4

View File

@ -1,5 +1,19 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Ekstrakcja informacji </h1>\n",
"<h2> 9. <i>Sequence labeling</i> [ćwiczenia]</h2> \n",
"<h3> Jakub Pokrywka (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -908,11 +922,14 @@
}
],
"metadata": {
"author": "Jakub Pokrywka",
"email": "kubapok@wmi.amu.edu.pl",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"lang": "pl",
"language_info": {
"codemirror_mode": {
"name": "ipython",
@ -923,8 +940,11 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
"version": "3.8.3"
},
"subtitle": "9.Sequence labeling[ćwiczenia]",
"title": "Ekstrakcja informacji",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4

View File

@ -1,5 +1,19 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Ekstrakcja informacji </h1>\n",
"<h2> 10. <i>CRF</i> [ćwiczenia]</h2> \n",
"<h3> Jakub Pokrywka (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -404,11 +418,14 @@
}
],
"metadata": {
"author": "Jakub Pokrywka",
"email": "kubapok@wmi.amu.edu.pl",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"lang": "pl",
"language_info": {
"codemirror_mode": {
"name": "ipython",
@ -419,8 +436,11 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
"version": "3.8.3"
},
"subtitle": "10.CRF[ćwiczenia]",
"title": "Ekstrakcja informacji",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4

View File

@ -1,5 +1,19 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Ekstrakcja informacji </h1>\n",
"<h2> 11. <i>NER RNN</i> [ćwiczenia]</h2> \n",
"<h3> Jakub Pokrywka (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -805,11 +819,14 @@
}
],
"metadata": {
"author": "Jakub Pokrywka",
"email": "kubapok@wmi.amu.edu.pl",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"lang": "pl",
"language_info": {
"codemirror_mode": {
"name": "ipython",
@ -820,8 +837,11 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
"version": "3.8.3"
},
"subtitle": "11.NER RNN[ćwiczenia]",
"title": "Ekstrakcja informacji",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -1,271 +1,293 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### SIMILARITY SEARCH\n",
"1. zainstaluj faiss i zrób tutorial: https://github.com/facebookresearch/faiss\n",
"2. wczytaj treści artykułów z BBC News Train.csv\n",
"3. Użyj któregoś z transformerów (możesz użyć biblioteki sentence-transformers) do stworzenia embeddingów dokumentów\n",
"4. wczytaj embeddingi do bazy danych faiss\n",
"5. wyszukaj query 'consumer electronics market'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"https://www.kaggle.com/avishi/bbc-news-train-data"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import pickle\n",
"import numpy as np\n",
"import faiss\n",
"from sklearn.metrics import ndcg_score, dcg_score, average_precision_score"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: sentence-transformers in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (1.2.0)\n",
"Requirement already satisfied: sentencepiece in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (0.1.91)\n",
"Requirement already satisfied: torchvision in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (0.6.0)\n",
"Requirement already satisfied: scipy in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (1.4.1)\n",
"Requirement already satisfied: torch>=1.6.0 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (1.8.1)\n",
"Requirement already satisfied: tqdm in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (4.48.2)\n",
"Requirement already satisfied: scikit-learn in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (0.23.2)\n",
"Requirement already satisfied: nltk in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (3.5)\n",
"Requirement already satisfied: transformers<5.0.0,>=3.1.0 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (4.4.2)\n",
"Requirement already satisfied: numpy in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (1.20.3)\n",
"Requirement already satisfied: pillow>=4.1.1 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from torchvision->sentence-transformers) (8.0.1)\n",
"Requirement already satisfied: typing-extensions in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers) (3.7.4.3)\n",
"Requirement already satisfied: joblib>=0.11 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from scikit-learn->sentence-transformers) (0.16.0)\n",
"Requirement already satisfied: threadpoolctl>=2.0.0 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from scikit-learn->sentence-transformers) (2.1.0)\n",
"Requirement already satisfied: click in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from nltk->sentence-transformers) (7.1.2)\n",
"Requirement already satisfied: regex in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from nltk->sentence-transformers) (2020.7.14)\n",
"Requirement already satisfied: sacremoses in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers) (0.0.43)\n",
"Requirement already satisfied: packaging in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers) (20.4)\n",
"Requirement already satisfied: filelock in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers) (3.0.12)\n",
"Requirement already satisfied: tokenizers<0.11,>=0.10.1 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers) (0.10.1)\n",
"Requirement already satisfied: requests in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers) (2.24.0)\n",
"Requirement already satisfied: six in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sacremoses->transformers<5.0.0,>=3.1.0->sentence-transformers) (1.15.0)\n",
"Requirement already satisfied: pyparsing>=2.0.2 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from packaging->transformers<5.0.0,>=3.1.0->sentence-transformers) (2.4.7)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from requests->transformers<5.0.0,>=3.1.0->sentence-transformers) (2020.6.20)\n",
"Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from requests->transformers<5.0.0,>=3.1.0->sentence-transformers) (1.25.10)\n",
"Requirement already satisfied: idna<3,>=2.5 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from requests->transformers<5.0.0,>=3.1.0->sentence-transformers) (2.10)\n",
"Requirement already satisfied: chardet<4,>=3.0.2 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from requests->transformers<5.0.0,>=3.1.0->sentence-transformers) (3.0.4)\n"
]
}
],
"source": [
"!pip install sentence-transformers"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[-0.07142266 -0.07716199 -0.03047761 ... 0.01356028 -0.04016104\n",
" -0.02446149]\n",
" [-0.06508802 -0.06923407 -0.03735013 ... 0.01013562 -0.04027328\n",
" -0.02171571]]\n"
]
}
],
"source": [
"from sentence_transformers import SentenceTransformer\n",
"sentences = [\"Hello World\", \"Hallo Welt\"]\n",
"\n",
"model = SentenceTransformer('LaBSE')\n",
"embeddings = model.encode(sentences)\n",
"print(embeddings)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"r = pd.read_csv('BBC News Train.csv')"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"DOCUMENTS = list(r.Text)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"embeddings = model.encode(DOCUMENTS)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"embeddings = model.encode(list(r.Text))"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"QUERY_STR = 'consumer electronics market'"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"query = model.encode([QUERY_STR])"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"index = faiss.IndexFlatL2(embeddings.shape[1]) "
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"index.add(np.ascontiguousarray(embeddings))"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"D, I = index.search(query, 5) "
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[1363, 1371, 898, 744, 292]])"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"I"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[1.3110979, 1.4027181, 1.4045265, 1.4421673, 1.4421673]],\n",
" dtype=float32)"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"D"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'internet boom for gift shopping cyberspace is becoming a very popular destination for christmas shoppers. forecasts predict that british people will spend £4bn buying gifts online during the festive season an increase of 64% on 2003. surveys also show that the average amount that people are spending is rising as is the range of goods that they are happy to buy online. savvy shoppers are also using the net to find the hot presents that are all but sold out in high street stores. almost half of the uk population now shop online according to figures collected by the interactive media in retail group which represents web retailers. about 85% of this group 18m people expect to do a lot of their christmas gift buying online this year reports the industry group. on average each shopper will spend £220 and britons lead europe in their affection for online shopping. almost a third of all the money spent online this christmas will come out of british wallets and purses compared to 29% from german shoppers and only 4% from italian gift buyers. james roper director of the imrg said shoppers were now much happier to buy so-called big ticket items such as lcd television sets and digital cameras. mr roper added that many retailers were working hard to reassure consumers that online shopping was safe and that goods ordered as presents would arrive in time for christmas. he advised consumers to give shops a little more time than usual to fulfil orders given that online buying is proving so popular. a survey by hostway suggests that many men prefer to shop online to avoid the embarrassment of buying some types of presents such as lingerie for wives and girlfriends. much of this online shopping is likely to be done during work time according to research carried out by security firm saint bernard software. the research reveals that up to two working days will be lost by staff who do their shopping via their work computer. worst offenders will be those in the 18-35 age bracket suggests the research who will spend up to five hours per week in december browsing and buying at online shops. iggy fanlo chief revenue officer at shopping.com said that the growing numbers of people using broadband was driving interest in online shopping. when you consider narrowband and broadband the conversion to sale is two times higher he said. higher speeds meant that everything happened much faster he said which let people spend time browsing and finding out about products before they buy. the behaviour of online shoppers was also changing he said. the single biggest reason people went online before this year was price he said. the number one reason now is convenience. very few consumers click on the lowest price he said. they are looking for good prices and merchant reliability. consumer comments and reviews were also proving popular with shoppers keen to find out who had the most reliable customer service. data collected by ebay suggests that some smart shoppers are getting round the shortages of hot presents by buying them direct through the auction site. according to ebay uk there are now more than 150 robosapiens remote control robots for sale via the site. the robosapiens toy is almost impossible to find in online and offline stores. similarly many shoppers are turning to ebay to help them get hold of the hard-to-find slimline playstation 2 which many retailers are only selling as part of an expensive bundle. the high demand for the playstation 2 has meant that prices for it are being driven up. in shops the ps2 is supposed to sell for £104.99. in some ebay uk auctions the price has risen to more than double this figure. many people are also using ebay to get hold of gadgets not even released in this country. the portable version of the playstation has only just gone on sale in japan yet some enterprising ebay users are selling the device to uk gadget fans.'"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DOCUMENTS[1363]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Ekstrakcja informacji </h1>\n",
"<h2> 14. <i>Ekstrakcja informacji seq2seq</i> [\u0107wiczenia]</h2> \n",
"<h3> Jakub Pokrywka (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### SIMILARITY SEARCH\n",
"1. zainstaluj faiss i zr\u00f3b tutorial: https://github.com/facebookresearch/faiss\n",
"2. wczytaj tre\u015bci artyku\u0142\u00f3w z BBC News Train.csv\n",
"3. U\u017cyj kt\u00f3rego\u015b z transformer\u00f3w (mo\u017cesz u\u017cy\u0107 biblioteki sentence-transformers) do stworzenia embedding\u00f3w dokument\u00f3w\n",
"4. wczytaj embeddingi do bazy danych faiss\n",
"5. wyszukaj query 'consumer electronics market'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"https://www.kaggle.com/avishi/bbc-news-train-data"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import pickle\n",
"import numpy as np\n",
"import faiss\n",
"from sklearn.metrics import ndcg_score, dcg_score, average_precision_score"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: sentence-transformers in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (1.2.0)\n",
"Requirement already satisfied: sentencepiece in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (0.1.91)\n",
"Requirement already satisfied: torchvision in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (0.6.0)\n",
"Requirement already satisfied: scipy in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (1.4.1)\n",
"Requirement already satisfied: torch>=1.6.0 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (1.8.1)\n",
"Requirement already satisfied: tqdm in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (4.48.2)\n",
"Requirement already satisfied: scikit-learn in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (0.23.2)\n",
"Requirement already satisfied: nltk in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (3.5)\n",
"Requirement already satisfied: transformers<5.0.0,>=3.1.0 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (4.4.2)\n",
"Requirement already satisfied: numpy in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (1.20.3)\n",
"Requirement already satisfied: pillow>=4.1.1 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from torchvision->sentence-transformers) (8.0.1)\n",
"Requirement already satisfied: typing-extensions in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers) (3.7.4.3)\n",
"Requirement already satisfied: joblib>=0.11 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from scikit-learn->sentence-transformers) (0.16.0)\n",
"Requirement already satisfied: threadpoolctl>=2.0.0 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from scikit-learn->sentence-transformers) (2.1.0)\n",
"Requirement already satisfied: click in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from nltk->sentence-transformers) (7.1.2)\n",
"Requirement already satisfied: regex in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from nltk->sentence-transformers) (2020.7.14)\n",
"Requirement already satisfied: sacremoses in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers) (0.0.43)\n",
"Requirement already satisfied: packaging in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers) (20.4)\n",
"Requirement already satisfied: filelock in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers) (3.0.12)\n",
"Requirement already satisfied: tokenizers<0.11,>=0.10.1 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers) (0.10.1)\n",
"Requirement already satisfied: requests in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers) (2.24.0)\n",
"Requirement already satisfied: six in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sacremoses->transformers<5.0.0,>=3.1.0->sentence-transformers) (1.15.0)\n",
"Requirement already satisfied: pyparsing>=2.0.2 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from packaging->transformers<5.0.0,>=3.1.0->sentence-transformers) (2.4.7)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from requests->transformers<5.0.0,>=3.1.0->sentence-transformers) (2020.6.20)\n",
"Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from requests->transformers<5.0.0,>=3.1.0->sentence-transformers) (1.25.10)\n",
"Requirement already satisfied: idna<3,>=2.5 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from requests->transformers<5.0.0,>=3.1.0->sentence-transformers) (2.10)\n",
"Requirement already satisfied: chardet<4,>=3.0.2 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from requests->transformers<5.0.0,>=3.1.0->sentence-transformers) (3.0.4)\n"
]
}
],
"source": [
"!pip install sentence-transformers"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[-0.07142266 -0.07716199 -0.03047761 ... 0.01356028 -0.04016104\n",
" -0.02446149]\n",
" [-0.06508802 -0.06923407 -0.03735013 ... 0.01013562 -0.04027328\n",
" -0.02171571]]\n"
]
}
],
"source": [
"from sentence_transformers import SentenceTransformer\n",
"sentences = [\"Hello World\", \"Hallo Welt\"]\n",
"\n",
"model = SentenceTransformer('LaBSE')\n",
"embeddings = model.encode(sentences)\n",
"print(embeddings)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"r = pd.read_csv('BBC News Train.csv')"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"DOCUMENTS = list(r.Text)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"embeddings = model.encode(DOCUMENTS)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"embeddings = model.encode(list(r.Text))"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"QUERY_STR = 'consumer electronics market'"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"query = model.encode([QUERY_STR])"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"index = faiss.IndexFlatL2(embeddings.shape[1]) "
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"index.add(np.ascontiguousarray(embeddings))"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"D, I = index.search(query, 5) "
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[1363, 1371, 898, 744, 292]])"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"I"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[1.3110979, 1.4027181, 1.4045265, 1.4421673, 1.4421673]],\n",
" dtype=float32)"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"D"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'internet boom for gift shopping cyberspace is becoming a very popular destination for christmas shoppers. forecasts predict that british people will spend \u00a34bn buying gifts online during the festive season an increase of 64% on 2003. surveys also show that the average amount that people are spending is rising as is the range of goods that they are happy to buy online. savvy shoppers are also using the net to find the hot presents that are all but sold out in high street stores. almost half of the uk population now shop online according to figures collected by the interactive media in retail group which represents web retailers. about 85% of this group 18m people expect to do a lot of their christmas gift buying online this year reports the industry group. on average each shopper will spend \u00a3220 and britons lead europe in their affection for online shopping. almost a third of all the money spent online this christmas will come out of british wallets and purses compared to 29% from german shoppers and only 4% from italian gift buyers. james roper director of the imrg said shoppers were now much happier to buy so-called big ticket items such as lcd television sets and digital cameras. mr roper added that many retailers were working hard to reassure consumers that online shopping was safe and that goods ordered as presents would arrive in time for christmas. he advised consumers to give shops a little more time than usual to fulfil orders given that online buying is proving so popular. a survey by hostway suggests that many men prefer to shop online to avoid the embarrassment of buying some types of presents such as lingerie for wives and girlfriends. much of this online shopping is likely to be done during work time according to research carried out by security firm saint bernard software. the research reveals that up to two working days will be lost by staff who do their shopping via their work computer. worst offenders will be those in the 18-35 age bracket suggests the research who will spend up to five hours per week in december browsing and buying at online shops. iggy fanlo chief revenue officer at shopping.com said that the growing numbers of people using broadband was driving interest in online shopping. when you consider narrowband and broadband the conversion to sale is two times higher he said. higher speeds meant that everything happened much faster he said which let people spend time browsing and finding out about products before they buy. the behaviour of online shoppers was also changing he said. the single biggest reason people went online before this year was price he said. the number one reason now is convenience. very few consumers click on the lowest price he said. they are looking for good prices and merchant reliability. consumer comments and reviews were also proving popular with shoppers keen to find out who had the most reliable customer service. data collected by ebay suggests that some smart shoppers are getting round the shortages of hot presents by buying them direct through the auction site. according to ebay uk there are now more than 150 robosapiens remote control robots for sale via the site. the robosapiens toy is almost impossible to find in online and offline stores. similarly many shoppers are turning to ebay to help them get hold of the hard-to-find slimline playstation 2 which many retailers are only selling as part of an expensive bundle. the high demand for the playstation 2 has meant that prices for it are being driven up. in shops the ps2 is supposed to sell for \u00a3104.99. in some ebay uk auctions the price has risen to more than double this figure. many people are also using ebay to get hold of gadgets not even released in this country. the portable version of the playstation has only just gone on sale in japan yet some enterprising ebay users are selling the device to uk gadget fans.'"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DOCUMENTS[1363]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
},
"author": "Jakub Pokrywka",
"email": "kubapok@wmi.amu.edu.pl",
"lang": "pl",
"subtitle": "14.Ekstrakcja informacji seq2seq[\u0107wiczenia]",
"title": "Ekstrakcja informacji",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4
}

File diff suppressed because one or more lines are too long