This commit is contained in:
Jakub Pokrywka 2021-10-05 15:04:58 +02:00
parent 0f34dcdeb4
commit 3c0223d434
19 changed files with 7039 additions and 28796 deletions

View File

@ -1,112 +1,103 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Ekstrakcja informacji </h1>\n",
"<h2> 0. <i>Informacje na temat przedmiotu</i> [\u0107wiczenia]</h2> \n",
"<h3> Jakub Pokrywka (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Informacje og\u00f3lne"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Kontakt z prowadz\u0105cym\n",
"\n",
"prowadz\u0105cy: mgr in\u017c. Jakub Pokrywka\n",
"\n",
"Najlepiej kontaktow\u0105\u0107 si\u0119 ze mn\u0105 przez MS TEAMS na grupie kana\u0142u (og\u00f3lne sprawy) lub w prywatnych wiadomo\u015bciach. Odpisuj\u0119 co 2-3 dni. Mo\u017cna te\u017c um\u00f3wi\u0107 si\u0119 na zdzwonko w godzinach dy\u017curu (wt 12.00-13.00) lub um\u00f3wi\u0107 si\u0119 w innym terminie.\n",
"\n",
"\n",
"## Literatura\n",
"Polecana literatura do przedmiotu:\n",
"\n",
"\n",
"- https://www.manning.com/books/relevant-search#toc (darmowa) Polecam chocia\u017c przejrze\u0107.\n",
"- Marie-Francine Moens. 2006. Information Extraction: Algorithms and Prospects in a Retrieval Context. Springer. (polecam mniej, jest troch\u0119 nieaktualna)\n",
"- Alex Graves. 2012. Supervised sequence labelling. Studies in Computational Intelligence, vol 385. Springer. Berlin, Heidelberg. \n",
"\n",
"- Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. North American Association for Computational Linguistics (NAACL). \n",
"\n",
"- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research vol 21, number 140, pages 1-67. \n",
"\n",
"- Flip Grali\u0144ski, Tomasz Stanis\u0142awek, Anna Wr\u00f3blewska, Dawid Lipi\u0144ski, Agnieszka Kaliska, Paulina Rosalska, Bartosz Topolski, Przemys\u0142aw Biecek. 2020. Kleister: A novel task for information extraction involving long documents with complex layout. URL https://arxiv.org/abs/2003.02356 \n",
"\n",
"- \u0141ukasz Garncarek, Rafa\u0142 Powalski, Tomasz Stanis\u0142awek, Bartosz Topolski, Piotr Halama, Filip Grali\u0144ski. 2020. LAMBERT: Layout-Aware (Language) Modeling using BERT. URL https://arxiv.org/pdf/2002.08087 \n",
"\n",
"## Zaliczenie\n",
"\n",
"\n",
"\n",
"Do zdobycia b\u0119dzie conajmniej 600 punkt\u00f3w.\n",
"\n",
"Ocena:\n",
"\n",
"- -299 \u2014 2\n",
"\n",
"- 300-349 \u2014 3\n",
"\n",
"- 350-399 \u2014 3+\n",
"\n",
"- 400-449 \u2014 4\n",
"\n",
"- 450\u2014499 \u2014 4+\n",
"\n",
"- 500- \u2014 5\n",
"\n",
"\n",
"**\u017beby zaliczy\u0107 przedmiot nale\u017cy pojawia\u0107 si\u0119 na laboratoriach. Maksymalna liczba nieobecno\u015bci to 3. Obecno\u015b\u0107 b\u0119d\u0119 sprawdza\u0142 poprzez panel MS TEAMS, czyli b\u0119d\u0119 sprawdza\u0142 czy kto\u015b jest wdzwoniony na \u0107wiczenia. Je\u017celi kogo\u015b nie b\u0119dzie wi\u0119cej ni\u017c 3 razy, to nie b\u0119dzie mia\u0142 zaliczonego przedmiotu** \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
},
"author": "Jakub Pokrywka",
"email": "kubapok@wmi.amu.edu.pl",
"lang": "pl",
"subtitle": "0.Informacje na temat przedmiotu[\u0107wiczenia]",
"title": "Ekstrakcja informacji",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4
}
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Ekstrakcja informacji </h1>\n",
"<h2> 0. <i>Informacje na temat przedmiotu</i> [ćwiczenia]</h2> \n",
"<h3> Jakub Pokrywka (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Informacje ogólne"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Kontakt z prowadzącym\n",
"\n",
"prowadzący: mgr inż. Jakub Pokrywka\n",
"\n",
"Najlepiej kontaktowąć się ze mną przez MS TEAMS na grupie kanału (ogólne sprawy) lub w prywatnych wiadomościach. Odpisuję co 2-3 dni. Można też umówić się na zdzwonko w godzinach dyżuru (wt 12.00-13.00) lub umówić się w innym terminie.\n",
"\n",
"\n",
"## Literatura\n",
"Polecana literatura do przedmiotu:\n",
"\n",
"\n",
"- https://www.manning.com/books/relevant-search#toc (darmowa) Polecam chociaż przejrzeć.\n",
"- Marie-Francine Moens. 2006. Information Extraction: Algorithms and Prospects in a Retrieval Context. Springer. (polecam mniej, jest trochę nieaktualna)\n",
"- Alex Graves. 2012. Supervised sequence labelling. Studies in Computational Intelligence, vol 385. Springer. Berlin, Heidelberg. \n",
"\n",
"- Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. North American Association for Computational Linguistics (NAACL). \n",
"\n",
"- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research vol 21, number 140, pages 1-67. \n",
"\n",
"- Flip Graliński, Tomasz Stanisławek, Anna Wróblewska, Dawid Lipiński, Agnieszka Kaliska, Paulina Rosalska, Bartosz Topolski, Przemysław Biecek. 2020. Kleister: A novel task for information extraction involving long documents with complex layout. URL https://arxiv.org/abs/2003.02356 \n",
"\n",
"- Łukasz Garncarek, Rafał Powalski, Tomasz Stanisławek, Bartosz Topolski, Piotr Halama, Filip Graliński. 2020. LAMBERT: Layout-Aware (Language) Modeling using BERT. URL https://arxiv.org/pdf/2002.08087 \n",
"\n",
"## Zaliczenie\n",
"\n",
"\n",
"\n",
"Do zdobycia będzie conajmniej 600 punktów.\n",
"\n",
"Ocena:\n",
"\n",
"- -299 — 2\n",
"\n",
"- 300-349 — 3\n",
"\n",
"- 350-399 — 3+\n",
"\n",
"- 400-449 — 4\n",
"\n",
"- 450—499 — 4+\n",
"\n",
"- 500- — 5\n",
"\n",
"\n",
"**Żeby zaliczyć przedmiot należy pojawiać się na laboratoriach. Maksymalna liczba nieobecności to 3. Obecność będę sprawdzał poprzez panel MS TEAMS, czyli będę sprawdzał czy ktoś jest wdzwoniony na ćwiczenia. Jeżeli kogoś nie będzie więcej niż 3 razy, to nie będzie miał zaliczonego przedmiotu** \n"
]
}
],
"metadata": {
"author": "Jakub Pokrywka",
"email": "kubapok@wmi.amu.edu.pl",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"lang": "pl",
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
},
"subtitle": "0.Informacje na temat przedmiotu[ćwiczenia]",
"title": "Ekstrakcja informacji",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -276,13 +276,6 @@
"67. [Instytut Techniki Górniczej - wycinki](http://www.komag.gliwice.pl/archiwum/historia-komag)\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {

File diff suppressed because it is too large Load Diff

View File

@ -1,91 +1,89 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Ekstrakcja informacji </h1>\n",
"<h2> 3. <i>tfidf (1)</i> [\u0107wiczenia]</h2> \n",
"<h3> Jakub Pokrywka (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"def word_to_index(word):\n",
" vec = np.zeros(len(vocabulary))\n",
" if word in vocabulary:\n",
" idx = vocabulary.index(word)\n",
" vec[idx] = 1\n",
" else:\n",
" vec[-1] = 1\n",
" return vec"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def tf(document):\n",
" document_vector = None\n",
" for word in document:\n",
" if document_vector is None:\n",
" document_vector = word_to_index(word)\n",
" else:\n",
" document_vector += word_to_index(word)\n",
" return document_vector"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def similarity(query, document):\n",
" numerator = np.sum(query * document)\n",
" denominator = np.sqrt(np.sum(query*query)) * np.sqrt(np.sum(document*document)) \n",
" return numerator / denominator"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
},
"author": "Jakub Pokrywka",
"email": "kubapok@wmi.amu.edu.pl",
"lang": "pl",
"subtitle": "3.tfidf (1)[\u0107wiczenia]",
"title": "Ekstrakcja informacji",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4
}
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Ekstrakcja informacji </h1>\n",
"<h2> 3. <i>tfidf (1)</i> [ćwiczenia]</h2> \n",
"<h3> Jakub Pokrywka (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"def word_to_index(word):\n",
" vec = np.zeros(len(vocabulary))\n",
" if word in vocabulary:\n",
" idx = vocabulary.index(word)\n",
" vec[idx] = 1\n",
" else:\n",
" vec[-1] = 1\n",
" return vec"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def tf(document):\n",
" document_vector = None\n",
" for word in document:\n",
" if document_vector is None:\n",
" document_vector = word_to_index(word)\n",
" else:\n",
" document_vector += word_to_index(word)\n",
" return document_vector"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def similarity(query, document):\n",
" numerator = np.sum(query * document)\n",
" denominator = np.sqrt(np.sum(query*query)) * np.sqrt(np.sum(document*document)) \n",
" return numerator / denominator"
]
}
],
"metadata": {
"author": "Jakub Pokrywka",
"email": "kubapok@wmi.amu.edu.pl",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"lang": "pl",
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
},
"subtitle": "3.tfidf (1)[ćwiczenia]",
"title": "Ekstrakcja informacji",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4
}

File diff suppressed because it is too large Load Diff

View File

@ -85,13 +85,6 @@
" * proszę zaznaczyć w MS TEAMS, że Państwo zrobili zadanie w assigments\n",
" * zdawanie zadania będzie na zajęciach. Proszę przygotować prezentację do 5 minut"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -30,20 +30,11 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages/gensim/similarities/__init__.py:15: UserWarning: The gensim.similarities.levenshtein submodule is disabled, because the optional Levenshtein package <https://pypi.org/project/python-Levenshtein/> is unavailable. Install Levenhstein (e.g. `pip install python-Levenshtein`) to suppress this warning.\n",
" warnings.warn(msg)\n"
]
}
],
"outputs": [],
"source": [
"import numpy as np\n",
"import gensim\n",

View File

@ -30,20 +30,11 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages/gensim/similarities/__init__.py:15: UserWarning: The gensim.similarities.levenshtein submodule is disabled, because the optional Levenshtein package <https://pypi.org/project/python-Levenshtein/> is unavailable. Install Levenhstein (e.g. `pip install python-Levenshtein`) to suppress this warning.\n",
" warnings.warn(msg)\n"
]
}
],
"outputs": [],
"source": [
"import numpy as np\n",
"import gensim\n",

File diff suppressed because one or more lines are too long

View File

@ -23,18 +23,9 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages/gensim/similarities/__init__.py:15: UserWarning: The gensim.similarities.levenshtein submodule is disabled, because the optional Levenshtein package <https://pypi.org/project/python-Levenshtein/> is unavailable. Install Levenhstein (e.g. `pip install python-Levenshtein`) to suppress this warning.\n",
" warnings.warn(msg)\n"
]
}
],
"outputs": [],
"source": [
"import numpy as np\n",
"import gensim\n",
@ -60,19 +51,11 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": null,
"metadata": {
"scrolled": false
"scrolled": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Reusing dataset conll2003 (/home/kuba/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/40e7cb6bcc374f7c349c83acd1e9352a4f09474eb691f64f364ee62eb65d0ca6)\n"
]
}
],
"outputs": [],
"source": [
"dataset = load_dataset(\"conll2003\")"
]
@ -432,227 +415,11 @@
},
{
"cell_type": "code",
"execution_count": 26,
"execution_count": null,
"metadata": {
"scrolled": true
"scrolled": false
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "75184b632ce54ae690b3444778f44651",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, max=14041.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "74a55a414fa948a3b251b89f780564d0",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, max=3250.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"(0.5068524970963996, 0.5072649075903755, 0.5070586184860281)\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "0c9c580076fb4ec48b7ea2f300878594",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, max=14041.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "0be8681c67f64aca95ce5d3c44f10538",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, max=3250.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"(0.653649243957614, 0.6381494827385795, 0.6458063757205035)\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "2dec403004bb4ae298bc73553ea3f4bc",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, max=14041.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "eebed0407ba343e29cf8c2d607f631dc",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, max=3250.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"(0.7140486069946651, 0.7001046146693014, 0.7070078647728607)\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "70792f22eea343c8916bcfcf9215c298",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, max=14041.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "5d400bf1b656433ba2091cf750ec2d78",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, max=3250.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"(0.756327964151629, 0.725909566430315, 0.7408066429418744)\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "604c4fa13c03435d81bf68be37977d74",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, max=14041.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "2f78871f366f4fd1b7de6c4be5303906",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, max=3250.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"(0.7963248522230789, 0.7203301174009067, 0.7564235581324383)\n"
]
}
],
"outputs": [],
"source": [
"for i in range(NUM_EPOCHS):\n",
" lstm.train()\n",

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -1,293 +1,212 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Ekstrakcja informacji </h1>\n",
"<h2> 14. <i>Ekstrakcja informacji seq2seq</i> [\u0107wiczenia]</h2> \n",
"<h3> Jakub Pokrywka (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### SIMILARITY SEARCH\n",
"1. zainstaluj faiss i zr\u00f3b tutorial: https://github.com/facebookresearch/faiss\n",
"2. wczytaj tre\u015bci artyku\u0142\u00f3w z BBC News Train.csv\n",
"3. U\u017cyj kt\u00f3rego\u015b z transformer\u00f3w (mo\u017cesz u\u017cy\u0107 biblioteki sentence-transformers) do stworzenia embedding\u00f3w dokument\u00f3w\n",
"4. wczytaj embeddingi do bazy danych faiss\n",
"5. wyszukaj query 'consumer electronics market'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"https://www.kaggle.com/avishi/bbc-news-train-data"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import pickle\n",
"import numpy as np\n",
"import faiss\n",
"from sklearn.metrics import ndcg_score, dcg_score, average_precision_score"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: sentence-transformers in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (1.2.0)\n",
"Requirement already satisfied: sentencepiece in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (0.1.91)\n",
"Requirement already satisfied: torchvision in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (0.6.0)\n",
"Requirement already satisfied: scipy in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (1.4.1)\n",
"Requirement already satisfied: torch>=1.6.0 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (1.8.1)\n",
"Requirement already satisfied: tqdm in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (4.48.2)\n",
"Requirement already satisfied: scikit-learn in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (0.23.2)\n",
"Requirement already satisfied: nltk in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (3.5)\n",
"Requirement already satisfied: transformers<5.0.0,>=3.1.0 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (4.4.2)\n",
"Requirement already satisfied: numpy in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sentence-transformers) (1.20.3)\n",
"Requirement already satisfied: pillow>=4.1.1 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from torchvision->sentence-transformers) (8.0.1)\n",
"Requirement already satisfied: typing-extensions in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers) (3.7.4.3)\n",
"Requirement already satisfied: joblib>=0.11 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from scikit-learn->sentence-transformers) (0.16.0)\n",
"Requirement already satisfied: threadpoolctl>=2.0.0 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from scikit-learn->sentence-transformers) (2.1.0)\n",
"Requirement already satisfied: click in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from nltk->sentence-transformers) (7.1.2)\n",
"Requirement already satisfied: regex in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from nltk->sentence-transformers) (2020.7.14)\n",
"Requirement already satisfied: sacremoses in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers) (0.0.43)\n",
"Requirement already satisfied: packaging in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers) (20.4)\n",
"Requirement already satisfied: filelock in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers) (3.0.12)\n",
"Requirement already satisfied: tokenizers<0.11,>=0.10.1 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers) (0.10.1)\n",
"Requirement already satisfied: requests in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers) (2.24.0)\n",
"Requirement already satisfied: six in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from sacremoses->transformers<5.0.0,>=3.1.0->sentence-transformers) (1.15.0)\n",
"Requirement already satisfied: pyparsing>=2.0.2 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from packaging->transformers<5.0.0,>=3.1.0->sentence-transformers) (2.4.7)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from requests->transformers<5.0.0,>=3.1.0->sentence-transformers) (2020.6.20)\n",
"Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from requests->transformers<5.0.0,>=3.1.0->sentence-transformers) (1.25.10)\n",
"Requirement already satisfied: idna<3,>=2.5 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from requests->transformers<5.0.0,>=3.1.0->sentence-transformers) (2.10)\n",
"Requirement already satisfied: chardet<4,>=3.0.2 in /media/kuba/ssdsam/anaconda3/lib/python3.8/site-packages (from requests->transformers<5.0.0,>=3.1.0->sentence-transformers) (3.0.4)\n"
]
}
],
"source": [
"!pip install sentence-transformers"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[-0.07142266 -0.07716199 -0.03047761 ... 0.01356028 -0.04016104\n",
" -0.02446149]\n",
" [-0.06508802 -0.06923407 -0.03735013 ... 0.01013562 -0.04027328\n",
" -0.02171571]]\n"
]
}
],
"source": [
"from sentence_transformers import SentenceTransformer\n",
"sentences = [\"Hello World\", \"Hallo Welt\"]\n",
"\n",
"model = SentenceTransformer('LaBSE')\n",
"embeddings = model.encode(sentences)\n",
"print(embeddings)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"r = pd.read_csv('BBC News Train.csv')"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"DOCUMENTS = list(r.Text)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"embeddings = model.encode(DOCUMENTS)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"embeddings = model.encode(list(r.Text))"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"QUERY_STR = 'consumer electronics market'"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"query = model.encode([QUERY_STR])"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"index = faiss.IndexFlatL2(embeddings.shape[1]) "
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"index.add(np.ascontiguousarray(embeddings))"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"D, I = index.search(query, 5) "
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[1363, 1371, 898, 744, 292]])"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"I"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[1.3110979, 1.4027181, 1.4045265, 1.4421673, 1.4421673]],\n",
" dtype=float32)"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"D"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'internet boom for gift shopping cyberspace is becoming a very popular destination for christmas shoppers. forecasts predict that british people will spend \u00a34bn buying gifts online during the festive season an increase of 64% on 2003. surveys also show that the average amount that people are spending is rising as is the range of goods that they are happy to buy online. savvy shoppers are also using the net to find the hot presents that are all but sold out in high street stores. almost half of the uk population now shop online according to figures collected by the interactive media in retail group which represents web retailers. about 85% of this group 18m people expect to do a lot of their christmas gift buying online this year reports the industry group. on average each shopper will spend \u00a3220 and britons lead europe in their affection for online shopping. almost a third of all the money spent online this christmas will come out of british wallets and purses compared to 29% from german shoppers and only 4% from italian gift buyers. james roper director of the imrg said shoppers were now much happier to buy so-called big ticket items such as lcd television sets and digital cameras. mr roper added that many retailers were working hard to reassure consumers that online shopping was safe and that goods ordered as presents would arrive in time for christmas. he advised consumers to give shops a little more time than usual to fulfil orders given that online buying is proving so popular. a survey by hostway suggests that many men prefer to shop online to avoid the embarrassment of buying some types of presents such as lingerie for wives and girlfriends. much of this online shopping is likely to be done during work time according to research carried out by security firm saint bernard software. the research reveals that up to two working days will be lost by staff who do their shopping via their work computer. worst offenders will be those in the 18-35 age bracket suggests the research who will spend up to five hours per week in december browsing and buying at online shops. iggy fanlo chief revenue officer at shopping.com said that the growing numbers of people using broadband was driving interest in online shopping. when you consider narrowband and broadband the conversion to sale is two times higher he said. higher speeds meant that everything happened much faster he said which let people spend time browsing and finding out about products before they buy. the behaviour of online shoppers was also changing he said. the single biggest reason people went online before this year was price he said. the number one reason now is convenience. very few consumers click on the lowest price he said. they are looking for good prices and merchant reliability. consumer comments and reviews were also proving popular with shoppers keen to find out who had the most reliable customer service. data collected by ebay suggests that some smart shoppers are getting round the shortages of hot presents by buying them direct through the auction site. according to ebay uk there are now more than 150 robosapiens remote control robots for sale via the site. the robosapiens toy is almost impossible to find in online and offline stores. similarly many shoppers are turning to ebay to help them get hold of the hard-to-find slimline playstation 2 which many retailers are only selling as part of an expensive bundle. the high demand for the playstation 2 has meant that prices for it are being driven up. in shops the ps2 is supposed to sell for \u00a3104.99. in some ebay uk auctions the price has risen to more than double this figure. many people are also using ebay to get hold of gadgets not even released in this country. the portable version of the playstation has only just gone on sale in japan yet some enterprising ebay users are selling the device to uk gadget fans.'"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DOCUMENTS[1363]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
},
"author": "Jakub Pokrywka",
"email": "kubapok@wmi.amu.edu.pl",
"lang": "pl",
"subtitle": "14.Ekstrakcja informacji seq2seq[\u0107wiczenia]",
"title": "Ekstrakcja informacji",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4
}
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Ekstrakcja informacji </h1>\n",
"<h2> 14. <i>Ekstrakcja informacji seq2seq</i> [ćwiczenia]</h2> \n",
"<h3> Jakub Pokrywka (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### SIMILARITY SEARCH\n",
"1. zainstaluj faiss i zrób tutorial: https://github.com/facebookresearch/faiss\n",
"2. wczytaj treści artykułów z BBC News Train.csv\n",
"3. Użyj któregoś z transformerów (możesz użyć biblioteki sentence-transformers) do stworzenia embeddingów dokumentów\n",
"4. wczytaj embeddingi do bazy danych faiss\n",
"5. wyszukaj query 'consumer electronics market'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"https://www.kaggle.com/avishi/bbc-news-train-data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import pickle\n",
"import numpy as np\n",
"import faiss\n",
"from sklearn.metrics import ndcg_score, dcg_score, average_precision_score"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"!pip install sentence-transformers"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sentence_transformers import SentenceTransformer\n",
"sentences = [\"Hello World\", \"Hallo Welt\"]\n",
"\n",
"model = SentenceTransformer('LaBSE')\n",
"embeddings = model.encode(sentences)\n",
"print(embeddings)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"r = pd.read_csv('BBC News Train.csv')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"DOCUMENTS = list(r.Text)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"embeddings = model.encode(DOCUMENTS)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"embeddings = model.encode(list(r.Text))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"QUERY_STR = 'consumer electronics market'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"query = model.encode([QUERY_STR])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"index = faiss.IndexFlatL2(embeddings.shape[1]) "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"index.add(np.ascontiguousarray(embeddings))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"D, I = index.search(query, 5) "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"I"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"D"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"DOCUMENTS[1363]"
]
}
],
"metadata": {
"author": "Jakub Pokrywka",
"email": "kubapok@wmi.amu.edu.pl",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"lang": "pl",
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
},
"subtitle": "14.Ekstrakcja informacji seq2seq[ćwiczenia]",
"title": "Ekstrakcja informacji",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4
}

File diff suppressed because one or more lines are too long