KWT-2024/lab/lab_04-05.ipynb

1060 lines
442 KiB
Plaintext
Raw Normal View History

2024-04-13 08:20:53 +02:00
{
"cells": [
{
"cell_type": "markdown",
"id": "going-morocco",
"metadata": {},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Komputerowe wspomaganie tłumaczenia </h1>\n",
"<h2> 4,5. <i>Klasyfikacja tematyczna (terminologii ciąg dalszy)</i> [laboratoria]</h2> \n",
"<h3>Rafał Jaworski (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "markdown",
"id": "expanded-entrance",
"metadata": {},
"source": [
"# Komputerowe wspomaganie tłumaczenia"
]
},
{
"cell_type": "markdown",
"id": "atlantic-greenhouse",
"metadata": {},
"source": [
"# Zajęcia 4 i 5 - klasyfikacja tematyczna (terminologii ciąg dalszy)"
]
},
{
"cell_type": "markdown",
"id": "colored-nothing",
"metadata": {},
"source": [
"Na poprzednich zajęciach opracowaliśmy nasz własny ekstraktor terminologii. Mówiliśmy również, jak ważna jest ekstrakcja terminów specjalistycznych. Dziś zajmiemy się zagadnieniem, w jaki sposób wyciągnąć z tekstu terminy, które naprawdę są specjalistyczne."
]
},
{
"cell_type": "markdown",
"id": "economic-pontiac",
"metadata": {},
"source": [
"Dlaczego nasze dotychczasowe rozwiązanie mogło nie spełniać tego warunku? Wykonajmy następujące ćwiczenie:"
]
},
{
"cell_type": "markdown",
"id": "sealed-november",
"metadata": {},
"source": [
"### Ćwiczenie 1: Zgromadź korpus w języku angielskim składający się z co najmniej 100 dokumentów, z których każdy zawiera co najmniej 100 zdań. Wykorzystaj stronę https://opus.nlpl.eu/. Dobrze, aby dokumenty pochodziły z różnych dziedzin (np. prawo Unii Europejskiej, manuale programistyczne, medycyna). Ściągnięty korpus zapisz na swoim dysku lokalnym, nie załączaj go do niniejszego notatnika."
]
},
2024-04-14 18:45:52 +02:00
{
"cell_type": "code",
"execution_count": 1,
"id": "ced0f120",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"100\n"
]
}
],
"source": [
"import os\n",
"\n",
"documents_files = os.listdir(\"./data/corpus\")\n",
"documents_files = [d for d in documents_files if d.endswith(\".txt\")]\n",
"documents_files = sorted(documents_files)\n",
"\n",
"documents = []\n",
"for document_file in documents_files:\n",
" with open(f\"./data/corpus/{document_file}\", \"r\") as f:\n",
" document_text = f.read()\n",
"\n",
" # Limit text to 100 lines\n",
" document_text = \"\\n\".join(document_text.split(\"\\n\")[:100])\n",
"\n",
" documents.append(document_text)\n",
"\n",
"print(len(documents))"
]
},
2024-04-13 08:20:53 +02:00
{
"cell_type": "markdown",
"id": "closed-steel",
"metadata": {},
"source": [
"Taki korpus pozwoli nam zaobserwować, co się stanie, jeśli do ekstrakcji terminologii będziemy stosowali wyłącznie kryterium częstościowe. Aby wykonać odpowiedni eksperyment musimy uruchomić ekstraktor z poprzednich zajęć."
]
},
{
"cell_type": "markdown",
"id": "environmental-thread",
"metadata": {},
"source": [
"### Ćwiczenie 2: Uruchom ekstraktor terminologii (wykrywacz rzeczowników) z poprzednich zajęć na każdym dokumencie z osobna. Jako wynik ekstraktora w każdym przypadku wypisz 5 najczęściej występujących rzeczowników. Wyniki działania komendy umieść w notatniku."
]
},
{
"cell_type": "code",
2024-04-14 18:45:52 +02:00
"execution_count": 2,
2024-04-13 08:20:53 +02:00
"id": "honest-assessment",
"metadata": {},
2024-04-14 18:45:52 +02:00
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1/100] ['project', 'victims', 'support', 'visit', 'mediation']\n",
"[2/100] ['exhibition', 'cooperation', 'year', 'meeting', 'films']\n",
"[3/100] ['exhibition', 'cooperation', 'year', 'meeting', 'films']\n",
"[4/100] ['solution', 'occupation', 'settlement', 'problem', 'resolutions']\n",
"[5/100] ['residence', 'citizens', 'permit', 'security', 'citizen']\n",
"[6/100] ['residence', 'citizens', 'permit', 'security', 'citizen']\n",
"[7/100] ['support', 'measures', 'countries', 'farmers', 'member']\n",
"[8/100] ['data', 'services', 'infrastructure', 'development', 'project']\n",
"[9/100] ['data', 'services', 'infrastructure', 'development', 'project']\n",
"[10/100] ['photographs', 'service', 'scans', 'materials', 'films']\n",
"[11/100] ['photographs', 'service', 'scans', 'materials', 'films']\n",
"[12/100] ['insurance', 'ZUS', 'contributions', 'benefits', 'administration']\n",
"[13/100] ['project', 'archaeology', 'research', 'conservation', 'history']\n",
"[14/100] ['project', 'archaeology', 'research', 'conservation', 'history']\n",
"[15/100] ['cases', '%', 'coronavirus', 'countries', 'disease']\n",
"[16/100] ['%', 'year', 'case', 'cases', 'coronavirus']\n",
"[17/100] ['ship', 'tug', 'speed', 'accident', 'course']\n",
"[18/100] ['ship', 'tug', 'speed', 'accident', 'course']\n",
"[19/100] ['work', 'scientists', 'research', 'science', 'telomerase']\n",
"[20/100] ['work', 'scientists', 'research', 'science', 'telomerase']\n",
"[21/100] ['film', 'media', 'part', 'time', 'efforts']\n",
"[22/100] ['film', 'media', 'part', 'time', 'efforts']\n",
"[23/100] ['insurance', 'ZUS', 'contributions', 'benefits', 'administration']\n",
"[24/100] ['use', 'care', 'stewardship', 'resistance', 'antibiotics']\n",
"[25/100] ['services', 'administration', 'state', 'information', 'e']\n",
"[26/100] ['services', 'administration', 'state', 'information', 'e']\n",
"[27/100] ['coronavirus', 'research', 'measures', 'outbreak', 'member']\n",
"[28/100] ['residence', 'card', 'foreigner', 'work', 'permit']\n",
"[29/100] ['security', 'e', 'threats', 'policy', 'gas']\n",
"[30/100] ['security', 'e', 'threats', 'policy', 'gas']\n",
"[31/100] ['paper', '15th', 'reader', 'file', 'date']\n",
"[32/100] ['paper', '15th', 'reader', 'file', 'date']\n",
"[33/100] ['costs', 'implementation', 'management', 'tasks', 'expenditures']\n",
"[34/100] ['food', 'cooperation', 'products', 'market', 'agri']\n",
"[35/100] ['costs', 'implementation', 'management', 'tasks', 'expenditures']\n",
"[36/100] ['costs', 'implementation', 'management', 'tasks', 'expenditures']\n",
"[37/100] ['artist', 'work', 'painting', 'paintings', 'time']\n",
"[38/100] ['artist', 'work', 'painting', 'paintings', 'time']\n",
"[39/100] ['Home', '»', 'rights', 'representatives', 'discrimination']\n",
"[40/100] ['Home', '»', 'rights', 'representatives', 'discrimination']\n",
"[41/100] ['command', 'documentation', 'alias', 'files', 'directory']\n",
"[42/100] ['water', 'basis', 'land', 'status', 'item']\n",
"[43/100] ['water', 'basis', 'land', 'status', 'item']\n",
"[44/100] ['%', 'contract', 'contracts', '.', 'No']\n",
"[45/100] ['food', 'cooperation', 'products', 'market', 'agri']\n",
"[46/100] ['%', 'contract', 'contracts', '.', 'No']\n",
"[47/100] ['market', 'level', 'services', 'age', 'companies']\n",
"[48/100] ['market', 'level', 'services', 'age', 'companies']\n",
"[49/100] ['projects', 'innovation', 'R&D', 'development', 'companies']\n",
"[50/100] ['projects', 'innovation', 'R&D', 'development', 'companies']\n",
"[51/100] ['contracts', 'contract', '%', 'item', 'procedures']\n",
"[52/100] ['contracts', 'contract', '%', 'item', 'procedures']\n",
"[53/100] ['room', 'A', 'office', 'information', 'B']\n",
"[54/100] ['room', 'A', 'office', 'information', 'B']\n",
"[55/100] ['advantage', 'production', 'country', 'countries', 'goods']\n",
"[56/100] ['measles', 'vaccine', 'disease', 'person', 'people']\n",
"[57/100] ['advantage', 'production', 'country', 'countries', 'goods']\n",
"[58/100] ['card', 'residence', 'permission', 'business', 'stamp']\n",
"[59/100] ['card', 'residence', 'permission', 'business', 'stamp']\n",
"[60/100] ['w', '%', 'gospodarczego', 'polityki', 'publicznych']\n",
"[61/100] ['system', 'banks', 'stability', 'risk', 'sector']\n",
"[62/100] ['camps', 'people', 'concentration', 'policy', 'resistance']\n",
"[63/100] ['camps', 'people', 'concentration', 'policy', 'resistance']\n",
"[64/100] ['safety', 'aviation', 'management', 'requirements', 'entity']\n",
"[65/100] ['safety', 'aviation', 'management', 'requirements', 'entity']\n",
"[66/100] ['research', 'call', 'philosophy', 'information', 'project']\n",
"[67/100] ['vaccination', 'pertussis', 'cancer', 'risk', 'disease']\n",
"[68/100] ['research', 'call', 'philosophy', 'information', 'project']\n",
"[69/100] ['energy', 'gas', '%', 'oil', 'countries']\n",
"[70/100] ['energy', 'gas', '%', 'oil', 'countries']\n",
"[71/100] ['cooperation', 'meeting', 'talks', 'forces', 'defence']\n",
"[72/100] ['project', 'education', 'information', 'coronavirus', 'funding']\n",
"[73/100] ['food', 'education', 'project', 'measures', 'assistance']\n",
"[74/100] ['infection', 'disease', 'symptoms', 'fever', 'humans']\n",
"[75/100] ['energy', 'audit', 'costs', 'use', 'management']\n",
"[76/100] ['countries', '%', 'development', 'benefits', 'funds']\n",
"[77/100] ['years', 'minister', 'year', 'rector', 'persons']\n",
"[78/100] ['water', 'food', 'fish', 'times', 'year']\n",
"[79/100] ['land', 'water', 'population', 'data', 'age']\n",
"[80/100] ['land', 'water', 'population', 'data', 'age']\n",
"[81/100] ['market', 'labour', 'crisis', 'unemployment', 'countries']\n",
"[82/100] ['market', 'labour', 'crisis', 'unemployment', 'countries']\n",
"[83/100] ['accelerator', 'research', '-', 'operation', 'model']\n",
"[84/100] ['accelerator', 'research', '-', 'operation', 'model']\n",
"[85/100] ['energy', 'policy', 'power', 'development', 'objectives']\n",
"[86/100] ['priest', 'hand', 'country', 'wedding', 'church']\n",
"[87/100] ['eggs', 'breakfast', 'food', 'products', 'meat']\n",
"[88/100] ['eggs', 'breakfast', 'food', 'products', 'meat']\n",
"[89/100] ['water', 'fish', 'times', 'food', 'year']\n",
"[90/100] ['honey', 'production', 'bread', 'time', 'taste']\n",
"[91/100] ['honey', 'production', 'bread', 'time', 'taste']\n",
"[92/100] ['data', 'job', 'portal', 'vacancies', 'Decision']\n",
"[93/100] ['data', 'job', 'portal', 'vacancies', 'Decision']\n",
"[94/100] ['food', 'quality', 'products', 'apples', 'farmers']\n",
"[95/100] ['food', 'quality', 'products', 'apples', 'farmers']\n",
"[96/100] ['visa', 'activities', 'child', 'B-1', 'institution']\n",
"[97/100] ['visa', 'activities', 'child', 'B-1', 'institution']\n",
"[98/100] ['-', 'co', 'preparations', 'operation', 'preparation']\n",
"[99/100] ['-', 'co', 'preparations', 'operation', 'preparation']\n",
"[100/100] ['project', 'victims', 'support', 'visit', 'mediation']\n"
]
}
],
"source": [
"import spacy\n",
"nlp = spacy.load(\"en_core_web_sm\")\n",
"\n",
"def get_nouns(text):\n",
" doc = nlp(text)\n",
" return [token.text for token in doc if token.pos_ == 'NOUN']\n",
"\n",
"def get_top_nouns(nouns, n=5):\n",
" from collections import Counter\n",
" return [noun for noun, _ in Counter(nouns).most_common(n)]\n",
"\n",
"top_nouns = []\n",
"for i, document in enumerate(documents):\n",
" nouns = get_nouns(document)\n",
" top_nouns.append(get_top_nouns(nouns, 5))\n",
" print(f\"[{i+1}/{len(documents)}] {top_nouns[-1]}\")\n",
"\n",
"with open(\"./data/top_nouns.txt\", \"w\") as f:\n",
" for nouns in top_nouns:\n",
" f.write(\" \".join(nouns) + \"\\n\")"
]
2024-04-13 08:20:53 +02:00
},
{
"cell_type": "markdown",
"id": "desirable-appeal",
"metadata": {},
"source": [
"Czy wyniki uzyskane w ten sposób to zawsze terminy specjalistyczne? Niestety może zdarzyć się, że w wynikach pojawią się rzeczowniki, które są po prostu częste w języku, a niekoniecznie charakterystyczne dla przetwarzanych przez nas tekstów. Aby wyniki ekstrakcji były lepsze, konieczne jest zastosowanie bardziej wyrafinowanych metod."
]
},
{
"cell_type": "markdown",
"id": "accomplished-interview",
"metadata": {},
"source": [
"Jedną z tych metod jest znana z dyscypliny Information Retrieval technika zwana TF-IDF. Jej nazwa wywodzi się od **T**erm **F**requency **I**nverted **D**ocument **F**requency. Według tej metody, dla każdego odnalezionego przez nas termu powinniśmy obliczyć czynnik TF-IDF, a następnie wyniki posortować malejąco po wartości tego czynnika."
]
},
{
"cell_type": "markdown",
"id": "green-september",
"metadata": {},
"source": [
"Jak obliczyć czynnik TF-IDF? Czym jest TF, a czym jest IDF?"
]
},
{
"cell_type": "markdown",
"id": "several-gardening",
"metadata": {},
"source": [
"Zacznijmy od TF, bo ten czynnik już znamy. Jest to nic innego jak częstość wystąpienia terminu w tekście, który przetwarzamy. Idea TF-IDF skupia się na drugim czynniku - IDF. Słowo *inverted* oznacza, że czynnik ten będzie odwrócony, czyli trafi do mianownika. W związku z tym TF-IDF to w istocie:\n",
"$\\frac{TF}{DF}$"
]
},
{
"cell_type": "markdown",
"id": "collect-brooklyn",
"metadata": {},
"source": [
"Czym zatem jest document frequency? Jest to liczba dokumentów, w których wystąpił dany termin. Dokumenty w tym przypadku są rozumiane jako jednostki, na które podzielony jest korpus, nad którym pracujemy (dokładnie taki, jak korpus z ćwiczenia pierwszego)."
]
},
{
"cell_type": "markdown",
"id": "urban-spray",
"metadata": {},
"source": [
"Zastanówmy się nad sensem tego czynnika. Pamiętajmy, że naszym zadaniem jest ekstracja terminów z tylko jednego dokumentu na raz. Mamy jednak do dyspozycji wiele innych dokumentów, zawierających wiele innych słów i termów. Wartość TF-IDF jest tym większa, im częściej termin występuje w dokumencie, na którym dokonujemy ekstrakcji. Czynnik ten jednak zmniejsza się, jeśli słowo występuje w wielu różnych dokumentach. Zatem, popularne słowa będą miały wysoki czynnik DF i niski TF-IDF. Natomiast najwyższą wartość TF-IDF będą miały terminy, które są częste w przetwarzanym przez nas dokumencie, ale nie występują nigdzie indziej."
]
},
{
"cell_type": "markdown",
"id": "worse-throat",
"metadata": {},
"source": [
"### Ćwiczenie 3: Zaimplementuj czynnik TF-IDF i dokonaj ekstrakcji terminologii za jego pomocą, używając korpusu z ćwiczenia nr 1. Czy wyniki różnią się od tych uzyskanych tylko za pomocą TF?"
]
},
{
"cell_type": "code",
2024-04-14 18:45:52 +02:00
"execution_count": 3,
2024-04-13 08:20:53 +02:00
"id": "published-speaking",
"metadata": {},
2024-04-14 18:45:52 +02:00
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1/100] ['approval', 'total', 'lawyers', 'priorities', 'judges']\n",
"[2/100] ['agriculture', 'support', 'guests', 'offers', 'author']\n",
"[3/100] ['agriculture', 'support', 'guests', 'offers', 'author']\n",
"[4/100] ['homeland', 'invasion', 'address', 'prisoners', 'sources']\n",
"[5/100] ['identity', 'positions', 'elaboration', 'issues', 'terms']\n",
"[6/100] ['identity', 'positions', 'elaboration', 'issues', 'terms']\n",
"[7/100] ['distancing', 'lenders', 'mechanism', 'check', 'part']\n",
"[8/100] ['IT', 'Realization', 'Services', 'resolutions', 'bases']\n",
"[9/100] ['IT', 'Realization', 'Services', 'resolutions', 'bases']\n",
"[10/100] ['occupation', 'scans', 'browser', 'Service', 'processes']\n",
"[11/100] ['occupation', 'scans', 'browser', 'Service', 'processes']\n",
"[12/100] ['am', 'war', 'month', 'Insurance', 'centralisation']\n",
"[13/100] ['conservation', 'zu', 'provisions', 'basin', 'record']\n",
"[14/100] ['conservation', 'zu', 'provisions', 'basin', 'record']\n",
"[15/100] ['culture', 'city', 'abscesses', 'aeronautics', 'disruptors']\n",
"[16/100] ['infection', 'Recommendations', 'man', 'evening', 'occurrence']\n",
"[17/100] ['course', 'hull', 'STATE', 'classifier', 'certificate']\n",
"[18/100] ['course', 'hull', 'STATE', 'classifier', 'certificate']\n",
"[19/100] ['cooling', 'work', 'culture', 'part', 'laboratory']\n",
"[20/100] ['cooling', 'work', 'culture', 'part', 'laboratory']\n",
"[21/100] ['culture', 'reverse', 'advisor', 'documentary', 'service']\n",
"[22/100] ['culture', 'reverse', 'advisor', 'documentary', 'service']\n",
"[23/100] ['am', 'war', 'month', 'Insurance', 'centralisation']\n",
"[24/100] ['pressure', 'ability', 'entry', 'prescribers', 'costs']\n",
"[25/100] ['economies', 'management', 'role', 'disk', 'stakeholders']\n",
"[26/100] ['economies', 'management', 'role', 'disk', 'stakeholders']\n",
"[27/100] ['traders', 'fears', 'carriers', 'illness', 'distancing']\n",
"[28/100] ['activity', 'employment', 'foreigners', 'Visa', 'graduate']\n",
"[29/100] ['defense', 'forecast', 'quarter', 'factors', 'opportunity']\n",
"[30/100] ['defense', 'forecast', 'quarter', 'factors', 'opportunity']\n",
"[31/100] ['case', 'author', 'screen', 'announcement', 'typefaces']\n",
"[32/100] ['case', 'author', 'screen', 'announcement', 'typefaces']\n",
"[33/100] ['revenue', 'office', 'premises', 'o', 'proposals']\n",
"[34/100] ['storage', 'completion', 'efforts', 'Meeting', 'crisis']\n",
"[35/100] ['office', 'Types', 'premises', 'protection', 'days']\n",
"[36/100] ['revenue', 'office', 'premises', 'o', 'proposals']\n",
"[37/100] ['pictures', 'splashing', 'dobrze', 'viewer', 'culture']\n",
"[38/100] ['pictures', 'splashing', 'dobrze', 'viewer', 'culture']\n",
"[39/100] ['creation', 'origin', 'discrimination', 'interest', 'institutions']\n",
"[40/100] ['creation', 'origin', 'discrimination', 'interest', 'institutions']\n",
"[41/100] ['names', 'contexts', 'calculator', 'program', 'descriptor']\n",
"[42/100] ['periods', 'standards', 'total', 'name', 'property']\n",
"[43/100] ['periods', 'standards', 'total', 'name', 'property']\n",
"[44/100] ['Art', 'days', 'liability', 'authorities', 'services']\n",
"[45/100] ['storage', 'completion', 'efforts', 'Meeting', 'crisis']\n",
"[46/100] ['Art', 'days', 'liability', 'authorities', 'services']\n",
"[47/100] ['skills', 'provision', 'country', 'economies', 'science']\n",
"[48/100] ['skills', 'provision', 'country', 'economies', 'science']\n",
"[49/100] ['Project', 'possibilities', 'cancer', 'members', 'therapies']\n",
"[50/100] ['Project', 'possibilities', 'cancer', 'members', 'therapies']\n",
"[51/100] ['price', 'auction', 'actions', 'telecommunications', 'appointment']\n",
"[52/100] ['price', 'auction', 'actions', 'telecommunications', 'appointment']\n",
"[53/100] ['records', 'coffee', 'authorisation', 'line', 'times']\n",
"[54/100] ['records', 'coffee', 'authorisation', 'line', 'times']\n",
"[55/100] ['example', 'manner', 'source', 'essence', 'identification']\n",
"[56/100] ['defences', 'vaccines', 'days', 'spread', 'body']\n",
"[57/100] ['example', 'manner', 'source', 'essence', 'identification']\n",
"[58/100] ['servants', 'employees', 'Possession', 'insurance', 'examinations']\n",
"[59/100] ['servants', 'employees', 'Possession', 'insurance', 'examinations']\n",
"[60/100] ['systemowe', 'dopiero', 'system', 'latach', 'popytem']\n",
"[61/100] ['efficiency', 'problems', 'uncertainty', 'improvement', 'Risk']\n",
"[62/100] ['uprising', 'borders', 'rights', 'security', 'campaign']\n",
"[63/100] ['uprising', 'borders', 'rights', 'security', 'campaign']\n",
"[64/100] ['part', 'audits', 'Responsibilities', 'services', 'authority']\n",
"[65/100] ['protection', 'competence', 'version', 'occurrence', 'requisition']\n",
"[66/100] ['Requirements', 'members', 'methodology', 'data', 'database']\n",
"[67/100] ['whoop', 'substitute', 'cause', 'exposure', 'course']\n",
"[68/100] ['Requirements', 'members', 'methodology', 'data', 'database']\n",
"[69/100] ['erent', 'decisions', 'SOURCES', 'spectrum', 'economies']\n",
"[70/100] ['erent', 'decisions', 'SOURCES', 'spectrum', 'economies']\n",
"[71/100] ['invitation', 'effects', 'help', 'armament', 'round']\n",
"[72/100] ['area', 'teaching', 'tax', 'time', 'travel']\n",
"[73/100] ['time', 'Recommendation', 'participants', 'guarantees', 'work']\n",
"[74/100] ['toxin', 'mechanisms', 'attacks', 'Babies', 'therapies']\n",
"[75/100] ['production', 'replacement', 'control', 'SMEs', 'audit']\n",
"[76/100] ['significance', 'net', 'ground', 'participants', 'levels']\n",
"[77/100] ['functioning', 'consultation', 'interest', 'expert', 'procedures']\n",
"[78/100] ['thing', 'mercury', 'eggs', 'municipality', 'lunch']\n",
"[79/100] ['agriculture', 'R', 'result', 'development', 'prices']\n",
"[80/100] ['agriculture', 'R', 'result', 'development', 'prices']\n",
"[81/100] ['reflection', 'basis', 'sources', 'points', 'results']\n",
"[82/100] ['reflection', 'basis', 'sources', 'points', 'results']\n",
"[83/100] ['leaders', 'reach', 'author', 'features', 'publications']\n",
"[84/100] ['leaders', 'reach', 'author', 'features', 'publications']\n",
"[85/100] ['consumption', 'Improvement', 'bodies', 'level', 'need']\n",
"[86/100] ['money', 'delirium', 'advice', 'house', 'couple']\n",
"[87/100] ['work', 'thanks', 'BEgINNINg', 'range', 'funds']\n",
"[88/100] ['work', 'thanks', 'BEgINNINg', 'range', 'funds']\n",
"[89/100] ['option', 'eggs', 'dinner', 'wine', 'quantities']\n",
"[90/100] ['seeds', 'mead', 'event', 'maples', 'approach']\n",
"[91/100] ['seeds', 'mead', 'event', 'maples', 'approach']\n",
"[92/100] ['case', 'complaints', 'consultation', 'Employers', 'actions']\n",
"[93/100] ['case', 'complaints', 'consultation', 'Employers', 'actions']\n",
"[94/100] ['activity', 'fruit', 'indications', 'zation', 'rice']\n",
"[95/100] ['activity', 'fruit', 'indications', 'zation', 'rice']\n",
"[96/100] ['building', 'work', 'premises', 'Food', 'child']\n",
"[97/100] ['building', 'work', 'premises', 'Food', 'child']\n",
"[98/100] ['virtue', 'works', 'culture', 'sectors', 'others']\n",
"[99/100] ['virtue', 'works', 'culture', 'sectors', 'others']\n",
"[100/100] ['approval', 'total', 'lawyers', 'priorities', 'judges']\n"
]
}
],
2024-04-13 08:20:53 +02:00
"source": [
2024-04-14 18:45:52 +02:00
"import math\n",
"from collections import Counter\n",
"\n",
"\n",
2024-04-13 08:20:53 +02:00
"def tfidf_extract():\n",
2024-04-14 18:45:52 +02:00
" def tf(word, document):\n",
" return document.count(word) / len(document)\n",
"\n",
" def idf(word, documents):\n",
" num_documents_with_word = sum(1 for document in documents if word in document)\n",
" if num_documents_with_word == 0:\n",
" return 0\n",
" return math.log(len(documents) / num_documents_with_word)\n",
"\n",
" def tfidf(word, document, documents):\n",
" tf_output = tf(word, document)\n",
" idf_output = idf(word, documents)\n",
" return tf_output * idf_output\n",
" \n",
" split_documents = [document.split() for document in documents]\n",
" top_special_nouns = []\n",
" for i, document in enumerate(split_documents):\n",
" nouns = get_nouns(\" \".join(document))\n",
" nouns = [(noun, tfidf(noun, document, split_documents)) for noun in nouns]\n",
" nouns = sorted(nouns, key=lambda x: x[1], reverse=True)\n",
" top_special_nouns.append([noun for noun, _ in list(set(nouns))[:5]])\n",
" print(f\"[{i+1}/{len(documents)}] {top_special_nouns[-1]}\")\n",
"\n",
" with open(\"./data/top_nouns_tfidf.txt\", \"w\") as f:\n",
" for nouns in top_special_nouns:\n",
" f.write(\" \".join(nouns) + \"\\n\")\n",
"\n",
"tfidf_extract()"
2024-04-13 08:20:53 +02:00
]
},
{
"cell_type": "markdown",
"id": "handmade-square",
"metadata": {},
"source": [
"Teraz potrafimy już w lepszy sposób wyciągać terminy z dokumentów. Spróbujmy jeszcze czegoś widowiskowego - wygenerujmy tzw. chmurę słów z tekstu przy użyciu biblioteki WordCloud dla artykułu z BBC News (https://www.bbc.com/news/world-europe-56530714):"
]
},
{
"cell_type": "markdown",
"id": "solar-particular",
"metadata": {},
"source": [
"`sudo pip install wordcloud`"
]
},
{
"cell_type": "code",
2024-04-14 18:45:52 +02:00
"execution_count": 4,
2024-04-13 08:20:53 +02:00
"id": "monetary-wages",
"metadata": {},
"outputs": [
{
"data": {
2024-04-14 18:45:52 +02:00
"image/jpeg": "/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCADIAZADASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD3yZzHEzgZIFMNyqj5lI5wQSOP19+1SsodSrDINMaCNmJKnJ68nn/OKwqxrXvTf3/16FK3UZ9pXj5H5OB055x60fa48gc57jjioplgt408xmyX42hmJPXAA5/L0qRo9oUxREgg5+Yj6Z9evesE8T1a/P8AL+vzC8BTdoAMqwJwQDgZB/GlFwHdAinaxxuP0zUKiR/m+z46AfMQQB/k05Fk+0AmHAyfm3/gDjNCddtXel10e33BeJz+u317ceI7XRbe9+wxSR72mA+ZjzwPyqaLw9qltcRSQ+Ibt1DAukw3ZGecZJ/lWlq2h2OtRKt3Gd6/ckQ4Zfx/xrm7ttV8HNBN9ua9015BGY5fvL34P0B9vaux+Z5lWPs5udVNrum9PkdjcrM9tKtvII5ipCORkK3Y4rnf+Ea1aRS8viS7E55/d5VAfoDWh4k1aTRtGe6iUNKWCJu6Anuf1qha6HqV5bR3F7r16JJFDFbdgirntxQ9zWtyznyWbaXey/Mf4U1O8vFvrK/cSXFlL5ZkH8Q5H81PNRabdXNh4wvdLuriWWK4XzrbzHLY74Gfx/75qr4JCrqmuqkrzKJlAkdss4y/JPcmrXjGB7dLPWoB+9spQW90J7/jx+JpdLmEJS+rxq31i3912n+Ba8XalJp+hsIHZbidhFGUOGBPUj8B+tWt0+keGmkkZ7i4t7cuxkYsWYDJyeuM/pWJPIniDxlYxxndaWUIuD6FmAI/mv611zKGUqwBUjBB701qdFO9Sc5p6bL9X9/5HG6Vp+pa9p6ahJ4huY3kJ/dwHaqc9CARXRaRZXtjbyR3t+142/5HZcELjv7/AI1k3Pg9IpmudGvJtPnPO1SSh9sf/r+lT+GNXu9QW8tb9VN1ZSeW7oOG6j88qaFpuZ0I+zmo1E1J9btp/wBehvNIiHDMAcFsew60gbzIg8ZB3DKk1QuBDLdpKBEW8oqSwOVzggjg+9QLawxPCH8pYl2l48feKq6k4xzkkflWbq9Dr53fY08XHTMf1OahvJJo4YVEmxpZlQyKAdoPpnj2/GqMlsss8xjnA3ghCvBXIAwRt6fj+FWLcWyWbxTKrq3zmMjcMYz02ge+MVEGk9/xBzctLWIINXmVXEsQkSKXy3l3YZsuyAhQMdgTyOvFXdPvnvoy7W7RDCspOcMCPcDn9PeqV5LqNtcGKy0CC4gULtkNyseccj5dpxg9Ki8Pam9/cX0R0mKyW3cI7xSh1d+4yFHI/rXTZmMZtTUG/wAH+divr2q3Vr9tmsZrpjZIGkRI4vKU43YYthjkEfd6Zq5I99d69d2cV81tBFbxSApGhbcxcdWB4+Uf0xTby28PX17LFcz27XE42SQ/aiu84wMoG5I7HGRxWhI1hZXclzNLFDNLGAzSSbcohPYnoN/X3qrqw1GTk25aevr/AMAxb7UNXl1K/g0+K5Y2mxYxGsOx2KBvnLsGxyB8uOnfpTda1e8t47ue0kug9nErzRKkRiRsbtrM3zHgj7vrWnPbaPf3qM0yG5kjBAhuSjSJ2yFYbh165qPVLDQjJI+pPDH9oG11kuTGsmBgZXcASB0OMjimmuxMoTtJqX4/15EWq3V5a35kluLi204IuJoIkdVbJ3eZkFgOnI49TW9WfcaXYXrfaJgzqwBI89xGwHTKg7SPqKlfVdOjERe/tV80ZjzMo3j1HPNSzaN4tuT09S3Wd9svGaTbAdqgsN0bAnH8PXr71cuLm3tI/MuZ4oU/vSOFH5mqNwiazAj2OpII1P34WDjIII6HGRjvnrUuN+tipauyZILm6Zkymxd43/umOAQ3Hvggcjjmoxc30axqYd5IB3FG754OO/HX36VdtYXgtkikmaZlHLt1NefeNvFGs6Rr/wBmsbzyYPJVtvlI3Jznkg0Km5OyZhiKqoQ55NnoUDSurNKAPmIACkYAJH/16qhdUGQXtivZuc/yxXkn/CeeJf8AoJf+QI//AImj/hPPEv8A0Ev/ACBH/wDE1sqUkcbzSi+kvw/zPZbcTLAouGVpR1ZRwaZLf2cF1FazXcEdzN/q4XkAd/oDyaliYtCjE5JUE1zljaTr431eb7fcuFtrc+UVj2sCZsKTszgdsHPPJNQle56y2NeH+0MqshGPPfexVf8AV5O3GD6Y96qz3l/A1uhiKmS5ZWLAH93u4IwT2PTr3xgGuUg8R642jajeyalYi4i02a4e0EqvLbyquR+78pSoByCGLdufXXv9Q1bSpdRtxevdMlrBOsjwrmHfKySMAoGVVRuAOTx1NacjT1sHI+5o2t7q1yFZIoyv2gq+RjanHv8A73HUcAjrVp31ZRb7Y4m3Snzeg2JkY788Z/HtWDqOrzW+kQSaZrkd/HJeeVPfySxKtuuwnBdIyq/MFGSp+9z2I3PD1xc3Wjxy3d1a3TlmxNbSB0dQxA+YKoJxwSABkdKUk0ri5fMm043p89r1drFxsHGANozjBPGc9efarbsy7dq7ssAfYetcn4t8bJoMn2K0jWa9Iy277sYPTPqfavPbrxj4gu3LPqcyA9osIB+WKSpuWuxwV8wpUXybs9xorwZfEmuIcjV778Z2P8zXXeB/EWu6prqWlxeGa2WNnkDoucDgc4z1IodNpXJpZnTqTUOV3Z3p1HDuqxBsEBSGIDZcL3Hv2zUyzym48nyk+VFZjv6ZJHHHPT2rCl1JU8Rz2P27SLMI0W2OeHMsxbng+YvOenB5rWXVbBr77Osh85mMQbymCsy5JUPjaSPm4zng+9YqnUW7O5N9WO+2+XD5ky4XzHViGztxn2Hp/KmHUnXfutm+QYbBPDYzjOMY7Zz+FQaa5efVLa4htittOqr5MJXcCiuMjJyQW/SlTUtONjJqM7oYvNMJkNsyMCX2BCpG7OcA/wBKl06vRhr3L1rLJIZxLgFJNoAOQBtB6496sVmxatpiWrSxEool8oxrA4k34zt8vbuzjnp056VctbuC9gE9u+5CSOQQQQcEEHkEEdDVqMktSovzILhdSM5+zvbCHHAdWLZ/CltF1EMPtcluVC8+WDkn6mq2ux3MtvaR2z3CbruMStASGEefmyR0GO9Zmu2t5btYwWtxdpYBZfNdRcXEm/KlMlHEhGN/cjpkdMaKNyeWzubV7bCa8s2xLgO24o7AAbW9Dxz3qrdST/2kFj81WWSPCjeQ0eV3N1245I6E8VkagL2GDT28/Ub2VLdcxRxTwec2epaPhG9Q/H05q7dWaR+MYruVb8xywKqNC8xQSB+jBD
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAZAAAADICAIAAABJdyC1AAEAAElEQVR4Aex9BXhbx9J2bItl2TIzsx07zMzMbdo0aZqU06aQMnObMjMmKaUNMzOjEzMzsxht/6+8zubkHElWqLf3+68ePcezs7Ozq2PpPbOzs7NO7e3t3f73+t8d+N8d+N8d+G+4A7x/+SBTm0t2VF7IaCmr1rVoW41CZ55cIAmVeifLQ4f4xHWXh9gaf5ai4mBtVnpLWbG6XmnSOnVzkvHFEa4+fTwjZwT39RG5sRr+UXLsk+xtYI7yS3q/93xWLbN4uC77yXO/ghMl81s99DFmFaGvecxcVdY45m7dbP3XWrt1c+7Wzclaq//feXWGOjeem8hFdJ03wtzWxnPGTb6Kl8ncyue5cBtkFdeE+HnIJEJu1dVy0gurQ/zkclfx1Tb8r5N3+tdaWLpW42tpa/fXZNi5p3dGDHs0fhJLIFtR+Ub6ugJVDYtPixIXwSspt4zx7045IJQm3aT9y41tZp6Ty7bRz3oKXJm1TPqFC6v3VKeB81j8pAURw5hV1zxmphJbdHu7sb2tzmQ4wRcOd3J2b29TOjljkK3d2lvb2/XgmPR7eMJBzs4+0NDW3v5x9tZiTZ1zN6fnu88KFHvYUmuVf6G2+qNTx9Pra0ytbYnePq8PH53o7QvJrIa6JTu3rJg6++n9u9Lqarwl0o233OErkaLqu9QzK9JSFQZ9dx+/l4eOTPbxA9OWHgzvg5NH1+dmNet1UDI7LuGpAUOtjuQ6mWeazvTx6OPs5LyzZmeye3KQOMhBhXvLCs1trSOCI1RGo5TPd3FyLlE2q01GXIcFhftJbH49WPqblNpV28+M6RebHBWgNZj0BlNlnSIp0t9gMgOwwBTwLY+f9IJqPGjCAzyBX0w6t7TOw03s6yFjtnV2dqJ8tM0oqkaToT0jQ3zlrN7/I8XdNT/39ZzoKQh0sHe1uXlL5Rfzwl5xRN7Ks9qs39dqShPKljnS/obItLfV61qelniuYGoD6BC0wncFZlGYq7crT6QxG0rU9RmKcq3ZAOFRV4IOae4ndi/TNBA6TOqd4B7sK3KDhVWkrj1Rn29ub4Wl9vLFv2BtRbpaflTk5cYXj/ZP2ll1EQLbKy8siLD++0G/sLDQBL+BSYE9Oxtf+nPNY76kwN5fo36Hk5OwrbXaqN/d3q51dvZoa610cpabTcB0Zx4/ETBF259oyNO0Gr7qd0+OsurTnO3v97JnM9JWlHAXimbExr8/erzAxeWd44ef3b97y9wFpLZGrX772KEXh4yIcPfIqK8laPVXVvqa7IwfJ88MlMn+zExbuHndvvmLPUViW3o25mVvK8hdPXOul1hS2NKkMRpp1w4SSpNyV+0ujVkz1m9sla6q3lDfaGzs59EvRBKyrdpiKY/wGaFv1QOnag21cbI4cI42HDW0GSATLg2HTGt762CvwZ4Cz311+3StuijXqL4efWnvGHypsuXvvHQJT6BvNUe4eTQbdGBWa9R7SgtmRSdK+QIqbIcwt7YqNXphByoVVjRsP5Y1ul+Mk5OTi4vzyYzSQG/3AG+3nSeyRUL+ofMFzy0cy6R3nMiG5IZDFQ/OHlJe20LbMvkHzxfAsMouqQVg2RmGoU3HcxK0d2vjOfHtiN2QqvH+d98QPVaVWAEsq3L/MLNC27S3Oh2dwtL5buB94VKL1UBfwJTUppIzjYXJ1qaEaLIwcrizk9PkwF5BEk/aCkSxuu6h0z81GFSmttY/io+9lDybWTs7pD8AC5zNFWdtAdaB2ixDqwkyg71jvIQyZvPrGTNTjy3ahRdt0u925oXBhgJItbe1tLfrnLrJefwUZxf/9rbmttYq2hbT0hG+CSjGuwUWqmqO1udsqzzPd+bV6Frujx7T1yvq67zdlbomndn4cNyEKAZwEw0Rcg+8CX1HUsptG/6Cp9Opo2xoNd/do3cvvwCUhoaEEZlvU88s6z84ycdihT3UZ8D3F84eKCmaE59kS4/OZLmHEj7fTSgkqogex69ufLfRvqPzVHmpzal8Z36ENGKc37hfS3/1UnmB7yXwWlm6cnH44mBJ8GT/yXi6FKoLYWFFu0ZDpkJX4SHw8BX6bq3eekfoHTX6mvsj72d13drW1qDT4GHZYtCLeLyTNeVjQ6NymuqDXd18xFKt2eQgYME48nSTxIZ2foH7J4X2SwhFXwKei49Hp5kWEegFDIoN9RUKeEw6p6TO30sW5CM3mTDZ70bbMvllNc0z5yY3KDSs8bOKGYpTYhdpjGvK1TsM2jdWfGpuN2nNismBD3oLQ/4sfSNQHNNsrAmVJvb2mMAqnmjYeK5p59zQ531FYccb1rvyPFLko8ztxpVFL9wT9eH+2lXNxlpTm36030IIwLACkkpcrvgdsUbOKloHrFbjGV3zY22tFULZk5hl6JphbWE+0ihyf7vNXGjSbezmJG5vrRK5v9lmLmIWnXkxBuV7ba1lMAFEbs+1mStMujXdusEuIKoGM2udeXG6lme6taudnDt/G3RwBapqQg/3S2ChFfiYtfXzisKbyrOIB2LGsjikGOHq+3jC5Jcu/IUi8I4l08szAn2VaOqBa3B+wU3GEkBxZ9UFwpwa3IdVe51jZmnjFnn8JB4fGOSMKr5wWIc9ZaE7XrCtQJOrhQG3HXx2pA7YjZ8fZmFvpMwtVNf+XHhA4MKHqfh2j9vLtY2f5+z4oHen9UTkcW3Uab88e+pYRSkmRHAawHEDDdR3k+B9xfPD1Npaqmh5dPc2vKmGCpXSjp5ZcYkHSouH/frjhMiYe3v26eHrTxs6SJxoPKEwKQJEATCU+N347nx3gTMsiHbYUHBUAaHMbWaWKioDewqIBpibEjAFMgAvliSK/fyD+/oHA6Nx33ADyTXF2x80V9g+x9zatuFg2qyRKRCDxUSES6qbLuZXavXGW0f3hECLWhfk4240tzLpkb2jTmWWSkQCL3dpTZOKtmXyAX/frD8G2BqQ1PnwsDqYYHFUnio10a2f1Vq7TKeZwcsgkKM8ka08McwnBFA1IeA+T4HliYUXqzjIe2advqSjpluyfCQgCYCVpzoT69a/XJsDeJoT8nSTsWpPzS+DvWfjftwS8kyJJv1EwwbSpMurdcBycpKJPT5rM+caVJ/zhIPFHp9CkVm/06Tb4cKPBQCJ5R+3mtINqi/54snMokC6sL1dI/b4qs1crFe+LZDM69beLvbsVIUZDbNW6PogRiz2+NpsOG7UfM8cq5TX6Ryt0DYy+ddPD/OJx/QQ3+w6veUXxXrNDOmHCRSYm8rPcgGryaA+3QFzmD8O77BfmM2vf8wNhlpvoWWWmqFILdMWTQ6Yw9TfQQOV6ItLX+a48SUqk46I4sfm4uxM5r8ynghQhWn1hZaSFy+uhkC46xXoQ5o8sGOzTCBYNf0Wf6nruZqqOev+pL2CELhc8bVpw/+4vR2OrUHBIVQMtgloW3pgW/04ZWZ6fe2q9AtQ/kT/wbDLaFtHCBcnl1p9LeAJ2MSUH+Y9bH3FeoBXP0/LjzNcEv5X+V+EpmJDvIdsrNzoa/TF3BAoRvksgkALQSjmlSXWZfHx20fA7w4xeKyoMNxVbz9ogUu89p/Lf+GusdtPZBdVNjLpXnHBPWKCYBOhd2ZbJn9QcjgsL/x/iSpbV6W5KUbW01atHb6hVbu9+luxi6vK1OQtDIYk31lI0YpbZKqS8Tzx1YCLKrPlyPiAu4vVaeWa7HXlH0AGllqLsdZLGAiaqY3Z3Cp9xTePSjhbUAlg4gZ8aW9T6RUvW7y8rbXOfItR48wL6rhGwG5iFdvMBR3W2cPgYwpjqWWoYtXCEHPhRRBVuDJfcW6BYhcBfNhnG4vey9y0JHY8MIIpcM20hCeU8oRqsx7zSrjYBc5X3IGpwb0xVwIfbvUnEqfCPc/saHdNWls7rJhuEwJ78J1dmFWgr3PMSlPLwbpdt4QsZKm9tmIvj3B4AEf6JWEJIkpmsV+YpgFAKk4W+HIyFxAtvWHSd6668tcOtEKxuKXZwrX9ErrwwuUe2Y31I8Ms/0366lIPHPMfjJ4wPCTsmf27rxaw+nv27+3
2024-04-13 08:20:53 +02:00
"text/plain": [
2024-04-14 18:45:52 +02:00
"<PIL.Image.Image image mode=RGB size=400x200>"
2024-04-13 08:20:53 +02:00
]
},
2024-04-14 18:45:52 +02:00
"execution_count": 4,
2024-04-13 08:20:53 +02:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from wordcloud import WordCloud\n",
"text = \"\"\"\"This is where it happened,\" says Felipe Luis Codesal, opening the gate to a three-hectare field on his farm in Zamora, north-west Spain.\n",
"\n",
"One night last November, a pack of wolves got through the fence surrounding the field and attacked Mr Codesal's sheep, many of which were pregnant. When he arrived the next morning, he found 11 animals had been killed. Over the following days, he says, another 36 sheep died from injuries sustained in that attack and miscarriages it triggered.\n",
"\n",
"Mr Codesal fears that such attacks will become even more commonplace if a proposed change to laws protecting the Iberian wolf comes into force.\n",
"\n",
"The leftist coalition government plans to prevent the Iberian wolf from being hunted anywhere by categorising it as an endangered species. The reform is yet to be implemented and could see changes.\n",
"\n",
"Iberian wolves from the Iberian Wolf Centre in Robledo de Sanabria on February 21, 2020 in Zamora, Spain\n",
"image captionSpain has Europe's biggest wolf population: These Iberian wolves are kept at Zamora's Iberian wolf centre\n",
"\"It's like in a nightclub when there's a fire,\" says Mr Codesal of the wolf attack. \"There's a stampede and people get trodden on and hurt. This is the same.\"\n",
"\n",
"He was not entitled to any compensation and estimates that the financial losses he suffered from this incident totalled around €12-14,000\n",
"\n",
"\"It's not even about the money,\" he says. \"It's emotional, because the animals are part of my family.\"\n",
"\n",
"A 'historic' change?\n",
"The region of Castilla y León is the habitat for most of Spain's wolves. Figures gathered by the local government showed that they killed 3,774 sheep and cows in the region in 2019.\n",
"\n",
"Felipe Luis Codesal's farm is just north of the Duero river, which marks a natural border between north-west Spain and the rest of the country. Until now, it has been legal to hunt wolves north of the Duero, under a strict quota system, because that is where they are most prevalent.\n",
"\n",
"South of the river they have been protected.\n",
"\n",
"Conservationist groups have welcomed the government plan. When it was unveiled in February, the Ecologistas en Acción organisation hailed it as a \"historic day\".\n",
"\n",
"But Mr Codesal, who is a member of the UPA association of smallholder farmers, warns the reform will ruin livestock owners by allowing the wolf population to spiral out of control and roam uncontrolled. The UPA is unconvinced by measures included in the plan to subsidise the installation of fences and the use of guard dogs in livestock farming areas.\n",
"\n",
"Biggest wolf numbers in Europe\n",
"The Iberian wolf was close to being wiped out in the middle of the 20th Century. But it enjoyed a resurgence on the back of new hunting regulations introduced in the 1970s and the migration of Spaniards away from rural areas also encouraged its spread down from the north-western corner of the country.\n",
"\n",
"In recent years, wolves have moved into areas such as the Guadarrama mountains north of Madrid and near the city of Ávila, to the west of the capital.\n",
"\n",
"There are now some 2,500 Iberian wolves: around 2,000 are in Spain - the largest wolf population in western Europe - and the rest in Portugal.\n",
"\"\"\"\n",
"\n",
"wordcloud = WordCloud(background_color=\"white\", max_words=5000, contour_width=3, contour_color='steelblue')\n",
"wordcloud.generate(text)\n",
"wordcloud.to_image()"
]
},
{
"cell_type": "markdown",
"id": "third-eagle",
"metadata": {},
"source": [
"### Ćwiczenie 4: Wykonaj chmurę słów dla całego korpusu z ćwiczenia nr 1."
]
},
{
"cell_type": "code",
2024-04-14 18:45:52 +02:00
"execution_count": 5,
2024-04-13 08:20:53 +02:00
"id": "electrical-disposition",
"metadata": {},
2024-04-14 18:45:52 +02:00
"outputs": [
{
"data": {
"image/jpeg": "/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCADIAZADASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD2651N7e8aLyFaJPK3P5mG+diowuOenrU41C2Pmne22LdvYowUbTgjOMZ9qimXTbfUTc3EsCXLIoXzWUFQCeRnkdT+VFtHZ3unTLBIZbeaSTLg/wAW8hsH2IOPpRZ7mS9pd6kn9pWoj37pPv7NnlPv3Y3Y24z056UJqVrKYhG7sZUWRQImPyt0J44H1qneaTJLGAknnOZfMkaYqC2F2j+Argem2rtpaNDslmfdP5SxuVAC8EngY9//AKwpCUqjlZr+vvCTUbaJ3SRnRkGTuiYZGQOOOeSOmetQ/wBt6ftDec2D/wBMn/w96Yuh2y3DTeZKWbd/d7uHPOMnlR1J4qpcWlvbzsv2e+kwgXciggjOeOOtOzZLlW7I0xqdodmJT8+MHY2Bk4G7j5eQRzinQXTTXdzAYwohK4bdncCM9O1ZVhZWt9MLlbW4hK4YNPEh8zLFuMgkYz1GOoq/aW8kOp3sjQsFl24lyuGxnsOe+OfSmlvc0g5v4i/XKa140/sjVZbH+z/N8sL8/nbc5APTafWuryAcZrynxp/yNV39E/8AQBUTbS0OTMK06NJSpuzv/mbn/Cx/+oV/5Mf/AGNSRfEWAsPO06RF9UlDfzArZs/DejSWNu76fCWaNSTg8nFMvPBei3MZEdubd+zxMePwPFK0u5l7PHWupp/16F3Ste0/WFP2Sb94BlonGGH4d/wq7dXUNlbtPOxWNcAkKW6nA4HPU15FeWt34c1oxiTbNAwaORf4h2P/ANb616lC0Wu6LBI25UmVJCFPIIIbH5jFOEruzNsHipVm4VFaSGR+ItMmi8yOaR8sFCrbyF2JBPC7dx4B6DjB9DTzrmnZhAuNwmVGVljYqA5wpZgMLk8DOKpTeEtOmRQ7SMVIZWZUfaRu7MpB++eoPb0qRvC+nm7t7kACSFET/UxEOEJI4KfL1P3dv6DG1oHcWrXWLe7u723SO4D2j7HLQOAflVuDjBPzDjqeo4pi67Z/Yra5kdsXLMsSxxSOzEZyAu3dnCk9OxqaPThDd3lxFdTp9qO50Gwqr7VXeMrnOFHBJHtUFnocVmtqv2q5m+yyPJGZNnVgwIO1Rx8xNL3QFfxBpieQfPdhOsbIY4XcYkO1MkA7ckEDOK065d/Dd3DqFrJZSQ+VbRxRRvLtLKqsScgxnJOTyrL+laEnh+Jrlpo728hLyGR1SXAJPXj8vyoaj0YFDxLpmuXt7FJpV28USx7WRZ2TLZPOB+FclqieJtHVHvL67RXJCkXZOT+Brv00KOOMqL28JJY7y67uSCedvHK/qa574gp5em6emSdrkZPU8CspxSV0zzMdhoqEqqk7+uhpeCbm4u9CaS5nlmfz2G6RyxxgcZNbgv7MruF3ARv2Z8wfe9OvX2rnvAP/ACLz/wDXw38lq3F4YgMURuZ5ZJUQISNu3bz8uNvT5j1596qFranVhG3Qi32Nb7da7touImbzBGQrgkMe31pk+qWNvbyzSXcOyIkNhwTuAyVx68dOtMfS4zbRwJPNGqTmfK7clt5fByDxk1Ufw3aurL9ouRmIwZBXiLn5Pu9OTz196pKPU6TUuXeO0mkiXdIqMVX1IHArzjw3rurXHiO3jku5pkmYh42OVxg8gdsdeK75ri/EhC2QZBn5t4GeeO/pSQweUs1xFpsEVy3ZdoZ/XLCs2rs461OVWcZRk1byYn2+VIZGMQaQXAhC+ZxyQOu3pz6UJqMsrJFHbp57GTcrSkKNjBTg7cnqO1RxxSxJui0m3jJbJUFRyDwSRVmSO2+yF72G3VFYs29RtBJ68+v9aNTVOb6/19wjSyJqTpvLR+Rv2YGAQccHGahOrZWPbCC8kUTqC+BlzgAnHT3/AEqxALG6kN1AkEkgO3zVUE9OmfpSmxtlhkSO2gXeMEeWMH6jvRZ9CrTesWVIr24MKNIASbtom2sPlG4gAccj8jUo1B/tZiMAEXneTv38527s4x07darLY3ckkcclrp/2eMg/NFnvzgdjWo1vEwI8tRk7sgYOcYz9cd6dmuooqfcpWt9Pe28r+R5KeWGRyx5JB9VHT1GRzVi0fdv+V1PHDuxP61Db6bDYrIY1eTem1hhVyAD/AHQMn361PbhhJJujdeF+Z2yT14/z61zvm9rG5dPm5fe3PNrLVdRfxXDC2oXTRG8ClDMxUjf0xnpXoOtTz2+ml7dnWUzQoDGFLYaRVIG7jOCeteZ2H/I4wf8AX8P/AEOvVruBbm0khaKGXcOEmXchI5GR9cVtTZ5uXOU4TV+pnnUpLKArPHMzpay3BM7IH+Qjg7Bt79RU0Woyy3E6i3UQwY3v5h3HKBuFxz1x1FUDpN80Yja20nyoyTFH5OQMnntgZ74FXIYNSS5kJ+xrEZFOY1IZlGB83HXA/p71tZHoqM+5B/bsoSPdaIJJRC0YE+QVkbaMnbwfz9iabL4gkQIiWLyT/vd0aFmH7ttp2lVOck8ZAHqRVjUbCzisH2WcITzUkkRAED4bPPBzUk8elqI7Sa2gKJjYhg3ImTgdsLk/Sp5omb9otOYhk1sR6ilsIQyOxTeC2VYRl8H5dvQf3ifapdN1OS+YLLbrCWgjnTbJv+Vs8HgYIx704/2X532oxQeaQ7GUxDdhflbJxnjp+lDXNvDbPcW0CExBY2UqYyq9hyM4GeB05pOURptO7lp+hFf3stvdTqsgRFspJlyB95T1/DNR2urzySxRy2yhDKLcyeZ8xfy9+duMY7da0p7S2utn2i3im2HK+YgbafUZ6Uy4aytV82cQp8+/JUZLYxn644olOMY3loU4zve+hTXU5Ut5maENKLoW6oZPlySAPm2jA59CfrUMmuzKrhLJGeKOZ5QZsAeWwBCnbznPHA98VGNQ0KFJI47OMJJ98JbqA/1HeteK2tGgTZbRCMx7QuwABDyVx6H0rKliaNVtU5J27CtN7SJ0YOisOhGRS1E88EBCPLHGccKzAcVLWikm7Jmxja9JDYxfbRp0F3cOCn71to2ojyddrY6N26kVPoMsU2kRyQwwwqZJcpDJ5iZ8xslWwMgnJ6DrTNeluYrSFoDcBPOHnm32bwm1uhfgfNtz7VZ0wEabDkTjg4+0bd+MnGdvH0x2xWj+EXUjF7Kql5NmwrIRtU5G049ef0pn22aS0kkDxoY50Qvt4Kkrk8MfU9+3auQsvGniK5stGvTp+nmHVWa3iCs6ssoyAzdcJkHgZOB17Vv6NrOq6jpuqRyW1odVsLh7fajssMhABByckAg/pW
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAZAAAADICAIAAABJdyC1AAEAAElEQVR4Aex9BXwV1/I/Sa67JDfu7gR3d6elUGgL1Fvq7m6v7vZaoIUKpaWFthQo7hYkBAlx99xc9yT/780Jm2Xvzc1NkNf3+7/7WZY5c+bMObvZnZ0zZ86MT1tbW5///f53B/53B/53B/4b7gDrv2GQ/xvjZbgDC/a9V2Zs8JJRqjR8+ZBlXhJfFrJcTdlnBVuK9HW+Pj4p0rD7EqZFi1SXhfNVYKKzHKg3/NjSaiJ9JaqWU51+X7r34/N/UUUAoQLlupGP0jH/g72/A/8TWN7fq/9RXqk7UGlquvvIV462FtLBgYbzZzQV60Y9JmLxgDnUdLrcVDs/fIL33a+v2rW17sgI/8yFEZO7bfXGuW9mhYxKlcb0ri+0qtS8H618zc9X0m1fHgj6fvVphFRKCD6ePCNSKvNAfPWr1p49XaXXPTR4GLp+99C+MIl0QUq622FsKykqUDfd3X+Q29pLRP6XCSyj2cbnsfERvsTL/l/zf9QdWFd+iJJWZGBau2lL9clrI4agOESZhqNHA54TOobty9bZDT1q1bu+0IrLCuOzE3raF4Oe7ef7+/wbGch/TrGtT1trW6s345kQHYvDG8pe0HQIrApT06J9n0WLAiwt9vFBqXfGj+sFryvdpEFt+OqXAyP6x4zoF+ulzJq5+40Gi86bgfn6+HJ8/fh+XDlHGMCVRAgDEiTBfeVR4QJ/b5pfIk3O8bLMfpF0Jgf35peVNl5/k/OD9n/+V2FqdL1Ggvyjei90pb6yhFuiZ4ImT1f6U8VWPx+/Zps+kCd/LOkmnz4+pzSFq8v+AtLSYn0j4x78ERnc0Gpd5Y5nUm4B/rWzK64NG5ckifq+bHN28zl/jkxj1xN6b/r6unhDnUVdZ2nSOYx3x80brEhF29Y2a0HD3Xx2fJ8+viiGyR4kDC/lfKK25qsTRz+bOgtMlm36/fasgVlBwbf++dvQsIhjNVX1RuOqWdcKOZxPjh7aW1EGGsiI27MGZNdUfXUim+PnB21oRHjkw4OHo+rHM6f+KMhraW0bHBpGMM/s2lqiaTbZ7aMiooBBX58dO8zy9W00mULFkvcnTfNGI3DbalXuyV/OnR4eFvnEsJHk8hkjdNuKUHpz7tSw4sVBK4bebmt13LT/i9GByUmSYG/aX00aqZgXHCDRG61eSqsejQ1fD0sLDnuzzVBsqDvcVECaQ2BNDM64JnywP/eSFH7Pg/nmq13vf76ETjN0ZAIOOuYS4RfSr6u1aDU2o9Zm1NhNAJyHHUVTk01vb+2Yjl1iL3Z7S7PaoArsmNowuBWcr41PDGIguyra22eIM0NGCln8UmMNRVZsrP56wDNsX9ZjOR+Vm+rC+ap3879/L/NBJdd9p1RDOlBprj+sPvNh1sNYdLoz+w1S1W1fofyAI+oz/x7wtMFheuLUJ0RaoW2AaB6dee9ge0vrrLXfoa2fj+9v1y3qignXz+/zdikGAognHGuuWQD45t/XDQwJBVCq1Wy6fjGAOT9/PzshGWJoQ/65H+cugAy6cf3Pp+prM1RBL44cx/bza2lrG/HtVw+1C7VzjQ07brwFkm7+rz9hQpegUIJDtz/XVovT+0o43PNNHV8gtyN0bdVtRxRBp8AiKI4vK0ESBK0EAuuB7NUDlTEnm8sarYZPBy4RsrhfF+461FgIylGqpMUxIwB8kLe52tRcZW7W2sxPpM4YqUp0bfXa6Q3lxiZzi22of/zdCePfP7fZ0mLL0ZSPDUzZXnvmsZTp++rzGRj0+2v50S01uS1trf0V0WiFvqBhjegXU12vbW1t8/X15htArumSzvjOryjasap497yIIbfFTSBWFbccv/x4m9VqP3OqcsToxD07z93z0OS+/aM2bji+a+vZltbWzKzIJbePRkMGprS44cdV+wvyap59dA1qX3lrgY+vz+/rsrdszMkaEH3bMqeqe+5MFWj8/Hyb1cbAIOmTL8zBnPjfn2yrq9HW1Gh0WvO9D08eMhyfd0+/FGk4DrcU7+f9+VPZfrdVPUVaLXZ1k7ErgbX+5yOPPetUGRi/cKF/nwYGrk+syL1oixWGQlqBWsYWmR2WJptOwhJ4L63wRKFttbkhWhgM7Qy/SGGX32ZGXywfvwxZ/KtnV4DDnFDnX5P8eOyYcvVr1pZKrl9YhPzpC+ie/e9hSgjJQvEiUokUC9VN6YFB5E1IVwXmNTbEKZSxMjn5oicq/cu0GkjkMo3mpvU/kyYGm83a4nhpz06oV1yWn85qaW113pDUABWkFQB/Pt9osxFi+hmCz95OCaSttZXt6yTutpXrCBOVAd22ovfLgC8SWK1tbcWG+jOaygeTphA6yK93+nUIe0guHF8NuRVV9x1d1U8RlSgJ3lN//tdRD+js5jsPr4C0cm0FzOMpM3B5UGGm73znrgTnGzjYPy5S6K93WB5JnnawXQIyMMF82abqU/8ecgseqbsOrzyjrUqVhgYHSP/cfZrN8rtq0opcDs6wsKwp27+r/szrmYu6eu1B1m9AdFi40mCw3P3ApOzDxaog6Y6/z7zzyU2QL4/f//35c9ViCZ+BSUwOefy5WUtOVbz6zvVUd7OuHSAUcUuKOl/i4oK6lT8tY7P9Hr57VVlpQ1iY4tD+ghU/3m3QWx69d7VnadXS0gph53C04OUEQPXSI2DX9rN1tdqGOt302VkQmmtWH7A7WiZOSQ9QSTasyzYarSlpYUOHx/+69khYuCIpJUSrNf303QG8aNNmZUmlgm+X7/YPEEM+uO10XviQX8oP0rW8IL5sSnBft8RQQOh4OUessxubbTo5RwI8TC14ZugEgPksnqbdnoW/Y6nJqawF8/xLjDUgBlxpqmPQU0VGX8Brbfql0TPC+CqKBkB58+uRiue5rHCro6JM/XKC6it6be9gTPcazWa0dbS2UgoLir60q8PLv6mogAiznPq6sVExeIWhH0HA4Rag1d39BuFtCRFLvp09z8/HB6wgy3aXl2qsls+mzNRYLH/knyfDQ63nccbKFd+fzgEHkB2qrJgRlwi421auI0Tzblt5GEmnwCrQ195y6KsgnvTlzHlKroi0yVJEUo2LDQ0p0lDyNCRLQ/P1tWmysIGK6EeP/wCahVFDKUp6K8wx3zq70eywQfbp7RbcUJDJOQJYVTl+LK4fy9Zih/WBgYHchFnt7sPfEJ4mhxWARm/OL63ncTvHTGqv2rnWrLnn6Nfv9FvSXxHjtlOpXKDXmTkcPw6HZbM5yksaqyvVTzzwPSE2m2zNTUYGxi0fV2RcQhCkFfAyuQB8WGw/qG8vPf0LMNfMH0To//rpsMVsn7ZgkFFn4Qu5ZqNVIOb5+frs+utU/3b969TR4kGjk9QN+tBIZV1Vc1C4AiLMtS+3GIiqtPSw+OuCP3l3c3SsCtInOFT+w7f7lz04qbK8CUofaTV2Qur+Pc53YP3PR2fO7Q9V68O3/oqMCZg8LTM0TPHh23+5ZY6V/s8G3vFZweZCfS2ek8HK+HsSpghYXGhD753/vsJUZ2qxNFibb4js+I7SmUDreSDh+lfProTa1dLW8kLqbXicPi5YW2assbc50HZp1IwIQaA/V/bIyQ8VHEmkwKm4hQsC+8sTHzrxfhBPGcz3B8abvswt1tY+bR8X/OTbx9fSans08UZMEtHW14cLaQUAZ8AAevGjpoRoi1XCeIUyWCS6bt2aAKEQsFuGsGrBLHX9rz/hpRoTGdUvKARTMBmPf/+WP6v1+jGR0TFyBRouSsvAZBBiAoNfPmNu38CgT7MP3fLHryqhKMnfee3e/PoGBg8KCZu+ZhWsvVNi46HQwRrFaIiZxKPbNxc1q/U2W7VB98DAoa4jdG3FYOK52PnyExsWg5r+sYoXB2IGRz5KZzWVIwMSQKy2Ge9NnBQFlZ72o7c60lSss5ne6rdQazdvqTlFqFzfEwYmRqTCN/azQUtwd/BVxPOBhgIee/a49OPnKiH0vH7RaMNqB9NkEQMUsUxsnz5Gh8XosDZZ9fn6GpixXAkIBhPbJ06sXjHknoiLL5nU0i8cmIhof7y
"text/plain": [
"<PIL.Image.Image image mode=RGB size=400x200>"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
2024-04-13 08:20:53 +02:00
"source": [
2024-04-14 18:45:52 +02:00
"wordcloud = WordCloud(background_color=\"white\", max_words=5000, contour_width=3, contour_color='steelblue')\n",
"wordcloud.generate(\" \".join(documents))\n",
"wordcloud.to_image()"
2024-04-13 08:20:53 +02:00
]
},
{
"cell_type": "markdown",
"id": "objective-possession",
"metadata": {},
"source": [
"Zastanówmy się nad jeszcze jednym zagadnieniem - jak pogrupować te terminy ze względu na dziedzinę? Zagadnienie to nosi nazwę klasyfikacji tematycznej. A dzięki pewnemu XIX-wiecznemu niemieckiemu matematykowi możliwe jest przeprowadzenie tego procesu automatycznie. Matematyk ten nosił nazwisko Peter Gustav Lejeune Dirichlet, a metoda klasyfikacji nazywa się LDA (**L**atent **D**irichlet **A**llocation)."
]
},
{
"cell_type": "markdown",
"id": "intellectual-gothic",
"metadata": {},
"source": [
"### Ćwiczenie 5: Wykonaj tutorial dostępny pod https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0. Wklej do notatnika wyniki."
]
2024-04-14 18:45:52 +02:00
},
{
"cell_type": "code",
"execution_count": 6,
"id": "76e8308a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['lab_06-07.ipynb', 'lab_01.ipynb', 'lab_03.ipynb', 'lab_09-10.ipynb', 'lab_02.ipynb', 'lab_13-14.ipynb', 'img', 'lab_11.ipynb', 'lab_08.ipynb', '.gitignore', 'lab_04-05.ipynb', 'lab_15.ipynb', 'lab_12.ipynb', 'data']\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>year</th>\n",
" <th>title</th>\n",
" <th>event_type</th>\n",
" <th>pdf_name</th>\n",
" <th>abstract</th>\n",
" <th>paper_text</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>1987</td>\n",
" <td>Self-Organization of Associative Database and ...</td>\n",
" <td>NaN</td>\n",
" <td>1-self-organization-of-associative-database-an...</td>\n",
" <td>Abstract Missing</td>\n",
" <td>767\\n\\nSELF-ORGANIZATION OF ASSOCIATIVE DATABA...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>10</td>\n",
" <td>1987</td>\n",
" <td>A Mean Field Theory of Layer IV of Visual Cort...</td>\n",
" <td>NaN</td>\n",
" <td>10-a-mean-field-theory-of-layer-iv-of-visual-c...</td>\n",
" <td>Abstract Missing</td>\n",
" <td>683\\n\\nA MEAN FIELD THEORY OF LAYER IV OF VISU...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>100</td>\n",
" <td>1988</td>\n",
" <td>Storing Covariance by the Associative Long-Ter...</td>\n",
" <td>NaN</td>\n",
" <td>100-storing-covariance-by-the-associative-long...</td>\n",
" <td>Abstract Missing</td>\n",
" <td>394\\n\\nSTORING COVARIANCE BY THE ASSOCIATIVE\\n...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1000</td>\n",
" <td>1994</td>\n",
" <td>Bayesian Query Construction for Neural Network...</td>\n",
" <td>NaN</td>\n",
" <td>1000-bayesian-query-construction-for-neural-ne...</td>\n",
" <td>Abstract Missing</td>\n",
" <td>Bayesian Query Construction for Neural\\nNetwor...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1001</td>\n",
" <td>1994</td>\n",
" <td>Neural Network Ensembles, Cross Validation, an...</td>\n",
" <td>NaN</td>\n",
" <td>1001-neural-network-ensembles-cross-validation...</td>\n",
" <td>Abstract Missing</td>\n",
" <td>Neural Network Ensembles, Cross\\nValidation, a...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id year title event_type \\\n",
"0 1 1987 Self-Organization of Associative Database and ... NaN \n",
"1 10 1987 A Mean Field Theory of Layer IV of Visual Cort... NaN \n",
"2 100 1988 Storing Covariance by the Associative Long-Ter... NaN \n",
"3 1000 1994 Bayesian Query Construction for Neural Network... NaN \n",
"4 1001 1994 Neural Network Ensembles, Cross Validation, an... NaN \n",
"\n",
" pdf_name abstract \\\n",
"0 1-self-organization-of-associative-database-an... Abstract Missing \n",
"1 10-a-mean-field-theory-of-layer-iv-of-visual-c... Abstract Missing \n",
"2 100-storing-covariance-by-the-associative-long... Abstract Missing \n",
"3 1000-bayesian-query-construction-for-neural-ne... Abstract Missing \n",
"4 1001-neural-network-ensembles-cross-validation... Abstract Missing \n",
"\n",
" paper_text \n",
"0 767\\n\\nSELF-ORGANIZATION OF ASSOCIATIVE DATABA... \n",
"1 683\\n\\nA MEAN FIELD THEORY OF LAYER IV OF VISU... \n",
"2 394\\n\\nSTORING COVARIANCE BY THE ASSOCIATIVE\\n... \n",
"3 Bayesian Query Construction for Neural\\nNetwor... \n",
"4 Neural Network Ensembles, Cross\\nValidation, a... "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Importing modules\n",
"import pandas as pd\n",
"import os\n",
"# Read data into papers\n",
"print(os.listdir(\".\"))\n",
"papers = pd.read_csv('./data/NIPS Papers/papers.csv')\n",
"# Print head\n",
"papers.head()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "e0be0994",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>year</th>\n",
" <th>title</th>\n",
" <th>abstract</th>\n",
" <th>paper_text</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>4172</th>\n",
" <td>2012</td>\n",
" <td>Learning Manifolds with K-Means and K-Flats</td>\n",
" <td>We study the problem of estimating a manifold ...</td>\n",
" <td>Learning Manifolds with K-Means and K-Flats\\n\\...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3535</th>\n",
" <td>2011</td>\n",
" <td>Unifying Framework for Fast Learning Rate of N...</td>\n",
" <td>In this paper, we give a new generalization er...</td>\n",
" <td>Unifying Framework for Fast Learning Rate of\\n...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6206</th>\n",
" <td>2017</td>\n",
" <td>Hunt For The Unique, Stable, Sparse And Fast F...</td>\n",
" <td>For the purpose of learning on graphs, we hunt...</td>\n",
" <td>Hunt For The Unique, Stable, Sparse And Fast\\n...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>1994</td>\n",
" <td>Multidimensional Scaling and Data Clustering</td>\n",
" <td>Abstract Missing</td>\n",
" <td>Multidimensional Scaling and Data Clustering\\n...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5119</th>\n",
" <td>2015</td>\n",
" <td>Convolutional Neural Networks with Intra-Layer...</td>\n",
" <td>Scene labeling is a challenging computer visio...</td>\n",
" <td>Convolutional Neural Networks with Intra-layer...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" year title \\\n",
"4172 2012 Learning Manifolds with K-Means and K-Flats \n",
"3535 2011 Unifying Framework for Fast Learning Rate of N... \n",
"6206 2017 Hunt For The Unique, Stable, Sparse And Fast F... \n",
"11 1994 Multidimensional Scaling and Data Clustering \n",
"5119 2015 Convolutional Neural Networks with Intra-Layer... \n",
"\n",
" abstract \\\n",
"4172 We study the problem of estimating a manifold ... \n",
"3535 In this paper, we give a new generalization er... \n",
"6206 For the purpose of learning on graphs, we hunt... \n",
"11 Abstract Missing \n",
"5119 Scene labeling is a challenging computer visio... \n",
"\n",
" paper_text \n",
"4172 Learning Manifolds with K-Means and K-Flats\\n\\... \n",
"3535 Unifying Framework for Fast Learning Rate of\\n... \n",
"6206 Hunt For The Unique, Stable, Sparse And Fast\\n... \n",
"11 Multidimensional Scaling and Data Clustering\\n... \n",
"5119 Convolutional Neural Networks with Intra-layer... "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Remove the columns\n",
"papers = papers.drop(columns=['id', 'event_type', 'pdf_name'], axis=1).sample(100)\n",
"# Print out the first rows of papers\n",
"papers.head()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "181076d0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4172 learning manifolds with k-means and k-flats\\n\\...\n",
"3535 unifying framework for fast learning rate of\\n...\n",
"6206 hunt for the unique stable sparse and fast\\nfe...\n",
"11 multidimensional scaling and data clustering\\n...\n",
"5119 convolutional neural networks with intra-layer...\n",
"Name: paper_text_processed, dtype: object"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Load the regular expression library\n",
"import re\n",
"# Remove punctuation\n",
"papers['paper_text_processed'] = \\\n",
"papers['paper_text'].map(lambda x: re.sub('[,\\.!?]', '', x))\n",
"# Convert the titles to lowercase\n",
"papers['paper_text_processed'] = \\\n",
"papers['paper_text_processed'].map(lambda x: x.lower())\n",
"# Print out the first rows of papers\n",
"papers['paper_text_processed'].head()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "f0fa1187",
"metadata": {},
"outputs": [
{
"data": {
"image/jpeg": "/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCADIAZADASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD3YX0Z1JrHDeaIvNzjgjOKmlljgheWVwkaAszHoBWapH/CUMGj+c2uVk9V3Djr6+3frV+7QSWcsZg88MpBiyBuHpk03oEtFoVJ9Tjl069ltWkEkMLOC8TLg4OD8wGelLYarbXawxCYtO0YblCobgZKnGD+FUo7W/az1CHy51gktykMVxKruHII4YE8dOpqybObztIIjwturCTBHy/u8fz9KzuzmU6jadvw8yxHqtlLcCBJ8uWKg7SFZh1AbGCfYGlTVLORolWcZljMqAqRlR1PTj8ay4bG9+x2WmvbBUtpUZrjeNrBDkYHXJx3Hc1QnsZw1xBGUF15zRW6bxkxkOSfbAmzj/ZFF5diXWqpXt+H4HSXBa80qU2jkPNATC+SpBK/KfUdRXj8vhX4mx/d1G+l/wBzVD/VhXtCIsaKijCqAAPanV0U6rp7JHTa+58422veJbbxBDZXes6mJIroRSxteOwBD4IPzYPevo6vmzU/+SjXn/YWf/0ca+kZJEiieSRgqICzMegA6mt8Wl7rQRK+oTy21oZYQhYOuQ5IyCwBAwCc9hxTY7zytMe8uyUVFZ3xG3yqCexAY4HtzUNxdQX2mu8Lo0eVDb1cEHggYGGzyOOucCpdNXfpqpIVcEuDw3TceCG56cEGuW2gm3zWXYle9tke2QyjdcnEIAJ38bu3bA61Ams2Ml4LQSuJWYou+J1VmHUBiME8HoazdE0u9t70G9T9zYxG2s2LA71LE7vY7RGvPoapRabq01xp73UFw08F0JLiZ7vMbD5h8kYbAHI6gH607IxdWpZO39aGxLNcw+KLaAXUjW1xbyuYWVNqlSgGCBu/iPUmrwvrYvdJ5nzWoBmG0/Lld348elZV2b0+I7W6j0q6kggiliLq8Q3FihBALg4+U9cGo7q21GC91b7NZeel+i+XIJVURsE2ncCc46HgGlYfPKLdk9/Pt/maUus2EMNvK0zFbhPMiCRO7MuAc7QCQORyRVqG5huLVbmGRZIWXcrKcgiuYbRbuJ9PuGt7qbZp8dtLDa3ZhdGXnOQyhhyR17Ct7TrRLTSEt4rd4AAxETybypJJOWyc8n1okkloOnOpJ+8rL5lca/bGPeLe5x5QmHyDlO7deg7/AKZqzNqcccvlRxTXD7Q7CFQdoPQnJHX061npp10LRUMXzDTTBjcPv8cdf16Ukmm3EV004iuJFljQMsFz5ZVlGOeQCK5OerYnnq2Ld9qvlaQb21jaVWXKsAML25yRTNTvJTot1KIp7V0AwWIB6jptJpZdOZvD8llBH5TspIR33YOc8mlvlutQ0e4i+yNFKwAVGdTnkdwcU5c7Tv2/HUcudp37fjqTw6nHNcpA0M8TSAtGZEwHA645/nio/wC2bfl/Ln+zhthuNn7vOcdc5xnjOMVJdW8smpWEyLmOIuXORxlcD9azvsV8ulHSBbgpygud427M9cdc47frTlKauv09BylNXX6ehpT6jHDcm3WGaaRVDOIlzsB6Z5/Qc1HokrT6RDK8jOWL/MxJJG4461GYrqz1GeaC3+0RzqnRwpRlGOc9vpUWn6bfw2EcbXfkSLvG0IGAyxOevPr+NOLk56+f6BzS5727/obNZl1qv2bU4oDDcNGY3ZtkDNkgrjGB05OfwqWC2vUuQ0t4XiXou0Ddx3445pbu8jtLmMyIu3ypJDIeqqu3IH1yPyrSe25UpPlvsFndy3c8zeTLHbqFCebGUYtzu4POPu/rV2suC7vroPEY44J0EcmN2RsbPB44PBFPSLVQp8y4hY9ivGOef4T26envThqr3HCWnc8S8Q+MfE0HjO9EWoXUXkXTRxW6sdm0NhRs6HIx9c17m+o2tuYo7q4igmeMyeW7gEADLH6DB59qzm0WWW/hvJ4rF7lGVmm8pRIMN0Dbc42j1698Vn69No9zLLcy6hcIYFZJooY8iQKJFK5KnB+dxkEdfYEdU5RqWSVrAm0dHPex28wjZXP3dzADC7jtXP1PHFVZdZjjt45vIlCsEYlsfKjMAGOD7nj25xVu7+zxQvdzxK/kIz7tgLAAZOPyqG1k0/UQWhijf7M/l/NHgxnCtjBGQRkccYrnG1N3sxbfU47l40jhm3Pk/MoG0DHJ5/2h0z+lQ3/iHSdLu0tL2/ihuJFDJEx+ZgSQMAe4NX47eCHHlQxpgEDaoGAf/wBQrw7V9S/tT4z20itmOHUYLdPYI6g/rk/jWtGmqjfkCulqevt4isk2mRLhI2+7IYjg8Keg5HDdx2NX7W6ju0Z4s4Vtpz9Af6ip6O3pWTsWZVjK51SdX3kEybHMpIIDAEBc4GMgZwPxrz/4u6tqWm3mlrYahd2oeOQuIJmTdgrjODzXoFjJDLqTvFeJMxVw6oijGGAGSOeOevXmvM/jV/x/aR/1yk/mtb4ZXqq5jD4Tr/h7q01x4S0v7bPcXNzcvMBLK5c/KzdSTnpXTtqECSXaPuUWqB5GI4wQTx+VcR8P4Jz4K0S7t4vONvLPujDBSQzMOCeK6u2hvPtOpXM9ouJ0QJCXB3AAggnp/Tmues2qsku7/r7xc8r2/q1v8y5a3xunC/ZLmJSu4PIoAI/AmoDrVurFvLnNuH2G4Cfuwc465zjPGcYqHTrW5hvQ0dvLaWmwhoZJg4LcY2gE7e/+FUo9GligNlLbXM8W4jct4VjKk55XPH4A1i5TtoS51LKy/r7jYn1KOG5NusM88iqGcRLnYD0zz7Hgc1HoMz3GjQyvI0jMX+ZiSSN5x1pjRXdnqlxPBbfaI7lU6SBSjKCOc9vpTtKtrq00mGBwiyqX3AnI5YkYx9aHKSldruVFydTXz/NWHaxd3NlaRy2scTu08MZEjlRh5FXsD6/156Ust1PHqFhA8aqs4bfskyAwXOOV5HvkfSppoGuIWinSGWN/vJIu5SPoetRPptvJ5EjWdo00IAjZogdgHQLxkY9qtVVb4WbOLvozJHipkjeWew2xLGXBSXcx+QsBjaOwP6fhraZfS30crTWr27I+0bgwDDAORuVT3x07U5bKNRxb2oIBAxGOmMDt6Ein2drHZq0cNvDBGTuxCgUE9zgfhVe1i9FFoSjJbsppI48Uyx5XyzaBsYG7Ib88c/54rz74g694utvEY0jSzIkE0YeH7HETJIOhyeSCDnpjjFelrp8C6o2oKCJ2i8psYwRkHPrngVPOrmPMeQ2R0xnGeetaqoqfvWvoVLU8HT4e+NdWHnXcbgtzm7ugWP4ZJ/OqWpeHPFvg5Bdv9o
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAZAAAADICAIAAABJdyC1AAEAAElEQVR4Aey9BXgcR9I/bGmZQczMZNkyMzPHEOZc2GFmvjAzs53YcWJmZpJkSRYz8zJL3281cns0C1rJTu7e/3d65hlVV1dX98zO1FRVV1d7dXd3D/nf3//uwP/uwP/uwP+FO8D+vzDIIe+8t23jpmzPh3rTDROvuXqs5/T/o8Qd+KL8vXpj7bPJb1yWu/F91WeF6rxX0j64LNwGxMRisf3869G9e883Nat8fSSjR8feeP0EoZA7ICb/QeKjDdXry/K7ejSJtybMZYzk6LHSJ5/+nYH8/NMbY2MCGMi/tfhp2Td56oIPMy/P0+L5UL09J3VDeePxN76t2O6G4L+56lRtHTW8M7X1l2WcWrM549MPo99/C8ep+l7ml4XzZWFi67YdbTtwWVj9TUzWVP9p7bbSmXeYVftbjtIxbuCPPtn13feHqmvaTCZrXX3HuvUnX3t9kxv6/7aqTZWFL46ZAVHlKK3+24bqyXhOldQyyPadK/t650kG0sPi5RFYHnb2X0hm6+4+WV1n6+rCsa2oxGyzXfogxVxuzu13H7zx1ktn9XdwqNCVbm5Y/3dwvlw819dttXb1+SGMXaZfq//0hL/BYN60OZtBeehwcWOTioEcaDGvrWmgTQZHHyKSnmqqK+poxTE4Dv9VrT7cfIQxnslpMTfNGMFAeli8aBIatEaj3qzwl3rY8v8Nsh1FJbtLymo6VV5DhqQGBXBZrMt1XSwvsPxv/CtQ5/4jw7q8l9+tseo8GXZ9Q2dXlxO3bG1te2CAzBMOrmi+KTg9aJXHarWx2Szq7Io/wYdL5AXtvcIxQeFL8P9VgCdPd2lD21c7TxTUNN392QYM/v3bFnl7ef16MOev4/mj4sNXLxwP5FsbDhjN1pyK+qnpsTuzix9dNmVkfNjvR85tO13U1dU1PC70rrl9fDu9AqutUXVsW05wlN9/p8CaPSs9Ity3U6VXqQxqtUHVA6jUesBWa9el/JBzEuPD5LLUwH/U/h/0gMu0RV9UvH9V+E0/VH0h4yjujHlwQ92ac6qzmYoR10bc5jXELiM6LR0b6n7NV+eYbKYQQej84CtSpBlUjz9Vf5mnylFZOlC888w1FHJ24KKFwcspmOXFPqc681f9703GBilHNkI5Zl7QMrZX70NSa6j+q35tqbYIRmWEMHpB8LI4cRLVEOdyXcnvtT/W6qslHOkkv+lc7wH7jExdptzO8xTDMx3nON69/dq6uw62HAsRBJK+3ACuxCQcW25aoardaPjk3HHMQV2VkKHkC78/f0ZtNg33DxkdFPb2mcOBQjE4ay3mj3OPQQ1fFpsSJJISmjmR8U6ZW8xWdbsOPDd8vX/OVWO3/nxk5V0zzEaLwk/SVNtuNll1akNyVhSjbbLS/2B9pcVmmxuVwKhyX3yv5JMaff2bGS8Sso/LvizSlL439DUKU62v/a12Q6m23GAzSNnSOEnMyrAl/jw/qrbD3PlrzbrszjxTlzFMEHpF6KIMeSphVawp+6l6TZW+Bg1nBEzx5PeNDfJ56ZpZZ8vrP/zXYsJn1YQMCZ8LWUYwoxPCI/zlGoPp4aWTjxZWBSklW08VfnnPcsjE2z5cl1/dlBJ+8fXsfSaEYl5LXbu3t6ufu5f5+tpDG2oPNxrbhSxerCTk3vgl4UJ/qs7UZXm54KcjLflsb/ZIn4T74peJ2Hyqal3NwfW1B5uNnYECxcrwKfODRwO/8ODTL6XdmC6PPtKa/2Tu18+lXj/JPz1fVfl47ld/Tbh4xykOSYnBOCiYce7s1C9d/j4DOaCi1mR+bPMOarb03/NnkbZnGxuWrf15+9U3xPn4UMjzrS3zfv7+r1XXpPoHVHZ2vHf86PG62naDPkAkXpWadkfWKNLWFTDrp2/nxsavHtX70Xjv+JEtpcXogqLfWFz44cljVZ2dwRIpGN6cmeWopqktqhPtR64Mu/Hn6q/fKX45Qz4c4ua32h9GKyfES5J1Vu2bRc+zvFhLQ64SsUUn249+XPrmXbEPJ0vT0cU0/7mT/GZsrF9XrS8HkupUypZTAM6QZWA7M2CBHy/gvObc9saNfG/hrMAFqKo31IBziCBsVdgNHG/uyfbD75W8BolJcVZZOt8vec2fF3Bd5G0g3t+yq8lYTyQd4e8egBl4rO3MOVUhyN4r+ZIQQxAH8v3uiL2eYNwAwcEKPMaOSlZ0VO9ribbnyxr1BnNmSpjZbOXzOBS3bwtOX5s4NFgkfeLIjoeHTyhTtb83aT6qvsw/tTwuNUqqAH5t8TlIrnCJ4sOcY8+NnkZo6OPZsTNPozEAA5VKX9XaVNsx75qxkYnBGBLOG78/1FzXEZ0UnDoyurNN6/SN+6U45+UxM9ne3i+e2DMlNMbxGaB35zls6bK+fP6tCGHYzVHX4qdpNDbldOaJ2WKKg9aqe67gNTw5V4VfIWaLjrSdeKPo/YcT7qVkVqdF9VrhOwF8/39F3wT6nU176g2NeNM9790NpUIsUOmNPDabx2GZrNbyxvbqls5/fbSOaqIzmulte7s06EwhMQGaDl13V7eXC7GV01n2aenG51KvixIFtps1ZztKfXkXdew/ag6tDJ/8cdbqZlPny/k//VC58/ZY+4MOAfdNxbbV8UsTpWHn1dXvFa3H93lRyLgESWipth4C65yqIlTom6+uhMACJlESRh9fv/Clz/5sOV/83KypfDbz7mcGBsUqlesL8x8dN5Eaxrrz+Ym+fpBWKIq43ECx+L3Zc/2F4pMNdY/v2p7k5z85gvm17Hf8hOBAVeUTe3a+PGV6ekBgeUfHY7u3W21dd45wIgShv0C1yVXh3T4LweTt5b2x4bdGYz0E1u7mrdCwnk1+HRIHnNNkwxqMtRvrf6PESiDfLvQhyKBJhQoiSNcEgLxbHfd4giQFmFTZ0HJtcZ76LCWw/qr/jefNuy/uCUgr1GbKR7xV/MLamu+fS3kTxT3N2+AmvzP2YTlHgWKGbPiTeasBDOhPxBbeE3dT95DuG07c9+GwV/gXdDRcIF4kD1nx+ZyF8zM3/HWGTj9zRmrABXvwi7WHzxXVozYjKfShf//x4TMrKEq91SLicCEmLD3usyCRhMIbLBYxhyvkcGHOaC2mELGMz2LdlWH/6BIaihJnvd78xltbbDa71o+RPHDDRAgpv2CFql0nU4jkvnaeUUnBerWhsaY9KjEo52gpaUsACOh6nRqdGqwWdErwlwh0WDq0Vu0Ev9FZisweVmmzA6cTnlsbd3aYO95IfymAb5fswxQZtfr632v/pATW1oZd1m7LIwn3Krhy1A5XDL337COkrRvAy8vLZLFixtPNhTAuMTpQCSXr0zuXQprjFWDI9N63VCwTGrQmLo/jSlphTAabGTfPhycNEvjgSJFF0geaIA27MXo2MBGigGkBw/JUlVTt95U7rwyfOiNwOIphQv96Q9v3FTvtAksaVqa1Pzd5nZXzg8ccbDkHuFRTBzzV8B87B8skmB/0FYnQY7xfrzJF9X5Fcuq32WceHjsBtxvu+b+KzhM1yk8oIoIsQi4HWX5z86UIrPdPHL0uPXNhgt3IipQrrkrN+Ckvx6nAUnDsgxSxxUquL15mwHxvgbnL/iE6rz4HJYiSViji0Y8RJRxq3WPpsnC8e1UJ4F398Vl8SlpRBP78oHJtCQUXavKHKUZS0orCQGb9XvtTq6nZl+dfra9Av5S0Qi3IokSxNfpKitLa3cXuGSdVdH/GmCOFYVxvticDdsrqjtunyeTCPXsLmps1gQHSOXMyli3JIpRnC2o/fm7lvS/+xmbhXbgoDlbFp795+qCQzZkXmUCIAcDcey/7SIxMKWBzlsWmvnv2MHxMab6BfgL7A8P4y86poqQVhZ++bASlAWRNSgSGOlMYzPJ4s7wDw/o8b1Sr29NGfXf+NAToTclZF8fH6GngRT+eb5I0/uuKH8u0FRP9xkWLIuk8cjvzw4ShlLQCHr9CgiR2T/MB6smp0FWhlpJWqOV6c2LF0ZX6ajoHpzDu8KzM+Kvf/CXYR/rWTfMxtfX0TzsqGtu1RlN
"text/plain": [
"<PIL.Image.Image image mode=RGB size=400x200>"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Import the wordcloud library\n",
"from wordcloud import WordCloud\n",
"# Join the different processed titles together.\n",
"long_string = ','.join(list(papers['paper_text_processed'].values))\n",
"# Create a WordCloud object\n",
"wordcloud = WordCloud(background_color=\"white\", max_words=5000, contour_width=3, contour_color='steelblue')\n",
"# Generate a word cloud\n",
"wordcloud.generate(long_string)\n",
"# Visualize the word cloud\n",
"wordcloud.to_image()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "8366f3b2",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package stopwords to\n",
"[nltk_data] /Users/patrykbart/nltk_data...\n",
"[nltk_data] Package stopwords is already up-to-date!\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"['learning', 'manifolds', 'means', 'flats', 'guillermo', 'canas', 'tomaso', 'poggio', 'lorenzo', 'rosasco', 'laboratory', 'computational', 'statistical', 'learning', 'mit', 'iit', 'cbcl', 'mcgovern', 'institute', 'massachusetts', 'institute', 'technology', 'guilledc', 'mitedu', 'tp', 'aimitedu', 'lrosasco', 'mitedu', 'abstract', 'study']\n"
]
}
],
"source": [
"from gensim.utils import simple_preprocess\n",
"import nltk\n",
"nltk.download('stopwords')\n",
"from nltk.corpus import stopwords\n",
"stop_words = stopwords.words('english')\n",
"stop_words.extend(['from', 'subject', 're', 'edu', 'use'])\n",
"def sent_to_words(sentences):\n",
" for sentence in sentences:\n",
" # deacc=True removes punctuations\n",
" yield(simple_preprocess(str(sentence), deacc=True))\n",
"def remove_stopwords(texts):\n",
" return [[word for word in simple_preprocess(str(doc)) \n",
" if word not in stop_words] for doc in texts]\n",
"data = papers.paper_text_processed.values.tolist()\n",
"data_words = list(sent_to_words(data))\n",
"# remove stop words\n",
"data_words = remove_stopwords(data_words)\n",
"print(data_words[:1][0][:30])"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "65913a29",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[(0, 3), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 8), (7, 1), (8, 2), (9, 3), (10, 1), (11, 1), (12, 2), (13, 1), (14, 1), (15, 1), (16, 3), (17, 1), (18, 1), (19, 12), (20, 7), (21, 1), (22, 1), (23, 2), (24, 4), (25, 2), (26, 1), (27, 1), (28, 1), (29, 1)]\n"
]
}
],
"source": [
"import gensim.corpora as corpora\n",
"# Create Dictionary\n",
"id2word = corpora.Dictionary(data_words)\n",
"# Create Corpus\n",
"texts = data_words\n",
"# Term Document Frequency\n",
"corpus = [id2word.doc2bow(text) for text in texts]\n",
"# View\n",
"print(corpus[:1][0][:30])"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "d60d43b9",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[(0,\n",
" '0.006*\"learning\" + 0.005*\"model\" + 0.005*\"data\" + 0.004*\"function\" + '\n",
" '0.004*\"set\" + 0.004*\"using\" + 0.004*\"number\" + 0.004*\"neural\" + 0.004*\"one\" '\n",
" '+ 0.003*\"error\"'),\n",
" (1,\n",
" '0.008*\"learning\" + 0.006*\"data\" + 0.005*\"model\" + 0.005*\"set\" + '\n",
" '0.004*\"algorithm\" + 0.004*\"time\" + 0.004*\"one\" + 0.004*\"two\" + 0.003*\"used\" '\n",
" '+ 0.003*\"figure\"'),\n",
" (2,\n",
" '0.007*\"data\" + 0.005*\"model\" + 0.005*\"set\" + 0.005*\"learning\" + 0.004*\"one\" '\n",
" '+ 0.004*\"algorithm\" + 0.004*\"time\" + 0.003*\"using\" + 0.003*\"figure\" + '\n",
" '0.003*\"training\"'),\n",
" (3,\n",
" '0.006*\"data\" + 0.005*\"model\" + 0.004*\"learning\" + 0.004*\"two\" + '\n",
" '0.004*\"algorithm\" + 0.004*\"using\" + 0.004*\"function\" + 0.004*\"set\" + '\n",
" '0.003*\"number\" + 0.003*\"given\"'),\n",
" (4,\n",
" '0.006*\"learning\" + 0.005*\"data\" + 0.005*\"model\" + 0.005*\"set\" + '\n",
" '0.004*\"algorithm\" + 0.004*\"time\" + 0.004*\"using\" + 0.004*\"two\" + '\n",
" '0.004*\"function\" + 0.003*\"one\"'),\n",
" (5,\n",
" '0.008*\"learning\" + 0.006*\"data\" + 0.005*\"algorithm\" + 0.004*\"model\" + '\n",
" '0.004*\"two\" + 0.004*\"function\" + 0.004*\"number\" + 0.003*\"figure\" + '\n",
" '0.003*\"time\" + 0.003*\"set\"'),\n",
" (6,\n",
" '0.007*\"learning\" + 0.006*\"model\" + 0.005*\"data\" + 0.005*\"algorithm\" + '\n",
" '0.004*\"function\" + 0.004*\"set\" + 0.003*\"time\" + 0.003*\"one\" + 0.003*\"based\" '\n",
" '+ 0.003*\"number\"'),\n",
" (7,\n",
" '0.007*\"learning\" + 0.005*\"set\" + 0.005*\"data\" + 0.005*\"model\" + '\n",
" '0.004*\"algorithm\" + 0.004*\"function\" + 0.004*\"using\" + 0.004*\"number\" + '\n",
" '0.004*\"log\" + 0.004*\"figure\"'),\n",
" (8,\n",
" '0.005*\"learning\" + 0.005*\"set\" + 0.005*\"algorithm\" + 0.004*\"model\" + '\n",
" '0.004*\"function\" + 0.004*\"data\" + 0.004*\"one\" + 0.004*\"time\" + '\n",
" '0.003*\"using\" + 0.003*\"given\"'),\n",
" (9,\n",
" '0.007*\"data\" + 0.006*\"model\" + 0.005*\"learning\" + 0.005*\"algorithm\" + '\n",
" '0.004*\"two\" + 0.003*\"number\" + 0.003*\"time\" + 0.003*\"set\" + '\n",
" '0.003*\"function\" + 0.003*\"used\"')]\n"
]
}
],
"source": [
"from gensim.models import LdaMulticore\n",
"\n",
"from pprint import pprint\n",
"# number of topics\n",
"num_topics = 10\n",
"# Build LDA model\n",
"lda_model = LdaMulticore(corpus=corpus,\n",
" id2word=id2word,\n",
" num_topics=num_topics)\n",
"# Print the Keyword in the 10 topics\n",
"pprint(lda_model.print_topics())\n",
"doc_lda = lda_model[corpus]\n",
"\n",
"# Save to txt file\n",
"with open(\"./data/lda_topics.txt\", \"w\") as f:\n",
" for topic in lda_model.print_topics():\n",
" f.write(f\"{topic}\\n\")"
]
2024-04-13 08:20:53 +02:00
}
],
"metadata": {
"author": "Rafał Jaworski",
"email": "rjawor@amu.edu.pl",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"lang": "pl",
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
2024-04-14 18:45:52 +02:00
"version": "3.10.14"
2024-04-13 08:20:53 +02:00
},
"subtitle": "4,5. Klasyfikacja tematyczna (terminologii ciąg dalszy)",
"title": "Komputerowe wspomaganie tłumaczenia",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 5
}