KWT-2024/lab/lab_04-05.ipynb
2024-04-14 18:45:52 +02:00

442 KiB

Logo 1

Komputerowe wspomaganie tłumaczenia

4,5. Klasyfikacja tematyczna (terminologii ciąg dalszy) [laboratoria]

Rafał Jaworski (2021)

Logo 2

Komputerowe wspomaganie tłumaczenia

Zajęcia 4 i 5 - klasyfikacja tematyczna (terminologii ciąg dalszy)

Na poprzednich zajęciach opracowaliśmy nasz własny ekstraktor terminologii. Mówiliśmy również, jak ważna jest ekstrakcja terminów specjalistycznych. Dziś zajmiemy się zagadnieniem, w jaki sposób wyciągnąć z tekstu terminy, które naprawdę są specjalistyczne.

Dlaczego nasze dotychczasowe rozwiązanie mogło nie spełniać tego warunku? Wykonajmy następujące ćwiczenie:

Ćwiczenie 1: Zgromadź korpus w języku angielskim składający się z co najmniej 100 dokumentów, z których każdy zawiera co najmniej 100 zdań. Wykorzystaj stronę https://opus.nlpl.eu/. Dobrze, aby dokumenty pochodziły z różnych dziedzin (np. prawo Unii Europejskiej, manuale programistyczne, medycyna). Ściągnięty korpus zapisz na swoim dysku lokalnym, nie załączaj go do niniejszego notatnika.

import os

documents_files = os.listdir("./data/corpus")
documents_files = [d for d in documents_files if d.endswith(".txt")]
documents_files = sorted(documents_files)

documents = []
for document_file in documents_files:
    with open(f"./data/corpus/{document_file}", "r") as f:
        document_text = f.read()

        # Limit text to 100 lines
        document_text = "\n".join(document_text.split("\n")[:100])

        documents.append(document_text)

print(len(documents))
100

Taki korpus pozwoli nam zaobserwować, co się stanie, jeśli do ekstrakcji terminologii będziemy stosowali wyłącznie kryterium częstościowe. Aby wykonać odpowiedni eksperyment musimy uruchomić ekstraktor z poprzednich zajęć.

Ćwiczenie 2: Uruchom ekstraktor terminologii (wykrywacz rzeczowników) z poprzednich zajęć na każdym dokumencie z osobna. Jako wynik ekstraktora w każdym przypadku wypisz 5 najczęściej występujących rzeczowników. Wyniki działania komendy umieść w notatniku.

import spacy
nlp = spacy.load("en_core_web_sm")

def get_nouns(text):
    doc = nlp(text)
    return [token.text for token in doc if token.pos_ == 'NOUN']

def get_top_nouns(nouns, n=5):
    from collections import Counter
    return [noun for noun, _ in Counter(nouns).most_common(n)]

top_nouns = []
for i, document in enumerate(documents):
    nouns = get_nouns(document)
    top_nouns.append(get_top_nouns(nouns, 5))
    print(f"[{i+1}/{len(documents)}] {top_nouns[-1]}")

with open("./data/top_nouns.txt", "w") as f:
    for nouns in top_nouns:
        f.write(" ".join(nouns) + "\n")
[1/100] ['project', 'victims', 'support', 'visit', 'mediation']
[2/100] ['exhibition', 'cooperation', 'year', 'meeting', 'films']
[3/100] ['exhibition', 'cooperation', 'year', 'meeting', 'films']
[4/100] ['solution', 'occupation', 'settlement', 'problem', 'resolutions']
[5/100] ['residence', 'citizens', 'permit', 'security', 'citizen']
[6/100] ['residence', 'citizens', 'permit', 'security', 'citizen']
[7/100] ['support', 'measures', 'countries', 'farmers', 'member']
[8/100] ['data', 'services', 'infrastructure', 'development', 'project']
[9/100] ['data', 'services', 'infrastructure', 'development', 'project']
[10/100] ['photographs', 'service', 'scans', 'materials', 'films']
[11/100] ['photographs', 'service', 'scans', 'materials', 'films']
[12/100] ['insurance', 'ZUS', 'contributions', 'benefits', 'administration']
[13/100] ['project', 'archaeology', 'research', 'conservation', 'history']
[14/100] ['project', 'archaeology', 'research', 'conservation', 'history']
[15/100] ['cases', '%', 'coronavirus', 'countries', 'disease']
[16/100] ['%', 'year', 'case', 'cases', 'coronavirus']
[17/100] ['ship', 'tug', 'speed', 'accident', 'course']
[18/100] ['ship', 'tug', 'speed', 'accident', 'course']
[19/100] ['work', 'scientists', 'research', 'science', 'telomerase']
[20/100] ['work', 'scientists', 'research', 'science', 'telomerase']
[21/100] ['film', 'media', 'part', 'time', 'efforts']
[22/100] ['film', 'media', 'part', 'time', 'efforts']
[23/100] ['insurance', 'ZUS', 'contributions', 'benefits', 'administration']
[24/100] ['use', 'care', 'stewardship', 'resistance', 'antibiotics']
[25/100] ['services', 'administration', 'state', 'information', 'e']
[26/100] ['services', 'administration', 'state', 'information', 'e']
[27/100] ['coronavirus', 'research', 'measures', 'outbreak', 'member']
[28/100] ['residence', 'card', 'foreigner', 'work', 'permit']
[29/100] ['security', 'e', 'threats', 'policy', 'gas']
[30/100] ['security', 'e', 'threats', 'policy', 'gas']
[31/100] ['paper', '15th', 'reader', 'file', 'date']
[32/100] ['paper', '15th', 'reader', 'file', 'date']
[33/100] ['costs', 'implementation', 'management', 'tasks', 'expenditures']
[34/100] ['food', 'cooperation', 'products', 'market', 'agri']
[35/100] ['costs', 'implementation', 'management', 'tasks', 'expenditures']
[36/100] ['costs', 'implementation', 'management', 'tasks', 'expenditures']
[37/100] ['artist', 'work', 'painting', 'paintings', 'time']
[38/100] ['artist', 'work', 'painting', 'paintings', 'time']
[39/100] ['Home', '»', 'rights', 'representatives', 'discrimination']
[40/100] ['Home', '»', 'rights', 'representatives', 'discrimination']
[41/100] ['command', 'documentation', 'alias', 'files', 'directory']
[42/100] ['water', 'basis', 'land', 'status', 'item']
[43/100] ['water', 'basis', 'land', 'status', 'item']
[44/100] ['%', 'contract', 'contracts', '.', 'No']
[45/100] ['food', 'cooperation', 'products', 'market', 'agri']
[46/100] ['%', 'contract', 'contracts', '.', 'No']
[47/100] ['market', 'level', 'services', 'age', 'companies']
[48/100] ['market', 'level', 'services', 'age', 'companies']
[49/100] ['projects', 'innovation', 'R&D', 'development', 'companies']
[50/100] ['projects', 'innovation', 'R&D', 'development', 'companies']
[51/100] ['contracts', 'contract', '%', 'item', 'procedures']
[52/100] ['contracts', 'contract', '%', 'item', 'procedures']
[53/100] ['room', 'A', 'office', 'information', 'B']
[54/100] ['room', 'A', 'office', 'information', 'B']
[55/100] ['advantage', 'production', 'country', 'countries', 'goods']
[56/100] ['measles', 'vaccine', 'disease', 'person', 'people']
[57/100] ['advantage', 'production', 'country', 'countries', 'goods']
[58/100] ['card', 'residence', 'permission', 'business', 'stamp']
[59/100] ['card', 'residence', 'permission', 'business', 'stamp']
[60/100] ['w', '%', 'gospodarczego', 'polityki', 'publicznych']
[61/100] ['system', 'banks', 'stability', 'risk', 'sector']
[62/100] ['camps', 'people', 'concentration', 'policy', 'resistance']
[63/100] ['camps', 'people', 'concentration', 'policy', 'resistance']
[64/100] ['safety', 'aviation', 'management', 'requirements', 'entity']
[65/100] ['safety', 'aviation', 'management', 'requirements', 'entity']
[66/100] ['research', 'call', 'philosophy', 'information', 'project']
[67/100] ['vaccination', 'pertussis', 'cancer', 'risk', 'disease']
[68/100] ['research', 'call', 'philosophy', 'information', 'project']
[69/100] ['energy', 'gas', '%', 'oil', 'countries']
[70/100] ['energy', 'gas', '%', 'oil', 'countries']
[71/100] ['cooperation', 'meeting', 'talks', 'forces', 'defence']
[72/100] ['project', 'education', 'information', 'coronavirus', 'funding']
[73/100] ['food', 'education', 'project', 'measures', 'assistance']
[74/100] ['infection', 'disease', 'symptoms', 'fever', 'humans']
[75/100] ['energy', 'audit', 'costs', 'use', 'management']
[76/100] ['countries', '%', 'development', 'benefits', 'funds']
[77/100] ['years', 'minister', 'year', 'rector', 'persons']
[78/100] ['water', 'food', 'fish', 'times', 'year']
[79/100] ['land', 'water', 'population', 'data', 'age']
[80/100] ['land', 'water', 'population', 'data', 'age']
[81/100] ['market', 'labour', 'crisis', 'unemployment', 'countries']
[82/100] ['market', 'labour', 'crisis', 'unemployment', 'countries']
[83/100] ['accelerator', 'research', '-', 'operation', 'model']
[84/100] ['accelerator', 'research', '-', 'operation', 'model']
[85/100] ['energy', 'policy', 'power', 'development', 'objectives']
[86/100] ['priest', 'hand', 'country', 'wedding', 'church']
[87/100] ['eggs', 'breakfast', 'food', 'products', 'meat']
[88/100] ['eggs', 'breakfast', 'food', 'products', 'meat']
[89/100] ['water', 'fish', 'times', 'food', 'year']
[90/100] ['honey', 'production', 'bread', 'time', 'taste']
[91/100] ['honey', 'production', 'bread', 'time', 'taste']
[92/100] ['data', 'job', 'portal', 'vacancies', 'Decision']
[93/100] ['data', 'job', 'portal', 'vacancies', 'Decision']
[94/100] ['food', 'quality', 'products', 'apples', 'farmers']
[95/100] ['food', 'quality', 'products', 'apples', 'farmers']
[96/100] ['visa', 'activities', 'child', 'B-1', 'institution']
[97/100] ['visa', 'activities', 'child', 'B-1', 'institution']
[98/100] ['-', 'co', 'preparations', 'operation', 'preparation']
[99/100] ['-', 'co', 'preparations', 'operation', 'preparation']
[100/100] ['project', 'victims', 'support', 'visit', 'mediation']

Czy wyniki uzyskane w ten sposób to zawsze terminy specjalistyczne? Niestety może zdarzyć się, że w wynikach pojawią się rzeczowniki, które są po prostu częste w języku, a niekoniecznie charakterystyczne dla przetwarzanych przez nas tekstów. Aby wyniki ekstrakcji były lepsze, konieczne jest zastosowanie bardziej wyrafinowanych metod.

Jedną z tych metod jest znana z dyscypliny Information Retrieval technika zwana TF-IDF. Jej nazwa wywodzi się od Term Frequency Inverted Document Frequency. Według tej metody, dla każdego odnalezionego przez nas termu powinniśmy obliczyć czynnik TF-IDF, a następnie wyniki posortować malejąco po wartości tego czynnika.

Jak obliczyć czynnik TF-IDF? Czym jest TF, a czym jest IDF?

Zacznijmy od TF, bo ten czynnik już znamy. Jest to nic innego jak częstość wystąpienia terminu w tekście, który przetwarzamy. Idea TF-IDF skupia się na drugim czynniku - IDF. Słowo _inverted oznacza, że czynnik ten będzie odwrócony, czyli trafi do mianownika. W związku z tym TF-IDF to w istocie: $\frac{TF}{DF}$

Czym zatem jest document frequency? Jest to liczba dokumentów, w których wystąpił dany termin. Dokumenty w tym przypadku są rozumiane jako jednostki, na które podzielony jest korpus, nad którym pracujemy (dokładnie taki, jak korpus z ćwiczenia pierwszego).

Zastanówmy się nad sensem tego czynnika. Pamiętajmy, że naszym zadaniem jest ekstracja terminów z tylko jednego dokumentu na raz. Mamy jednak do dyspozycji wiele innych dokumentów, zawierających wiele innych słów i termów. Wartość TF-IDF jest tym większa, im częściej termin występuje w dokumencie, na którym dokonujemy ekstrakcji. Czynnik ten jednak zmniejsza się, jeśli słowo występuje w wielu różnych dokumentach. Zatem, popularne słowa będą miały wysoki czynnik DF i niski TF-IDF. Natomiast najwyższą wartość TF-IDF będą miały terminy, które są częste w przetwarzanym przez nas dokumencie, ale nie występują nigdzie indziej.

Ćwiczenie 3: Zaimplementuj czynnik TF-IDF i dokonaj ekstrakcji terminologii za jego pomocą, używając korpusu z ćwiczenia nr 1. Czy wyniki różnią się od tych uzyskanych tylko za pomocą TF?

import math
from collections import Counter


def tfidf_extract():
    def tf(word, document):
        return document.count(word) / len(document)

    def idf(word, documents):
        num_documents_with_word = sum(1 for document in documents if word in document)
        if num_documents_with_word == 0:
            return 0
        return math.log(len(documents) / num_documents_with_word)

    def tfidf(word, document, documents):
        tf_output = tf(word, document)
        idf_output = idf(word, documents)
        return tf_output * idf_output
        
    split_documents = [document.split() for document in documents]
    top_special_nouns = []
    for i, document in enumerate(split_documents):
        nouns = get_nouns(" ".join(document))
        nouns = [(noun, tfidf(noun, document, split_documents)) for noun in nouns]
        nouns = sorted(nouns, key=lambda x: x[1], reverse=True)
        top_special_nouns.append([noun for noun, _ in list(set(nouns))[:5]])
        print(f"[{i+1}/{len(documents)}] {top_special_nouns[-1]}")

    with open("./data/top_nouns_tfidf.txt", "w") as f:
        for nouns in top_special_nouns:
            f.write(" ".join(nouns) + "\n")

tfidf_extract()
[1/100] ['approval', 'total', 'lawyers', 'priorities', 'judges']
[2/100] ['agriculture', 'support', 'guests', 'offers', 'author']
[3/100] ['agriculture', 'support', 'guests', 'offers', 'author']
[4/100] ['homeland', 'invasion', 'address', 'prisoners', 'sources']
[5/100] ['identity', 'positions', 'elaboration', 'issues', 'terms']
[6/100] ['identity', 'positions', 'elaboration', 'issues', 'terms']
[7/100] ['distancing', 'lenders', 'mechanism', 'check', 'part']
[8/100] ['IT', 'Realization', 'Services', 'resolutions', 'bases']
[9/100] ['IT', 'Realization', 'Services', 'resolutions', 'bases']
[10/100] ['occupation', 'scans', 'browser', 'Service', 'processes']
[11/100] ['occupation', 'scans', 'browser', 'Service', 'processes']
[12/100] ['am', 'war', 'month', 'Insurance', 'centralisation']
[13/100] ['conservation', 'zu', 'provisions', 'basin', 'record']
[14/100] ['conservation', 'zu', 'provisions', 'basin', 'record']
[15/100] ['culture', 'city', 'abscesses', 'aeronautics', 'disruptors']
[16/100] ['infection', 'Recommendations', 'man', 'evening', 'occurrence']
[17/100] ['course', 'hull', 'STATE', 'classifier', 'certificate']
[18/100] ['course', 'hull', 'STATE', 'classifier', 'certificate']
[19/100] ['cooling', 'work', 'culture', 'part', 'laboratory']
[20/100] ['cooling', 'work', 'culture', 'part', 'laboratory']
[21/100] ['culture', 'reverse', 'advisor', 'documentary', 'service']
[22/100] ['culture', 'reverse', 'advisor', 'documentary', 'service']
[23/100] ['am', 'war', 'month', 'Insurance', 'centralisation']
[24/100] ['pressure', 'ability', 'entry', 'prescribers', 'costs']
[25/100] ['economies', 'management', 'role', 'disk', 'stakeholders']
[26/100] ['economies', 'management', 'role', 'disk', 'stakeholders']
[27/100] ['traders', 'fears', 'carriers', 'illness', 'distancing']
[28/100] ['activity', 'employment', 'foreigners', 'Visa', 'graduate']
[29/100] ['defense', 'forecast', 'quarter', 'factors', 'opportunity']
[30/100] ['defense', 'forecast', 'quarter', 'factors', 'opportunity']
[31/100] ['case', 'author', 'screen', 'announcement', 'typefaces']
[32/100] ['case', 'author', 'screen', 'announcement', 'typefaces']
[33/100] ['revenue', 'office', 'premises', 'o', 'proposals']
[34/100] ['storage', 'completion', 'efforts', 'Meeting', 'crisis']
[35/100] ['office', 'Types', 'premises', 'protection', 'days']
[36/100] ['revenue', 'office', 'premises', 'o', 'proposals']
[37/100] ['pictures', 'splashing', 'dobrze', 'viewer', 'culture']
[38/100] ['pictures', 'splashing', 'dobrze', 'viewer', 'culture']
[39/100] ['creation', 'origin', 'discrimination', 'interest', 'institutions']
[40/100] ['creation', 'origin', 'discrimination', 'interest', 'institutions']
[41/100] ['names', 'contexts', 'calculator', 'program', 'descriptor']
[42/100] ['periods', 'standards', 'total', 'name', 'property']
[43/100] ['periods', 'standards', 'total', 'name', 'property']
[44/100] ['Art', 'days', 'liability', 'authorities', 'services']
[45/100] ['storage', 'completion', 'efforts', 'Meeting', 'crisis']
[46/100] ['Art', 'days', 'liability', 'authorities', 'services']
[47/100] ['skills', 'provision', 'country', 'economies', 'science']
[48/100] ['skills', 'provision', 'country', 'economies', 'science']
[49/100] ['Project', 'possibilities', 'cancer', 'members', 'therapies']
[50/100] ['Project', 'possibilities', 'cancer', 'members', 'therapies']
[51/100] ['price', 'auction', 'actions', 'telecommunications', 'appointment']
[52/100] ['price', 'auction', 'actions', 'telecommunications', 'appointment']
[53/100] ['records', 'coffee', 'authorisation', 'line', 'times']
[54/100] ['records', 'coffee', 'authorisation', 'line', 'times']
[55/100] ['example', 'manner', 'source', 'essence', 'identification']
[56/100] ['defences', 'vaccines', 'days', 'spread', 'body']
[57/100] ['example', 'manner', 'source', 'essence', 'identification']
[58/100] ['servants', 'employees', 'Possession', 'insurance', 'examinations']
[59/100] ['servants', 'employees', 'Possession', 'insurance', 'examinations']
[60/100] ['systemowe', 'dopiero', 'system', 'latach', 'popytem']
[61/100] ['efficiency', 'problems', 'uncertainty', 'improvement', 'Risk']
[62/100] ['uprising', 'borders', 'rights', 'security', 'campaign']
[63/100] ['uprising', 'borders', 'rights', 'security', 'campaign']
[64/100] ['part', 'audits', 'Responsibilities', 'services', 'authority']
[65/100] ['protection', 'competence', 'version', 'occurrence', 'requisition']
[66/100] ['Requirements', 'members', 'methodology', 'data', 'database']
[67/100] ['whoop', 'substitute', 'cause', 'exposure', 'course']
[68/100] ['Requirements', 'members', 'methodology', 'data', 'database']
[69/100] ['erent', 'decisions', 'SOURCES', 'spectrum', 'economies']
[70/100] ['erent', 'decisions', 'SOURCES', 'spectrum', 'economies']
[71/100] ['invitation', 'effects', 'help', 'armament', 'round']
[72/100] ['area', 'teaching', 'tax', 'time', 'travel']
[73/100] ['time', 'Recommendation', 'participants', 'guarantees', 'work']
[74/100] ['toxin', 'mechanisms', 'attacks', 'Babies', 'therapies']
[75/100] ['production', 'replacement', 'control', 'SMEs', 'audit']
[76/100] ['significance', 'net', 'ground', 'participants', 'levels']
[77/100] ['functioning', 'consultation', 'interest', 'expert', 'procedures']
[78/100] ['thing', 'mercury', 'eggs', 'municipality', 'lunch']
[79/100] ['agriculture', 'R', 'result', 'development', 'prices']
[80/100] ['agriculture', 'R', 'result', 'development', 'prices']
[81/100] ['reflection', 'basis', 'sources', 'points', 'results']
[82/100] ['reflection', 'basis', 'sources', 'points', 'results']
[83/100] ['leaders', 'reach', 'author', 'features', 'publications']
[84/100] ['leaders', 'reach', 'author', 'features', 'publications']
[85/100] ['consumption', 'Improvement', 'bodies', 'level', 'need']
[86/100] ['money', 'delirium', 'advice', 'house', 'couple']
[87/100] ['work', 'thanks', 'BEgINNINg', 'range', 'funds']
[88/100] ['work', 'thanks', 'BEgINNINg', 'range', 'funds']
[89/100] ['option', 'eggs', 'dinner', 'wine', 'quantities']
[90/100] ['seeds', 'mead', 'event', 'maples', 'approach']
[91/100] ['seeds', 'mead', 'event', 'maples', 'approach']
[92/100] ['case', 'complaints', 'consultation', 'Employers', 'actions']
[93/100] ['case', 'complaints', 'consultation', 'Employers', 'actions']
[94/100] ['activity', 'fruit', 'indications', 'zation', 'rice']
[95/100] ['activity', 'fruit', 'indications', 'zation', 'rice']
[96/100] ['building', 'work', 'premises', 'Food', 'child']
[97/100] ['building', 'work', 'premises', 'Food', 'child']
[98/100] ['virtue', 'works', 'culture', 'sectors', 'others']
[99/100] ['virtue', 'works', 'culture', 'sectors', 'others']
[100/100] ['approval', 'total', 'lawyers', 'priorities', 'judges']

Teraz potrafimy już w lepszy sposób wyciągać terminy z dokumentów. Spróbujmy jeszcze czegoś widowiskowego - wygenerujmy tzw. chmurę słów z tekstu przy użyciu biblioteki WordCloud dla artykułu z BBC News (https://www.bbc.com/news/world-europe-56530714):

sudo pip install wordcloud

from wordcloud import WordCloud
text = """"This is where it happened," says Felipe Luis Codesal, opening the gate to a three-hectare field on his farm in Zamora, north-west Spain.

One night last November, a pack of wolves got through the fence surrounding the field and attacked Mr Codesal's sheep, many of which were pregnant. When he arrived the next morning, he found 11 animals had been killed. Over the following days, he says, another 36 sheep died from injuries sustained in that attack and miscarriages it triggered.

Mr Codesal fears that such attacks will become even more commonplace if a proposed change to laws protecting the Iberian wolf comes into force.

The leftist coalition government plans to prevent the Iberian wolf from being hunted anywhere by categorising it as an endangered species. The reform is yet to be implemented and could see changes.

Iberian wolves from the Iberian Wolf Centre in Robledo de Sanabria on February 21, 2020 in Zamora, Spain
image captionSpain has Europe's biggest wolf population: These Iberian wolves are kept at Zamora's Iberian wolf centre
"It's like in a nightclub when there's a fire," says Mr Codesal of the wolf attack. "There's a stampede and people get trodden on and hurt. This is the same."

He was not entitled to any compensation and estimates that the financial losses he suffered from this incident totalled around €12-14,000

"It's not even about the money," he says. "It's emotional, because the animals are part of my family."

A 'historic' change?
The region of Castilla y León is the habitat for most of Spain's wolves. Figures gathered by the local government showed that they killed 3,774 sheep and cows in the region in 2019.

Felipe Luis Codesal's farm is just north of the Duero river, which marks a natural border between north-west Spain and the rest of the country. Until now, it has been legal to hunt wolves north of the Duero, under a strict quota system, because that is where they are most prevalent.

South of the river they have been protected.

Conservationist groups have welcomed the government plan. When it was unveiled in February, the Ecologistas en Acción organisation hailed it as a "historic day".

But Mr Codesal, who is a member of the UPA association of smallholder farmers, warns the reform will ruin livestock owners by allowing the wolf population to spiral out of control and roam uncontrolled. The UPA is unconvinced by measures included in the plan to subsidise the installation of fences and the use of guard dogs in livestock farming areas.

Biggest wolf numbers in Europe
The Iberian wolf was close to being wiped out in the middle of the 20th Century. But it enjoyed a resurgence on the back of new hunting regulations introduced in the 1970s and the migration of Spaniards away from rural areas also encouraged its spread down from the north-western corner of the country.

In recent years, wolves have moved into areas such as the Guadarrama mountains north of Madrid and near the city of Ávila, to the west of the capital.

There are now some 2,500 Iberian wolves: around 2,000 are in Spain - the largest wolf population in western Europe - and the rest in Portugal.
"""

wordcloud = WordCloud(background_color="white", max_words=5000, contour_width=3, contour_color='steelblue')
wordcloud.generate(text)
wordcloud.to_image()

Ćwiczenie 4: Wykonaj chmurę słów dla całego korpusu z ćwiczenia nr 1.

wordcloud = WordCloud(background_color="white", max_words=5000, contour_width=3, contour_color='steelblue')
wordcloud.generate(" ".join(documents))
wordcloud.to_image()

Zastanówmy się nad jeszcze jednym zagadnieniem - jak pogrupować te terminy ze względu na dziedzinę? Zagadnienie to nosi nazwę klasyfikacji tematycznej. A dzięki pewnemu XIX-wiecznemu niemieckiemu matematykowi możliwe jest przeprowadzenie tego procesu automatycznie. Matematyk ten nosił nazwisko Peter Gustav Lejeune Dirichlet, a metoda klasyfikacji nazywa się LDA (Latent Dirichlet Allocation).

Ćwiczenie 5: Wykonaj tutorial dostępny pod https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0. Wklej do notatnika wyniki.

# Importing modules
import pandas as pd
import os
# Read data into papers
print(os.listdir("."))
papers = pd.read_csv('./data/NIPS Papers/papers.csv')
# Print head
papers.head()
['lab_06-07.ipynb', 'lab_01.ipynb', 'lab_03.ipynb', 'lab_09-10.ipynb', 'lab_02.ipynb', 'lab_13-14.ipynb', 'img', 'lab_11.ipynb', 'lab_08.ipynb', '.gitignore', 'lab_04-05.ipynb', 'lab_15.ipynb', 'lab_12.ipynb', 'data']
id year title event_type pdf_name abstract paper_text
0 1 1987 Self-Organization of Associative Database and ... NaN 1-self-organization-of-associative-database-an... Abstract Missing 767\n\nSELF-ORGANIZATION OF ASSOCIATIVE DATABA...
1 10 1987 A Mean Field Theory of Layer IV of Visual Cort... NaN 10-a-mean-field-theory-of-layer-iv-of-visual-c... Abstract Missing 683\n\nA MEAN FIELD THEORY OF LAYER IV OF VISU...
2 100 1988 Storing Covariance by the Associative Long-Ter... NaN 100-storing-covariance-by-the-associative-long... Abstract Missing 394\n\nSTORING COVARIANCE BY THE ASSOCIATIVE\n...
3 1000 1994 Bayesian Query Construction for Neural Network... NaN 1000-bayesian-query-construction-for-neural-ne... Abstract Missing Bayesian Query Construction for Neural\nNetwor...
4 1001 1994 Neural Network Ensembles, Cross Validation, an... NaN 1001-neural-network-ensembles-cross-validation... Abstract Missing Neural Network Ensembles, Cross\nValidation, a...
# Remove the columns
papers = papers.drop(columns=['id', 'event_type', 'pdf_name'], axis=1).sample(100)
# Print out the first rows of papers
papers.head()
year title abstract paper_text
4172 2012 Learning Manifolds with K-Means and K-Flats We study the problem of estimating a manifold ... Learning Manifolds with K-Means and K-Flats\n\...
3535 2011 Unifying Framework for Fast Learning Rate of N... In this paper, we give a new generalization er... Unifying Framework for Fast Learning Rate of\n...
6206 2017 Hunt For The Unique, Stable, Sparse And Fast F... For the purpose of learning on graphs, we hunt... Hunt For The Unique, Stable, Sparse And Fast\n...
11 1994 Multidimensional Scaling and Data Clustering Abstract Missing Multidimensional Scaling and Data Clustering\n...
5119 2015 Convolutional Neural Networks with Intra-Layer... Scene labeling is a challenging computer visio... Convolutional Neural Networks with Intra-layer...
# Load the regular expression library
import re
# Remove punctuation
papers['paper_text_processed'] = \
papers['paper_text'].map(lambda x: re.sub('[,\.!?]', '', x))
# Convert the titles to lowercase
papers['paper_text_processed'] = \
papers['paper_text_processed'].map(lambda x: x.lower())
# Print out the first rows of papers
papers['paper_text_processed'].head()
4172    learning manifolds with k-means and k-flats\n\...
3535    unifying framework for fast learning rate of\n...
6206    hunt for the unique stable sparse and fast\nfe...
11      multidimensional scaling and data clustering\n...
5119    convolutional neural networks with intra-layer...
Name: paper_text_processed, dtype: object
# Import the wordcloud library
from wordcloud import WordCloud
# Join the different processed titles together.
long_string = ','.join(list(papers['paper_text_processed'].values))
# Create a WordCloud object
wordcloud = WordCloud(background_color="white", max_words=5000, contour_width=3, contour_color='steelblue')
# Generate a word cloud
wordcloud.generate(long_string)
# Visualize the word cloud
wordcloud.to_image()
from gensim.utils import simple_preprocess
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
def sent_to_words(sentences):
    for sentence in sentences:
        # deacc=True removes punctuations
        yield(simple_preprocess(str(sentence), deacc=True))
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) 
             if word not in stop_words] for doc in texts]
data = papers.paper_text_processed.values.tolist()
data_words = list(sent_to_words(data))
# remove stop words
data_words = remove_stopwords(data_words)
print(data_words[:1][0][:30])
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/patrykbart/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
['learning', 'manifolds', 'means', 'flats', 'guillermo', 'canas', 'tomaso', 'poggio', 'lorenzo', 'rosasco', 'laboratory', 'computational', 'statistical', 'learning', 'mit', 'iit', 'cbcl', 'mcgovern', 'institute', 'massachusetts', 'institute', 'technology', 'guilledc', 'mitedu', 'tp', 'aimitedu', 'lrosasco', 'mitedu', 'abstract', 'study']
import gensim.corpora as corpora
# Create Dictionary
id2word = corpora.Dictionary(data_words)
# Create Corpus
texts = data_words
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]
# View
print(corpus[:1][0][:30])
[(0, 3), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 8), (7, 1), (8, 2), (9, 3), (10, 1), (11, 1), (12, 2), (13, 1), (14, 1), (15, 1), (16, 3), (17, 1), (18, 1), (19, 12), (20, 7), (21, 1), (22, 1), (23, 2), (24, 4), (25, 2), (26, 1), (27, 1), (28, 1), (29, 1)]
from gensim.models import LdaMulticore

from pprint import pprint
# number of topics
num_topics = 10
# Build LDA model
lda_model = LdaMulticore(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=num_topics)
# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

# Save to txt file
with open("./data/lda_topics.txt", "w") as f:
    for topic in lda_model.print_topics():
        f.write(f"{topic}\n")
[(0,
  '0.006*"learning" + 0.005*"model" + 0.005*"data" + 0.004*"function" + '
  '0.004*"set" + 0.004*"using" + 0.004*"number" + 0.004*"neural" + 0.004*"one" '
  '+ 0.003*"error"'),
 (1,
  '0.008*"learning" + 0.006*"data" + 0.005*"model" + 0.005*"set" + '
  '0.004*"algorithm" + 0.004*"time" + 0.004*"one" + 0.004*"two" + 0.003*"used" '
  '+ 0.003*"figure"'),
 (2,
  '0.007*"data" + 0.005*"model" + 0.005*"set" + 0.005*"learning" + 0.004*"one" '
  '+ 0.004*"algorithm" + 0.004*"time" + 0.003*"using" + 0.003*"figure" + '
  '0.003*"training"'),
 (3,
  '0.006*"data" + 0.005*"model" + 0.004*"learning" + 0.004*"two" + '
  '0.004*"algorithm" + 0.004*"using" + 0.004*"function" + 0.004*"set" + '
  '0.003*"number" + 0.003*"given"'),
 (4,
  '0.006*"learning" + 0.005*"data" + 0.005*"model" + 0.005*"set" + '
  '0.004*"algorithm" + 0.004*"time" + 0.004*"using" + 0.004*"two" + '
  '0.004*"function" + 0.003*"one"'),
 (5,
  '0.008*"learning" + 0.006*"data" + 0.005*"algorithm" + 0.004*"model" + '
  '0.004*"two" + 0.004*"function" + 0.004*"number" + 0.003*"figure" + '
  '0.003*"time" + 0.003*"set"'),
 (6,
  '0.007*"learning" + 0.006*"model" + 0.005*"data" + 0.005*"algorithm" + '
  '0.004*"function" + 0.004*"set" + 0.003*"time" + 0.003*"one" + 0.003*"based" '
  '+ 0.003*"number"'),
 (7,
  '0.007*"learning" + 0.005*"set" + 0.005*"data" + 0.005*"model" + '
  '0.004*"algorithm" + 0.004*"function" + 0.004*"using" + 0.004*"number" + '
  '0.004*"log" + 0.004*"figure"'),
 (8,
  '0.005*"learning" + 0.005*"set" + 0.005*"algorithm" + 0.004*"model" + '
  '0.004*"function" + 0.004*"data" + 0.004*"one" + 0.004*"time" + '
  '0.003*"using" + 0.003*"given"'),
 (9,
  '0.007*"data" + 0.006*"model" + 0.005*"learning" + 0.005*"algorithm" + '
  '0.004*"two" + 0.003*"number" + 0.003*"time" + 0.003*"set" + '
  '0.003*"function" + 0.003*"used"')]