40 KiB
Ekstrakcja informacji
6. Klasyfikacja [ćwiczenia]
Jakub Pokrywka (2021)
Zajęcia klasyfikacja
Zbiór kleister
import pathlib
from collections import Counter
from sklearn.metrics import *
KLEISTER_PATH = pathlib.Path('/home/kuba/Syncthing/przedmioty/2020-02/IE/applica/kleister-nda')
Pytanie
Czy jurysdykcja musi być zapisana explicite w umowie?
def get_expected_jurisdiction(filepath):
dataset_expected_jurisdiction = []
with open(filepath,'r') as train_expected_file:
for line in train_expected_file:
key_values = line.rstrip('\n').split(' ')
jurisdiction = None
for key_value in key_values:
key, value = key_value.split('=')
if key == 'jurisdiction':
jurisdiction = value
if jurisdiction is None:
jurisdiction = 'NONE'
dataset_expected_jurisdiction.append(jurisdiction)
return dataset_expected_jurisdiction
train_expected_jurisdiction = get_expected_jurisdiction(KLEISTER_PATH/'train'/'expected.tsv')
dev_expected_jurisdiction = get_expected_jurisdiction(KLEISTER_PATH/'dev-0'/'expected.tsv')
len(train_expected_jurisdiction)
254
'NONE' in train_expected_jurisdiction
False
len(set(train_expected_jurisdiction))
31
Czy wszystkie stany muszą występować w zbiorze trenującym w zbiorze kleister?
https://en.wikipedia.org/wiki/U.S._state
Jaki jest baseline?
train_counter = Counter(train_expected_jurisdiction)
train_counter.most_common(100)
[('New_York', 43), ('Delaware', 39), ('California', 32), ('Massachusetts', 15), ('Texas', 13), ('Illinois', 10), ('Oregon', 9), ('Florida', 9), ('Pennsylvania', 9), ('Missouri', 9), ('Ohio', 8), ('New_Jersey', 7), ('Georgia', 6), ('Indiana', 5), ('Nevada', 5), ('Colorado', 4), ('Virginia', 4), ('Washington', 4), ('Michigan', 3), ('Minnesota', 3), ('Connecticut', 2), ('Wisconsin', 2), ('Maine', 2), ('North_Carolina', 2), ('Kansas', 2), ('Utah', 2), ('Iowa', 1), ('Idaho', 1), ('South_Dakota', 1), ('South_Carolina', 1), ('Rhode_Island', 1)]
most_common_answer = train_counter.most_common(100)[0][0]
most_common_answer
'New_York'
dev_predictions_jurisdiction = [most_common_answer] * len(dev_expected_jurisdiction)
dev_expected_jurisdiction
['New_York', 'New_York', 'Delaware', 'Massachusetts', 'Delaware', 'Washington', 'Delaware', 'New_Jersey', 'New_York', 'NONE', 'NONE', 'Delaware', 'Delaware', 'Delaware', 'New_York', 'Massachusetts', 'Minnesota', 'California', 'New_York', 'California', 'Iowa', 'California', 'Virginia', 'North_Carolina', 'Arizona', 'Indiana', 'New_Jersey', 'California', 'Delaware', 'Georgia', 'New_York', 'New_York', 'California', 'Minnesota', 'California', 'Kentucky', 'Minnesota', 'Ohio', 'Michigan', 'California', 'Minnesota', 'California', 'Delaware', 'Illinois', 'Minnesota', 'Texas', 'New_Jersey', 'Delaware', 'Washington', 'NONE', 'Delaware', 'Oregon', 'Delaware', 'Delaware', 'Delaware', 'Massachusetts', 'California', 'NONE', 'Delaware', 'Illinois', 'Idaho', 'Washington', 'New_York', 'New_York', 'California', 'Utah', 'Delaware', 'Washington', 'Virginia', 'New_York', 'New_York', 'Illinois', 'California', 'Delaware', 'NONE', 'Texas', 'California', 'Washington', 'Delaware', 'Washington', 'New_York', 'Washington', 'Illinois']
counter = 0
for pred, exp in zip(dev_predictions_jurisdiction, dev_expected_jurisdiction):
if pred == exp:
counter +=1
print('accuracy: ', counter/len(dev_predictions_jurisdiction))
accuracy: 0.14457831325301204
accuracy_score(dev_predictions_jurisdiction, dev_expected_jurisdiction)
0.14457831325301204
Co jeżeli nazwy klas nie występują explicite w zbiorach?
https://git.wmi.amu.edu.pl/kubapok/paranormal-or-skeptic-ISI-public
https://git.wmi.amu.edu.pl/kubapok/sport-text-classification-ball-ISI-public
SPORT_PATH='/home/kuba/Syncthing/przedmioty/2020-02/ISI/zajecia6_klasyfikacja/repos/sport-text-classification-ball'
SPORT_TRAIN=$SPORT_PATH/train/train.tsv.gz
SPORT_DEV_EXP=$SPORT_PATH/dev-0/expected.tsv
jaki jest baseline dla sport classification ball?
zcat $SPORT_TRAIN | awk '{print $1}' | wc -l
zcat $SPORT_TRAIN | awk '{print $1}' | grep 1 | wc -l
cat $SPORT_DEV_EXP | wc -l
grep 1 $SPORT_DEV_EXP | wc -l
Sprytne podejście do klasyfikacji tekstu? Naiwny bayess
from sklearn.datasets import fetch_20newsgroups
# https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import sklearn.metrics
import gensim
/home/kuba/anaconda3/lib/python3.8/site-packages/gensim/similarities/__init__.py:15: UserWarning: The gensim.similarities.levenshtein submodule is disabled, because the optional Levenshtein package <https://pypi.org/project/python-Levenshtein/> is unavailable. Install Levenhstein (e.g. `pip install python-Levenshtein`) to suppress this warning. warnings.warn(msg)
newsgroups = fetch_20newsgroups()
newsgroups_text = newsgroups['data']
newsgroups_text_tokenized = [list(set(gensim.utils.tokenize(x, lowercase = True))) for x in newsgroups_text]
print(newsgroups_text[0])
From: lerxst@wam.umd.edu (where's my thing) Subject: WHAT car is this!? Nntp-Posting-Host: rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: 15 I was wondering if anyone out there could enlighten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the body. This is all I know. If anyone can tellme a model name, engine specs, years of production, where this car is made, history, or whatever info you have on this funky looking car, please e-mail. Thanks, - IL ---- brought to you by your neighborhood Lerxst ----
print(newsgroups_text_tokenized[0])
['lerxst', 'on', 'be', 'name', 'brought', 'late', 'front', 'umd', 'bumper', 'door', 'there', 'subject', 'day', 'early', 'history', 'me', 'neighborhood', 'university', 'mail', 'doors', 'by', 'funky', 'if', 'engine', 'know', 'years', 'maryland', 'your', 'rest', 'is', 'info', 'body', 'have', 'tellme', 'out', 'anyone', 'small', 'wam', 'il', 'organization', 'thanks', 'park', 'made', 'whatever', 'other', 'specs', 'wondering', 'lines', 'from', 'was', 'a', 'what', 'the', 's', 'or', 'please', 'all', 'rac', 'i', 'looked', 'really', 'edu', 'where', 'to', 'e', 'my', 'it', 'car', 'addition', 'can', 'of', 'production', 'in', 'saw', 'separate', 'you', 'thing', 'posting', 'bricklin', 'could', 'enlighten', 'nntp', 'model', 'were', 'host', 'looking', 'this', 'college', 'sports', 'called']
Y = newsgroups['target']
Y
array([7, 4, 4, ..., 3, 1, 8])
Y_names = newsgroups['target_names']
Y_names
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
Y_names[16]
'talk.politics.guns'
$P('talk.politics.guns' | 'gun')= ?$
$P(A|B) * P(A) = P(B) * P(B|A)$
$P(A|B) = \frac{P(B) * P(B|A)}{P(A)}$
$P('talk.politics.guns' | 'gun') * P('gun') = P('gun'|'talk.politics.guns') * P('talk.politics.guns')$
$P('talk.politics.guns' | 'gun') = \frac{P('gun'|'talk.politics.guns') * P('talk.politics.guns')}{P('gun')}$
$p1 = P('gun'|'talk.politics.guns')$
$p2 = P('talk.politics.guns')$
$p3 = P('gun')$
obliczanie $p1 = P('gun'|'talk.politics.guns')$
talk_politics_guns = [x for x,y in zip(newsgroups_text_tokenized,Y) if y == 16]
len(talk_politics_guns)
546
len([x for x in talk_politics_guns if 'gun' in x])
253
p1 = len([x for x in talk_politics_guns if 'gun' in x]) / len(talk_politics_guns)
p1
0.4633699633699634
obliczanie $p2 = P('talk.politics.guns')$
p2 = len(talk_politics_guns) / len(Y)
p2
0.048258794414000356
obliczanie $p3 = P('gun')$
p3 = len([x for x in newsgroups_text_tokenized if 'gun' in x]) / len(Y)
p3
0.03270284603146544
ostatecznie
(p1 * p2) / p3
0.6837837837837839
def get_prob(index ):
talks_topic = [x for x,y in zip(newsgroups_text_tokenized,Y) if y == index]
len([x for x in talks_topic if 'gun' in x])
if len(talks_topic) == 0:
return 0.0
p1 = len([x for x in talks_topic if 'gun' in x]) / len(talks_topic)
p2 = len(talks_topic) / len(Y)
p3 = len([x for x in newsgroups_text_tokenized if 'gun' in x]) / len(Y)
if p3 == 0:
return 0.0
else:
return (p1 * p2)/ p3
probs = []
for i in range(len(Y_names)):
probs.append(get_prob(i))
print("%.5f" % get_prob(i),'\t\t', Y_names[i])
print("%.5f" % sum(probs), '\t\tsuma',)
0.01622 alt.atheism 0.00000 comp.graphics 0.00541 comp.os.ms-windows.misc 0.01892 comp.sys.ibm.pc.hardware 0.00270 comp.sys.mac.hardware 0.00000 comp.windows.x 0.01351 misc.forsale 0.04054 rec.autos 0.01892 rec.motorcycles 0.00270 rec.sport.baseball 0.00541 rec.sport.hockey 0.03784 sci.crypt 0.02973 sci.electronics 0.00541 sci.med 0.01622 sci.space 0.00270 soc.religion.christian 0.68378 talk.politics.guns 0.04595 talk.politics.mideast 0.03784 talk.politics.misc 0.01622 talk.religion.misc 1.00000 suma
def get_prob2(index, word ):
talks_topic = [x for x,y in zip(newsgroups_text_tokenized,Y) if y == index]
len([x for x in talks_topic if word in x])
if len(talks_topic) == 0:
return 0.0
p1 = len([x for x in talks_topic if word in x]) / len(talks_topic)
p2 = len(talks_topic) / len(Y)
p3 = len([x for x in newsgroups_text_tokenized if word in x]) / len(Y)
if p3 == 0:
return 0.0
else:
return (p1 * p2)/ p3
probs = []
for i in range(len(Y_names)):
probs.append(get_prob2(i,'god'))
print("%.5f" % get_prob2(i,'god'),'\t\t', Y_names[i])
print("%.5f" % sum(probs), '\t\tsuma',)
0.20874 alt.atheism 0.00850 comp.graphics 0.00364 comp.os.ms-windows.misc 0.00850 comp.sys.ibm.pc.hardware 0.00243 comp.sys.mac.hardware 0.00485 comp.windows.x 0.00607 misc.forsale 0.01092 rec.autos 0.02063 rec.motorcycles 0.01456 rec.sport.baseball 0.01092 rec.sport.hockey 0.00485 sci.crypt 0.00364 sci.electronics 0.00364 sci.med 0.01092 sci.space 0.41748 soc.religion.christian 0.03398 talk.politics.guns 0.02791 talk.politics.mideast 0.02549 talk.politics.misc 0.17233 talk.religion.misc 1.00000 suma
założenie naiwnego bayesa
$P(class | word1, word2, word3) = \frac{P(word1, word2, word3|class) * P(class)}{P(word1, word2, word3)}$
przy założeniu o niezależności zmiennych losowych $word1$, $word2$, $word3$:
$P(word1, word2, word3|class) = P(word1|class)* P(word2|class) * P(word3|class)$
ostatecznie:
$P(class | word1, word2, word3) = \frac{P(word1|class)* P(word2|class) * P(word3|class) * P(class)}{\sum_k{P(word1|class_k)* P(word2|class_k) * P(word3|class_k) * P(class_k)}}$
zadania domowe naiwny bayes1 ręcznie
- analogicznie zaimplementować funkcję get_prob3(index, document_tokenized), argument document_tokenized ma być zbiorem słów dokumentu. funkcja ma być naiwnym klasyfikatorem bayesowskim (w przypadku wielu słów)
- odpalić powyższy listing prawdopodobieństw z funkcją get_prob3 dla dokumentów: {'i','love','guns'} oraz {'is','there','life','after' ,'death'}
- zadanie proszę zrobić w jupyterze, wygenerować pdf (kod + wyniki odpalenia) i umieścić go jako zadanie w teams
- termin 12.05, punktów: 40
zadania domowe naiwny bayes2 gotowa biblioteka
- wybrać jedno z poniższych repozytoriów i je sforkować:
- stworzyć klasyfikator bazujący na naiwnym bayessie (może być gotowa biblioteka), może też korzystać z gotowych implementacji tfidf
- stworzyć predykcje w plikach dev-0/out.tsv oraz test-A/out.tsv
- wynik accuracy sprawdzony za pomocą narzędzia geval (patrz poprzednie zadanie) powinien wynosić conajmniej 0.67
- proszę umieścić predykcję oraz skrypty generujące (w postaci tekstowej a nie jupyter) w repo, a w MS TEAMS umieścić link do swojego repo termin 12.05, 40 punktów