aitech-eks-pub/cw/06_klasyfikacja.ipynb

23 KiB
Raw Blame History

Zajęcia klasyfikacja

Zbiór kleister

import pathlib
from collections import Counter
from sklearn.metrics import *
KLEISTER_PATH = pathlib.Path('/home/kuba/Syncthing/przedmioty/2020-02/IE/applica/kleister-nda')

Pytanie

Czy jurysdykcja musi być zapisana explicite w umowie?

def get_expected_jurisdiction(filepath):
    dataset_expected_jurisdiction = []
    with open(filepath,'r') as train_expected_file:
        for line in train_expected_file:
            key_values = line.rstrip('\n').split(' ')
            jurisdiction = None
            for key_value in key_values:
                key, value = key_value.split('=')
                if key == 'jurisdiction':
                    jurisdiction = value
            if jurisdiction is None:
                jurisdiction = 'NONE'
            dataset_expected_jurisdiction.append(jurisdiction)
    return dataset_expected_jurisdiction
train_expected_jurisdiction = get_expected_jurisdiction(KLEISTER_PATH/'train'/'expected.tsv')
dev_expected_jurisdiction = get_expected_jurisdiction(KLEISTER_PATH/'dev-0'/'expected.tsv')
len(train_expected_jurisdiction)
254
'NONE' in train_expected_jurisdiction
False
len(set(train_expected_jurisdiction))
31

Czy wszystkie stany muszą występować w zbiorze trenującym w zbiorze kleister?

https://en.wikipedia.org/wiki/U.S._state

Jaki jest baseline?

train_counter = Counter(train_expected_jurisdiction)
train_counter.most_common(100)
[('New_York', 43),
 ('Delaware', 39),
 ('California', 32),
 ('Massachusetts', 15),
 ('Texas', 13),
 ('Illinois', 10),
 ('Oregon', 9),
 ('Florida', 9),
 ('Pennsylvania', 9),
 ('Missouri', 9),
 ('Ohio', 8),
 ('New_Jersey', 7),
 ('Georgia', 6),
 ('Indiana', 5),
 ('Nevada', 5),
 ('Colorado', 4),
 ('Virginia', 4),
 ('Washington', 4),
 ('Michigan', 3),
 ('Minnesota', 3),
 ('Connecticut', 2),
 ('Wisconsin', 2),
 ('Maine', 2),
 ('North_Carolina', 2),
 ('Kansas', 2),
 ('Utah', 2),
 ('Iowa', 1),
 ('Idaho', 1),
 ('South_Dakota', 1),
 ('South_Carolina', 1),
 ('Rhode_Island', 1)]
most_common_answer = train_counter.most_common(100)[0][0]
most_common_answer
'New_York'
dev_predictions_jurisdiction = [most_common_answer] * len(dev_expected_jurisdiction)
dev_expected_jurisdiction
['New_York',
 'New_York',
 'Delaware',
 'Massachusetts',
 'Delaware',
 'Washington',
 'Delaware',
 'New_Jersey',
 'New_York',
 'NONE',
 'NONE',
 'Delaware',
 'Delaware',
 'Delaware',
 'New_York',
 'Massachusetts',
 'Minnesota',
 'California',
 'New_York',
 'California',
 'Iowa',
 'California',
 'Virginia',
 'North_Carolina',
 'Arizona',
 'Indiana',
 'New_Jersey',
 'California',
 'Delaware',
 'Georgia',
 'New_York',
 'New_York',
 'California',
 'Minnesota',
 'California',
 'Kentucky',
 'Minnesota',
 'Ohio',
 'Michigan',
 'California',
 'Minnesota',
 'California',
 'Delaware',
 'Illinois',
 'Minnesota',
 'Texas',
 'New_Jersey',
 'Delaware',
 'Washington',
 'NONE',
 'Delaware',
 'Oregon',
 'Delaware',
 'Delaware',
 'Delaware',
 'Massachusetts',
 'California',
 'NONE',
 'Delaware',
 'Illinois',
 'Idaho',
 'Washington',
 'New_York',
 'New_York',
 'California',
 'Utah',
 'Delaware',
 'Washington',
 'Virginia',
 'New_York',
 'New_York',
 'Illinois',
 'California',
 'Delaware',
 'NONE',
 'Texas',
 'California',
 'Washington',
 'Delaware',
 'Washington',
 'New_York',
 'Washington',
 'Illinois']
counter = 0 
for pred, exp in zip(dev_predictions_jurisdiction, dev_expected_jurisdiction):
    if pred == exp:
        counter +=1
print('accuracy: ', counter/len(dev_predictions_jurisdiction))
accuracy:  0.14457831325301204
accuracy_score(dev_predictions_jurisdiction, dev_expected_jurisdiction)
0.14457831325301204

Co jeżeli nazwy klas nie występują explicite w zbiorach?

SPORT_PATH='/home/kuba/Syncthing/przedmioty/2020-02/ISI/zajecia6_klasyfikacja/repos/sport-text-classification-ball'

SPORT_TRAIN=$SPORT_PATH/train/train.tsv.gz

SPORT_DEV_EXP=$SPORT_PATH/dev-0/expected.tsv

jaki jest baseline dla sport classification ball?

zcat $SPORT_TRAIN | awk '{print $1}' | wc -l

zcat $SPORT_TRAIN | awk '{print $1}' | grep 1 | wc -l

cat $SPORT_DEV_EXP | wc -l

grep 1 $SPORT_DEV_EXP | wc -l

Sprytne podejście do klasyfikacji tekstu? Naiwny bayess

from sklearn.datasets import fetch_20newsgroups
# https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import sklearn.metrics
import gensim
/home/kuba/anaconda3/lib/python3.8/site-packages/gensim/similarities/__init__.py:15: UserWarning: The gensim.similarities.levenshtein submodule is disabled, because the optional Levenshtein package <https://pypi.org/project/python-Levenshtein/> is unavailable. Install Levenhstein (e.g. `pip install python-Levenshtein`) to suppress this warning.
  warnings.warn(msg)
newsgroups = fetch_20newsgroups()
newsgroups_text = newsgroups['data']
newsgroups_text_tokenized = [list(set(gensim.utils.tokenize(x, lowercase = True))) for x in newsgroups_text]
print(newsgroups_text[0])
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----





print(newsgroups_text_tokenized[0])
['where', 'name', 'looked', 'to', 'have', 'out', 'on', 'by', 'park', 'what', 'from', 'host', 'doors', 'day', 'be', 'organization', 'e', 'front', 'in', 'it', 'history', 'brought', 'know', 'addition', 'il', 'of', 'lines', 'i', 'your', 'bumper', 'there', 'please', 'me', 'separate', 'is', 'tellme', 'can', 'could', 'called', 'specs', 'college', 'this', 'thanks', 'looking', 'if', 'production', 'sports', 'lerxst', 'whatever', 'anyone', 'enlighten', 'saw', 'all', 'small', 'you', 'wam', 'mail', 'rest', 's', 'late', 'rac', 'funky', 'edu', 'info', 'the', 'wondering', 'years', 'door', 'posting', 'car', 'made', 'or', 'maryland', 'subject', 'bricklin', 'was', 'model', 'thing', 'university', 'engine', 'nntp', 'other', 'really', 'neighborhood', 'early', 'a', 'umd', 'my', 'body', 'were']
Y = newsgroups['target']
Y
array([7, 4, 4, ..., 3, 1, 8])
Y_names = newsgroups['target_names']
Y_names
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']
Y_names[16]
'talk.politics.guns'

$P('talk.politics.guns' | 'gun')= ?$

$P(A|B) * P(A) = P(B) * P(B|A)$

$P(A|B) = \frac{P(B) * P(B|A)}{P(A)}$

$P('talk.politics.guns' | 'gun') * P('gun') = P('gun'|'talk.politics.guns') * P('talk.politics.guns')$

$P('talk.politics.guns' | 'gun') = \frac{P('gun'|'talk.politics.guns') * P('talk.politics.guns')}{P('gun')}$

$p1 = P('gun'|'talk.politics.guns')$

$p2 = P('talk.politics.guns')$

$p3 = P('gun')$

obliczanie $p1 = P('gun'|'talk.politics.guns')$

# samodzielne wykonanie

obliczanie $p2 = P('talk.politics.guns')$

# samodzielne wykonanie

obliczanie $p3 = P('gun')$

# samodzielne wykonanie

ostatecznie

(p1 * p2) / p3
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-31-447f586cc09f> in <module>
----> 1 (p1 * p2) / p3

NameError: name 'p1' is not defined
def get_prob(index ):
    talks_topic = [x for x,y in zip(newsgroups_text_tokenized,Y) if y == index]

    len([x for x in talks_topic if 'gun' in x])

    if len(talks_topic) == 0:
        return 0.0
    p1 = len([x for x in talks_topic if 'gun' in x]) / len(talks_topic)
    p2 = len(talks_topic) / len(Y)
    p3 = len([x for x in newsgroups_text_tokenized if 'gun' in x]) / len(Y)

    if p3 == 0:
        return 0.0
    else: 
        return (p1 * p2)/ p3
probs = []
for i in range(len(Y_names)):
    probs.append(get_prob(i))
    print("%.5f" %   get_prob(i),'\t\t', Y_names[i])
    
print("%.5f" % sum(probs), '\t\tsuma',)
0.01622 		 alt.atheism
0.00000 		 comp.graphics
0.00541 		 comp.os.ms-windows.misc
0.01892 		 comp.sys.ibm.pc.hardware
0.00270 		 comp.sys.mac.hardware
0.00000 		 comp.windows.x
0.01351 		 misc.forsale
0.04054 		 rec.autos
0.01892 		 rec.motorcycles
0.00270 		 rec.sport.baseball
0.00541 		 rec.sport.hockey
0.03784 		 sci.crypt
0.02973 		 sci.electronics
0.00541 		 sci.med
0.01622 		 sci.space
0.00270 		 soc.religion.christian
0.68378 		 talk.politics.guns
0.04595 		 talk.politics.mideast
0.03784 		 talk.politics.misc
0.01622 		 talk.religion.misc
1.00000 		suma

zadanie samodzielne

def get_prob2(index, word ):
    pass
# listing dla get_prob2, słowo 'god'

założenie naiwnego bayesa

$P(class | word1, word2, word3) = \frac{P(word1, word2, word3|class) * P(class)}{P(word1, word2, word3)}$

przy założeniu o niezależności zmiennych losowych $word1$, $word2$, $word3$:

$P(word1, word2, word3|class) = P(word1|class)* P(word2|class) * P(word3|class)$

ostatecznie:

$P(class | word1, word2, word3) = \frac{P(word1|class)* P(word2|class) * P(word3|class) * P(class)}{\sum_k{P(word1|class_k)* P(word2|class_k) * P(word3|class_k) * P(class_k)}}$

zadania domowe naiwny bayes1 ręcznie

  • analogicznie zaimplementować funkcję get_prob3(index, document_tokenized), argument document_tokenized ma być zbiorem słów dokumentu. funkcja ma być naiwnym klasyfikatorem bayesowskim (w przypadku wielu słów)
  • odpalić powyższy listing prawdopodobieństw z funkcją get_prob3 dla dokumentów: {'i','love','guns'} oraz {'is','there','life','after' ,'death'}
  • zadanie proszę zrobić w jupyterze, wygenerować pdf (kod + wyniki odpalenia) i umieścić go jako zadanie w teams
  • termin 12.05, punktów: 40

zadania domowe naiwny bayes2 gotowa biblioteka