aitech-eks-pub/cw/06_klasyfikacja_ODPOWIEDZI.ipynb
Jakub Pokrywka b7ebc44cc2 naive bayes
2021-04-21 12:19:58 +02:00

26 KiB

Zajęcia klasyfikacja

Zbiór kleister

import pathlib
from collections import Counter
from sklearn.metrics import *
KLEISTER_PATH = pathlib.Path('/home/kuba/Syncthing/przedmioty/2020-02/IE/applica/kleister-nda')

Pytanie

Czy jurysdykcja musi być zapisana explicite w umowie?

def get_expected_jurisdiction(filepath):
    dataset_expected_jurisdiction = []
    with open(filepath,'r') as train_expected_file:
        for line in train_expected_file:
            key_values = line.rstrip('\n').split(' ')
            jurisdiction = None
            for key_value in key_values:
                key, value = key_value.split('=')
                if key == 'jurisdiction':
                    jurisdiction = value
            if jurisdiction is None:
                jurisdiction = 'NONE'
            dataset_expected_jurisdiction.append(jurisdiction)
    return dataset_expected_jurisdiction
train_expected_jurisdiction = get_expected_jurisdiction(KLEISTER_PATH/'train'/'expected.tsv')
dev_expected_jurisdiction = get_expected_jurisdiction(KLEISTER_PATH/'dev-0'/'expected.tsv')
len(train_expected_jurisdiction)
254
'NONE' in train_expected_jurisdiction
False
len(set(train_expected_jurisdiction))
31

Czy wszystkie stany muszą występować w zbiorze trenującym w zbiorze kleister?

https://en.wikipedia.org/wiki/U.S._state

Jaki jest baseline?

train_counter = Counter(train_expected_jurisdiction)
train_counter.most_common(100)
[('New_York', 43),
 ('Delaware', 39),
 ('California', 32),
 ('Massachusetts', 15),
 ('Texas', 13),
 ('Illinois', 10),
 ('Oregon', 9),
 ('Florida', 9),
 ('Pennsylvania', 9),
 ('Missouri', 9),
 ('Ohio', 8),
 ('New_Jersey', 7),
 ('Georgia', 6),
 ('Indiana', 5),
 ('Nevada', 5),
 ('Colorado', 4),
 ('Virginia', 4),
 ('Washington', 4),
 ('Michigan', 3),
 ('Minnesota', 3),
 ('Connecticut', 2),
 ('Wisconsin', 2),
 ('Maine', 2),
 ('North_Carolina', 2),
 ('Kansas', 2),
 ('Utah', 2),
 ('Iowa', 1),
 ('Idaho', 1),
 ('South_Dakota', 1),
 ('South_Carolina', 1),
 ('Rhode_Island', 1)]
most_common_answer = train_counter.most_common(100)[0][0]
most_common_answer
'New_York'
dev_predictions_jurisdiction = [most_common_answer] * len(dev_expected_jurisdiction)
dev_expected_jurisdiction
['New_York',
 'New_York',
 'Delaware',
 'Massachusetts',
 'Delaware',
 'Washington',
 'Delaware',
 'New_Jersey',
 'New_York',
 'NONE',
 'NONE',
 'Delaware',
 'Delaware',
 'Delaware',
 'New_York',
 'Massachusetts',
 'Minnesota',
 'California',
 'New_York',
 'California',
 'Iowa',
 'California',
 'Virginia',
 'North_Carolina',
 'Arizona',
 'Indiana',
 'New_Jersey',
 'California',
 'Delaware',
 'Georgia',
 'New_York',
 'New_York',
 'California',
 'Minnesota',
 'California',
 'Kentucky',
 'Minnesota',
 'Ohio',
 'Michigan',
 'California',
 'Minnesota',
 'California',
 'Delaware',
 'Illinois',
 'Minnesota',
 'Texas',
 'New_Jersey',
 'Delaware',
 'Washington',
 'NONE',
 'Delaware',
 'Oregon',
 'Delaware',
 'Delaware',
 'Delaware',
 'Massachusetts',
 'California',
 'NONE',
 'Delaware',
 'Illinois',
 'Idaho',
 'Washington',
 'New_York',
 'New_York',
 'California',
 'Utah',
 'Delaware',
 'Washington',
 'Virginia',
 'New_York',
 'New_York',
 'Illinois',
 'California',
 'Delaware',
 'NONE',
 'Texas',
 'California',
 'Washington',
 'Delaware',
 'Washington',
 'New_York',
 'Washington',
 'Illinois']
counter = 0 
for pred, exp in zip(dev_predictions_jurisdiction, dev_expected_jurisdiction):
    if pred == exp:
        counter +=1
print('accuracy: ', counter/len(dev_predictions_jurisdiction))
accuracy:  0.14457831325301204
accuracy_score(dev_predictions_jurisdiction, dev_expected_jurisdiction)
0.14457831325301204

Co jeżeli nazwy klas nie występują explicite w zbiorach?

SPORT_PATH='/home/kuba/Syncthing/przedmioty/2020-02/ISI/zajecia6_klasyfikacja/repos/sport-text-classification-ball'

SPORT_TRAIN=$SPORT_PATH/train/train.tsv.gz

SPORT_DEV_EXP=$SPORT_PATH/dev-0/expected.tsv

jaki jest baseline dla sport classification ball?

zcat $SPORT_TRAIN | awk '{print $1}' | wc -l

zcat $SPORT_TRAIN | awk '{print $1}' | grep 1 | wc -l

cat $SPORT_DEV_EXP | wc -l

grep 1 $SPORT_DEV_EXP | wc -l

Sprytne podejście do klasyfikacji tekstu? Naiwny bayess

from sklearn.datasets import fetch_20newsgroups
# https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import sklearn.metrics
import gensim
/home/kuba/anaconda3/lib/python3.8/site-packages/gensim/similarities/__init__.py:15: UserWarning: The gensim.similarities.levenshtein submodule is disabled, because the optional Levenshtein package <https://pypi.org/project/python-Levenshtein/> is unavailable. Install Levenhstein (e.g. `pip install python-Levenshtein`) to suppress this warning.
  warnings.warn(msg)
newsgroups = fetch_20newsgroups()
newsgroups_text = newsgroups['data']
newsgroups_text_tokenized = [list(set(gensim.utils.tokenize(x, lowercase = True))) for x in newsgroups_text]
print(newsgroups_text[0])
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----





print(newsgroups_text_tokenized[0])
['lerxst', 'on', 'be', 'name', 'brought', 'late', 'front', 'umd', 'bumper', 'door', 'there', 'subject', 'day', 'early', 'history', 'me', 'neighborhood', 'university', 'mail', 'doors', 'by', 'funky', 'if', 'engine', 'know', 'years', 'maryland', 'your', 'rest', 'is', 'info', 'body', 'have', 'tellme', 'out', 'anyone', 'small', 'wam', 'il', 'organization', 'thanks', 'park', 'made', 'whatever', 'other', 'specs', 'wondering', 'lines', 'from', 'was', 'a', 'what', 'the', 's', 'or', 'please', 'all', 'rac', 'i', 'looked', 'really', 'edu', 'where', 'to', 'e', 'my', 'it', 'car', 'addition', 'can', 'of', 'production', 'in', 'saw', 'separate', 'you', 'thing', 'posting', 'bricklin', 'could', 'enlighten', 'nntp', 'model', 'were', 'host', 'looking', 'this', 'college', 'sports', 'called']
Y = newsgroups['target']
Y
array([7, 4, 4, ..., 3, 1, 8])
Y_names = newsgroups['target_names']
Y_names
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']
Y_names[16]
'talk.politics.guns'

$P('talk.politics.guns' | 'gun')= ?$

$P(A|B) * P(A) = P(B) * P(B|A)$

$P(A|B) = \frac{P(B) * P(B|A)}{P(A)}$

$P('talk.politics.guns' | 'gun') * P('gun') = P('gun'|'talk.politics.guns') * P('talk.politics.guns')$

$P('talk.politics.guns' | 'gun') = \frac{P('gun'|'talk.politics.guns') * P('talk.politics.guns')}{P('gun')}$

$p1 = P('gun'|'talk.politics.guns')$

$p2 = P('talk.politics.guns')$

$p3 = P('gun')$

obliczanie $p1 = P('gun'|'talk.politics.guns')$

talk_politics_guns = [x for x,y in zip(newsgroups_text_tokenized,Y) if y == 16]
len(talk_politics_guns)
546
len([x for x in talk_politics_guns if 'gun' in x])
253
p1 = len([x for x in talk_politics_guns if 'gun' in x]) / len(talk_politics_guns)
p1
0.4633699633699634

obliczanie $p2 = P('talk.politics.guns')$

p2 = len(talk_politics_guns) / len(Y)
p2
0.048258794414000356

obliczanie $p3 = P('gun')$

p3 = len([x for x in newsgroups_text_tokenized if 'gun' in x]) / len(Y)
p3
0.03270284603146544

ostatecznie

(p1 * p2) / p3
0.6837837837837839
def get_prob(index ):
    talks_topic = [x for x,y in zip(newsgroups_text_tokenized,Y) if y == index]

    len([x for x in talks_topic if 'gun' in x])

    if len(talks_topic) == 0:
        return 0.0
    p1 = len([x for x in talks_topic if 'gun' in x]) / len(talks_topic)
    p2 = len(talks_topic) / len(Y)
    p3 = len([x for x in newsgroups_text_tokenized if 'gun' in x]) / len(Y)

    if p3 == 0:
        return 0.0
    else: 
        return (p1 * p2)/ p3
probs = []
for i in range(len(Y_names)):
    probs.append(get_prob(i))
    print("%.5f" %   get_prob(i),'\t\t', Y_names[i])
    
print("%.5f" % sum(probs), '\t\tsuma',)
0.01622 		 alt.atheism
0.00000 		 comp.graphics
0.00541 		 comp.os.ms-windows.misc
0.01892 		 comp.sys.ibm.pc.hardware
0.00270 		 comp.sys.mac.hardware
0.00000 		 comp.windows.x
0.01351 		 misc.forsale
0.04054 		 rec.autos
0.01892 		 rec.motorcycles
0.00270 		 rec.sport.baseball
0.00541 		 rec.sport.hockey
0.03784 		 sci.crypt
0.02973 		 sci.electronics
0.00541 		 sci.med
0.01622 		 sci.space
0.00270 		 soc.religion.christian
0.68378 		 talk.politics.guns
0.04595 		 talk.politics.mideast
0.03784 		 talk.politics.misc
0.01622 		 talk.religion.misc
1.00000 		suma
def get_prob2(index, word ):
    talks_topic = [x for x,y in zip(newsgroups_text_tokenized,Y) if y == index]

    len([x for x in talks_topic if word in x])

    if len(talks_topic) == 0:
        return 0.0
    p1 = len([x for x in talks_topic if word in x]) / len(talks_topic)
    p2 = len(talks_topic) / len(Y)
    p3 = len([x for x in newsgroups_text_tokenized if word in x]) / len(Y)

    if p3 == 0:
        return 0.0
    else: 
        return (p1 * p2)/ p3
probs = []
for i in range(len(Y_names)):
    probs.append(get_prob2(i,'god'))
    print("%.5f" %   get_prob2(i,'god'),'\t\t', Y_names[i])
    
print("%.5f" % sum(probs), '\t\tsuma',)
0.20874 		 alt.atheism
0.00850 		 comp.graphics
0.00364 		 comp.os.ms-windows.misc
0.00850 		 comp.sys.ibm.pc.hardware
0.00243 		 comp.sys.mac.hardware
0.00485 		 comp.windows.x
0.00607 		 misc.forsale
0.01092 		 rec.autos
0.02063 		 rec.motorcycles
0.01456 		 rec.sport.baseball
0.01092 		 rec.sport.hockey
0.00485 		 sci.crypt
0.00364 		 sci.electronics
0.00364 		 sci.med
0.01092 		 sci.space
0.41748 		 soc.religion.christian
0.03398 		 talk.politics.guns
0.02791 		 talk.politics.mideast
0.02549 		 talk.politics.misc
0.17233 		 talk.religion.misc
1.00000 		suma

założenie naiwnego bayesa

$P(class | word1, word2, word3) = \frac{P(word1, word2, word3|class) * P(class)}{P(word1, word2, word3)}$

przy założeniu o niezależności zmiennych losowych $word1$, $word2$, $word3$:

$P(word1, word2, word3|class) = P(word1|class)* P(word2|class) * P(word3|class)$

ostatecznie:

$P(class | word1, word2, word3) = \frac{P(word1|class)* P(word2|class) * P(word3|class) * P(class)}{\sum_k{P(word1|class_k)* P(word2|class_k) * P(word3|class_k) * P(class_k)}}$

zadania domowe naiwny bayes1

  • analogicznie zaimplementować funkcję get_prob3(index, document_tokenized), argument document_tokenized ma być zbiorem słów dokumentu. funkcja ma być naiwnym klasyfikatorem bayesowskim (w przypadku wielu słów)
  • odpalić powyższy listing prawdopodobieństw z funkcją get_prob3 dla dokumentów: {'i','love','guns'} oraz {'is','there','life','after' ,'death'}
  • zadanie proszę zrobić w jupyterze, wygenerować pdf (kod + wyniki odpalenia) i umieścić go jako zadanie w teams
  • termin 12.05, punktów: 40

zadania domowe naiwny bayes1