aitech-eks-pub-22/cw/06_klasyfikacja.ipynb
Jakub Pokrywka 81d0d0928a add 05 06
2022-04-20 09:56:20 +02:00

22 KiB

Logo 1

Ekstrakcja informacji

6. Klasyfikacja [ćwiczenia]

Jakub Pokrywka (2021)

Logo 2

Zajęcia klasyfikacja

Zbiór kleister

import pathlib
from collections import Counter
from sklearn.metrics import *
KLEISTER_PATH = pathlib.Path('/home/kuba/kleister-nda')

Pytanie

Czy jurysdykcja musi być zapisana explicite w umowie?

def get_expected_jurisdiction(filepath):
    dataset_expected_jurisdiction = []
    with open(filepath,'r') as train_expected_file:
        for line in train_expected_file:
            key_values = line.rstrip('\n').split(' ')
            jurisdiction = None
            for key_value in key_values:
                key, value = key_value.split('=')
                if key == 'jurisdiction':
                    jurisdiction = value
            if jurisdiction is None:
                jurisdiction = 'NONE'
            dataset_expected_jurisdiction.append(jurisdiction)
    return dataset_expected_jurisdiction
train_expected_jurisdiction = get_expected_jurisdiction(KLEISTER_PATH/'train'/'expected.tsv')
dev_expected_jurisdiction = get_expected_jurisdiction(KLEISTER_PATH/'dev-0'/'expected.tsv')
len(train_expected_jurisdiction)
254
'NONE' in train_expected_jurisdiction
False
len(set(train_expected_jurisdiction))
31

Czy wszystkie stany muszą występować w zbiorze trenującym w zbiorze kleister?

https://en.wikipedia.org/wiki/U.S._state

Jaki jest baseline?

train_counter = Counter(train_expected_jurisdiction)
train_counter.most_common(100)
[('New_York', 43),
 ('Delaware', 39),
 ('California', 32),
 ('Massachusetts', 15),
 ('Texas', 13),
 ('Illinois', 10),
 ('Oregon', 9),
 ('Florida', 9),
 ('Pennsylvania', 9),
 ('Missouri', 9),
 ('Ohio', 8),
 ('New_Jersey', 7),
 ('Georgia', 6),
 ('Indiana', 5),
 ('Nevada', 5),
 ('Colorado', 4),
 ('Virginia', 4),
 ('Washington', 4),
 ('Michigan', 3),
 ('Minnesota', 3),
 ('Connecticut', 2),
 ('Wisconsin', 2),
 ('Maine', 2),
 ('North_Carolina', 2),
 ('Kansas', 2),
 ('Utah', 2),
 ('Iowa', 1),
 ('Idaho', 1),
 ('South_Dakota', 1),
 ('South_Carolina', 1),
 ('Rhode_Island', 1)]
most_common_answer = train_counter.most_common(100)[0][0]
most_common_answer
'New_York'
dev_predictions_jurisdiction = [most_common_answer] * len(dev_expected_jurisdiction)
dev_expected_jurisdiction
['New_York',
 'New_York',
 'Delaware',
 'Massachusetts',
 'Delaware',
 'Washington',
 'Delaware',
 'New_Jersey',
 'New_York',
 'NONE',
 'NONE',
 'Delaware',
 'Delaware',
 'Delaware',
 'New_York',
 'Massachusetts',
 'Minnesota',
 'California',
 'New_York',
 'California',
 'Iowa',
 'California',
 'Virginia',
 'North_Carolina',
 'Arizona',
 'Indiana',
 'New_Jersey',
 'California',
 'Delaware',
 'Georgia',
 'New_York',
 'New_York',
 'California',
 'Minnesota',
 'California',
 'Kentucky',
 'Minnesota',
 'Ohio',
 'Michigan',
 'California',
 'Minnesota',
 'California',
 'Delaware',
 'Illinois',
 'Minnesota',
 'Texas',
 'New_Jersey',
 'Delaware',
 'Washington',
 'NONE',
 'Delaware',
 'Oregon',
 'Delaware',
 'Delaware',
 'Delaware',
 'Massachusetts',
 'California',
 'NONE',
 'Delaware',
 'Illinois',
 'Idaho',
 'Washington',
 'New_York',
 'New_York',
 'California',
 'Utah',
 'Delaware',
 'Washington',
 'Virginia',
 'New_York',
 'New_York',
 'Illinois',
 'California',
 'Delaware',
 'NONE',
 'Texas',
 'California',
 'Washington',
 'Delaware',
 'Washington',
 'New_York',
 'Washington',
 'Illinois']
counter = 0 
for pred, exp in zip(dev_predictions_jurisdiction, dev_expected_jurisdiction):
    if pred == exp:
        counter +=1
print('accuracy: ', counter/len(dev_predictions_jurisdiction))
accuracy:  0.14457831325301204
accuracy_score(dev_predictions_jurisdiction, dev_expected_jurisdiction)
0.14457831325301204

Co jeżeli nazwy klas nie występują explicite w zbiorach?

SPORT_PATH='/home/kuba/Syncthing/przedmioty/2020-02/ISI/zajecia6_klasyfikacja/repos/sport-text-classification-ball'

SPORT_TRAIN=$SPORT_PATH/train/train.tsv.gz

SPORT_DEV_EXP=$SPORT_PATH/dev-0/expected.tsv

jaki jest baseline dla sport classification ball?

zcat $SPORT_TRAIN | awk '{print $1}' | wc -l

zcat $SPORT_TRAIN | awk '{print $1}' | grep 1 | wc -l

cat $SPORT_DEV_EXP | wc -l

grep 1 $SPORT_DEV_EXP | wc -l

Sprytne podejście do klasyfikacji tekstu? Naiwny bayess

from sklearn.datasets import fetch_20newsgroups
# https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import sklearn.metrics
import gensim
newsgroups = fetch_20newsgroups()
newsgroups_text = newsgroups['data']
newsgroups_text_tokenized = [list(set(gensim.utils.tokenize(x, lowercase = True))) for x in newsgroups_text]
print(newsgroups_text[0])
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----





print(newsgroups_text_tokenized[0])
['s', 'day', 'was', 'it', 'know', 'is', 'where', 'nntp', 'on', 'body', 'i', 'my', 'il', 'wam', 'maryland', 'model', 'history', 'could', 'really', 'host', 'all', 'subject', 'wondering', 'brought', 'umd', 'edu', 'posting', 'funky', 'bumper', 'rac', 'saw', 'the', 'lines', 'what', 'doors', 'enlighten', 'early', 'out', 'thanks', 'bricklin', 'lerxst', 'front', 'were', 'production', 'other', 'neighborhood', 'late', 'please', 'to', 'rest', 'university', 'park', 'addition', 'can', 'by', 'car', 'whatever', 'tellme', 'anyone', 'sports', 'organization', 'me', 'mail', 'be', 'e', 'if', 'looking', 'years', 'door', 'in', 'separate', 'have', 'there', 'made', 'specs', 'thing', 'engine', 'info', 'you', 'of', 'college', 'small', 'or', 'your', 'called', 'name', 'from', 'a', 'this', 'looked']
Y = newsgroups['target']
Y
array([7, 4, 4, ..., 3, 1, 8])
Y_names = newsgroups['target_names']
Y_names
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']
Y_names[16]
'talk.politics.guns'

$P('talk.politics.guns' | 'gun')= ?$

$P(A|B) * P(A) = P(B) * P(B|A)$

$P(A|B) = \frac{P(B) * P(B|A)}{P(A)}$

$P('talk.politics.guns' | 'gun') * P('gun') = P('gun'|'talk.politics.guns') * P('talk.politics.guns')$

$P('talk.politics.guns' | 'gun') = \frac{P('gun'|'talk.politics.guns') * P('talk.politics.guns')}{P('gun')}$

$p1 = P('gun'|'talk.politics.guns')$

$p2 = P('talk.politics.guns')$

$p3 = P('gun')$

obliczanie $p1 = P('gun'|'talk.politics.guns')$

# samodzielne wykonanie

obliczanie $p2 = P('talk.politics.guns')$

# samodzielne wykonanie

obliczanie $p3 = P('gun')$

# samodzielne wykonanie

ostatecznie

(p1 * p2) / p3
def get_prob(index ):
    talks_topic = [x for x,y in zip(newsgroups_text_tokenized,Y) if y == index]

    len([x for x in talks_topic if 'gun' in x])

    if len(talks_topic) == 0:
        return 0.0
    p1 = len([x for x in talks_topic if 'gun' in x]) / len(talks_topic)
    p2 = len(talks_topic) / len(Y)
    p3 = len([x for x in newsgroups_text_tokenized if 'gun' in x]) / len(Y)

    if p3 == 0:
        return 0.0
    else: 
        return (p1 * p2)/ p3
probs = []
for i in range(len(Y_names)):
    probs.append(get_prob(i))
    print("%.5f" %   get_prob(i),'\t\t', Y_names[i])
    
print("%.5f" % sum(probs), '\t\tsuma',)

zadanie samodzielne

def get_prob2(index, word ):
    pass
# listing dla get_prob2, słowo 'god'

założenie naiwnego bayesa

$P(class | word1, word2, word3) = \frac{P(word1, word2, word3|class) * P(class)}{P(word1, word2, word3)}$

przy założeniu o niezależności zmiennych losowych $word1$, $word2$, $word3$:

$P(word1, word2, word3|class) = P(word1|class)* P(word2|class) * P(word3|class)$

ostatecznie:

$P(class | word1, word2, word3) = \frac{P(word1|class)* P(word2|class) * P(word3|class) * P(class)}{\sum_k{P(word1|class_k)* P(word2|class_k) * P(word3|class_k) * P(class_k)}}$

zadania domowe naiwny bayes1 ręcznie

  • analogicznie zaimplementować funkcję get_prob3(index, document_tokenized), argument document_tokenized ma być zbiorem słów dokumentu. funkcja ma być naiwnym klasyfikatorem bayesowskim (w przypadku wielu słów)
  • odpalić powyższy listing prawdopodobieństw z funkcją get_prob3 dla dokumentów: {'i','love','guns'} oraz {'is','there','life','after' ,'death'}
  • zadanie proszę zrobić w jupyterze, wygenerować pdf (kod + wyniki odpalenia) i umieścić go jako zadanie w teams
  • termin 10.05, punktów: 40

zadania domowe naiwny bayes2 gotowa biblioteka