2021-04-21 12:19:58 +02:00
{
2021-09-27 12:34:44 +02:00
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Ekstrakcja informacji </h1>\n",
"<h2> 6. <i>Klasyfikacja</i> [\u0107wiczenia]</h2> \n",
"<h3> Jakub Pokrywka (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Zaj\u0119cia klasyfikacja"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Zbi\u00f3r kleister"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pathlib\n",
"from collections import Counter\n",
"from sklearn.metrics import *"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"KLEISTER_PATH = pathlib.Path('/home/kuba/Syncthing/przedmioty/2020-02/IE/applica/kleister-nda')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pytanie\n",
"\n",
"Czy jurysdykcja musi by\u0107 zapisana explicite w umowie?"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def get_expected_jurisdiction(filepath):\n",
" dataset_expected_jurisdiction = []\n",
" with open(filepath,'r') as train_expected_file:\n",
" for line in train_expected_file:\n",
" key_values = line.rstrip('\\n').split(' ')\n",
" jurisdiction = None\n",
" for key_value in key_values:\n",
" key, value = key_value.split('=')\n",
" if key == 'jurisdiction':\n",
" jurisdiction = value\n",
" if jurisdiction is None:\n",
" jurisdiction = 'NONE'\n",
" dataset_expected_jurisdiction.append(jurisdiction)\n",
" return dataset_expected_jurisdiction"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"train_expected_jurisdiction = get_expected_jurisdiction(KLEISTER_PATH/'train'/'expected.tsv')"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"dev_expected_jurisdiction = get_expected_jurisdiction(KLEISTER_PATH/'dev-0'/'expected.tsv')"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"254"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(train_expected_jurisdiction)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"'NONE' in train_expected_jurisdiction"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"31"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(set(train_expected_jurisdiction))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Czy wszystkie stany musz\u0105 wyst\u0119powa\u0107 w zbiorze trenuj\u0105cym w zbiorze kleister?\n",
"\n",
"https://en.wikipedia.org/wiki/U.S._state\n",
"\n",
"### Jaki jest baseline?"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"train_counter = Counter(train_expected_jurisdiction)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('New_York', 43),\n",
" ('Delaware', 39),\n",
" ('California', 32),\n",
" ('Massachusetts', 15),\n",
" ('Texas', 13),\n",
" ('Illinois', 10),\n",
" ('Oregon', 9),\n",
" ('Florida', 9),\n",
" ('Pennsylvania', 9),\n",
" ('Missouri', 9),\n",
" ('Ohio', 8),\n",
" ('New_Jersey', 7),\n",
" ('Georgia', 6),\n",
" ('Indiana', 5),\n",
" ('Nevada', 5),\n",
" ('Colorado', 4),\n",
" ('Virginia', 4),\n",
" ('Washington', 4),\n",
" ('Michigan', 3),\n",
" ('Minnesota', 3),\n",
" ('Connecticut', 2),\n",
" ('Wisconsin', 2),\n",
" ('Maine', 2),\n",
" ('North_Carolina', 2),\n",
" ('Kansas', 2),\n",
" ('Utah', 2),\n",
" ('Iowa', 1),\n",
" ('Idaho', 1),\n",
" ('South_Dakota', 1),\n",
" ('South_Carolina', 1),\n",
" ('Rhode_Island', 1)]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_counter.most_common(100)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"most_common_answer = train_counter.most_common(100)[0][0]"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'New_York'"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"most_common_answer"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"dev_predictions_jurisdiction = [most_common_answer] * len(dev_expected_jurisdiction)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"['New_York',\n",
" 'New_York',\n",
" 'Delaware',\n",
" 'Massachusetts',\n",
" 'Delaware',\n",
" 'Washington',\n",
" 'Delaware',\n",
" 'New_Jersey',\n",
" 'New_York',\n",
" 'NONE',\n",
" 'NONE',\n",
" 'Delaware',\n",
" 'Delaware',\n",
" 'Delaware',\n",
" 'New_York',\n",
" 'Massachusetts',\n",
" 'Minnesota',\n",
" 'California',\n",
" 'New_York',\n",
" 'California',\n",
" 'Iowa',\n",
" 'California',\n",
" 'Virginia',\n",
" 'North_Carolina',\n",
" 'Arizona',\n",
" 'Indiana',\n",
" 'New_Jersey',\n",
" 'California',\n",
" 'Delaware',\n",
" 'Georgia',\n",
" 'New_York',\n",
" 'New_York',\n",
" 'California',\n",
" 'Minnesota',\n",
" 'California',\n",
" 'Kentucky',\n",
" 'Minnesota',\n",
" 'Ohio',\n",
" 'Michigan',\n",
" 'California',\n",
" 'Minnesota',\n",
" 'California',\n",
" 'Delaware',\n",
" 'Illinois',\n",
" 'Minnesota',\n",
" 'Texas',\n",
" 'New_Jersey',\n",
" 'Delaware',\n",
" 'Washington',\n",
" 'NONE',\n",
" 'Delaware',\n",
" 'Oregon',\n",
" 'Delaware',\n",
" 'Delaware',\n",
" 'Delaware',\n",
" 'Massachusetts',\n",
" 'California',\n",
" 'NONE',\n",
" 'Delaware',\n",
" 'Illinois',\n",
" 'Idaho',\n",
" 'Washington',\n",
" 'New_York',\n",
" 'New_York',\n",
" 'California',\n",
" 'Utah',\n",
" 'Delaware',\n",
" 'Washington',\n",
" 'Virginia',\n",
" 'New_York',\n",
" 'New_York',\n",
" 'Illinois',\n",
" 'California',\n",
" 'Delaware',\n",
" 'NONE',\n",
" 'Texas',\n",
" 'California',\n",
" 'Washington',\n",
" 'Delaware',\n",
" 'Washington',\n",
" 'New_York',\n",
" 'Washington',\n",
" 'Illinois']"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dev_expected_jurisdiction"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"accuracy: 0.14457831325301204\n"
]
}
],
"source": [
"counter = 0 \n",
"for pred, exp in zip(dev_predictions_jurisdiction, dev_expected_jurisdiction):\n",
" if pred == exp:\n",
" counter +=1\n",
"print('accuracy: ', counter/len(dev_predictions_jurisdiction))"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.14457831325301204"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"accuracy_score(dev_predictions_jurisdiction, dev_expected_jurisdiction)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Co je\u017celi nazwy klas nie wyst\u0119puj\u0105 explicite w zbiorach?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"https://git.wmi.amu.edu.pl/kubapok/paranormal-or-skeptic-ISI-public\n",
" \n",
"https://git.wmi.amu.edu.pl/kubapok/sport-text-classification-ball-ISI-public"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"SPORT_PATH='/home/kuba/Syncthing/przedmioty/2020-02/ISI/zajecia6_klasyfikacja/repos/sport-text-classification-ball'\n",
"\n",
"SPORT_TRAIN=$SPORT_PATH/train/train.tsv.gz\n",
" \n",
"SPORT_DEV_EXP=$SPORT_PATH/dev-0/expected.tsv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### jaki jest baseline dla sport classification ball?\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"zcat $SPORT_TRAIN | awk '{print $1}' | wc -l"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"zcat $SPORT_TRAIN | awk '{print $1}' | grep 1 | wc -l"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"cat $SPORT_DEV_EXP | wc -l\n",
"\n",
"grep 1 $SPORT_DEV_EXP | wc -l"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sprytne podej\u015bcie do klasyfikacji tekstu? Naiwny bayess"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/kuba/anaconda3/lib/python3.8/site-packages/gensim/similarities/__init__.py:15: UserWarning: The gensim.similarities.levenshtein submodule is disabled, because the optional Levenshtein package <https://pypi.org/project/python-Levenshtein/> is unavailable. Install Levenhstein (e.g. `pip install python-Levenshtein`) to suppress this warning.\n",
" warnings.warn(msg)\n"
]
}
],
"source": [
"from sklearn.datasets import fetch_20newsgroups\n",
"# https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html\n",
"\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"import numpy as np\n",
"import sklearn.metrics\n",
"import gensim"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"newsgroups = fetch_20newsgroups()"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"newsgroups_text = newsgroups['data']"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"newsgroups_text_tokenized = [list(set(gensim.utils.tokenize(x, lowercase = True))) for x in newsgroups_text]"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"From: lerxst@wam.umd.edu (where's my thing)\n",
"Subject: WHAT car is this!?\n",
"Nntp-Posting-Host: rac3.wam.umd.edu\n",
"Organization: University of Maryland, College Park\n",
"Lines: 15\n",
"\n",
" I was wondering if anyone out there could enlighten me on this car I saw\n",
"the other day. It was a 2-door sports car, looked to be from the late 60s/\n",
"early 70s. It was called a Bricklin. The doors were really small. In addition,\n",
"the front bumper was separate from the rest of the body. This is \n",
"all I know. If anyone can tellme a model name, engine specs, years\n",
"of production, where this car is made, history, or whatever info you\n",
"have on this funky looking car, please e-mail.\n",
"\n",
"Thanks,\n",
"- IL\n",
" ---- brought to you by your neighborhood Lerxst ----\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
}
],
"source": [
"print(newsgroups_text[0])"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['lerxst', 'on', 'be', 'name', 'brought', 'late', 'front', 'umd', 'bumper', 'door', 'there', 'subject', 'day', 'early', 'history', 'me', 'neighborhood', 'university', 'mail', 'doors', 'by', 'funky', 'if', 'engine', 'know', 'years', 'maryland', 'your', 'rest', 'is', 'info', 'body', 'have', 'tellme', 'out', 'anyone', 'small', 'wam', 'il', 'organization', 'thanks', 'park', 'made', 'whatever', 'other', 'specs', 'wondering', 'lines', 'from', 'was', 'a', 'what', 'the', 's', 'or', 'please', 'all', 'rac', 'i', 'looked', 'really', 'edu', 'where', 'to', 'e', 'my', 'it', 'car', 'addition', 'can', 'of', 'production', 'in', 'saw', 'separate', 'you', 'thing', 'posting', 'bricklin', 'could', 'enlighten', 'nntp', 'model', 'were', 'host', 'looking', 'this', 'college', 'sports', 'called']\n"
]
}
],
"source": [
"print(newsgroups_text_tokenized[0])"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"Y = newsgroups['target']"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([7, 4, 4, ..., 3, 1, 8])"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Y"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"Y_names = newsgroups['target_names']"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['alt.atheism',\n",
" 'comp.graphics',\n",
" 'comp.os.ms-windows.misc',\n",
" 'comp.sys.ibm.pc.hardware',\n",
" 'comp.sys.mac.hardware',\n",
" 'comp.windows.x',\n",
" 'misc.forsale',\n",
" 'rec.autos',\n",
" 'rec.motorcycles',\n",
" 'rec.sport.baseball',\n",
" 'rec.sport.hockey',\n",
" 'sci.crypt',\n",
" 'sci.electronics',\n",
" 'sci.med',\n",
" 'sci.space',\n",
" 'soc.religion.christian',\n",
" 'talk.politics.guns',\n",
" 'talk.politics.mideast',\n",
" 'talk.politics.misc',\n",
" 'talk.religion.misc']"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Y_names"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'talk.politics.guns'"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Y_names[16]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$P('talk.politics.guns' | 'gun')= ?$ \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"$P(A|B) * P(A) = P(B) * P(B|A)$\n",
"\n",
"$P(A|B) = \\frac{P(B) * P(B|A)}{P(A)}$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$P('talk.politics.guns' | 'gun') * P('gun') = P('gun'|'talk.politics.guns') * P('talk.politics.guns')$\n",
"\n",
"\n",
"$P('talk.politics.guns' | 'gun') = \\frac{P('gun'|'talk.politics.guns') * P('talk.politics.guns')}{P('gun')}$\n",
"\n",
"\n",
"$p1 = P('gun'|'talk.politics.guns')$\n",
"\n",
"\n",
"$p2 = P('talk.politics.guns')$\n",
"\n",
"\n",
"$p3 = P('gun')$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## obliczanie $p1 = P('gun'|'talk.politics.guns')$"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"talk_politics_guns = [x for x,y in zip(newsgroups_text_tokenized,Y) if y == 16]"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"546"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(talk_politics_guns)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"253"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len([x for x in talk_politics_guns if 'gun' in x])"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"p1 = len([x for x in talk_politics_guns if 'gun' in x]) / len(talk_politics_guns)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.4633699633699634"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"p1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## obliczanie $p2 = P('talk.politics.guns')$\n"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"p2 = len(talk_politics_guns) / len(Y)"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.048258794414000356"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"p2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## obliczanie $p3 = P('gun')$"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"p3 = len([x for x in newsgroups_text_tokenized if 'gun' in x]) / len(Y)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.03270284603146544"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"p3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ostatecznie"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.6837837837837839"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(p1 * p2) / p3"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"def get_prob(index ):\n",
" talks_topic = [x for x,y in zip(newsgroups_text_tokenized,Y) if y == index]\n",
"\n",
" len([x for x in talks_topic if 'gun' in x])\n",
"\n",
" if len(talks_topic) == 0:\n",
" return 0.0\n",
" p1 = len([x for x in talks_topic if 'gun' in x]) / len(talks_topic)\n",
" p2 = len(talks_topic) / len(Y)\n",
" p3 = len([x for x in newsgroups_text_tokenized if 'gun' in x]) / len(Y)\n",
"\n",
" if p3 == 0:\n",
" return 0.0\n",
" else: \n",
" return (p1 * p2)/ p3\n"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.01622 \t\t alt.atheism\n",
"0.00000 \t\t comp.graphics\n",
"0.00541 \t\t comp.os.ms-windows.misc\n",
"0.01892 \t\t comp.sys.ibm.pc.hardware\n",
"0.00270 \t\t comp.sys.mac.hardware\n",
"0.00000 \t\t comp.windows.x\n",
"0.01351 \t\t misc.forsale\n",
"0.04054 \t\t rec.autos\n",
"0.01892 \t\t rec.motorcycles\n",
"0.00270 \t\t rec.sport.baseball\n",
"0.00541 \t\t rec.sport.hockey\n",
"0.03784 \t\t sci.crypt\n",
"0.02973 \t\t sci.electronics\n",
"0.00541 \t\t sci.med\n",
"0.01622 \t\t sci.space\n",
"0.00270 \t\t soc.religion.christian\n",
"0.68378 \t\t talk.politics.guns\n",
"0.04595 \t\t talk.politics.mideast\n",
"0.03784 \t\t talk.politics.misc\n",
"0.01622 \t\t talk.religion.misc\n",
"1.00000 \t\tsuma\n"
]
}
],
"source": [
"probs = []\n",
"for i in range(len(Y_names)):\n",
" probs.append(get_prob(i))\n",
" print(\"%.5f\" % get_prob(i),'\\t\\t', Y_names[i])\n",
" \n",
"print(\"%.5f\" % sum(probs), '\\t\\tsuma',)"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": [
"def get_prob2(index, word ):\n",
" talks_topic = [x for x,y in zip(newsgroups_text_tokenized,Y) if y == index]\n",
"\n",
" len([x for x in talks_topic if word in x])\n",
"\n",
" if len(talks_topic) == 0:\n",
" return 0.0\n",
" p1 = len([x for x in talks_topic if word in x]) / len(talks_topic)\n",
" p2 = len(talks_topic) / len(Y)\n",
" p3 = len([x for x in newsgroups_text_tokenized if word in x]) / len(Y)\n",
"\n",
" if p3 == 0:\n",
" return 0.0\n",
" else: \n",
" return (p1 * p2)/ p3\n"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.20874 \t\t alt.atheism\n",
"0.00850 \t\t comp.graphics\n",
"0.00364 \t\t comp.os.ms-windows.misc\n",
"0.00850 \t\t comp.sys.ibm.pc.hardware\n",
"0.00243 \t\t comp.sys.mac.hardware\n",
"0.00485 \t\t comp.windows.x\n",
"0.00607 \t\t misc.forsale\n",
"0.01092 \t\t rec.autos\n",
"0.02063 \t\t rec.motorcycles\n",
"0.01456 \t\t rec.sport.baseball\n",
"0.01092 \t\t rec.sport.hockey\n",
"0.00485 \t\t sci.crypt\n",
"0.00364 \t\t sci.electronics\n",
"0.00364 \t\t sci.med\n",
"0.01092 \t\t sci.space\n",
"0.41748 \t\t soc.religion.christian\n",
"0.03398 \t\t talk.politics.guns\n",
"0.02791 \t\t talk.politics.mideast\n",
"0.02549 \t\t talk.politics.misc\n",
"0.17233 \t\t talk.religion.misc\n",
"1.00000 \t\tsuma\n"
]
}
],
"source": [
"probs = []\n",
"for i in range(len(Y_names)):\n",
" probs.append(get_prob2(i,'god'))\n",
" print(\"%.5f\" % get_prob2(i,'god'),'\\t\\t', Y_names[i])\n",
" \n",
"print(\"%.5f\" % sum(probs), '\\t\\tsuma',)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## za\u0142o\u017cenie naiwnego bayesa"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$P(class | word1, word2, word3) = \\frac{P(word1, word2, word3|class) * P(class)}{P(word1, word2, word3)}$\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**przy za\u0142o\u017ceniu o niezale\u017cno\u015bci zmiennych losowych $word1$, $word2$, $word3$**:\n",
"\n",
"\n",
"$P(word1, word2, word3|class) = P(word1|class)* P(word2|class) * P(word3|class)$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**ostatecznie:**\n",
"\n",
"\n",
"$P(class | word1, word2, word3) = \\frac{P(word1|class)* P(word2|class) * P(word3|class) * P(class)}{\\sum_k{P(word1|class_k)* P(word2|class_k) * P(word3|class_k) * P(class_k)}}$\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## zadania domowe naiwny bayes1 r\u0119cznie"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- analogicznie zaimplementowa\u0107 funkcj\u0119 get_prob3(index, document_tokenized), argument document_tokenized ma by\u0107 zbiorem s\u0142\u00f3w dokumentu. funkcja ma by\u0107 naiwnym klasyfikatorem bayesowskim (w przypadku wielu s\u0142\u00f3w)\n",
"- odpali\u0107 powy\u017cszy listing prawdopodobie\u0144stw z funkcj\u0105 get_prob3 dla dokument\u00f3w: {'i','love','guns'} oraz {'is','there','life','after'\n",
",'death'}\n",
"- zadanie prosz\u0119 zrobi\u0107 w jupyterze, wygenerowa\u0107 pdf (kod + wyniki odpalenia) i umie\u015bci\u0107 go jako zadanie w teams\n",
"- termin 12.05, punkt\u00f3w: 40\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## zadania domowe naiwny bayes2 gotowa biblioteka"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- wybra\u0107 jedno z poni\u017cszych repozytori\u00f3w i je sforkowa\u0107:\n",
" - https://git.wmi.amu.edu.pl/kubapok/paranormal-or-skeptic-ISI-public\n",
" - https://git.wmi.amu.edu.pl/kubapok/sport-text-classification-ball-ISI-public\n",
"- stworzy\u0107 klasyfikator bazuj\u0105cy na naiwnym bayessie (mo\u017ce by\u0107 gotowa biblioteka), mo\u017ce te\u017c korzysta\u0107 z gotowych implementacji tfidf\n",
"- stworzy\u0107 predykcje w plikach dev-0/out.tsv oraz test-A/out.tsv\n",
"- wynik accuracy sprawdzony za pomoc\u0105 narz\u0119dzia geval (patrz poprzednie zadanie) powinien wynosi\u0107 conajmniej 0.67\n",
"- prosz\u0119 umie\u015bci\u0107 predykcj\u0119 oraz skrypty generuj\u0105ce (w postaci tekstowej a nie jupyter) w repo, a w MS TEAMS umie\u015bci\u0107 link do swojego repo\n",
"termin 12.05, 40 punkt\u00f3w\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
},
"author": "Jakub Pokrywka",
"email": "kubapok@wmi.amu.edu.pl",
"lang": "pl",
"subtitle": "6.Klasyfikacja[\u0107wiczenia]",
"title": "Ekstrakcja informacji",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4
}