first commit

This commit is contained in:
Rafał Sobański 2021-06-29 12:08:21 +02:00
commit 3d15bdda38
21 changed files with 326324 additions and 0 deletions

39
README.md Normal file
View File

@ -0,0 +1,39 @@
Skeptic vs paranormal subreddits
================================
Classify a reddit as either from Skeptic subreddit or one of the
"paranormal" subreddits (Paranormal, UFOs, TheTruthIsHere, Ghosts,
,Glitch-in-the-Matrix, conspiracytheories).
Output label is the probability of a paranormal subreddit.
Sources
-------
Data taken from <https://archive.org/details/2015_reddit_comments_corpus>.
## Cel projektu
Przewidzenie czy dany post na reddicie pochodzi ze „sceptycznych” subredditów, czy z „paranormalnych” subredditów.
## Dane
Dane pochodzą z wyzwania „Skeptic vs paranormal subreddits” na platformie gonito.net (link: https://gonito.net/challenge/paranormal-or-skeptic.
Zbiór jest podzielony na 289579 przykładów uczących oraz 5272 przykładów testowych.
## Modele
W projekcie porównano działanie 3 modeli:
* Regresja liniowa.
* Regresja logistyczna korzystająca z solvera lbfgs.
* Klasyfikator SGD.
## Ewaluacja
Do ewaluacji wykorzystano metryki accuracy, precision, recall i F1-score. Wyniki ewaluacji przedstawia poniższa tabelka:
Model | Accuracy | Precision | Recall | F1-score
| :---: | :---: | :---: | :---: | :---: |
Regresja liniowa | 0.7083 | 0.6513 | 0.3783 | 0.4786
Regresja logistyczna | 0.7123 | 0.6382 | 0.4319 | 0.5152
Klasyfikator SGD | 0.7191 | 0.6224 | 0.5247 | 0.5694
## Wnioski
Najlepsze rezultaty uzyskał Klasyfikator SGD. Warto zauważyć, że Recall malał wraz ze wzrostem pozostałych metryk. Stąd też w przypadku regresji liniowej Precision było największe, mimo najsłabszych pozostałych wyników, a w przypadku SGD Precision było najniższe, mimo najlepszych wyników (szczególnie Recall i F1).

BIN
Raport.pdf Normal file

Binary file not shown.

View File

@ -0,0 +1,5 @@
Likelihood 0.0000
Accuracy 0.7191
F1.0 0.5694
Precision 0.6224
Recall 0.5247

View File

@ -0,0 +1 @@
--metric Likelihood --metric Accuracy --metric F1 --metric F0:N<Precision> --metric F9999999:N<Recall> --precision 4 --in-header in-header.tsv --out-header out-header.tsv

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

Binary file not shown.

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

BIN
paranormal-or-skeptic/geval Executable file

Binary file not shown.

View File

@ -0,0 +1 @@
PostText Timestamp
1 PostText Timestamp

View File

@ -0,0 +1,5 @@
Likelihood 0.0000
Accuracy 0.7083
F1.0 0.4786
Precision 0.6513
Recall 0.3783

View File

@ -0,0 +1,5 @@
Likelihood 0.0000
Accuracy 0.7123
F1.0 0.5152
Precision 0.6382
Recall 0.4319

View File

@ -0,0 +1,144 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 25,
"id": "e25d0d30",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.linear_model import SGDClassifier\n",
"import gensim.downloader as gensim\n",
"from nltk.tokenize import word_tokenize"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "38f915e1",
"metadata": {},
"outputs": [],
"source": [
"x_train = pd.read_table('train/in.tsv', sep='\\t', header = None, error_bad_lines = False, quoting = 3)\n",
"y_train = pd.read_table('train/expected.tsv', sep='\\t', header = None, quoting = 3)\n",
"y_train = y_train[0]\n",
"x_dev = pd.read_table('dev-0/in.tsv', sep='\\t', header = None, quoting = 3)\n",
"x_test = pd.read_table('test-A/in.tsv', sep='\\t', header = None, quoting = 3)\n",
"\n",
"x_train = x_train[0].str.lower()\n",
"x_dev = x_dev[0].str.lower()\n",
"x_test = x_test[0].str.lower()\n",
"\n",
"x_train = [word_tokenize(x) for x in x_train]\n",
"x_dev = [word_tokenize(x) for x in x_dev]\n",
"x_test = [word_tokenize(x) for x in x_test]\n",
"\n",
"word2vec = gensim.load('glove-wiki-gigaword-50')\n",
"\n",
"def document_vector(doc):\n",
" return np.mean([word2vec[word] for word in doc if word in word2vec] or [np.zeros(50)], axis=0)\n",
"\n",
"x_train = [document_vector(doc) for doc in x_train]\n",
"x_dev = [document_vector(doc) for doc in x_dev]\n",
"x_test = [document_vector(doc) for doc in x_test]"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "6cdbf2b6",
"metadata": {},
"outputs": [],
"source": [
"# Linear Regression\n",
"\n",
"model = LinearRegression()\n",
"model.fit(x_train, y_train)\n",
"\n",
"y_dev = model.predict(x_dev)\n",
"y_test = model.predict(x_test)\n",
" \n",
"Y_dev = pd.DataFrame({'label':y_dev})\n",
"Y_test = pd.DataFrame({'label':y_test})\n",
"\n",
"Y_dev['label'] = Y_dev['label'].apply(lambda x: 0 if x < 0 else x)\n",
"Y_test['label'] = Y_test['label'].apply(lambda x: 0 if x < 0 else x)\n",
"\n",
"Y_dev['label'] = Y_dev['label'].apply(lambda x: 1 if x > 1 else x)\n",
"Y_test['label'] = Y_test['label'].apply(lambda x: 1 if x > 1 else x)\n",
"\n",
"Y_dev.to_csv(r'dev-0/linear_out.tsv', sep='\\t', index=False, header=False)\n",
"Y_test.to_csv(r'test-A/linear_out.tsv', sep='\\t', index=False, header=False)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "4125bba6",
"metadata": {},
"outputs": [],
"source": [
"# Logistic Regression\n",
"\n",
"model = LogisticRegression(solver='lbfgs', max_iter=100000)\n",
"model.fit(x_train, y_train)\n",
"\n",
"y_dev = model.predict(x_dev)\n",
"y_test = model.predict(x_test)\n",
" \n",
"Y_dev = pd.DataFrame({'label':y_dev})\n",
"Y_test = pd.DataFrame({'label':y_test})\n",
"\n",
"Y_dev.to_csv(r'dev-0/logistic_out.tsv', sep='\\t', index=False, header=False)\n",
"Y_test.to_csv(r'test-A/logistic_out.tsv', sep='\\t', index=False, header=False)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "c515393b",
"metadata": {},
"outputs": [],
"source": [
"# SGDCLassifier\n",
"\n",
"model = SGDClassifier(max_iter=100000)\n",
"model.fit(x_train, y_train)\n",
"\n",
"y_dev = model.predict(x_dev)\n",
"y_test = model.predict(x_test)\n",
" \n",
"Y_dev = pd.DataFrame({'label':y_dev})\n",
"Y_test = pd.DataFrame({'label':y_test})\n",
"\n",
"Y_dev.to_csv(r'dev-0/SGD_out.tsv', sep='\\t', index=False, header=False)\n",
"Y_test.to_csv(r'test-A/SGD_out.tsv', sep='\\t', index=False, header=False)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -0,0 +1 @@
Label
1 Label

File diff suppressed because it is too large Load Diff

Binary file not shown.

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

Binary file not shown.