ISI-14 LogisticRegression, sklearn

This commit is contained in:
ksanu 2020-04-06 17:54:13 +02:00
commit f47b11484a
13 changed files with 305343 additions and 0 deletions

10
.gitignore vendored Normal file
View File

@ -0,0 +1,10 @@
in.tsv
model.pkl
*~
*.swp
*.bak
*.pyc
*.o
.DS_Store
.token
.idea

13
README.md Normal file
View File

@ -0,0 +1,13 @@
Skeptic vs paranormal subreddits
================================
Classify a reddit as either from Skeptic subreddit or one of the
"paranormal" subreddits (Paranormal, UFOs, TheTruthIsHere, Ghosts,
,Glitch-in-the-Matrix, conspiracytheories).
Output label is 0 (for skeptic) and 1 (for paranormal).
Sources
-------
Data taken from <https://archive.org/details/2015_reddit_comments_corpus>.

1
config.txt Normal file
View File

@ -0,0 +1 @@
--metric Accuracy --metric F1 --metric F0:N<Precision> --metric F9999999:N<Recall> --precision 4 --in-header in-header.tsv --out-header out-header.tsv

5272
dev-0/expected.tsv Normal file

File diff suppressed because it is too large Load Diff

BIN
dev-0/in.tsv.xz Normal file

Binary file not shown.

5272
dev-0/out.tsv Normal file

File diff suppressed because it is too large Load Diff

1
in-header.tsv Normal file
View File

@ -0,0 +1 @@
PostText Timestamp
1 PostText Timestamp

1
out-header.tsv Normal file
View File

@ -0,0 +1 @@
Label
1 Label

42
solution.py Normal file
View File

@ -0,0 +1,42 @@
import pandas as pd
import numpy as np
import csv
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
#load data:
train = pd.read_csv("train/in.tsv", delimiter="\t", header=None, names=["text","date"], quoting=csv.QUOTE_NONE)
texts = train["text"]
y = pd.read_csv("train/expected.tsv", header=None)
#print(y)
#train
X_train_counts = count_vect.fit_transform(texts)
clf = LogisticRegression().fit(X_train_counts, y)
print(texts[0])
print(len(texts))
print(len(y))
#predict
dev0 = pd.read_csv("dev-0/in.tsv", delimiter="\t", header=None, names=["text","date"], quoting=csv.QUOTE_NONE)["text"]
testA = pd.read_csv("test-A/in.tsv", delimiter="\t", header=None, names=["text","date"], quoting=csv.QUOTE_NONE)["text"]
dev0_new_counts = count_vect.transform(dev0)
testA_new_counts = count_vect.transform(testA)
predicted_dev0 = clf.predict(dev0_new_counts)
predicted_testA = clf.predict(testA_new_counts)
print(len(dev0))
print(len(predicted_dev0))
with open("dev-0/out.tsv", "w") as out1:
for line in predicted_dev0:
out1.write(str(line))
out1.write("\n")
with open("test-A/out.tsv", "w") as out2:
for line in predicted_testA:
out2.write(str(line))
out2.write("\n")

BIN
test-A/in.tsv.xz Normal file

Binary file not shown.

5152
test-A/out.tsv Normal file

File diff suppressed because it is too large Load Diff

289579
train/expected.tsv Normal file

File diff suppressed because it is too large Load Diff

BIN
train/in.tsv.xz Normal file

Binary file not shown.