linear model

This commit is contained in:
s444417 2022-05-17 20:16:24 +02:00
commit ca70a52e9f
16 changed files with 123962 additions and 0 deletions

142
README.md Normal file
View File

@ -0,0 +1,142 @@
RetroC2 temporal classification challenge
=========================================
Guess the publication year of a Polish text.
This is the second (larger and improved) edition of the challenge, see
[http://gonito.net/challenge/retroc]() for the first edition.
Example
-------
For instance, you are expected to guess the publication year of this 500-word
text:
> Gazet, a tam o osobie zamformuie się. Uwiadomienie. Stosownie do
> dodatku gazety W. Xiestwa Poznańskiego Nru 74. Ig15. niźey podpisany
> odwoluiąc się, w którey wszelkie pełnomocnictwa komukolwiek priez
> niego dane, od daty teyźe gazety za nieważne mieć chce, dziś więc
> potwierdza to, kassuiac i umarzaiąc pełnomocnictw» Ur. Podgurskiemu
> przez siebie uczyni o n e, w P o z n a n i u dnia 3. Mala 1816 r.
> Psirohońshi. Odmienienie mieszkania. Donoszę Szanowney Publiczności,
> iż mieszkanie moie z Dominikjńskiey ulicy przeniosłem na Szeroką
> ulicę do JP. Fi asa pod Nr 114 na pierwszym piętrze, i handel zboża
> nadal prowadzić będę. Poznań dnia 6. Maia 1816. Meyer Marcuse. III
> 111---»-- Do przedania. Kamienica w rynku podNrcra 62, o trzech
> piętrach, wraz z zabudowaniami, w bardzo dobrym znayduiąca się
> stanie, do szynku i przyimowania gości urządzona, iest zwolney ręki
> do przedania. Dokładnieyszą wiadomość powziąść można u właściciela.
> Do przedania. Dom za Świętym Marcinem pod N rem 42. z browarnią,
> staynią, studnią i wielkićm podwórzem, niemniey kilkanaście szachtów
> kamieni, iest na dniu 24m Czerwca r. b. z wolney .ręki do
> sprzedania. Każdy ochotę mający kupna, o kondycyach sprzedaży
> dowiedzieć się mole tu w Voznaniu w rynku pod N rem 57. u S tanisław
> a PoweIskiego. Do przedania. Na mocy w Prześwietnym Sądzie Pokoiti
> Powiatu tuteyszego pomiędzy Szl. Henrykiem Eichbaum, właścicielem
> młyna papierni w MuchodzU 5 A 7 II nie, Powiatu Międzyrzeckiego, a
> Szl. Wilhelmem Ferdynandem Naukę, Kredy torem pryncypalnym z młyna
> wodnego w Muchodzinie, na dniu 29m miesiąca Marca roku bieżąesgo
> itawartey i w ley n.ierze do podpisanego uczynionego wniosku,
> zesunie młyn papiernia, wraz do tego należącemi gruntami, w wsi
> Muchodzinie w Powiecie Międzyrzeckim leżąca, według urzedowey na
> dniu I I Kwietnia roku bieżącego zdziałaney taxy, na summe. 2246
> Tal. 12 dgr, oszacowana, w drodze lieytacyi public-zney więcey
> daiącemu za gotowa Ziraz zjpłatę, i wypełnieniem kondycyi kupna,
> sprzedana; do którey to sprzedaży termin pierwszy do publikacyi
> kondycyi kupna 1 przedsunowczego przysądzenia, na żądanie
> Iineressentow, na dzieli 12. miesiąca Czerwca roku bieżącego
> w.kascelląryi Urzędnika podpisanego o godzinie iotey przed południem
> wyznaczonym zostaie.- Wzywa się więc ninśeyszem Publiczność kupna
> tego ochotę maiącą, oraz wszelcy Kredyiorowie e x q u o c u n q u e
> jur e d o młyna tego papierni twierdzić prawa sobie mogący, aby w
> terminie wzwyż wyrażonym osobiście lub przez prawnie umocowanych
> Pełnomocników stawilisię; pierwsi swe licyta, drudzy zaś swe realne
> pretensye do protokółu podali, a nay więcey licytuiącemu nie«
> ruchomości powyż wymienioney zprzyległościami przygotowawcze
> przysądzenie nastąpi. Kredytprowie zaś "Z swerni pretensyami do
> nieruchomości tey za prekludowanych, a to sub prejudicio perpetui
> silentii uważani zostaną. Zbiór obiaśnień i kondycyi kupna przeyrzeć
> każdy interessuiący może u podpisanego. Międzyrzecz dnia 20.
> Kwietnia i816\\ Ur ząd P isars t wa Aktowego Powiatu
> Międzyrzeckiego. M. GądkowskL Do przedania. Podaie się do publiczney
> wiadomości, iż podpisany Komornik Sądowy Powiatu Krobskiego,
> zatradowane inwentarze, to iest: konie, woły, krowy, owce i t. d. i
> porządki gospodarskie, Wmu Kamieńskiemu, Possessocpwi dóbr
> Sobiałkowskich, za kaucyą na zabezpieczenie inwentarzy gruntowych,
> do massy konkursowey JOO.XiazatSujftoivsftjcji należących, w wsi
> Sobiałkowie pod Rawiczem
(Yes, there might be a lot of OCR noise there!)
The perfect answer for this text is 1816.37021856342 (year with a
fraction representing a specific day, May, 15th, 1816 for this
example). You could as well return non-integer numbers, for instance
if you are sure that the text was published in 1977, but you have no
idea on which day, the optimal decision is to return 1977.5.
The metric is root mean squared error.
Directory structure
-------------------
* `README.md` — this file
* `config.txt` — GEval configuration file
* `train/` — directory with training data
* `train/train.tsv.xz` — train set (compressed with xz, not gzip!)
* `train/meta.tsv.xz` — metadata (do **not** use in training)
* `dev-0/` — directory with dev (test) data from the same sources as the train set
* `dev-0/in.tsv` — input text for the dev set
* `dev-0/expected.tsv` — expected data for the dev set (publication years)
* `dev-0/meta.tsv.xz` — metadata (do **not** use while testing)
* `dev-1/` — directory with dev (test) data from different source than the train set
* `dev-1/in.tsv` — input text for the dev set
* `dev-1/expected.tsv` — expected data for the dev set (publication years)
* `dev-1/meta.tsv.xz` — metadata (do **not** use while testing)
* `test-A` — directory with test data
* `test-A/in.tsv` — input text for the test set
* `test-A/expected.tsv` — expected data for the test set (hidden)
* `test-A/meta.tsv.xz` — hidden metadata
Structure of data sets
----------------------
Dev and tests test sets are balanced for years (or at least it was
attempted to balance them for years — for some years there was not enough material).
The `dev-0` dataset was created using the same sources as the train set, but
`dev-1` and `test-A` were generated using sources different from
`dev-0` (and different to each other), so `dev-0` is likely to be
easier than `dev-1`.
Metadata files are given for reference, do not use them for training.
Format of the train set
-----------------------
The format of the train set is different from test sets. There is more
information there and you are free to exploit it.
TAB-separated columns:
* beginning of the period in which a text is known to be published, given as a year
with a possible fraction (note that various time granularities are given in this
data set — daily, monthly, yearly, etc.),
* end of the period in which a text is known to be published,
* title normalised,
* symbol of the source (usually a Polish digital library).
* ~500-word-long text snippet.
Format of the test sets
-----------------------
The input file is just a list of ~500-word-long text snippets, each
given in a separate line.
The `expected.tsv` file is a list of publication years (with fractions).
Format of the output files
--------------------------
For each input line, publication year should be given (it is the same
as `expected.tsv` files). The name of the output files is `out.tsv`.

1
config.txt Normal file
View File

@ -0,0 +1 @@
--metric RMSE --precision 4

20000
dev-0/expected.tsv Normal file

File diff suppressed because it is too large Load Diff

20000
dev-0/in.tsv Normal file

File diff suppressed because one or more lines are too long

BIN
dev-0/meta.tsv.xz Normal file

Binary file not shown.

19998
dev-0/out.tsv Normal file

File diff suppressed because it is too large Load Diff

11563
dev-1/expected.tsv Normal file

File diff suppressed because it is too large Load Diff

11563
dev-1/in.tsv Normal file

File diff suppressed because one or more lines are too long

BIN
dev-1/meta.tsv.xz Normal file

Binary file not shown.

11562
dev-1/out.tsv Normal file

File diff suppressed because it is too large Load Diff

570
run.ipynb Normal file
View File

@ -0,0 +1,570 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [],
"source": [
"import lzma\n",
"import sys\n",
"from io import StringIO\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"import pandas as pd\n",
"import numpy\n",
"\n",
"pathX = \"./train/train.tsv.xz\"\n",
"# pathX = \"./train/in.tsv\"\n",
"# pathY = \"./train/meta.tsv.xz\"\n",
"nrows = 100000"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {},
"outputs": [],
"source": [
"# data = lzma.open(pathX, mode='rt', encoding='utf-8').read()\n",
"# stringIO = StringIO(data)\n",
"# df = pd.read_csv(stringIO, sep=\"\\t\", header=None)\n",
"df = pd.read_csv(pathX, sep='\\t', nrows=nrows, header=None)\n",
"# df = df.drop(df.columns, axis=1)\n",
"# topics = pd.read_csv(pathY, sep='\\t', nrows=nrows, header=None)"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"100000\n"
]
}
],
"source": [
"print(len(df.index))\n",
"\n",
"# print(len(topics.index))\n"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [],
"source": [
"def mergeTexts(a, b, c):\n",
" return str(a) + \" \" + str(b) + \" \" + str(c)"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [],
"source": [
"def getMean(a, b):\n",
" return ((a + b)/2)"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [],
"source": [
"df[\"year\"] = df.apply(lambda x: getMean(x[0], x[1]), axis = 1)\n",
"df[\"text\"] = df.apply(lambda x: x[4], axis = 1)"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>year</th>\n",
" <th>text</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>63552</th>\n",
" <td>2013.212329</td>\n",
" <td>dnia 10 października 2012r., znak (...)/, Zakł...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>89500</th>\n",
" <td>2013.656164</td>\n",
" <td>postępowania, skarżąca wniosła, jak na wstępie...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>94039</th>\n",
" <td>1925.015068</td>\n",
" <td>dzieją się ciągle awantury, nieustają podejrze...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>62566</th>\n",
" <td>2012.348361</td>\n",
" <td>Samodzielnego Publicznego Zespołu Opieki Zdrow...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>87553</th>\n",
" <td>1975.494521</td>\n",
" <td>doprowadzających przeładowywane produkty od st...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" year text\n",
"63552 2013.212329 dnia 10 października 2012r., znak (...)/, Zakł...\n",
"89500 2013.656164 postępowania, skarżąca wniosła, jak na wstępie...\n",
"94039 1925.015068 dzieją się ciągle awantury, nieustają podejrze...\n",
"62566 2012.348361 Samodzielnego Publicznego Zespołu Opieki Zdrow...\n",
"87553 1975.494521 doprowadzających przeładowywane produkty od st..."
]
},
"execution_count": 90,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = df.drop(columns = [0,1,2,3,4], axis=1)\n",
"df.sample(5)"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>text</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>85466</th>\n",
" <td>pokoJenie dOInu Burbonów, zaprzestalo we Frano...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>416</th>\n",
" <td>non. Jakiekolwiek próby odbudowy takich instyt...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>36354</th>\n",
" <td>000 ludzi, a który był iście tryumf.ilnym. trw...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>95566</th>\n",
" <td>do robienia lodów. Ogrodzenia do kląbów w wiel...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>58632</th>\n",
" <td>ftcitnciu&gt;a4ii. Dzień już dal nogę i mrugali g...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" text\n",
"85466 pokoJenie dOInu Burbonów, zaprzestalo we Frano...\n",
"416 non. Jakiekolwiek próby odbudowy takich instyt...\n",
"36354 000 ludzi, a który był iście tryumf.ilnym. trw...\n",
"95566 do robienia lodów. Ogrodzenia do kląbów w wiel...\n",
"58632 ftcitnciu>a4ii. Dzień już dal nogę i mrugali g..."
]
},
"execution_count": 91,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"topics = df.pop('year')\n",
"df.sample(5)"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"22843 1991.500000\n",
"12830 1937.157534\n",
"63119 1919.500000\n",
"77638 2010.130137\n",
"5577 1934.768493\n",
"Name: year, dtype: float64"
]
},
"execution_count": 92,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"topics.sample(5)"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['00', '000', '0000', ..., 'תורהט', 'תותיירב', 'תשדוקמ'],\n",
" dtype=object)"
]
},
"execution_count": 93,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vectorizer = TfidfVectorizer(lowercase=True, stop_words=['polish'])\n",
"X = vectorizer.fit_transform(df.to_numpy().ravel())\n",
"vectorizer.get_feature_names_out()\n"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [],
"source": [
"# vectorizer.transform(\"Ala ma kotka\".lower().split())"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [],
"source": [
"# df = df.reset_index()"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {},
"outputs": [],
"source": [
"tfidfVector = vectorizer.transform(df[\"text\"])\n",
"\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [],
"source": [
"# from sklearn.model_selection import train_test_split\n",
"# from sklearn.naive_bayes import GaussianNB\n",
"# \n",
"# gnb = GaussianNB()\n",
"# gnb.fit(tfidfVector.todense(), topics)"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression\n",
"\n",
"reg = LinearRegression().fit(tfidfVector, topics)\n"
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {},
"outputs": [],
"source": [
"testXPath = \"./dev-0/in.tsv\"\n",
"testYPath = \"./dev-0/expected.tsv\"\n",
"\n",
"testX = pd.read_csv(testXPath, sep='\\t', nrows=19998, header=None)\n",
"\n",
"testY = pd.read_csv(testYPath, sep='\\t', nrows=19998, header=None)\n"
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>9194</th>\n",
" <td>że w moich oczach umizgasz się do niej. Rzeczy...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3959</th>\n",
" <td>reką dopoty, dopóki nic wyiaanię !Prawiü Bye m...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19210</th>\n",
" <td>końcach klapy wentylacyjnej i zapobiegają odch...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6080</th>\n",
" <td>lat cię- owe g\\\\\\\\ allowne walki w dziennikars...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18845</th>\n",
" <td>elektr,yczne] ny, poszanowania ładu i po- roc1...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0\n",
"9194 że w moich oczach umizgasz się do niej. Rzeczy...\n",
"3959 reką dopoty, dopóki nic wyiaanię !Prawiü Bye m...\n",
"19210 końcach klapy wentylacyjnej i zapobiegają odch...\n",
"6080 lat cię- owe g\\\\\\\\ allowne walki w dziennikars...\n",
"18845 elektr,yczne] ny, poszanowania ładu i po- roc1..."
]
},
"execution_count": 107,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"testX.sample(5)"
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>3849</th>\n",
" <td>1956.476776</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0\n",
"3849 1956.476776"
]
},
"execution_count": 108,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"testY.sample()\n"
]
},
{
"cell_type": "code",
"execution_count": 109,
"metadata": {},
"outputs": [],
"source": [
"testXtfidfVector = vectorizer.transform(testX[0])\n"
]
},
{
"cell_type": "code",
"execution_count": 110,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"-0.3101240322770993"
]
},
"execution_count": 110,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"reg.score(testXtfidfVector, testY[0])\n"
]
},
{
"cell_type": "code",
"execution_count": 115,
"metadata": {},
"outputs": [],
"source": [
"testXPath = \"./dev-1/in.tsv\"\n",
"testYPath = \"./dev-1/out.tsv\"\n",
"\n",
"testX = pd.read_csv(testXPath, sep='\\t', nrows=nrows, header=None)\n",
"\n",
"# testY = pd.read_csv(testYPath, sep='\\t', nrows=nrows, header=None)\n",
"testXtfidfVector = vectorizer.transform(testX[0])\n"
]
},
{
"cell_type": "code",
"execution_count": 116,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1967.93839413 1941.34584207 1967.15515902 ... 1935.84850316 1940.56249116\n",
" 1958.40938993]\n"
]
}
],
"source": [
"pred = reg.predict(testXtfidfVector)\n",
"print(pred)\n",
"\n",
"import csv\n",
"with open(testYPath, 'w', newline='') as f_output:\n",
" tsv_output = csv.writer(f_output, delimiter='\\n')\n",
" tsv_output.writerow(pred)"
]
}
],
"metadata": {
"interpreter": {
"hash": "369f2c481f4da34e4445cda3fffd2e751bd1c4d706f27375911949ba6bb62e1c"
},
"kernelspec": {
"display_name": "Python 3.10.4 64-bit",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}

124
run.py Normal file
View File

@ -0,0 +1,124 @@
# %%
import lzma
import sys
from io import StringIO
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy
pathX = "./train/train.tsv.xz"
# pathX = "./train/in.tsv"
# pathY = "./train/meta.tsv.xz"
nrows = 100000
# %%
# data = lzma.open(pathX, mode='rt', encoding='utf-8').read()
# stringIO = StringIO(data)
# df = pd.read_csv(stringIO, sep="\t", header=None)
df = pd.read_csv(pathX, sep='\t', nrows=nrows, header=None)
# df = df.drop(df.columns, axis=1)
# topics = pd.read_csv(pathY, sep='\t', nrows=nrows, header=None)
# %%
print(len(df.index))
# print(len(topics.index))
# %%
def mergeTexts(a, b, c):
return str(a) + " " + str(b) + " " + str(c)
# %%
def getMean(a, b):
return ((a + b)/2)
# %%
df["year"] = df.apply(lambda x: getMean(x[0], x[1]), axis = 1)
df["text"] = df.apply(lambda x: x[4], axis = 1)
# %%
df = df.drop(columns = [0,1,2,3,4], axis=1)
df.sample(5)
# %%
topics = df.pop('year')
df.sample(5)
# %%
topics.sample(5)
# %%
vectorizer = TfidfVectorizer(lowercase=True, stop_words=['polish'])
X = vectorizer.fit_transform(df.to_numpy().ravel())
vectorizer.get_feature_names_out()
# %%
# vectorizer.transform("Ala ma kotka".lower().split())
# %%
# df = df.reset_index()
# %%
tfidfVector = vectorizer.transform(df["text"])
# %%
# from sklearn.model_selection import train_test_split
# from sklearn.naive_bayes import GaussianNB
#
# gnb = GaussianNB()
# gnb.fit(tfidfVector.todense(), topics)
# %%
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(tfidfVector, topics)
# %%
testXPath = "./dev-0/in.tsv"
testYPath = "./dev-0/expected.tsv"
testX = pd.read_csv(testXPath, sep='\t', nrows=19998, header=None)
testY = pd.read_csv(testYPath, sep='\t', nrows=19998, header=None)
# %%
testX.sample(5)
# %%
testY.sample()
# %%
testXtfidfVector = vectorizer.transform(testX[0])
# %%
reg.score(testXtfidfVector, testY[0])
# %%
testXPath = "./dev-1/in.tsv"
testYPath = "./dev-1/out.tsv"
testX = pd.read_csv(testXPath, sep='\t', nrows=nrows, header=None)
# testY = pd.read_csv(testYPath, sep='\t', nrows=nrows, header=None)
testXtfidfVector = vectorizer.transform(testX[0])
# %%
pred = reg.predict(testXtfidfVector)
print(pred)
import csv
with open(testYPath, 'w', newline='') as f_output:
tsv_output = csv.writer(f_output, delimiter='\n')
tsv_output.writerow(pred)

14220
test-A/in.tsv Normal file

File diff suppressed because one or more lines are too long

14219
test-A/out.tsv Normal file

File diff suppressed because it is too large Load Diff

BIN
train/meta.tsv.xz Normal file

Binary file not shown.

BIN
train/train.tsv.xz Normal file

Binary file not shown.