Go to file

s443930 483fb9837c Fixed malformed examples		2022-05-28 16:44:54 +02:00
dev-0	a	2020-10-22 10:15:36 +02:00
test-A	a	2020-10-22 10:15:36 +02:00
test-B	add 2020-2021 data	2021-11-13 22:39:44 +01:00
train	a	2020-10-22 10:15:36 +02:00
.gitignore	a	2020-10-22 10:15:36 +02:00
config.txt	change formatting	2021-11-13 14:29:34 +01:00
dev0_predicted	Fixed malformed examples	2022-05-28 16:44:54 +02:00
inout.py	Rocky road to Dublin	2022-05-28 16:27:40 +02:00
ireland_news_dev0_targets	Rocky road to Dublin	2022-05-28 16:27:40 +02:00
model_ireland_news.vw	Fixed malformed examples	2022-05-28 16:44:54 +02:00
names	a	2020-10-22 10:15:36 +02:00
README.md	name	2020-10-24 22:18:19 +02:00
rockyRoadtoDublin.py	Fixed malformed examples	2022-05-28 16:44:54 +02:00
vw_ireland_news_dev0	Fixed malformed examples	2022-05-28 16:44:54 +02:00
vw_ireland_news_test-A	Fixed malformed examples	2022-05-28 16:44:54 +02:00
vw_ireland_news_test-B	Fixed malformed examples	2022-05-28 16:44:54 +02:00
vw_ireland_news_train	Fixed malformed examples	2022-05-28 16:44:54 +02:00

README.md

Ireland news headlines

Dataset source and thanks

Predict the headline category given headine text and year Start Date: 1996-01-01 End Date: 2019-12-31

Dataset taken from https://www.kaggle.com/therohk/ireland-historical-news on 19.06.2020. Special thanks to Rohit Kulkarni who created it.

You may find whole dataset (including the test dataset) in the link above. The dataset in the link may be updated. Please, do not incorporate any of the data from this kaggle dataset (or others) to your submission in this gonito challange.

Context (from https://www.kaggle.com/therohk/ireland-historical-news )

This news dataset is a composition of 1.48 million headlines posted by the Irish Times operating within Ireland.

Created over 160 years ago; the agency can provides long term birds eye view of the happenings in Europe.

Challange creation

Year is normalized as follows:

''' days_in_year = 366 if is_leap else 365 normalized = d.year + ((day_of_year-1) / days_in_year) '''

train, dev, test split is 80%, 10%, 10% randomly

note that there are very similar headlines in the data

I did not make any effort to prevent from going one sentence like this to the train and second one to the test.

I used a first category in the classification task. E.g there is "world" instead of "world.us" as on original dataset.