milion-news-headlines/README.md

43 lines
1.3 KiB
Markdown
Raw Permalink Normal View History

2020-10-24 22:20:55 +02:00
# milion news headlines
2020-10-22 10:21:12 +02:00
# Dataset source and thanks
Predict the date of headline.
Start Date: 2003-02-19 ; End Date: 2019-12-31
Dataset taken from https://www.kaggle.com/therohk/million-headlines on 19.06.2020.
Special thanks to Rohit Kulkarni who created it.
You may find whole dataset (including the test dataset) in the link above.
The dataset in the link may be updated.
Please, do not incorporate any of the data from this kaggle dataset (or others) to your submission in this gonito challange.
## Context (from https://www.kaggle.com/therohk/million-headlines )
This contains data of news headlines published over a period of seventeen years.
Sourced from the reputable Australian news source ABC (Australian Broadcasting Corporation)
Agency Site: (http://www.abc.net.au)
# Challange creation
Year is normalized as follows:
'''
days_in_year = 366 if is_leap else 365
normalized = d.year + ((day_of_year-1) / days_in_year)
'''
train, dev, test split is 80%, 10%, 10% randomly
note that there are very similar headlines in the data, e.g :
20191219,charles massy and regenerative agriculture drought
20191219,charles massy merino drought sale
I did not make any effort to prevent from going one sentence like this to the train and second one to the test.