2020-10-24 22:20:55 +02:00
|
|
|
# milion news headlines
|
|
|
|
|
2020-10-22 10:21:12 +02:00
|
|
|
# Dataset source and thanks
|
|
|
|
|
|
|
|
Predict the date of headline.
|
|
|
|
Start Date: 2003-02-19 ; End Date: 2019-12-31
|
|
|
|
|
|
|
|
|
|
|
|
Dataset taken from https://www.kaggle.com/therohk/million-headlines on 19.06.2020.
|
|
|
|
Special thanks to Rohit Kulkarni who created it.
|
|
|
|
|
|
|
|
You may find whole dataset (including the test dataset) in the link above.
|
|
|
|
The dataset in the link may be updated.
|
|
|
|
Please, do not incorporate any of the data from this kaggle dataset (or others) to your submission in this gonito challange.
|
|
|
|
|
|
|
|
## Context (from https://www.kaggle.com/therohk/million-headlines )
|
|
|
|
|
|
|
|
This contains data of news headlines published over a period of seventeen years.
|
|
|
|
|
|
|
|
Sourced from the reputable Australian news source ABC (Australian Broadcasting Corporation)
|
|
|
|
|
|
|
|
Agency Site: (http://www.abc.net.au)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# Challange creation
|
|
|
|
|
|
|
|
Year is normalized as follows:
|
|
|
|
|
|
|
|
'''
|
|
|
|
days_in_year = 366 if is_leap else 365
|
|
|
|
normalized = d.year + ((day_of_year-1) / days_in_year)
|
|
|
|
'''
|
|
|
|
|
|
|
|
train, dev, test split is 80%, 10%, 10% randomly
|
|
|
|
|
|
|
|
note that there are very similar headlines in the data, e.g :
|
|
|
|
|
|
|
|
20191219,charles massy and regenerative agriculture drought
|
|
|
|
20191219,charles massy merino drought sale
|
|
|
|
|
|
|
|
I did not make any effort to prevent from going one sentence like this to the train and second one to the test.
|