retro-gap/README.md


Guess a word in a gap in historic texts
=======================================

Give a probability distribution for a word in a gap in a corpus of
Polish historic texts spanning 1814-2013. This is a challenge for
(temporal) language models.

The metric is log-loss calculated on 10-bit fingerprints generated
with MurmurHash3 hash function.

Example
-------

For instance, you are expected guess the word in the
gap marked with **???????** here:

> od każde ; aztuki blldła i trzod1l
> chlewnej , oczne opłatll , pr.re1DlIź ' zo jqce dóchody z chowu t '
> Uch z1Diet ' zt1t . Nic wtęc dzfWM { 10 , że to betonowe .btarniki
> na obOrnik i gtlo ; ówkę wtlposożone jest każde holendenkie go- '
> POdant1Do ' OIM i le holender , ka wid fl4ldll dzil to Iwiecie do
> fI4 ; bard % iej CZ ) llto i e- ' tetuczni.e ut , ztłmGnt / ch . 1V
> ft4 , zvm wojew6clzt1Die , mestett / , zbiornikami fUl / 7łłOjówkę
> dt / sponuje 2clled1DiC ! 15-20 proc. ogółu f1O , podorlltw chlop ,
> kich , pozo .. talllch odchodU zwierzęce ob6r i chlew ni w znacznej
> częlci spluwajq do ' tawów , jezior , rzek , strumieni , bqdź wodami
> powle , zchniowlft ' ni 1Dproit do , tud ni , z których pijq wodę
> rolnict / i ich , odzint / . Wiei zatruwa lię IC1ma i truje
> lrod01Disko 1D ktÓł ' 1 / m wllpólnu t1 / jemt / . CZ1l nu
> fl4leźalobt / więc ' ięgnq do dolwiadczeń HoIQndił1 Nie pro
> pCntujem1 / konfukatt / QO ' POd4 " t1D lecz kar1 / 1D1Imłerzo tte
> niedbaluchom bcI , dzo bU .tę przt / daltl . Koszt zbi01 ' nika na
> gnojówkę nie je , t wtlloki i każd1 / , ol nik moźe 110 zbudowa wlc1
> ! nt / mi , iłami . Karzu : łć będzie miar podwóJnq : c ' l / .tq
> wodę z włalnej , tudni i Ipore i ! olci gnojówki cennego na- 1DOZU ,
> 1D1 / wożonego w beczkowozach i razlewaneQo lICI pola. sztucznych
> dla rolnictwa zwięk s się o 30 proc. a produkcja środków ochrony
> roślin trzykrotnie . W przC < ' hodżeniu na przemy słowe metody w
> produkcji roś ! innej I zWierzęcej rolnictwo NRD posiada już pewne
> doświadczenia , kt6re dostarezają np. ferma opasowa b ) ' dła w
> Ferdlnandshof , posiadająca 21 tys. stanowisk , kombinat tuczu w
> Eberswalde o ro < , znej produkeji ] 0 tys. ton mi sa , tuczarnia
> trzody chlewnej W ' 08 podarstwie PDftstwowym DI @ bensee z 12 tvs.
> st , nowi ! ; k sze r. wle ] kirh ferm kurt " ch , gos podantw m ] e
> ( ' znych itp . Prolet intensyfikacji I prze ksztakania produkcji **???????**
> jest wszechstronny i obejmuje wszystkie dziedziny : ! ycia wsi ,
> jednym z podstawowych jed ak warunków Jest wkład przem ) ' słu w
> postaci dopływu przemysłowych środków produkcji . Obrazują to nast
> pują ce dane : wartość poc1st3wowyeh środk6w produkcji na jedn o
> członka lpółdzlelnl produk < ' yjnej Wynosiła dotych < , za s '
> rednio 38 tys. marek , pod < , zas gdy w centrum agrochemicznym 150
> do 200 tys. marek , a w zakładach tuczu przemysłowelto , zaletnie od
> ro dzaju zwier t 300-800 tys. marek . Traktory w kooperacji W :
> r.ruł. WD-ł ' m wojewOdztWi1e licz ! > . rolników , któr ' y łl ! c
> ąli ai4 : w ItUłi.uOIIobOwe zesp < > I ) ' , Wypo ; iyozaJI ! z t >
> .maszynowych ItCXeoll : n ) I.n1 , c ' ycł1 ClilanJJU lub cale
> zeIIiawy S ( u ' Z4 : u . W ub. rOku \ \ all ; llcłl & rup
> kooperuJllcyc.h. w zalulaie IneOhanil.ZlicJI było zaJedWte Iu.qll
> .iaJ.e. ol > e < : n : e je . Ich vo .. ad 1110 I & ku.plaJą one
> aokołtl $ lIję t ł = . ) I ! -ł-ko -ł \ \ l , CbowAldnl , k
> l1m.kim i .w , i < lwlnaklm . Zalltą t J torń : ly ut inla zes w6w m
> ...... ' > owych J & t prZ8 < le wlzYs " iUnl 0 .. - < 1.r. ; ej " '
> ' ' ' ' I Inlczna ebpJ.oa \ \ aCJe clllll.llikuw , & .owi .. kaidJ "
> nlch pracuj « w DUl.u ioapociar atwach . Rolnic ' u.sta : 1aJI ! JrI
> , Ięuty bą koleJoojó pracy m.nyn. .1 ' 0 r.lalnym tpj , ac . , nlu
> .1- IUJortYZóicJl , obLi < .ozoneJ od warlo- > ci Sprzętu w momencie
> WypO yczanoi . , il ' Upa utytłGoWnl- ! t < > w ataJe a1ę
> w.dclclelem ma- .zyn. & zy lDu .. yn < > wI kółek ruwn : et zyskuJ "
> na tej transakcji , & .owiem nie mUA , tl I.rOtlZcitYć o obelUłi4 :
> wypożyczonych clllil1ikuw oru o k  serwacJę I ! ' onty m ... y "
> , Drui4 , r6wnie-l cennll ! oc w " Pol ! , . aeT kóh , j ( z ' listą
> . " za ... nlle mechN1izac ; 1 II ! wypqj , , v " zab.i. IprZlltu
> w.warzyazllcęlC do C : IIIIUk6w , s \ \ .anowiącyClb WIUr : OK
> roJników . Takie wypntycza1n1e otwarto w w1ęJ \ \ ! o & O 601
> .mbeemów z ich " .IUC lI & orzYltalo np. w 1 & 1 » . foKU okolo U.
> roJnikÓW. ct )

(Yes, there is a lot of OCR noise here!)

The period in which this text was published is also given
(1973.1397-1973.1424 — dates are given using years with possible fractions).

(And the correct expected word here is *rolnej*.)

Directory structure
-------------------

* `README.md` — this file
* `config.txt` — GEval configuration file
* `train/` — directory with training data
* `train/train.tsv.xz` — train set (compressed with xz, not gzip!)
* `train/meta.tsv.xz` — metadata (do **not** use in training)
* `dev-0/` — directory with dev (test) data from the same sources as the train set
* `dev-0/in.tsv` — input data for the dev set
* `dev-0/expected.tsv` — expected (reference) data for the dev set
* `dev-0/meta.tsv.xz` — metadata (do **not** use while testing)
* `dev-1/` — directory with dev (test) data from different source than the train set
* `dev-1/in.tsv` — input data for the dev set
* `dev-1/expected.tsv` — expected (reference) data for the dev set
* `dev-1/meta.tsv.xz` — metadata (do **not** use while testing)
* `test-A` — directory with test data
* `test-A/in.tsv` — input data for the test set
* `test-A/expected.tsv` — expected (reference) data for the test set (hidden)
* `test-A/meta.tsv.xz` — hidden metadata

Structure of data sets
----------------------

Dev and tests test sets are balanced for years (or at least it was
attempted to balance them for years — for some years there was not enough material).

The `dev-0` dataset was created using the same sources as the train set, but
`dev-1` and `test-A` was generated using sources different from
`dev-0` (and different to each other), so `dev-0` is likely to be
easier than `dev-1`.

Metadata files are given for reference, do not use them for training.

Format of the train set
-----------------------

TAB-separated columns:

* beginning of the period in which a text is known to be published, given as a year
  with a possible fraction (note that various time granularities are given  in this
  data set — daily, monthly, yearly, etc.),
* end of the period in which a text is known to be published,
* title normalised,
* symbol of the source (usually a Polish digital library).
* ~500-word-long text snippet.

Texts (in the train set as well as in the test sets) were tokenised
with `tokenizer.perl` script from Moses MT system.

Format of the test sets
-----------------------

The input file consists of the following columns:

* beginning of the period in which a text is known to be published,
* end of the period in which a text is known to be published,
* left context (before the gap),
* right context (after the gap).

The `expected.tsv` file is just a list of words to be filled in the gaps.

Format of the output files
--------------------------

For each input line, a probability distribution for words in a gap
must be given:

    word1:logprob1 word2:logprob2 ... wordN:logprobN :logprob0

where *logprobi* is the logarithm of the probability for *wordi* and
*logprob0* is the logarithm of the probability mass for all the other
words (it will be spread between all 1024 fingerprint values). If the
respective probabilities do not sum up to 1, they will be normalised with
softmax.

Note: the separator here is space, not TAB!