diff --git a/README.md b/README.md new file mode 100644 index 0000000..1c6cb12 --- /dev/null +++ b/README.md @@ -0,0 +1,57 @@ + +Guess a word in a gap +======================================= + +Give a probability distribution for a word in a gap in a corpus of +book "Lalka". This is a challenge for +language models. + +The metric is log-loss calculated on 10-bit fingerprints generated +with MurmurHash3 hash function. + +Example +------- + +For instance, you are expected guess the word in the +gap marked with token here: + +> Wokulski słuchał go uważnie do Pruszkowa . +> Za Pruszkowem zmęczony i jednostajny głos pana +> Tomasza zaczął go męczyć . Za to coraz +> wyraźniej wpadała mu w ucho rozmowa panny +> Izabeli ze , prowadzona po angielsku . +> Usłyszał nawet kilka zdań , które go zainteresowały , +> i zadał sobie pytanie : czy nie należałoby ostrzec +> ich , że on rozumie po angielsku ? + + +(And the correct expected word here is *rolnej*.) + +Directory structure +------------------- + +* `README.md` — this file +* `config.txt` — GEval configuration file +* `train/` — directory with training data +* `dev-0/` — directory with dev (test) data +* `dev-0/in.tsv` — input data for the dev set +* `dev-0/expected.tsv` — expected (reference) data for the dev set +* `test-A` — directory with test data +* `test-A/in.tsv` — input data for the test set +* `test-A/expected.tsv` — expected (reference) data for the test set (hidden) + + +Format of the output files +-------------------------- + +For each input line, a probability distribution for words in a gap +must be given: + + word1:logprob1 word2:logprob2 ... wordN:logprobN :logprob0 + +where *logprobi* is the logarithm of the probability for *wordi* and +*logprob0* is the logarithm of the probability mass for all the other +words (it will be spread between all 1024 fingerprint values). If the +respective probabilities do not sum up to 1, they will be normalised with +softmax. + diff --git a/config.txt b/config.txt new file mode 100644 index 0000000..43db952 --- /dev/null +++ b/config.txt @@ -0,0 +1 @@ +--metric LogLossHashed10 --precision 8