dev-0 | ||
test-A | ||
train | ||
.gitignore | ||
config.txt | ||
geval | ||
guessword.py | ||
README.md |
Guess a word in a gap
Give a probability distribution for a word in a gap in a corpus of book "Lalka". This is a challenge for language models.
The metric is log-loss calculated on 10-bit fingerprints generated with MurmurHash3 hash function.
Example
For instance, you are expected guess the word in the gap marked with token here:
Wokulski słuchał go uważnie do Pruszkowa . Za Pruszkowem zmęczony i jednostajny głos pana Tomasza zaczął go męczyć . Za to coraz wyraźniej wpadała mu w ucho rozmowa panny Izabeli ze , prowadzona po angielsku . Usłyszał nawet kilka zdań , które go zainteresowały , i zadał sobie pytanie : czy nie należałoby ostrzec ich , że on rozumie po angielsku ?
(And the correct expected word here is rolnej.)
Directory structure
README.md
— this fileconfig.txt
— GEval configuration filetrain/
— directory with training datadev-0/
— directory with dev (test) datadev-0/in.tsv
— input data for the dev setdev-0/expected.tsv
— expected (reference) data for the dev settest-A
— directory with test datatest-A/in.tsv
— input data for the test settest-A/expected.tsv
— expected (reference) data for the test set (hidden)
Format of the output files
For each input line, a probability distribution for words in a gap must be given:
word1:logprob1 word2:logprob2 ... wordN:logprobN :logprob0
where logprobi is the logarithm of the probability for wordi and logprob0 is the logarithm of the probability mass for all the other words (it will be spread between all 1024 fingerprint values). If the respective probabilities do not sum up to 1, they will be normalised with softmax.