forked from kubapok/lalka-lm
58 lines
1.8 KiB
Markdown
58 lines
1.8 KiB
Markdown
|
|
||
|
Guess a word in a gap
|
||
|
=======================================
|
||
|
|
||
|
Give a probability distribution for a word in a gap in a corpus of
|
||
|
book "Lalka". This is a challenge for
|
||
|
language models.
|
||
|
|
||
|
The metric is log-loss calculated on 10-bit fingerprints generated
|
||
|
with MurmurHash3 hash function.
|
||
|
|
||
|
Example
|
||
|
-------
|
||
|
|
||
|
For instance, you are expected guess the word in the
|
||
|
gap marked with <MASK> token here:
|
||
|
|
||
|
> Wokulski słuchał go uważnie do Pruszkowa .
|
||
|
> Za Pruszkowem zmęczony i jednostajny głos pana
|
||
|
> Tomasza zaczął go męczyć . Za to coraz
|
||
|
> wyraźniej wpadała mu w ucho rozmowa panny
|
||
|
> Izabeli ze <MASK> , prowadzona po angielsku .
|
||
|
> Usłyszał nawet kilka zdań , które go zainteresowały ,
|
||
|
> i zadał sobie pytanie : czy nie należałoby ostrzec
|
||
|
> ich , że on rozumie po angielsku ?
|
||
|
|
||
|
|
||
|
(And the correct expected word here is *rolnej*.)
|
||
|
|
||
|
Directory structure
|
||
|
-------------------
|
||
|
|
||
|
* `README.md` — this file
|
||
|
* `config.txt` — GEval configuration file
|
||
|
* `train/` — directory with training data
|
||
|
* `dev-0/` — directory with dev (test) data
|
||
|
* `dev-0/in.tsv` — input data for the dev set
|
||
|
* `dev-0/expected.tsv` — expected (reference) data for the dev set
|
||
|
* `test-A` — directory with test data
|
||
|
* `test-A/in.tsv` — input data for the test set
|
||
|
* `test-A/expected.tsv` — expected (reference) data for the test set (hidden)
|
||
|
|
||
|
|
||
|
Format of the output files
|
||
|
--------------------------
|
||
|
|
||
|
For each input line, a probability distribution for words in a gap
|
||
|
must be given:
|
||
|
|
||
|
word1:logprob1 word2:logprob2 ... wordN:logprobN :logprob0
|
||
|
|
||
|
where *logprobi* is the logarithm of the probability for *wordi* and
|
||
|
*logprob0* is the logarithm of the probability mass for all the other
|
||
|
words (it will be spread between all 1024 fingerprint values). If the
|
||
|
respective probabilities do not sum up to 1, they will be normalised with
|
||
|
softmax.
|
||
|
|