lalka-lm/README.md

58 lines
1.8 KiB
Markdown
Raw Normal View History

2021-05-24 12:34:02 +02:00
Guess a word in a gap
=======================================
Give a probability distribution for a word in a gap in a corpus of
book "Lalka". This is a challenge for
language models.
The metric is log-loss calculated on 10-bit fingerprints generated
with MurmurHash3 hash function.
Example
-------
For instance, you are expected guess the word in the
gap marked with <MASK> token here:
> Wokulski słuchał go uważnie do Pruszkowa .
> Za Pruszkowem zmęczony i jednostajny głos pana
> Tomasza zaczął go męczyć . Za to coraz
> wyraźniej wpadała mu w ucho rozmowa panny
> Izabeli ze <MASK> , prowadzona po angielsku .
> Usłyszał nawet kilka zdań , które go zainteresowały ,
> i zadał sobie pytanie : czy nie należałoby ostrzec
> ich , że on rozumie po angielsku ?
(And the correct expected word here is *rolnej*.)
Directory structure
-------------------
* `README.md` — this file
* `config.txt` — GEval configuration file
* `train/` — directory with training data
* `dev-0/` — directory with dev (test) data
* `dev-0/in.tsv` — input data for the dev set
* `dev-0/expected.tsv` — expected (reference) data for the dev set
* `test-A` — directory with test data
* `test-A/in.tsv` — input data for the test set
* `test-A/expected.tsv` — expected (reference) data for the test set (hidden)
Format of the output files
--------------------------
For each input line, a probability distribution for words in a gap
must be given:
word1:logprob1 word2:logprob2 ... wordN:logprobN :logprob0
where *logprobi* is the logarithm of the probability for *wordi* and
*logprob0* is the logarithm of the probability mass for all the other
words (it will be spread between all 1024 fingerprint values). If the
respective probabilities do not sum up to 1, they will be normalised with
softmax.