forked from kubapok/lalka-lm
commit
7d61ed9133
57
README.md
Normal file
57
README.md
Normal file
@ -0,0 +1,57 @@
|
|||||||
|
|
||||||
|
Guess a word in a gap
|
||||||
|
=======================================
|
||||||
|
|
||||||
|
Give a probability distribution for a word in a gap in a corpus of
|
||||||
|
book "Lalka". This is a challenge for
|
||||||
|
language models.
|
||||||
|
|
||||||
|
The metric is log-loss calculated on 10-bit fingerprints generated
|
||||||
|
with MurmurHash3 hash function.
|
||||||
|
|
||||||
|
Example
|
||||||
|
-------
|
||||||
|
|
||||||
|
For instance, you are expected guess the word in the
|
||||||
|
gap marked with <MASK> token here:
|
||||||
|
|
||||||
|
> Wokulski słuchał go uważnie do Pruszkowa .
|
||||||
|
> Za Pruszkowem zmęczony i jednostajny głos pana
|
||||||
|
> Tomasza zaczął go męczyć . Za to coraz
|
||||||
|
> wyraźniej wpadała mu w ucho rozmowa panny
|
||||||
|
> Izabeli ze <MASK> , prowadzona po angielsku .
|
||||||
|
> Usłyszał nawet kilka zdań , które go zainteresowały ,
|
||||||
|
> i zadał sobie pytanie : czy nie należałoby ostrzec
|
||||||
|
> ich , że on rozumie po angielsku ?
|
||||||
|
|
||||||
|
|
||||||
|
(And the correct expected word here is *rolnej*.)
|
||||||
|
|
||||||
|
Directory structure
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
* `README.md` — this file
|
||||||
|
* `config.txt` — GEval configuration file
|
||||||
|
* `train/` — directory with training data
|
||||||
|
* `dev-0/` — directory with dev (test) data
|
||||||
|
* `dev-0/in.tsv` — input data for the dev set
|
||||||
|
* `dev-0/expected.tsv` — expected (reference) data for the dev set
|
||||||
|
* `test-A` — directory with test data
|
||||||
|
* `test-A/in.tsv` — input data for the test set
|
||||||
|
* `test-A/expected.tsv` — expected (reference) data for the test set (hidden)
|
||||||
|
|
||||||
|
|
||||||
|
Format of the output files
|
||||||
|
--------------------------
|
||||||
|
|
||||||
|
For each input line, a probability distribution for words in a gap
|
||||||
|
must be given:
|
||||||
|
|
||||||
|
word1:logprob1 word2:logprob2 ... wordN:logprobN :logprob0
|
||||||
|
|
||||||
|
where *logprobi* is the logarithm of the probability for *wordi* and
|
||||||
|
*logprob0* is the logarithm of the probability mass for all the other
|
||||||
|
words (it will be spread between all 1024 fingerprint values). If the
|
||||||
|
respective probabilities do not sum up to 1, they will be normalised with
|
||||||
|
softmax.
|
||||||
|
|
1
config.txt
Normal file
1
config.txt
Normal file
@ -0,0 +1 @@
|
|||||||
|
--metric LogLossHashed10 --precision 8
|
Loading…
Reference in New Issue
Block a user