forked from kubapok/lalka-lm
commit
7d61ed9133
57
README.md
Normal file
57
README.md
Normal file
@ -0,0 +1,57 @@
|
||||
|
||||
Guess a word in a gap
|
||||
=======================================
|
||||
|
||||
Give a probability distribution for a word in a gap in a corpus of
|
||||
book "Lalka". This is a challenge for
|
||||
language models.
|
||||
|
||||
The metric is log-loss calculated on 10-bit fingerprints generated
|
||||
with MurmurHash3 hash function.
|
||||
|
||||
Example
|
||||
-------
|
||||
|
||||
For instance, you are expected guess the word in the
|
||||
gap marked with <MASK> token here:
|
||||
|
||||
> Wokulski słuchał go uważnie do Pruszkowa .
|
||||
> Za Pruszkowem zmęczony i jednostajny głos pana
|
||||
> Tomasza zaczął go męczyć . Za to coraz
|
||||
> wyraźniej wpadała mu w ucho rozmowa panny
|
||||
> Izabeli ze <MASK> , prowadzona po angielsku .
|
||||
> Usłyszał nawet kilka zdań , które go zainteresowały ,
|
||||
> i zadał sobie pytanie : czy nie należałoby ostrzec
|
||||
> ich , że on rozumie po angielsku ?
|
||||
|
||||
|
||||
(And the correct expected word here is *rolnej*.)
|
||||
|
||||
Directory structure
|
||||
-------------------
|
||||
|
||||
* `README.md` — this file
|
||||
* `config.txt` — GEval configuration file
|
||||
* `train/` — directory with training data
|
||||
* `dev-0/` — directory with dev (test) data
|
||||
* `dev-0/in.tsv` — input data for the dev set
|
||||
* `dev-0/expected.tsv` — expected (reference) data for the dev set
|
||||
* `test-A` — directory with test data
|
||||
* `test-A/in.tsv` — input data for the test set
|
||||
* `test-A/expected.tsv` — expected (reference) data for the test set (hidden)
|
||||
|
||||
|
||||
Format of the output files
|
||||
--------------------------
|
||||
|
||||
For each input line, a probability distribution for words in a gap
|
||||
must be given:
|
||||
|
||||
word1:logprob1 word2:logprob2 ... wordN:logprobN :logprob0
|
||||
|
||||
where *logprobi* is the logarithm of the probability for *wordi* and
|
||||
*logprob0* is the logarithm of the probability mass for all the other
|
||||
words (it will be spread between all 1024 fingerprint values). If the
|
||||
respective probabilities do not sum up to 1, they will be normalised with
|
||||
softmax.
|
||||
|
1
config.txt
Normal file
1
config.txt
Normal file
@ -0,0 +1 @@
|
||||
--metric LogLossHashed10 --precision 8
|
Loading…
Reference in New Issue
Block a user