Merge branch 'master' of https://git.wmi.amu.edu.pl/s426289/lalka-lm

x
2021-06-24 00:28:18 +02:00 · 2021-06-24 00:28:18 +02:00 · 7d61ed9133
commit 7d61ed9133
parent 6170338a40 fdf03fd960
2 changed files with 58 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,57 @@
+
+Guess a word in a gap 
+=======================================
+
+Give a probability distribution for a word in a gap in a corpus of
+book "Lalka". This is a challenge for
+language models.
+
+The metric is log-loss calculated on 10-bit fingerprints generated
+with MurmurHash3 hash function.
+
+Example
+-------
+
+For instance, you are expected guess the word in the
+gap marked with <MASK> token here:
+
+> Wokulski słuchał go uważnie do Pruszkowa .
+> Za Pruszkowem zmęczony i jednostajny głos pana
+> Tomasza zaczął go męczyć . Za to coraz
+> wyraźniej wpadała mu w ucho rozmowa panny
+> Izabeli ze <MASK> , prowadzona po angielsku .
+> Usłyszał nawet kilka zdań , które go zainteresowały ,
+> i zadał sobie pytanie : czy nie należałoby ostrzec
+> ich , że on rozumie po angielsku ?
+
+
+(And the correct expected word here is *rolnej*.)
+
+Directory structure
+-------------------
+
+* `README.md` — this file
+* `config.txt` — GEval configuration file
+* `train/` — directory with training data
+* `dev-0/` — directory with dev (test) data
+* `dev-0/in.tsv` — input data for the dev set
+* `dev-0/expected.tsv` — expected (reference) data for the dev set
+* `test-A` — directory with test data
+* `test-A/in.tsv` — input data for the test set
+* `test-A/expected.tsv` — expected (reference) data for the test set (hidden)
+
+
+Format of the output files
+--------------------------
+
+For each input line, a probability distribution for words in a gap
+must be given:
+
+    word1:logprob1 word2:logprob2 ... wordN:logprobN :logprob0
+
+where *logprobi* is the logarithm of the probability for *wordi* and
+*logprob0* is the logarithm of the probability mass for all the other
+words (it will be spread between all 1024 fingerprint values). If the
+respective probabilities do not sum up to 1, they will be normalised with
+softmax.
+
--- a/config.txt
+++ b/config.txt
@ -0,0 +1 @@
+--metric LogLossHashed10 --precision 8