Go to file

VanillaHellen e04a09c93e Merge branch 'master' of git.wmi.amu.edu.pl:s416067/lalka-lm		2021-07-06 08:49:37 +02:00
dev-0	fix	2021-07-06 08:48:10 +02:00
test-A	fix	2021-07-06 08:48:10 +02:00
train	fix	2021-07-06 08:48:10 +02:00
.gitignore	out fix	2021-07-05 00:06:22 +02:00
config.txt	out fix	2021-07-05 00:06:22 +02:00
geval	program to check	2021-06-29 18:04:35 +02:00
guessword.py	fix	2021-07-06 08:48:10 +02:00
README.md	out fix	2021-07-05 00:06:22 +02:00

README.md

Guess a word in a gap

Give a probability distribution for a word in a gap in a corpus of book "Lalka". This is a challenge for language models.

The metric is log-loss calculated on 10-bit fingerprints generated with MurmurHash3 hash function.

Example

For instance, you are expected guess the word in the gap marked with token here:

Wokulski słuchał go uważnie do Pruszkowa . Za Pruszkowem zmęczony i jednostajny głos pana Tomasza zaczął go męczyć . Za to coraz wyraźniej wpadała mu w ucho rozmowa panny Izabeli ze , prowadzona po angielsku . Usłyszał nawet kilka zdań , które go zainteresowały , i zadał sobie pytanie : czy nie należałoby ostrzec ich , że on rozumie po angielsku ?

(And the correct expected word here is rolnej.)

Directory structure

README.md — this file
config.txt — GEval configuration file
train/ — directory with training data
dev-0/ — directory with dev (test) data
dev-0/in.tsv — input data for the dev set
dev-0/expected.tsv — expected (reference) data for the dev set
test-A — directory with test data
test-A/in.tsv — input data for the test set
test-A/expected.tsv — expected (reference) data for the test set (hidden)

Format of the output files

For each input line, a probability distribution for words in a gap must be given:

word1:logprob1 word2:logprob2 ... wordN:logprobN :logprob0

where logprobi is the logarithm of the probability for wordi and logprob0 is the logarithm of the probability mass for all the other words (it will be spread between all 1024 fingerprint values). If the respective probabilities do not sum up to 1, they will be normalised with softmax.