lalka-lm_RNN/README.md
2021-05-24 12:34:02 +02:00

1.8 KiB

Guess a word in a gap

Give a probability distribution for a word in a gap in a corpus of book "Lalka". This is a challenge for language models.

The metric is log-loss calculated on 10-bit fingerprints generated with MurmurHash3 hash function.

Example

For instance, you are expected guess the word in the gap marked with token here:

Wokulski słuchał go uważnie do Pruszkowa . Za Pruszkowem zmęczony i jednostajny głos pana Tomasza zaczął go męczyć . Za to coraz wyraźniej wpadała mu w ucho rozmowa panny Izabeli ze , prowadzona po angielsku . Usłyszał nawet kilka zdań , które go zainteresowały , i zadał sobie pytanie : czy nie należałoby ostrzec ich , że on rozumie po angielsku ?

(And the correct expected word here is rolnej.)

Directory structure

  • README.md — this file
  • config.txt — GEval configuration file
  • train/ — directory with training data
  • dev-0/ — directory with dev (test) data
  • dev-0/in.tsv — input data for the dev set
  • dev-0/expected.tsv — expected (reference) data for the dev set
  • test-A — directory with test data
  • test-A/in.tsv — input data for the test set
  • test-A/expected.tsv — expected (reference) data for the test set (hidden)

Format of the output files

For each input line, a probability distribution for words in a gap must be given:

word1:logprob1 word2:logprob2 ... wordN:logprobN :logprob0

where logprobi is the logarithm of the probability for wordi and logprob0 is the logarithm of the probability mass for all the other words (it will be spread between all 1024 fingerprint values). If the respective probabilities do not sum up to 1, they will be normalised with softmax.