improve sample challenge for LogLossHashed

This commit is contained in:
Filip Gralinski 2018-05-15 08:14:52 +02:00
parent 06fd093349
commit 9fc4beaba1

View File

@ -113,6 +113,35 @@ The metric is average log-loss calculated for 10-bit hashes.
Train file is a just text file (one utterance per line). Train file is a just text file (one utterance per line).
In an input file, left and right contexts (TAB-separated) are given. In an input file, left and right contexts (TAB-separated) are given.
In an expected file, the word to be guessed is given. In an expected file, the word to be guessed is given.
Format of the output files
--------------------------
For each input line, a probability distribution for words in a gap
must be given:
word1:logprob1 word2:logprob2 ... wordN:logprobN :logprob0
where *logprobi* is the logarithm of the probability for *wordi* and
*logprob0* is the logarithm of the probability mass for all the other
words (it will be spread between all 1024 fingerprint values). If the
respective probabilities do not sum up to 1, they will be normalised with
softmax.
Note: the separator here is space, not TAB!
### Probs
Probabilities could be given (instead of logprobs):
* if **all** values look as probs and **at least value** is positive, we treat
the values as probs rather then logprobs (single value 0.0 is treated
as a logprob, i.e. probability 1.0!);
* if their sum is greater than 1.0, then we normalize simply by dividing by the sum;
* if the sum is smaller than 1.0 and there is no entry for all the other words,
we add such an entry for the missing probability mass;
* if the sum is smaller than 1.0 and there is an entry for all the other words,
we normalize by dividing by the sum.
|] ++ (commonReadmeMDContents testName) |] ++ (commonReadmeMDContents testName)
readmeMDContents CharMatch testName = [i| readmeMDContents CharMatch testName = [i|