From 9fc4beaba169eb9d443f17ea245861dbf1d06565 Mon Sep 17 00:00:00 2001
From: Filip Gralinski <filipg@amu.edu.pl>
Date: Tue, 15 May 2018 08:14:52 +0200
Subject: [PATCH] improve sample challenge for LogLossHashed

---
 src/GEval/CreateChallenge.hs | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/src/GEval/CreateChallenge.hs b/src/GEval/CreateChallenge.hs
index b0e53a5..3b31b3f 100644
--- a/src/GEval/CreateChallenge.hs
+++ b/src/GEval/CreateChallenge.hs
@@ -113,6 +113,35 @@ The metric is average log-loss calculated for 10-bit hashes.
 Train file is a just text file (one utterance per line).
 In an input file, left and right contexts (TAB-separated) are given.
 In an expected file, the word to be guessed is given.
+
+Format of the output files
+--------------------------
+
+For each input line, a probability distribution for words in a gap
+must be given:
+
+    word1:logprob1 word2:logprob2 ... wordN:logprobN :logprob0
+
+where *logprobi* is the logarithm of the probability for *wordi* and
+*logprob0* is the logarithm of the probability mass for all the other
+words (it will be spread between all 1024 fingerprint values). If the
+respective probabilities do not sum up to 1, they will be normalised with
+softmax.
+
+Note: the separator here is space, not TAB!
+
+### Probs
+
+Probabilities could be given (instead of logprobs):
+
+  * if **all** values look as probs and **at least value** is positive, we treat
+    the values as probs rather then logprobs (single value 0.0 is treated
+    as a logprob, i.e. probability 1.0!);
+  * if their sum is greater than 1.0, then we normalize simply by dividing by the sum;
+  * if the sum is smaller than 1.0 and there is no entry for all the other words,
+    we add such an entry for the missing probability mass;
+  * if the sum is smaller than 1.0 and there is an entry for all the other words,
+    we normalize by dividing by the sum.
 |] ++ (commonReadmeMDContents testName)
 
 readmeMDContents CharMatch testName = [i|