Merge remote-tracking branch 'origin/master' into improvement/#25-issue-when-build-under-linux

# Conflicts: # README.md
2019-12-11 10:20:46 +01:00 · 2019-12-11 10:20:46 +01:00 · 3528fcbfca
commit 3528fcbfca
parent afccb8e595 ef8945af11
19 changed files with 280 additions and 24 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,4 +1,13 @@

+## 1.22.1.0
+
+* Add "Mean/" meta-metric (for the time being working only with MultiLabel-F-measure)
+* Add :S flag
+
+## 1.22.0.0
+
+* Add SegmentAccuracy
+
 ## 1.21.0.0

 * Add Probabilistic-MultiLabel-F-measure
--- a/README.md
+++ b/README.md
@ -2,7 +2,7 @@

 GEval is a Haskell library and a stand-alone tool for evaluating the
 results of solutions to machine learning challenges as defined in the
-[Gonito](https://gonito.net) platform. Also could be used outside the
+[Gonito](https://gonito.net) platform. Also, could be used outside the
 context of Gonito.net challenges, assuming the test data is given in
 simple TSV (tab-separated values) files.

@ -14,6 +14,29 @@ The official repository is `git://gonito.net/geval`, browsable at

 ## Installing

+### The easy way: just download the fully static GEval binary
+
+(Assuming you have a 64-bit Linux.)
+
+    wget https://gonito.net/get/bin/geval
+    chmod u+x geval
+    ./geval --help
+
+#### On Windows
+
+For Windows, you should use Windows PowerShell.
+
+	wget https://gonito.net/get/bin/geval
+
+Next, you should go to the folder where you download `geval` and right-click to `geval` file.
+Go to `Properties` and in the section `Security` grant full access to the folder.
+
+Or you should use `icacls "folder path to geval" /grant USER:<username>`
+
+This is a fully static binary, it should work on any 64-bit Linux or 64-bit Windows.
+
+### Build from scratch
+
 You need [Haskell Stack](https://github.com/commercialhaskell/stack).
 You could install Stack with your package manager or with:

@ -36,6 +59,8 @@ order to run `geval` you need to either add `$HOME/.local/bin` to

    PATH="$HOME/.local/bin" geval ...

+In Windows you should add new global variable with name 'geval' and path should be the same as above.
+
 ### Troubleshooting

 If you see a message like this:
@ -64,15 +89,32 @@ In case the `lzma` package is not installed on your Linux, you need to run (assu

    sudo apt-get install pkg-config liblzma-dev libpq-dev libpcre3-dev libcairo2-dev libbz2-dev
    
-### Plan B — just download the GEval binary
+#### Windows issues

-(Assuming you have a 64-bit Linux.)
+If you see this message on Windows during executing `stack test` command:

-    wget https://gonito.net/get/bin/geval
-    chmod u+x geval
-    ./geval --help
+	In the dependencies for geval-1.21.1.0:
+	    unix needed, but the stack configuration has no specified version
+	In the dependencies for lzma-0.0.0.3:
+	    lzma-clib needed, but the stack configuration has no specified version

-This is a fully static binary, it should work on any 64-bit Linux.
+You should replace `unix` with `unix-compat` in `geval.cabal` file,
+because `unix` package is not supported for Windows.
+
+And you should add `lzma-clib-5.2.2` and `unix-compat-0.5.2` to section extra-deps in `stack.yaml` file.
+
+If you see message about missing pkg-config on Windpws you should download two packages from the site:
+http://ftp.gnome.org/pub/gnome/binaries/win32/dependencies/
+These packages are:
+		- pkg-config (the newest version)
+		- gettext-runtime (the newest version)
+Extract `pkg-config.exe` file in Windows PATH
+Extract init.dll file from gettext-runtime
+
+You should also download from http://ftp.gnome.org/pub/gnome/binaries/win32/glib/2.28 glib package
+and extract libglib-2.0-0.dll file.
+
+All files you should put for example in `C:\MinGW\bin` directory.

 ## Quick tour

@ -189,7 +231,7 @@ But why were double quotes so problematic in German-English
 translation?! Well, look at the second-worst feature — `&apos;&apos;`
 in the _output_! Oops, it seems like a very stupid mistake with
 post-processing was done and no double quote was correctly generated,
-which decreased the score a little bit for each sentence in which the
+which decreased the score a little for each sentence in which the
 quote was expected.

 When I fixed this simple bug, the BLUE metric increased from 0.27358
@ -502,9 +544,9 @@ submitted. The suggested way to do this is as follows:
   `test-A/expected.tsv` added. This branch should be accessible by
   Gonito platform, but should be kept “hidden” for regular users (or
   at least they should be kindly asked not to peek there). It is
-   recommended (though not obligatory) that this branch contain all
+   recommended (though not obligatory) that this branch contains all
   the source codes and data used to generate the train/dev/test sets.
-   (Use [git-annex](https://git-annex.branchable.com/) if you have really big files there.)
+   (Use [git-annex](https://git-annex.branchable.com/) if you have huge files there.)

 Branch (1) should be the parent of the branch (2), for instance, the
 repo (for the toy “planets” challenge) could be created as follows:
@ -567,7 +609,7 @@ be nice and commit also your source codes.
    git push mine master

 Then let Gonito pull them and evaluate your results, either manually clicking
-"submit" at the Gonito web site or using `--submit` option (see below).
+"submit" at the Gonito website or using `--submit` option (see below).

 ### Submitting a solution to a Gonito platform with GEval

--- a/geval.cabal
+++ b/geval.cabal
@ -1,5 +1,5 @@
 name:                geval
-version:             1.21.1.0
+version:             1.22.1.0
 synopsis:            Machine learning evaluation tools
 description:         Please see README.md
 homepage:            http://github.com/name/project
--- a/src/GEval/Annotation.hs
+++ b/src/GEval/Annotation.hs
@ -4,11 +4,12 @@
 module GEval.Annotation
       (parseAnnotations, Annotation(..),
        parseObtainedAnnotations, ObtainedAnnotation(..),
-        matchScore, intSetParser)
+        matchScore, intSetParser, segmentAccuracy, parseSegmentAnnotations)
       where

 import qualified Data.IntSet as IS
 import qualified Data.Text as T
+import Data.Set (intersection, fromList)

 import Data.Attoparsec.Text
 import Data.Attoparsec.Combinator
@ -17,11 +18,12 @@ import GEval.Common (sepByWhitespaces, (/.))
 import GEval.Probability
 import Data.Char
 import Data.Maybe (fromMaybe)
+import Data.Either (partitionEithers)

 import GEval.PrecisionRecall(weightedMaxMatching)

 data Annotation = Annotation T.Text IS.IntSet
-                  deriving (Eq, Show)
+                  deriving (Eq, Show, Ord)

 data ObtainedAnnotation = ObtainedAnnotation Annotation Double
                          deriving (Eq, Show)
@ -52,6 +54,36 @@ obtainedAnnotationParser = do
 parseAnnotations :: T.Text -> Either String [Annotation]
 parseAnnotations t = parseOnly (annotationsParser <* endOfInput) t

+parseSegmentAnnotations :: T.Text -> Either String [Annotation]
+parseSegmentAnnotations t = case parseAnnotationsWithColons t of
+  Left m -> Left m
+  Right annotations -> if areSegmentsDisjoint annotations
+                      then (Right annotations)
+                      else (Left "Overlapping segments")
+
+areSegmentsDisjoint :: [Annotation] -> Bool
+areSegmentsDisjoint = areIntSetsDisjoint . map (\(Annotation _ s) -> s)
+
+areIntSetsDisjoint :: [IS.IntSet] -> Bool
+areIntSetsDisjoint ss = snd $ foldr step (IS.empty, True) ss
+  where step _ w@(_, False) = w
+        step s (u, True) = (s `IS.union` u, s `IS.disjoint` u)
+
+-- unfortunately, attoparsec does not seem to back-track properly
+-- so we need a special function if labels can contain colons
+parseAnnotationsWithColons :: T.Text -> Either String [Annotation]
+parseAnnotationsWithColons t = case partitionEithers (map parseAnnotationWithColons $ T.words t) of
+  ([], annotations) -> Right annotations
+  ((firstProblem:_), _) -> Left firstProblem
+
+parseAnnotationWithColons :: T.Text -> Either String Annotation
+parseAnnotationWithColons t = if T.null label
+                              then Left "Colon expected"
+                              else case parseOnly (intSetParser <* endOfInput) position of
+                                     Left m -> Left m
+                                     Right s -> Right (Annotation (T.init label) s)
+  where (label, position) = T.breakOnEnd ":" t
+
 annotationsParser :: Parser [Annotation]
 annotationsParser = sepByWhitespaces annotationParser

@ -70,3 +102,7 @@ intervalParser = do
  startIx <- decimal
  endIx <- (string "-" *> decimal <|> pure startIx)
  pure $ IS.fromList [startIx..endIx]
+
+segmentAccuracy :: [Annotation] -> [Annotation] -> Double
+segmentAccuracy expected output = (fromIntegral $ length matched) / (fromIntegral $ length expected)
+  where matched = (fromList expected) `intersection` (fromList output)
--- a/src/GEval/Core.hs
+++ b/src/GEval/Core.hs
@ -492,6 +492,23 @@ gevalCoreOnSources CharMatch inputLineSource = helper inputLineSource
 gevalCoreOnSources (LogLossHashed nbOfBits) _ = helperLogLossHashed nbOfBits id
 gevalCoreOnSources (LikelihoodHashed nbOfBits) _ = helperLogLossHashed nbOfBits logLossToLikehood

+
+gevalCoreOnSources (Mean (MultiLabelFMeasure beta)) _
+  = gevalCoreWithoutInputOnItemTargets (Right . intoWords)
+                                       (Right . getWords)
+                                       ((fMeasureOnCounts beta) . (getCounts (==)))
+                                       averageC
+                                       id
+                                       noGraph
+    where
+      -- repeated as below, as it will be refactored into dependent types soon anyway
+      getWords (RawItemTarget t) = Prelude.map unpack $ selectByStandardThreshold $ parseIntoProbList t
+      getWords (PartiallyParsedItemTarget ts) = Prelude.map unpack ts
+      intoWords (RawItemTarget t) = Prelude.map unpack $ Data.Text.words t
+      intoWords (PartiallyParsedItemTarget ts) = Prelude.map unpack ts
+
+gevalCoreOnSources (Mean _) _ = error $ "Mean/ meta-metric defined only for MultiLabel-F1 for the time being"
+
 -- only MultiLabel-F1 handled for JSONs for the time being...
 gevalCoreOnSources (MultiLabelFMeasure beta) _ = gevalCoreWithoutInputOnItemTargets (Right . intoWords)
                                                                            (Right . getWords)
@ -706,6 +723,13 @@ gevalCoreOnSources TokenAccuracy _ = gevalCoreWithoutInput intoTokens
           | otherwise = (h, t + 1)
         hitsAndTotalsAgg = CC.foldl (\(h1, t1) (h2, t2) -> (h1 + h2, t1 + t2)) (0, 0)

+gevalCoreOnSources SegmentAccuracy _ = gevalCoreWithoutInput parseSegmentAnnotations
+                                                             parseSegmentAnnotations
+                                                             (uncurry segmentAccuracy)
+                                                             averageC
+                                                             id
+                                                             noGraph
+
 gevalCoreOnSources MultiLabelLogLoss _ = gevalCoreWithoutInput intoWords
                                                       (Right . parseIntoProbList)
                                                       (uncurry countLogLossOnProbList)
--- a/src/GEval/CreateChallenge.hs
+++ b/src/GEval/CreateChallenge.hs
@ -55,6 +55,7 @@ createFile filePath contents = do
  writeFile filePath contents

 readmeMDContents :: Metric -> String -> String
+readmeMDContents (Mean metric) testName = readmeMDContents metric testName
 readmeMDContents GLEU testName = readmeMDContents BLEU testName
 readmeMDContents BLEU testName = [i|
 GEval sample machine translation challenge
@ -297,6 +298,19 @@ in the expected file (but not in the output file).

 |] ++ (commonReadmeMDContents testName)

+readmeMDContents SegmentAccuracy testName = [i|
+Segment a sentence and tag with POS tags
+========================================
+
+This is a sample, toy challenge for SegmentAccuracy.
+
+For each sentence, give a sequence of POS tags, each one with
+its position (1-indexed). For instance, `N:1-10` means a nouns
+starting from the beginning (the first character) up to to the tenth
+character (inclusively).
+
+|] ++ (commonReadmeMDContents testName)
+
 readmeMDContents (ProbabilisticMultiLabelFMeasure beta) testName = readmeMDContents (MultiLabelFMeasure beta) testName
 readmeMDContents (MultiLabelFMeasure beta) testName = [i|
 Tag names and their component
@ -400,6 +414,7 @@ configContents schemes precision testName = unwords (Prelude.map (\scheme -> ("-
          precisionOpt (Just p) = " --precision " ++ (show p)

 trainContents :: Metric -> String
+trainContents (Mean metric) = trainContents metric
 trainContents GLEU = trainContents BLEU
 trainContents BLEU = [hereLit|alussa loi jumala taivaan ja maan	he mea hanga na te atua i te timatanga te rangi me te whenua
 ja maa oli autio ja tyhjä , ja pimeys oli syvyyden päällä	a kahore he ahua o te whenua , i takoto kau ; he pouri ano a runga i te mata o te hohonu
@ -473,6 +488,9 @@ B-firstname/JOHN I-surname/VON I-surname/NEUMANN	John von Nueman
 trainContents TokenAccuracy = [hereLit|* V N	I like cats
 * * V * N	I can see the rainbow
 |]
+trainContents SegmentAccuracy = [hereLit|Art:1-3 N:5-11 V:12-13 A:15-19	The student's smart
+N:1-6 N:8-10 V:12-13 A:15-18	Mary's dog is nice
+|]
 trainContents (ProbabilisticMultiLabelFMeasure beta) = trainContents (MultiLabelFMeasure beta)
 trainContents (MultiLabelFMeasure _) = [hereLit|I know Mr John Smith	person/3,4,5 first-name/4 surname/5
 Steven bloody Brown	person/1,3 first-name/1 surname/3
@ -494,6 +512,7 @@ trainContents _ = [hereLit|0.06        0.39    0       0.206
 |]

 devInContents :: Metric -> String
+devInContents (Mean metric) = devInContents metric
 devInContents GLEU = devInContents BLEU
 devInContents BLEU = [hereLit|ja jumala sanoi : " tulkoon valkeus " , ja valkeus tuli
 ja jumala näki , että valkeus oli hyvä ; ja jumala erotti valkeuden pimeydestä
@ -540,6 +559,9 @@ Mr Jan Kowalski
 devInContents TokenAccuracy = [hereLit|The cats on the mat
 Ala has a cat
 |]
+devInContents SegmentAccuracy = [hereLit|John is smart
+Mary's intelligent
+|]
 devInContents (ProbabilisticMultiLabelFMeasure beta) = devInContents (MultiLabelFMeasure beta)
 devInContents (MultiLabelFMeasure _) = [hereLit|Jan Kowalski is here
 I see him
@ -558,6 +580,7 @@ devInContents _ = [hereLit|0.72	0	0.007
 |]

 devExpectedContents :: Metric -> String
+devExpectedContents (Mean metric) = devExpectedContents metric
 devExpectedContents GLEU = devExpectedContents BLEU
 devExpectedContents BLEU = [hereLit|a ka ki te atua , kia marama : na ka marama
 a ka kite te atua i te marama , he pai : a ka wehea e te atua te marama i te pouri
@ -604,6 +627,9 @@ O B-firstname/JAN B-surname/KOWALSKI
 devExpectedContents TokenAccuracy = [hereLit|* N * * N
 N V * N
 |]
+devExpectedContents SegmentAccuracy = [hereLit|N:1-4 V:6-7 A:9-13
+N:1-4 V:6-7 A:9-19
+|]
 devExpectedContents (ProbabilisticMultiLabelFMeasure beta) = devExpectedContents (MultiLabelFMeasure beta)
 devExpectedContents (MultiLabelFMeasure _) = [hereLit|person/1,2 first-name/1 surname/2

@ -624,7 +650,9 @@ devExpectedContents _ = [hereLit|0.82
 |]

 testInContents :: Metric -> String
-testInContents GLEU = testInContents BLEU
+testInContents (Mean metric) = testInContents metric
+testInContents GLEU = [hereLit|Alice has a black
+|]
 testInContents BLEU = [hereLit|ja jumala kutsui valkeuden päiväksi , ja pimeyden hän kutsui yöksi
 ja tuli ehtoo , ja tuli aamu , ensimmäinen päivä
 |]
@ -672,6 +700,9 @@ No name here
 testInContents TokenAccuracy = [hereLit|I have cats
 I know
 |]
+testInContents SegmentAccuracy = [hereLit|Mary's cat is old
+John is young
+|]
 testInContents (ProbabilisticMultiLabelFMeasure beta) = testInContents (MultiLabelFMeasure beta)
 testInContents (MultiLabelFMeasure _) = [hereLit|John bloody Smith
 Nobody is there
@ -690,7 +721,7 @@ testInContents _ = [hereLit|0.72	0	0.007
 |]

 testExpectedContents :: Metric -> String
-testExpectedContents GLEU = testExpectedContents BLEU
+testExpectedContents (Mean metric) = testExpectedContents metric
 testExpectedContents BLEU = [hereLit|na ka huaina e te atua te marama ko te awatea , a ko te pouri i huaina e ia ko te po
 a ko te ahiahi , ko te ata , he ra kotahi
 |]
@ -738,6 +769,9 @@ O O O
 testExpectedContents TokenAccuracy = [hereLit|* V N
 * V
 |]
+testExpectedContents SegmentAccuracy = [hereLit|N:1-6 N:8-10 V:12-13 A:15-17
+N:1-4 V:6-7 A:9-13
+|]
 testExpectedContents (ProbabilisticMultiLabelFMeasure beta) = testExpectedContents (MultiLabelFMeasure beta)
 testExpectedContents (MultiLabelFMeasure _) = [hereLit|person/1,3 first-name/1 surname/3

@ -753,10 +787,13 @@ bar:1/50,50,1000,1000
 testExpectedContents ClippEU = [hereLit|3/0,0,100,100/10
 1/10,10,1000,1000/10
 |]
+testExpectedContents GLEU = [hereLit|Alice has a black cat
+|]
 testExpectedContents _ = [hereLit|0.11
 17.2
 |]

+
 gitignoreContents :: String
 gitignoreContents = [hereLit|
 *~
--- a/src/GEval/EvaluationScheme.hs
+++ b/src/GEval/EvaluationScheme.hs
@ -6,8 +6,8 @@ import GEval.Metric

 import Text.Regex.PCRE.Heavy
 import Text.Regex.PCRE.Light.Base (Regex(..))
-import Data.Text (Text(..), concat, toLower, toUpper, pack, unpack)
-import Data.List (intercalate, break)
+import Data.Text (Text(..), concat, toLower, toUpper, pack, unpack, words, unwords)
+import Data.List (intercalate, break, sort)
 import Data.Either
 import Data.Maybe (fromMaybe)
 import qualified Data.ByteString.UTF8 as BSU
@ -16,7 +16,7 @@ import qualified Data.ByteString.UTF8 as BSU
 data EvaluationScheme = EvaluationScheme Metric [PreprocessingOperation]
  deriving (Eq)

-data PreprocessingOperation = RegexpMatch Regex | LowerCasing | UpperCasing | SetName Text
+data PreprocessingOperation = RegexpMatch Regex | LowerCasing | UpperCasing | Sorting | SetName Text
  deriving (Eq)

 leftParameterBracket :: Char
@ -39,6 +39,8 @@ readOps ('l':theRest) = (LowerCasing:ops, theRest')
 readOps ('u':theRest) = (UpperCasing:ops, theRest')
    where (ops, theRest') = readOps theRest
 readOps ('m':theRest) = handleParametrizedOp (RegexpMatch . (fromRight undefined) . ((flip compileM) []) . BSU.fromString) theRest
+readOps ('S':theRest) = (Sorting:ops, theRest')
+    where (ops, theRest') = readOps theRest
 readOps ('N':theRest) = handleParametrizedOp (SetName . pack) theRest
 readOps s = ([], s)

@ -70,6 +72,7 @@ instance Show PreprocessingOperation where
  show (RegexpMatch (Regex _ regexp)) = parametrizedOperation "m" (BSU.toString regexp)
  show LowerCasing = "l"
  show UpperCasing = "u"
+  show Sorting = "S"
  show (SetName t) = parametrizedOperation "N" (unpack t)

 parametrizedOperation :: String -> String -> String
@ -82,4 +85,5 @@ applyPreprocessingOperation :: PreprocessingOperation -> Text -> Text
 applyPreprocessingOperation (RegexpMatch regex) = Data.Text.concat . (map fst) . (scan regex)
 applyPreprocessingOperation LowerCasing = toLower
 applyPreprocessingOperation UpperCasing = toUpper
+applyPreprocessingOperation Sorting = Data.Text.unwords . sort . Data.Text.words
 applyPreprocessingOperation (SetName _) = id
--- a/src/GEval/Metric.hs
+++ b/src/GEval/Metric.hs
@ -26,9 +26,14 @@ import Data.Attoparsec.Text (parseOnly)
 data Metric = RMSE | MSE | Pearson | Spearman | BLEU | GLEU | WER | Accuracy | ClippEU
              | FMeasure Double | MacroFMeasure Double | NMI
              | LogLossHashed Word32 | CharMatch | MAP | LogLoss | Likelihood
-              | BIOF1 | BIOF1Labels | TokenAccuracy | LikelihoodHashed Word32 | MAE | SMAPE | MultiLabelFMeasure Double
+              | BIOF1 | BIOF1Labels | TokenAccuracy | SegmentAccuracy | LikelihoodHashed Word32 | MAE | SMAPE | MultiLabelFMeasure Double
              | MultiLabelLogLoss | MultiLabelLikelihood
-              | SoftFMeasure Double | ProbabilisticMultiLabelFMeasure Double | ProbabilisticSoftFMeasure Double | Soft2DFMeasure Double
+              | SoftFMeasure Double | ProbabilisticMultiLabelFMeasure Double
+              | ProbabilisticSoftFMeasure Double | Soft2DFMeasure Double
+              -- it would be better to avoid infinite recursion here
+              -- `Mean (Mean BLEU)` is not useful, but as it would mean
+              -- a larger refactor, we will postpone this
+              | Mean Metric
              deriving (Eq)

 instance Show Metric where
@ -67,13 +72,18 @@ instance Show Metric where
  show BIOF1 = "BIO-F1"
  show BIOF1Labels = "BIO-F1-Labels"
  show TokenAccuracy = "TokenAccuracy"
+  show SegmentAccuracy = "SegmentAccuracy"
  show MAE = "MAE"
  show SMAPE = "SMAPE"
  show (MultiLabelFMeasure beta) = "MultiLabel-F" ++ (show beta)
  show MultiLabelLogLoss = "MultiLabel-Logloss"
  show MultiLabelLikelihood = "MultiLabel-Likelihood"
+  show (Mean metric) = "Mean/" ++ (show metric)

 instance Read Metric where
+  readsPrec p ('M':'e':'a':'n':'/':theRest) = case readsPrec p theRest of
+    [(metric, theRest)] -> [(Mean metric, theRest)]
+    _ -> []
  readsPrec _ ('R':'M':'S':'E':theRest) = [(RMSE, theRest)]
  readsPrec _ ('M':'S':'E':theRest) = [(MSE, theRest)]
  readsPrec _ ('P':'e':'a':'r':'s':'o':'n':theRest) = [(Pearson, theRest)]
@ -118,6 +128,7 @@ instance Read Metric where
  readsPrec _ ('B':'I':'O':'-':'F':'1':'-':'L':'a':'b':'e':'l':'s':theRest) = [(BIOF1Labels, theRest)]
  readsPrec _ ('B':'I':'O':'-':'F':'1':theRest) = [(BIOF1, theRest)]
  readsPrec _ ('T':'o':'k':'e':'n':'A':'c':'c':'u':'r':'a':'c':'y':theRest) = [(TokenAccuracy, theRest)]
+  readsPrec _ ('S':'e':'g':'m':'e':'n':'t':'A':'c':'c':'u':'r':'a':'c':'y':theRest) = [(SegmentAccuracy, theRest)]
  readsPrec _ ('M':'A':'E':theRest) = [(MAE, theRest)]
  readsPrec _ ('S':'M':'A':'P':'E':theRest) = [(SMAPE, theRest)]
  readsPrec _ ('M':'u':'l':'t':'i':'L':'a':'b':'e':'l':'-':'L':'o':'g':'L':'o':'s':'s':theRest) = [(MultiLabelLogLoss, theRest)]
@ -154,11 +165,13 @@ getMetricOrdering Likelihood = TheHigherTheBetter
 getMetricOrdering BIOF1 = TheHigherTheBetter
 getMetricOrdering BIOF1Labels = TheHigherTheBetter
 getMetricOrdering TokenAccuracy = TheHigherTheBetter
+getMetricOrdering SegmentAccuracy = TheHigherTheBetter
 getMetricOrdering MAE = TheLowerTheBetter
 getMetricOrdering SMAPE = TheLowerTheBetter
 getMetricOrdering (MultiLabelFMeasure _) = TheHigherTheBetter
 getMetricOrdering MultiLabelLogLoss = TheLowerTheBetter
 getMetricOrdering MultiLabelLikelihood = TheHigherTheBetter
+getMetricOrdering (Mean metric) = getMetricOrdering metric

 bestPossibleValue :: Metric -> MetricValue
 bestPossibleValue metric = case getMetricOrdering metric of
@ -166,18 +179,21 @@ bestPossibleValue metric = case getMetricOrdering metric of
  TheHigherTheBetter -> 1.0

 fixedNumberOfColumnsInExpected :: Metric -> Bool
+fixedNumberOfColumnsInExpected (Mean metric) = fixedNumberOfColumnsInExpected metric
 fixedNumberOfColumnsInExpected MAP = False
 fixedNumberOfColumnsInExpected BLEU = False
 fixedNumberOfColumnsInExpected GLEU = False
 fixedNumberOfColumnsInExpected _ = True

 fixedNumberOfColumnsInInput :: Metric -> Bool
+fixedNumberOfColumnsInInput (Mean metric) = fixedNumberOfColumnsInInput metric
 fixedNumberOfColumnsInInput (SoftFMeasure _) = False
 fixedNumberOfColumnsInInput (ProbabilisticSoftFMeasure _) = False
 fixedNumberOfColumnsInInput (Soft2DFMeasure _) = False
 fixedNumberOfColumnsInInput _ = True

 perfectOutLineFromExpectedLine :: Metric -> Text -> Text
+perfectOutLineFromExpectedLine (Mean metric) t = perfectOutLineFromExpectedLine metric t
 perfectOutLineFromExpectedLine (LogLossHashed _) t = t <> ":1.0"
 perfectOutLineFromExpectedLine (LikelihoodHashed _) t = t <> ":1.0"
 perfectOutLineFromExpectedLine BLEU t = getFirstColumn t
--- a/src/GEval/MetricsMeta.hs
+++ b/src/GEval/MetricsMeta.hs
@ -48,6 +48,7 @@ listOfAvailableMetrics = [RMSE,
                          MultiLabelFMeasure 1.0,
                          MultiLabelFMeasure 2.0,
                          MultiLabelFMeasure 0.25,
+                          Mean (MultiLabelFMeasure 1.0),
                          ProbabilisticMultiLabelFMeasure 1.0,
                          ProbabilisticMultiLabelFMeasure 2.0,
                          ProbabilisticMultiLabelFMeasure 0.25,
@ -63,6 +64,7 @@ listOfAvailableMetrics = [RMSE,
                          BIOF1,
                          BIOF1Labels,
                          TokenAccuracy,
+                          SegmentAccuracy,
                          SoftFMeasure 1.0,
                          SoftFMeasure 2.0,
                          SoftFMeasure 0.25,
@ -93,6 +95,8 @@ isMetricDescribed :: Metric -> Bool
 isMetricDescribed (SoftFMeasure _) = True
 isMetricDescribed (Soft2DFMeasure _) = True
 isMetricDescribed (ProbabilisticMultiLabelFMeasure _) = True
+isMetricDescribed GLEU = True
+isMetricDescribed SegmentAccuracy = True
 isMetricDescribed _ = False

 getEvaluationSchemeDescription :: EvaluationScheme -> String
@ -118,8 +122,26 @@ where calibration measures the quality of probabilities (how well they are calib
 if we have 10 items with probability 0.5 and 5 of them are correct, then the calibration
 is perfect.
 |]
-
-
+getMetricDescription GLEU =
+  [i|For the GLEU score, we record all sub-sequences of
+1, 2, 3 or 4 tokens in output and target sequence (n-grams). We then
+compute a recall, which is the ratio of the number of matching n-grams
+to the number of total n-grams in the target (ground truth) sequence,
+and a precision, which is the ratio of the number of matching n-grams
+to the number of total n-grams in the generated output sequence. Then
+GLEU score is simply the minimum of recall and precision. This GLEU
+score's range is always between 0 (no matches) and 1 (all match) and
+it is symmetrical when switching output and target. According to
+the article, GLEU score correlates quite well with the BLEU
+metric on a corpus level but does not have its drawbacks for our per
+sentence reward objective.
+see: https://arxiv.org/pdf/1609.08144.pdf
+|]
+getMetricDescription SegmentAccuracy =
+  [i|Accuracy counted for segments, i.e. labels with positions.
+The percentage of labels in the ground truth retrieved in the actual output is returned.
+Accuracy is calculated separately for each item and then averaged.
+|]

 outContents :: Metric -> String
 outContents (SoftFMeasure _) = [hereLit|inwords:1-4
@ -132,6 +154,11 @@ outContents (ProbabilisticMultiLabelFMeasure _) = [hereLit|first-name/1:0.8 surn
 surname/1:0.4
 first-name/3:0.9
 |]
+outContents GLEU = [hereLit|Alice has a black
+|]
+outContents SegmentAccuracy = [hereLit|N:1-4 V:5-6 N:8-10 V:12-13 A:15-17
+N:1-4 V:6-7 A:9-13
+|]

 expectedScore :: EvaluationScheme -> MetricValue
 expectedScore (EvaluationScheme (SoftFMeasure beta) [])
@ -146,6 +173,10 @@ expectedScore (EvaluationScheme (ProbabilisticMultiLabelFMeasure beta) [])
  = let precision = 0.6569596940847289
        recall = 0.675
      in weightedHarmonicMean beta precision recall
+expectedScore (EvaluationScheme GLEU [])
+  = 0.7142857142857143
+expectedScore (EvaluationScheme SegmentAccuracy [])
+  = 0.875

 helpMetricParameterMetricsList :: String
 helpMetricParameterMetricsList = intercalate ", " $ map (\s -> (show s) ++ (case extraInfo s of
@ -194,7 +225,15 @@ the form LABEL:PAGE/X0,Y0,X1,Y1 where LABEL is any label, page is the page numbe
 formatDescription (ProbabilisticMultiLabelFMeasure _) = [hereLit|In each line a number of labels (entities) can be given. A label probability
 can be provided with a colon (e.g. "foo:0.7"). By default, 1.0 is assumed.
 |]
-
+formatDescription GLEU = [hereLit|In each line a there is a space sparated sentence of words.
+|]
+formatDescription SegmentAccuracy = [hereLit|Labels can be any strings (without spaces), whereas is a list of
+1-based indexes or spans separated by commas (spans are inclusive
+ranges, e.g. "10-14"). For instance, "foo:bar:2,4-7,10" is a
+label "foo:bar" for positions 2, 4, 5, 6, 7 and 10. Note that no
+overlapping segments can be returned (evaluation will fail in
+such a case).
+|]

 scoreExplanation :: EvaluationScheme -> Maybe String
 scoreExplanation (EvaluationScheme (SoftFMeasure _) [])
@ -206,6 +245,17 @@ As far as the second item is concerned, the total area that covered by the outpu
 Hence, recall is 247500/902500=0.274 and precision - 247500/(20000+912000+240000)=0.211. Therefore, the F-score
 for the second item is 0.238 and the F-score for the whole set is (0 + 0.238)/2 = 0.119.|]
 scoreExplanation (EvaluationScheme (ProbabilisticMultiLabelFMeasure _) []) = Nothing
+scoreExplanation (EvaluationScheme GLEU [])
+  = Just [hereLit|To find out GLEU score we first count number of tp (true positives) fp(false positives) and fn(false negatives).
+  We have 4 matching unigrams ("Alice", "has", "a", "black") , 3 bigrams ("Alice has", "has a", "a black"), 2 trigrams ("Alice has a", "has a black") and 1 tetragram ("Alice has a black"),
+so tp=10. We have no fp, therefore fp=0. There are 4 fn - ("cat", "black cat", "a black cat", "has a black cat").
+Now we have to calculate precision and recall:
+  Precision is tp / (tp+fp) = 10/(10+0) = 1,
+  recall is tp / (tp+fn) = 10 / (10+4) = 10/14 =~ 0.71428...
+  The GLEU score is min(precision,recall)=0.71428 |]
+scoreExplanation (EvaluationScheme SegmentAccuracy [])
+  = Just [hereLit|Out of 4 segments in the expected output for the first item, 3 were retrieved correcly (accuracy is 3/4=0.75).
+The second item was retrieved perfectly (accuracy is 1.0). Hence, the average is (0.75+1.0)/2=0.875.|]

 pasteLines :: String -> String -> String
 pasteLines a b = printf "%-35s %s\n" a b
--- a/test/Spec.hs
+++ b/test/Spec.hs
@ -127,6 +127,8 @@ main = hspec $ do
      runGEvalTest "accuracy-simple" `shouldReturnAlmost` 0.6
    it "with probs" $
      runGEvalTest "accuracy-probs" `shouldReturnAlmost` 0.4
+    it "sorted" $
+      runGEvalTest "accuracy-on-sorted" `shouldReturnAlmost` 0.75
  describe "F-measure" $ do
    it "simple example" $
      runGEvalTest "f-measure-simple" `shouldReturnAlmost` 0.57142857
@ -146,6 +148,9 @@ main = hspec $ do
  describe "TokenAccuracy" $ do
    it "simple example" $ do
       runGEvalTest "token-accuracy-simple" `shouldReturnAlmost` 0.5
+  describe "SegmentAccuracy" $ do
+    it "simple test" $ do
+      runGEvalTest "segment-accuracy-simple" `shouldReturnAlmost` 0.4444444
  describe "precision count" $ do
    it "simple test" $ do
      precisionCount [["Alice", "has", "a", "cat" ]] ["Ala", "has", "cat"] `shouldBe` 2
@ -323,6 +328,9 @@ main = hspec $ do
      runGEvalTest "multilabel-f1-with-probs" `shouldReturnAlmost` 0.615384615384615
    it "labels given with probs and numbers" $ do
      runGEvalTest "multilabel-f1-with-probs-and-numbers" `shouldReturnAlmost` 0.6666666666666
+  describe "Mean/MultiLabel-F" $ do
+    it "simple" $ do
+      runGEvalTest "mean-multilabel-f1-simple" `shouldReturnAlmost` 0.5
  describe "MultiLabel-Likelihood" $ do
    it "simple" $ do
      runGEvalTest "multilabel-likelihood-simple" `shouldReturnAlmost` 0.115829218528827
@ -342,6 +350,11 @@ main = hspec $ do
    it "just parse" $ do
      parseAnnotations "foo:3,7-10 baz:4-6" `shouldBe` Right [Annotation "foo" (IS.fromList [3,7,8,9,10]),
                                                              Annotation "baz" (IS.fromList [4,5,6])]
+    it "just parse wit colons" $ do
+      parseSegmentAnnotations "foo:x:3,7-10 baz:4-6" `shouldBe` Right [Annotation "foo:x" (IS.fromList [3,7,8,9,10]),
+                                                                       Annotation "baz" (IS.fromList [4,5,6])]
+    it "just parse wit colons" $ do
+      parseSegmentAnnotations "foo:x:3,7-10 baz:2-6" `shouldBe` Left "Overlapping segments"
    it "just parse 2" $ do
      parseAnnotations "inwords:1-3 indigits:5" `shouldBe` Right [Annotation "inwords" (IS.fromList [1,2,3]),
                                                                  Annotation "indigits" (IS.fromList [5])]
--- a/test/accuracy-on-sorted/accuracy-on-sorted-solution/test-A/out.tsv
+++ b/test/accuracy-on-sorted/accuracy-on-sorted-solution/test-A/out.tsv
@ -0,0 +1,4 @@
+foo baz bar
+
+xyz aaa
+2 a:1 3
--- a/test/accuracy-on-sorted/accuracy-on-sorted/config.txt
+++ b/test/accuracy-on-sorted/accuracy-on-sorted/config.txt
@ -0,0 +1 @@
+--metric Accuracy:S
--- a/test/accuracy-on-sorted/accuracy-on-sorted/test-A/expected.tsv
+++ b/test/accuracy-on-sorted/accuracy-on-sorted/test-A/expected.tsv
@ -0,0 +1,4 @@
+bar baz foo
+
+xyz
+a:1 2 3
--- a/test/mean-multilabel-f1-simple/mean-multilabel-f1-simple-solution/test-A/out.tsv
+++ b/test/mean-multilabel-f1-simple/mean-multilabel-f1-simple-solution/test-A/out.tsv
@ -0,0 +1,4 @@
+foo bar baz
+uuu
+foo bar baz
+qqq aaa
--- a/test/mean-multilabel-f1-simple/mean-multilabel-f1-simple/config.txt
+++ b/test/mean-multilabel-f1-simple/mean-multilabel-f1-simple/config.txt
@ -0,0 +1 @@
+--metric Mean/MultiLabel-F1
--- a/test/mean-multilabel-f1-simple/mean-multilabel-f1-simple/test-A/expected.tsv
+++ b/test/mean-multilabel-f1-simple/mean-multilabel-f1-simple/test-A/expected.tsv
@ -0,0 +1,4 @@
+foo bar baz
+
+foo
+qqq qqq
--- a/test/segment-accuracy-simple/segment-accuracy-simple-solution/test-A/out.tsv
+++ b/test/segment-accuracy-simple/segment-accuracy-simple-solution/test-A/out.tsv
@ -0,0 +1,3 @@
+foo:0 baq:1-2 baz:3
+aaa:0-1
+xyz:0 bbb:x:1
--- a/test/segment-accuracy-simple/segment-accuracy-simple/config.txt
+++ b/test/segment-accuracy-simple/segment-accuracy-simple/config.txt
@ -0,0 +1 @@
+--metric SegmentAccuracy
--- a/test/segment-accuracy-simple/segment-accuracy-simple/test-A/expected.tsv
+++ b/test/segment-accuracy-simple/segment-accuracy-simple/test-A/expected.tsv
@ -0,0 +1,3 @@
+foo:0 bar:1-2 baz:3
+aaa:0-2
+xyz:0 bbb:x:1 ccc:x:2