Update README

2020-02-18 21:01:39 +01:00 · 2020-02-18 21:01:39 +01:00 · 561f568437
commit 561f568437
parent ccd6d919da
1 changed files with 86 additions and 26 deletions
--- a/README.md
+++ b/README.md
@ -499,17 +499,13 @@ have the following structure:
  non-standard test subdirectory)
 * `train/` — subdirectory with training data (if training data are
  supplied for a given Gonito challenge at all)
-* `train/train.tsv` — the usual name of the training data file (this
-  name is not required and could be more than one file), the first
-  column is the target (predicted) value, the other columns represent
-  features, no header is assumed
+* `train/in.tsv` — the input data for the training set
+* `train/expected.tsv` —  the target values
 * `dev-0/` — subdirectory with a development set (a sample test set,
  which won't be used for the final evaluation)
 * `dev-0/in.tsv` — input data (the same format as `train/train.tsv`,
  but without the first column)
-* `dev-0/expected.tsv` — values to be guessed (note that `paste
-  dev-0/expected.tsv dev-0/in.tsv` should give the same format as
-  `train/train.tsv`)
+* `dev-0/expected.tsv` — values to be guessed
 * `dev-1/`, `dev-2`, ... — other dev sets (if supplied)
 * `test-A/` — subdirectory with the test set
 * `test-A/in.tsv` — test input (the same format as `dev-0/in.tsv`)
@ -523,45 +519,109 @@ have the following structure:

 You can use `geval` to initiate a [Gonito](https://gonito.net) challenge:

-    geval --init --expected-directory my-challenge
+    geval --init --expected-directory my-challenge --metric RMSE

 (This will generate a sample toy challenge about guessing planet masses).

-A metric (other than the default `RMSE` — root-mean-square error) can
+Of course, any other metric can
 be given to generate another type of toy challenge:

    geval --init --expected-directory my-machine-translation-challenge --metric BLEU

 ### Preparing a Git repository

-[Gonito](https://gonito.net) platform expects a Git repository with a challenge to be
-submitted. The suggested way to do this is as follows:
+[Gonito](https://gonito.net) platform expects a Git repository with a
+challenge to be submitted. The suggested way to do this will be
+presented as a [Makefile](https://en.wikipedia.org/wiki/Makefile), but
+of course you could use any other scripting language and the commands
+should be clear if you know Bash and some basic facts about Makefile:

-1. Prepare a branch with all the files _without_
-   `test-A/expected.tsv`. This branch will be cloned by people taking
+* a Makefile consists of rules, each rule specify how to build a _target_ out _dependencies_ using
+  shell commands
+* `$@` is the (first) target, whereas `$<` — the first dependency
+* the indentation should be done with TABs, not spaces!
+
+```
+SHELL=/bin/bash
+
+# no not delete intermediate files
+.SECONDARY:
+
+# the directory where the challenge will be created
+output_directory=...
+
+# let's define which files are necessary, other files will be created if needed
+all: $(output_directory)/train/in.tsv.xz $(output_directory)/train/expected.tsv \
+     $(output_directory)/dev-0/in.tsv.xz $(output_directory)/dev-0/expected.tsv \
+     $(output_directory)/test-A/in.tsv.xz $(output_directory)/test-A/expected.tsv
+
+$(output_directory)/config.txt:
+    mkdir -p $(output_directory)
+    geval --init --expected-directory $(output_directory) --metric MAIN_METRIC --metric AUXILIARY_METRIC --precision N --gonito-host https://some.gonito.host.net
+    # `geval --init` will generate a toy challenge for a given metric(s)
+    # ... but we remove the `in/expected.tsv` files just in case
+    # (we will overwrite this with our data anyway)
+    rm -f $(output_directory)/{train,dev-0,test-A}/{in,expected}.tsv
+
+# a "total" TSV containing all the data, we'll split it later
+all-data.tsv.xz: some-other-files
+    # the data are generated using your script, let's say prepare.py and
+    # some other files (of course, it depends on your task);
+    # the file will be compressed with xz
+    ./prepare.py some-other-files | xz > $@
+
+# and now the challenge files, note that they will depend on config.txt so that
+# the challenge skeleton is generated first
+
+# The best way to split data into train, dev-0 and test-A set is to do it in a random,
+# but _stable_ manner, the set into which an item is assigned should depend on the MD5 sum
+# of some field in the input data (a field unlikely to change). Let's assume
+# that you created a script `filter.py` that takes as an argument a regular expression that will be applied
+# for the MD5 sum (written in the hexadecimal format).
+
+$(output_directory)/train/in.tsv.xz $(output_directory)/train/expected.tsv: all-data.tsv.xz filter.py config.txt
+    # 1. xzcat for decompression
+    # 2. ./filter.py will select 14/16=7/8 of items in a stable random manner
+    # 3. tee >(...) is Bash magic to fork the ouptut into two streams
+    # 4. cut will select the columns
+    # 5. xz will compress it back
+    xzcat $< | ./filter.py '[0-9abcd]$' | tee >(cut -f 1 > $(output_directory)/train/expected.tsv) | cut -f 2- | xz > $@
+
+$(output_directory)/dev-0/in.tsv.xz $(output_directory)/dev-0/expected.tsv: all-data.tsv.xz filter.py config.txt
+    # 1/16 of items for dev-0 set
+    xzcat $< | ./filter.py 'e$' | tee >(cut -f 1 > $(output_directory)/dev-0/expected.tsv) | cut -f 2- | xz > $@
+
+$(output_directory)/test-A/in.tsv.xz $(output_directory)/test-A/expected.tsv: all-data.tsv.xz filter.py config.txt
+    # ( other)1/16 of items for test-A set
+    xzcat $< | ./filter.py 'f$' | tee >(cut -f 1 > $(output_directory)/test-A/expected.tsv) | cut -f 2- | xz > $@
+
+# wiping out the challenge, if you are desperate
+clean:
+    rm -rf $(output_directory)
+```
+
+Now let's do the git stuff, we will:
+
+1. prepare a branch (say `master`) with all the files _without_
+   `test-A/expected.tsv`, this branch will be cloned by people taking
   up the challenge.
-2. Prepare a separate branch (or even a repo) with
-   `test-A/expected.tsv` added. This branch should be accessible by
+2. prepare a separate branch (or could be a repo, we'll use the branch `dont-peek`) with
+   `test-A/expected.tsv` added; this branch should be accessible by
   Gonito platform, but should be kept “hidden” for regular users (or
-   at least they should be kindly asked not to peek there). It is
-   recommended (though not obligatory) that this branch contains all
-   the source codes and data used to generate the train/dev/test sets.
-   (Use [git-annex](https://git-annex.branchable.com/) if you have huge files there.)
+   at least they should be kindly asked not to peek there).

 Branch (1) should be the parent of the branch (2), for instance, the
 repo (for the toy “planets” challenge) could be created as follows:

-    geval --init --expected-directory planets
-    cd planets
+    cd planets  # output_directory in the Makefile above
    git init
-    git add .gitignore config.txt README.md train/train.tsv dev-0/{in,expected}.tsv test-A/in.tsv
+    git add .gitignore config.txt README.md {train,dev-0}/{in.tsv.xz,expected.tsv} test-A/in.tsv.xz
    git commit -m 'init challenge'
-    git remote add origin ssh://gitolite@gonito.net/filipg/planets
+    git remote add origin ssh://gitolite@gonito.net/planets # some repo you have access
    git push origin master
-    git branch dont-peek
-    git checkout dont-peek
+    git checkout -b dont-peek
    git add test-A/expected.tsv
-    git commit -m 'with expected results'
+    git commit -m 'hidden data'
    git push origin dont-peek

 ## Taking up a Gonito challenge