Update README

2020-02-18 21:01:39 +01:00 · 2020-02-18 21:01:39 +01:00 · 561f568437
commit 561f568437
parent ccd6d919da
1 changed files with 86 additions and 26 deletions
--- a/README.md
+++ b/README.md
@ -499,17 +499,13 @@ have the following structure:
  non-standard test subdirectory)
 * `train/` — subdirectory with training data (if training data are
  supplied for a given Gonito challenge at all)
-* `train/train.tsv` — the usual name of the training data file (this
+* `train/in.tsv` — the input data for the training set
-  name is not required and could be more than one file), the first
+* `train/expected.tsv` —  the target values
  column is the target (predicted) value, the other columns represent
  features, no header is assumed
 * `dev-0/` — subdirectory with a development set (a sample test set,
  which won't be used for the final evaluation)
 * `dev-0/in.tsv` — input data (the same format as `train/train.tsv`,
  but without the first column)
-* `dev-0/expected.tsv` — values to be guessed (note that `paste
+* `dev-0/expected.tsv` — values to be guessed
  dev-0/expected.tsv dev-0/in.tsv` should give the same format as
  `train/train.tsv`)
 * `dev-1/`, `dev-2`, ... — other dev sets (if supplied)
 * `test-A/` — subdirectory with the test set
 * `test-A/in.tsv` — test input (the same format as `dev-0/in.tsv`)
@ -523,45 +519,109 @@ have the following structure:
 You can use `geval` to initiate a [Gonito](https://gonito.net) challenge:
-    geval --init --expected-directory my-challenge
+    geval --init --expected-directory my-challenge --metric RMSE
 (This will generate a sample toy challenge about guessing planet masses).
-A metric (other than the default `RMSE` — root-mean-square error) can
+Of course, any other metric can
 be given to generate another type of toy challenge:
    geval --init --expected-directory my-machine-translation-challenge --metric BLEU
 ### Preparing a Git repository
-[Gonito](https://gonito.net) platform expects a Git repository with a challenge to be
+[Gonito](https://gonito.net) platform expects a Git repository with a
-submitted. The suggested way to do this is as follows:
+challenge to be submitted. The suggested way to do this will be
 presented as a [Makefile](https://en.wikipedia.org/wiki/Makefile), but
 of course you could use any other scripting language and the commands
 should be clear if you know Bash and some basic facts about Makefile:
-1. Prepare a branch with all the files _without_
+* a Makefile consists of rules, each rule specify how to build a _target_ out _dependencies_ using
-   `test-A/expected.tsv`. This branch will be cloned by people taking
+  shell commands
 * `$@` is the (first) target, whereas `$<` — the first dependency
 * the indentation should be done with TABs, not spaces!
 ```
 SHELL=/bin/bash
 # no not delete intermediate files
 .SECONDARY:
 # the directory where the challenge will be created
 output_directory=...
 # let's define which files are necessary, other files will be created if needed
 all: $(output_directory)/train/in.tsv.xz $(output_directory)/train/expected.tsv \
     $(output_directory)/dev-0/in.tsv.xz $(output_directory)/dev-0/expected.tsv \
     $(output_directory)/test-A/in.tsv.xz $(output_directory)/test-A/expected.tsv
 $(output_directory)/config.txt:
    mkdir -p $(output_directory)
    geval --init --expected-directory $(output_directory) --metric MAIN_METRIC --metric AUXILIARY_METRIC --precision N --gonito-host https://some.gonito.host.net
    # `geval --init` will generate a toy challenge for a given metric(s)
    # ... but we remove the `in/expected.tsv` files just in case
    # (we will overwrite this with our data anyway)
    rm -f $(output_directory)/{train,dev-0,test-A}/{in,expected}.tsv
 # a "total" TSV containing all the data, we'll split it later
 all-data.tsv.xz: some-other-files
    # the data are generated using your script, let's say prepare.py and
    # some other files (of course, it depends on your task);
    # the file will be compressed with xz
    ./prepare.py some-other-files | xz > $@
 # and now the challenge files, note that they will depend on config.txt so that
 # the challenge skeleton is generated first
 # The best way to split data into train, dev-0 and test-A set is to do it in a random,
 # but _stable_ manner, the set into which an item is assigned should depend on the MD5 sum
 # of some field in the input data (a field unlikely to change). Let's assume
 # that you created a script `filter.py` that takes as an argument a regular expression that will be applied
 # for the MD5 sum (written in the hexadecimal format).
 $(output_directory)/train/in.tsv.xz $(output_directory)/train/expected.tsv: all-data.tsv.xz filter.py config.txt
    # 1. xzcat for decompression
    # 2. ./filter.py will select 14/16=7/8 of items in a stable random manner
    # 3. tee >(...) is Bash magic to fork the ouptut into two streams
    # 4. cut will select the columns
    # 5. xz will compress it back
    xzcat $< | ./filter.py '[0-9abcd]$' | tee >(cut -f 1 > $(output_directory)/train/expected.tsv) | cut -f 2- | xz > $@
 $(output_directory)/dev-0/in.tsv.xz $(output_directory)/dev-0/expected.tsv: all-data.tsv.xz filter.py config.txt
    # 1/16 of items for dev-0 set
    xzcat $< | ./filter.py 'e$' | tee >(cut -f 1 > $(output_directory)/dev-0/expected.tsv) | cut -f 2- | xz > $@
 $(output_directory)/test-A/in.tsv.xz $(output_directory)/test-A/expected.tsv: all-data.tsv.xz filter.py config.txt
    # ( other)1/16 of items for test-A set
    xzcat $< | ./filter.py 'f$' | tee >(cut -f 1 > $(output_directory)/test-A/expected.tsv) | cut -f 2- | xz > $@
 # wiping out the challenge, if you are desperate
 clean:
    rm -rf $(output_directory)
 ```
 Now let's do the git stuff, we will:
 1. prepare a branch (say `master`) with all the files _without_
   `test-A/expected.tsv`, this branch will be cloned by people taking
   up the challenge.
-2. Prepare a separate branch (or even a repo) with
+2. prepare a separate branch (or could be a repo, we'll use the branch `dont-peek`) with
-   `test-A/expected.tsv` added. This branch should be accessible by
+   `test-A/expected.tsv` added; this branch should be accessible by
   Gonito platform, but should be kept “hidden” for regular users (or
-   at least they should be kindly asked not to peek there). It is
+   at least they should be kindly asked not to peek there).
   recommended (though not obligatory) that this branch contains all
   the source codes and data used to generate the train/dev/test sets.
   (Use [git-annex](https://git-annex.branchable.com/) if you have huge files there.)
 Branch (1) should be the parent of the branch (2), for instance, the
 repo (for the toy “planets” challenge) could be created as follows:
-    geval --init --expected-directory planets
+    cd planets  # output_directory in the Makefile above
    cd planets
    git init
-    git add .gitignore config.txt README.md train/train.tsv dev-0/{in,expected}.tsv test-A/in.tsv
+    git add .gitignore config.txt README.md {train,dev-0}/{in.tsv.xz,expected.tsv} test-A/in.tsv.xz
    git commit -m 'init challenge'
-    git remote add origin ssh://gitolite@gonito.net/filipg/planets
+    git remote add origin ssh://gitolite@gonito.net/planets # some repo you have access
    git push origin master
-    git branch dont-peek
+    git checkout -b dont-peek
    git checkout dont-peek
    git add test-A/expected.tsv
-    git commit -m 'with expected results'
+    git commit -m 'hidden data'
    git push origin dont-peek
 ## Taking up a Gonito challenge