Merge branch 'master' into headers2

This commit is contained in:
Filip Gralinski 2020-02-22 12:15:04 +01:00
commit 08be1e1f5b

127
README.md
View File

@ -499,17 +499,12 @@ have the following structure:
non-standard test subdirectory) non-standard test subdirectory)
* `train/` — subdirectory with training data (if training data are * `train/` — subdirectory with training data (if training data are
supplied for a given Gonito challenge at all) supplied for a given Gonito challenge at all)
* `train/train.tsv` — the usual name of the training data file (this * `train/in.tsv` — the input data for the training set
name is not required and could be more than one file), the first * `train/expected.tsv` — the target values
column is the target (predicted) value, the other columns represent
features, no header is assumed
* `dev-0/` — subdirectory with a development set (a sample test set, * `dev-0/` — subdirectory with a development set (a sample test set,
which won't be used for the final evaluation) which won't be used for the final evaluation)
* `dev-0/in.tsv` — input data (the same format as `train/train.tsv`, * `dev-0/in.tsv` — input data
but without the first column) * `dev-0/expected.tsv` — values to be guessed
* `dev-0/expected.tsv` — values to be guessed (note that `paste
dev-0/expected.tsv dev-0/in.tsv` should give the same format as
`train/train.tsv`)
* `dev-1/`, `dev-2`, ... — other dev sets (if supplied) * `dev-1/`, `dev-2`, ... — other dev sets (if supplied)
* `test-A/` — subdirectory with the test set * `test-A/` — subdirectory with the test set
* `test-A/in.tsv` — test input (the same format as `dev-0/in.tsv`) * `test-A/in.tsv` — test input (the same format as `dev-0/in.tsv`)
@ -523,45 +518,121 @@ have the following structure:
You can use `geval` to initiate a [Gonito](https://gonito.net) challenge: You can use `geval` to initiate a [Gonito](https://gonito.net) challenge:
geval --init --expected-directory my-challenge geval --init --expected-directory my-challenge --metric RMSE
(This will generate a sample toy challenge about guessing planet masses). (This will generate a sample toy challenge about guessing planet masses).
A metric (other than the default `RMSE` — root-mean-square error) can Of course, any other metric can
be given to generate another type of toy challenge: be given to generate another type of toy challenge:
geval --init --expected-directory my-machine-translation-challenge --metric BLEU geval --init --expected-directory my-machine-translation-challenge --metric BLEU
### Preparing a Git repository ### Preparing a Git repository
[Gonito](https://gonito.net) platform expects a Git repository with a challenge to be [Gonito](https://gonito.net) platform expects a Git repository with a
submitted. The suggested way to do this is as follows: challenge to be submitted. The suggested way to do this will be
presented as a [Makefile](https://en.wikipedia.org/wiki/Makefile), but
of course you could use any other scripting language and the commands
should be clear if you know Bash and some basic facts about Makefiles:
1. Prepare a branch with all the files _without_ * a Makefile consists of rules, each rule specifies how to build a _target_ out of _dependencies_ using
`test-A/expected.tsv`. This branch will be cloned by people taking shell commands
* `$@` is the (first) target, whereas `$<` — the first dependency
* the indentation should be done with TABs, not spaces!
```
SHELL=/bin/bash
# no not delete intermediate files
.SECONDARY:
# the directory where the challenge will be created
output_directory=...
# let's define which files are necessary, other files will be created if needed;
# we'll compress the input files with xz and leave `expected.tsv` files uncompressed
# (but you could decide otherwise)
all: $(output_directory)/train/in.tsv.xz $(output_directory)/train/expected.tsv \
$(output_directory)/dev-0/in.tsv.xz $(output_directory)/dev-0/expected.tsv \
$(output_directory)/test-A/in.tsv.xz $(output_directory)/test-A/expected.tsv \
$(output_directory)/README.md
# always validate the challenge
geval --validate --expected-directory $(output_directory)
# we need to replace the default README.md, we assume that it
# is kept as challenge-readme.md in the repo with this Makefile;
# note that the title from README.md will be taken as the title of the challenge
# and the first paragraph — as a short description
$(output_directory)/README.md: challenge-readme.md
cp $< $@
$(output_directory)/config.txt:
mkdir -p $(output_directory)
geval --init --expected-directory $(output_directory) --metric MAIN_METRIC --metric AUXILIARY_METRIC --precision N --gonito-host https://some.gonito.host.net
# `geval --init` will generate a toy challenge for a given metric(s)
# ... but we remove the `in/expected.tsv` files just in case
# (we will overwrite this with our data anyway)
rm -f $(output_directory)/{train,dev-0,test-A}/{in,expected}.tsv
# a "total" TSV containing all the data, we'll split it later
all-data.tsv.xz: prepare.py some-other-files
# the data are generated using your script, let's say prepare.py and
# some other files (of course, it depends on your task);
# the file will be compressed with xz
./prepare.py some-other-files | xz > $@
# and now the challenge files, note that they will depend on config.txt so that
# the challenge skeleton is generated first
# The best way to split data into train, dev-0 and test-A set is to do it in a random,
# but _stable_ manner, the set into which an item is assigned should depend on the MD5 sum
# of some field in the input data (a field unlikely to change). Let's assume
# that you created a script `filter.py` that takes as an argument a regular expression that will be applied
# to the MD5 sum (written in the hexadecimal format).
$(output_directory)/train/in.tsv.xz $(output_directory)/train/expected.tsv: all-data.tsv.xz filter.py config.txt
# 1. xzcat for decompression
# 2. ./filter.py will select 14/16=7/8 of items in a stable random manner
# 3. tee >(...) is Bash magic to fork the ouptut into two streams
# 4. cut will select the columns
# 5. xz will compress it back
xzcat $< | ./filter.py '[0-9abcd]$' | tee >(cut -f 1 > $(output_directory)/train/expected.tsv) | cut -f 2- | xz > $@
$(output_directory)/dev-0/in.tsv.xz $(output_directory)/dev-0/expected.tsv: all-data.tsv.xz filter.py config.txt
# 1/16 of items goes to dev-0 set
xzcat $< | ./filter.py 'e$' | tee >(cut -f 1 > $(output_directory)/dev-0/expected.tsv) | cut -f 2- | xz > $@
$(output_directory)/test-A/in.tsv.xz $(output_directory)/test-A/expected.tsv: all-data.tsv.xz filter.py config.txt
# (other) 1/16 of items goes to test-A set
xzcat $< | ./filter.py 'f$' | tee >(cut -f 1 > $(output_directory)/test-A/expected.tsv) | cut -f 2- | xz > $@
# wiping out the challenge, if you are desperate
clean:
rm -rf $(output_directory)
```
Now let's do the git stuff, we will:
1. prepare a branch (say `master`) with all the files _without_
`test-A/expected.tsv`, this branch will be cloned by people taking
up the challenge. up the challenge.
2. Prepare a separate branch (or even a repo) with 2. prepare a separate branch (or could be a repo, we'll use the branch `dont-peek`) with
`test-A/expected.tsv` added. This branch should be accessible by `test-A/expected.tsv` added; this branch should be accessible by
Gonito platform, but should be kept “hidden” for regular users (or Gonito platform, but should be kept “hidden” for regular users (or
at least they should be kindly asked not to peek there). It is at least they should be kindly asked not to peek there).
recommended (though not obligatory) that this branch contains all
the source codes and data used to generate the train/dev/test sets.
(Use [git-annex](https://git-annex.branchable.com/) if you have huge files there.)
Branch (1) should be the parent of the branch (2), for instance, the Branch (1) should be the parent of the branch (2), for instance, the
repo (for the toy “planets” challenge) could be created as follows: repo (for the toy “planets” challenge) could be created as follows:
geval --init --expected-directory planets cd planets # output_directory in the Makefile above
cd planets
git init git init
git add .gitignore config.txt README.md train/train.tsv dev-0/{in,expected}.tsv test-A/in.tsv git add .gitignore config.txt README.md {train,dev-0}/{in.tsv.xz,expected.tsv} test-A/in.tsv.xz
git commit -m 'init challenge' git commit -m 'init challenge'
git remote add origin ssh://gitolite@gonito.net/filipg/planets git remote add origin ssh://gitolite@gonito.net/planets # some repo you have access
git push origin master git push origin master
git branch dont-peek git checkout -b dont-peek
git checkout dont-peek
git add test-A/expected.tsv git add test-A/expected.tsv
git commit -m 'with expected results' git commit -m 'hidden data'
git push origin dont-peek git push origin dont-peek
## Taking up a Gonito challenge ## Taking up a Gonito challenge