Update README

This commit is contained in:
Filip Gralinski 2020-02-18 21:01:39 +01:00
parent ccd6d919da
commit 561f568437
1 changed files with 86 additions and 26 deletions

112
README.md
View File

@ -499,17 +499,13 @@ have the following structure:
non-standard test subdirectory)
* `train/` — subdirectory with training data (if training data are
supplied for a given Gonito challenge at all)
* `train/train.tsv` — the usual name of the training data file (this
name is not required and could be more than one file), the first
column is the target (predicted) value, the other columns represent
features, no header is assumed
* `train/in.tsv` — the input data for the training set
* `train/expected.tsv` — the target values
* `dev-0/` — subdirectory with a development set (a sample test set,
which won't be used for the final evaluation)
* `dev-0/in.tsv` — input data (the same format as `train/train.tsv`,
but without the first column)
* `dev-0/expected.tsv` — values to be guessed (note that `paste
dev-0/expected.tsv dev-0/in.tsv` should give the same format as
`train/train.tsv`)
* `dev-0/expected.tsv` — values to be guessed
* `dev-1/`, `dev-2`, ... — other dev sets (if supplied)
* `test-A/` — subdirectory with the test set
* `test-A/in.tsv` — test input (the same format as `dev-0/in.tsv`)
@ -523,45 +519,109 @@ have the following structure:
You can use `geval` to initiate a [Gonito](https://gonito.net) challenge:
geval --init --expected-directory my-challenge
geval --init --expected-directory my-challenge --metric RMSE
(This will generate a sample toy challenge about guessing planet masses).
A metric (other than the default `RMSE` — root-mean-square error) can
Of course, any other metric can
be given to generate another type of toy challenge:
geval --init --expected-directory my-machine-translation-challenge --metric BLEU
### Preparing a Git repository
[Gonito](https://gonito.net) platform expects a Git repository with a challenge to be
submitted. The suggested way to do this is as follows:
[Gonito](https://gonito.net) platform expects a Git repository with a
challenge to be submitted. The suggested way to do this will be
presented as a [Makefile](https://en.wikipedia.org/wiki/Makefile), but
of course you could use any other scripting language and the commands
should be clear if you know Bash and some basic facts about Makefile:
1. Prepare a branch with all the files _without_
`test-A/expected.tsv`. This branch will be cloned by people taking
* a Makefile consists of rules, each rule specify how to build a _target_ out _dependencies_ using
shell commands
* `$@` is the (first) target, whereas `$<` — the first dependency
* the indentation should be done with TABs, not spaces!
```
SHELL=/bin/bash
# no not delete intermediate files
.SECONDARY:
# the directory where the challenge will be created
output_directory=...
# let's define which files are necessary, other files will be created if needed
all: $(output_directory)/train/in.tsv.xz $(output_directory)/train/expected.tsv \
$(output_directory)/dev-0/in.tsv.xz $(output_directory)/dev-0/expected.tsv \
$(output_directory)/test-A/in.tsv.xz $(output_directory)/test-A/expected.tsv
$(output_directory)/config.txt:
mkdir -p $(output_directory)
geval --init --expected-directory $(output_directory) --metric MAIN_METRIC --metric AUXILIARY_METRIC --precision N --gonito-host https://some.gonito.host.net
# `geval --init` will generate a toy challenge for a given metric(s)
# ... but we remove the `in/expected.tsv` files just in case
# (we will overwrite this with our data anyway)
rm -f $(output_directory)/{train,dev-0,test-A}/{in,expected}.tsv
# a "total" TSV containing all the data, we'll split it later
all-data.tsv.xz: some-other-files
# the data are generated using your script, let's say prepare.py and
# some other files (of course, it depends on your task);
# the file will be compressed with xz
./prepare.py some-other-files | xz > $@
# and now the challenge files, note that they will depend on config.txt so that
# the challenge skeleton is generated first
# The best way to split data into train, dev-0 and test-A set is to do it in a random,
# but _stable_ manner, the set into which an item is assigned should depend on the MD5 sum
# of some field in the input data (a field unlikely to change). Let's assume
# that you created a script `filter.py` that takes as an argument a regular expression that will be applied
# for the MD5 sum (written in the hexadecimal format).
$(output_directory)/train/in.tsv.xz $(output_directory)/train/expected.tsv: all-data.tsv.xz filter.py config.txt
# 1. xzcat for decompression
# 2. ./filter.py will select 14/16=7/8 of items in a stable random manner
# 3. tee >(...) is Bash magic to fork the ouptut into two streams
# 4. cut will select the columns
# 5. xz will compress it back
xzcat $< | ./filter.py '[0-9abcd]$' | tee >(cut -f 1 > $(output_directory)/train/expected.tsv) | cut -f 2- | xz > $@
$(output_directory)/dev-0/in.tsv.xz $(output_directory)/dev-0/expected.tsv: all-data.tsv.xz filter.py config.txt
# 1/16 of items for dev-0 set
xzcat $< | ./filter.py 'e$' | tee >(cut -f 1 > $(output_directory)/dev-0/expected.tsv) | cut -f 2- | xz > $@
$(output_directory)/test-A/in.tsv.xz $(output_directory)/test-A/expected.tsv: all-data.tsv.xz filter.py config.txt
# ( other)1/16 of items for test-A set
xzcat $< | ./filter.py 'f$' | tee >(cut -f 1 > $(output_directory)/test-A/expected.tsv) | cut -f 2- | xz > $@
# wiping out the challenge, if you are desperate
clean:
rm -rf $(output_directory)
```
Now let's do the git stuff, we will:
1. prepare a branch (say `master`) with all the files _without_
`test-A/expected.tsv`, this branch will be cloned by people taking
up the challenge.
2. Prepare a separate branch (or even a repo) with
`test-A/expected.tsv` added. This branch should be accessible by
2. prepare a separate branch (or could be a repo, we'll use the branch `dont-peek`) with
`test-A/expected.tsv` added; this branch should be accessible by
Gonito platform, but should be kept “hidden” for regular users (or
at least they should be kindly asked not to peek there). It is
recommended (though not obligatory) that this branch contains all
the source codes and data used to generate the train/dev/test sets.
(Use [git-annex](https://git-annex.branchable.com/) if you have huge files there.)
at least they should be kindly asked not to peek there).
Branch (1) should be the parent of the branch (2), for instance, the
repo (for the toy “planets” challenge) could be created as follows:
geval --init --expected-directory planets
cd planets
cd planets # output_directory in the Makefile above
git init
git add .gitignore config.txt README.md train/train.tsv dev-0/{in,expected}.tsv test-A/in.tsv
git add .gitignore config.txt README.md {train,dev-0}/{in.tsv.xz,expected.tsv} test-A/in.tsv.xz
git commit -m 'init challenge'
git remote add origin ssh://gitolite@gonito.net/filipg/planets
git remote add origin ssh://gitolite@gonito.net/planets # some repo you have access
git push origin master
git branch dont-peek
git checkout dont-peek
git checkout -b dont-peek
git add test-A/expected.tsv
git commit -m 'with expected results'
git commit -m 'hidden data'
git push origin dont-peek
## Taking up a Gonito challenge