Merge branch 'master' into headers2
This commit is contained in:
commit
08be1e1f5b
127
README.md
127
README.md
@ -499,17 +499,12 @@ have the following structure:
|
||||
non-standard test subdirectory)
|
||||
* `train/` — subdirectory with training data (if training data are
|
||||
supplied for a given Gonito challenge at all)
|
||||
* `train/train.tsv` — the usual name of the training data file (this
|
||||
name is not required and could be more than one file), the first
|
||||
column is the target (predicted) value, the other columns represent
|
||||
features, no header is assumed
|
||||
* `train/in.tsv` — the input data for the training set
|
||||
* `train/expected.tsv` — the target values
|
||||
* `dev-0/` — subdirectory with a development set (a sample test set,
|
||||
which won't be used for the final evaluation)
|
||||
* `dev-0/in.tsv` — input data (the same format as `train/train.tsv`,
|
||||
but without the first column)
|
||||
* `dev-0/expected.tsv` — values to be guessed (note that `paste
|
||||
dev-0/expected.tsv dev-0/in.tsv` should give the same format as
|
||||
`train/train.tsv`)
|
||||
* `dev-0/in.tsv` — input data
|
||||
* `dev-0/expected.tsv` — values to be guessed
|
||||
* `dev-1/`, `dev-2`, ... — other dev sets (if supplied)
|
||||
* `test-A/` — subdirectory with the test set
|
||||
* `test-A/in.tsv` — test input (the same format as `dev-0/in.tsv`)
|
||||
@ -523,45 +518,121 @@ have the following structure:
|
||||
|
||||
You can use `geval` to initiate a [Gonito](https://gonito.net) challenge:
|
||||
|
||||
geval --init --expected-directory my-challenge
|
||||
geval --init --expected-directory my-challenge --metric RMSE
|
||||
|
||||
(This will generate a sample toy challenge about guessing planet masses).
|
||||
|
||||
A metric (other than the default `RMSE` — root-mean-square error) can
|
||||
Of course, any other metric can
|
||||
be given to generate another type of toy challenge:
|
||||
|
||||
geval --init --expected-directory my-machine-translation-challenge --metric BLEU
|
||||
|
||||
### Preparing a Git repository
|
||||
|
||||
[Gonito](https://gonito.net) platform expects a Git repository with a challenge to be
|
||||
submitted. The suggested way to do this is as follows:
|
||||
[Gonito](https://gonito.net) platform expects a Git repository with a
|
||||
challenge to be submitted. The suggested way to do this will be
|
||||
presented as a [Makefile](https://en.wikipedia.org/wiki/Makefile), but
|
||||
of course you could use any other scripting language and the commands
|
||||
should be clear if you know Bash and some basic facts about Makefiles:
|
||||
|
||||
1. Prepare a branch with all the files _without_
|
||||
`test-A/expected.tsv`. This branch will be cloned by people taking
|
||||
* a Makefile consists of rules, each rule specifies how to build a _target_ out of _dependencies_ using
|
||||
shell commands
|
||||
* `$@` is the (first) target, whereas `$<` — the first dependency
|
||||
* the indentation should be done with TABs, not spaces!
|
||||
|
||||
```
|
||||
SHELL=/bin/bash
|
||||
|
||||
# no not delete intermediate files
|
||||
.SECONDARY:
|
||||
|
||||
# the directory where the challenge will be created
|
||||
output_directory=...
|
||||
|
||||
# let's define which files are necessary, other files will be created if needed;
|
||||
# we'll compress the input files with xz and leave `expected.tsv` files uncompressed
|
||||
# (but you could decide otherwise)
|
||||
all: $(output_directory)/train/in.tsv.xz $(output_directory)/train/expected.tsv \
|
||||
$(output_directory)/dev-0/in.tsv.xz $(output_directory)/dev-0/expected.tsv \
|
||||
$(output_directory)/test-A/in.tsv.xz $(output_directory)/test-A/expected.tsv \
|
||||
$(output_directory)/README.md
|
||||
# always validate the challenge
|
||||
geval --validate --expected-directory $(output_directory)
|
||||
|
||||
# we need to replace the default README.md, we assume that it
|
||||
# is kept as challenge-readme.md in the repo with this Makefile;
|
||||
# note that the title from README.md will be taken as the title of the challenge
|
||||
# and the first paragraph — as a short description
|
||||
$(output_directory)/README.md: challenge-readme.md
|
||||
cp $< $@
|
||||
|
||||
$(output_directory)/config.txt:
|
||||
mkdir -p $(output_directory)
|
||||
geval --init --expected-directory $(output_directory) --metric MAIN_METRIC --metric AUXILIARY_METRIC --precision N --gonito-host https://some.gonito.host.net
|
||||
# `geval --init` will generate a toy challenge for a given metric(s)
|
||||
# ... but we remove the `in/expected.tsv` files just in case
|
||||
# (we will overwrite this with our data anyway)
|
||||
rm -f $(output_directory)/{train,dev-0,test-A}/{in,expected}.tsv
|
||||
|
||||
# a "total" TSV containing all the data, we'll split it later
|
||||
all-data.tsv.xz: prepare.py some-other-files
|
||||
# the data are generated using your script, let's say prepare.py and
|
||||
# some other files (of course, it depends on your task);
|
||||
# the file will be compressed with xz
|
||||
./prepare.py some-other-files | xz > $@
|
||||
|
||||
# and now the challenge files, note that they will depend on config.txt so that
|
||||
# the challenge skeleton is generated first
|
||||
|
||||
# The best way to split data into train, dev-0 and test-A set is to do it in a random,
|
||||
# but _stable_ manner, the set into which an item is assigned should depend on the MD5 sum
|
||||
# of some field in the input data (a field unlikely to change). Let's assume
|
||||
# that you created a script `filter.py` that takes as an argument a regular expression that will be applied
|
||||
# to the MD5 sum (written in the hexadecimal format).
|
||||
|
||||
$(output_directory)/train/in.tsv.xz $(output_directory)/train/expected.tsv: all-data.tsv.xz filter.py config.txt
|
||||
# 1. xzcat for decompression
|
||||
# 2. ./filter.py will select 14/16=7/8 of items in a stable random manner
|
||||
# 3. tee >(...) is Bash magic to fork the ouptut into two streams
|
||||
# 4. cut will select the columns
|
||||
# 5. xz will compress it back
|
||||
xzcat $< | ./filter.py '[0-9abcd]$' | tee >(cut -f 1 > $(output_directory)/train/expected.tsv) | cut -f 2- | xz > $@
|
||||
|
||||
$(output_directory)/dev-0/in.tsv.xz $(output_directory)/dev-0/expected.tsv: all-data.tsv.xz filter.py config.txt
|
||||
# 1/16 of items goes to dev-0 set
|
||||
xzcat $< | ./filter.py 'e$' | tee >(cut -f 1 > $(output_directory)/dev-0/expected.tsv) | cut -f 2- | xz > $@
|
||||
|
||||
$(output_directory)/test-A/in.tsv.xz $(output_directory)/test-A/expected.tsv: all-data.tsv.xz filter.py config.txt
|
||||
# (other) 1/16 of items goes to test-A set
|
||||
xzcat $< | ./filter.py 'f$' | tee >(cut -f 1 > $(output_directory)/test-A/expected.tsv) | cut -f 2- | xz > $@
|
||||
|
||||
# wiping out the challenge, if you are desperate
|
||||
clean:
|
||||
rm -rf $(output_directory)
|
||||
```
|
||||
|
||||
Now let's do the git stuff, we will:
|
||||
|
||||
1. prepare a branch (say `master`) with all the files _without_
|
||||
`test-A/expected.tsv`, this branch will be cloned by people taking
|
||||
up the challenge.
|
||||
2. Prepare a separate branch (or even a repo) with
|
||||
`test-A/expected.tsv` added. This branch should be accessible by
|
||||
2. prepare a separate branch (or could be a repo, we'll use the branch `dont-peek`) with
|
||||
`test-A/expected.tsv` added; this branch should be accessible by
|
||||
Gonito platform, but should be kept “hidden” for regular users (or
|
||||
at least they should be kindly asked not to peek there). It is
|
||||
recommended (though not obligatory) that this branch contains all
|
||||
the source codes and data used to generate the train/dev/test sets.
|
||||
(Use [git-annex](https://git-annex.branchable.com/) if you have huge files there.)
|
||||
at least they should be kindly asked not to peek there).
|
||||
|
||||
Branch (1) should be the parent of the branch (2), for instance, the
|
||||
repo (for the toy “planets” challenge) could be created as follows:
|
||||
|
||||
geval --init --expected-directory planets
|
||||
cd planets
|
||||
cd planets # output_directory in the Makefile above
|
||||
git init
|
||||
git add .gitignore config.txt README.md train/train.tsv dev-0/{in,expected}.tsv test-A/in.tsv
|
||||
git add .gitignore config.txt README.md {train,dev-0}/{in.tsv.xz,expected.tsv} test-A/in.tsv.xz
|
||||
git commit -m 'init challenge'
|
||||
git remote add origin ssh://gitolite@gonito.net/filipg/planets
|
||||
git remote add origin ssh://gitolite@gonito.net/planets # some repo you have access
|
||||
git push origin master
|
||||
git branch dont-peek
|
||||
git checkout dont-peek
|
||||
git checkout -b dont-peek
|
||||
git add test-A/expected.tsv
|
||||
git commit -m 'with expected results'
|
||||
git commit -m 'hidden data'
|
||||
git push origin dont-peek
|
||||
|
||||
## Taking up a Gonito challenge
|
||||
|
Loading…
Reference in New Issue
Block a user