diff --git a/README.md b/README.md index b83795c..f587ca7 100644 --- a/README.md +++ b/README.md @@ -1018,6 +1018,269 @@ Note that using `--submit` option for the main instance at repositories are configured there in such a way that an evaluation is triggered with each push anyway. +## Reproducibility guidelines + +GEval is about evaluation, all you actually need to supply are just +`out.tsv` files. Remember, GEval (and associated evaluation platform +Gonito) is not going to _run_ your submission, it just evaluates the +_output_ of your solution by comparing it against the gold standard, +i.e. the `expected.tsv` files. + +Nevertheless, it would be nice to have some _standards_ for organizing +your code and models so that it would be easy for other people (and +you yourself a month later) to reproduce your results. Here I lay out +some guidelines or standards for this. The conformance to the +guidelines is not checked by GEval/Gonito (though it may be at some +time in the future). + +### The file structure + +Here is the recommended file structure of your submission: + +* `dev-?/out.tsv`, `test-?/out.tsv` — files required by GEval/Gonito + for the actual evaluation; + +* `gonito.yaml` — metadata for Gonito; + +* `predict.sh` — this script should read items from standard input in the + same format as in `in.tsv` files for a given challenge and + print the results on the standard output in the same + format as in `out.tsv` files + - actually `out.tsv` should be generated with `predict.sh`, + - `predict.sh` must print exactly the same number of lines as it read from the input, + - `predict.sh` should accept any number of items, including a single item, in other + words `echo '...' | ./predict.sh` should work, + - `predict.sh` should use models stored in `models/` (generated by `train.sh`), + - `predict.sh` can invoke further scripts in `code/`; + +* `train.sh` — this script should train a machine-learning model using + the data in `train/` (and possibly using development sets in + `dev-?/` for fine-tuning, validation, early stopping, etc.), all the models + should be saved in the `models/` directory + - just as `predict.sh`, `train.sh` can invoke scripts in `code/` (obviously some code + in `code/` could be shared between `predict.sh` and `train.sh`) + - `train.sh` should generate `out.tsv` files (preferably by running `predict.sh`); + +* `code/` — source codes and scripts for training and prediction should be put here; + +* `models/` — all the models generated by `train.sh` should be put here; + +* `Dockerfile` — recipe for a multi-stage build with `train` and + `predict` targets for building containers in which, respectively, + `train.sh` and `predict.sh` is guaranteed to run (more details below); + +* `.dockerignore` — put at least `models/*` and `train/*` here to + speed up building Docker containers; + +* `Makefile` (optional) — if you use make, please put your recipe here (not in `code/`). + +#### Environment variables + +There are some environment variables that should be handled by +`train.sh` and `predict.sh` (if it is applicable to them): + +* `RANDOM_SEED` — the value of the random seed, +* `THREADS` — number of threads/jobs/cores to be used (usually to be passed + to options `-j N`, `--threads N` or similar). +* `BATCH_SIZE` (only `predict.sh`) — the value of the batch size + - by default, `BATCH_SIZE=1` should be assumed + - if set to 1, `predict.sh` should immediately return the processed value + - if set to N > 1, `predict.sh` can read batches of N items, process the whole + batch and return the results for the whole batch + +### Example — Classification with fastText + +Let's try to reproduce a sample submission conforming to the standards +laid out above. The challenge is to [guess whether a given tweet +expresses a positive or a negative +sentiment](https://gonito.net/challenge/sentiment140). You're given +the tweet text along with datestamps in two formats. + +The [sample solution](https://gonito.net/view-variant/7452) to this challenge, based on +[fastText](https://fasttext.cc/), can be cloned as a git repo: + +``` +git clone --single-branch git://gonito.net/sentiment140 -b submission-07130 +``` + +(The `--single-branch` is to speed up the download.) + +As usual, you could evaluate this solution locally on the dev set: + +``` +$ cd sentiment140 +$ geval -t dev-0 +79.88 +``` + +The accuracy is nearly 80%, so it's pretty good. But now we are not +interested in evaluating outputs, we'd like to actually _run_ the +solution, or even reproduce training from scratch. + +Let try to run the fastText classifier on the first 5 items from the +dev-0 set. + +``` +$ xzcat dev-0/in.tsv.xz | head -n 5 +2009.4109589041095 20090531 @smaknews I love Santa Barbara! In fact @BCCF's next Black Tie Charity Event is in Santa Barbara on August 15th! +2009.4054794520548 20090529 @GreenMommaSmith yeah man, I really need an exercise bike. Tris laughs when I mention it +2009.2630136986302 20090407 Anticipating a slow empty boring summer +2009.4164383561645 20090602 just crossed the kankakee river i need to go back soon & see my family. *tori* +2009.4301369863015 20090607 is o tired because of my HillBilly Family and my histerical sister! Stress is not good for me, lol. Stuck at work + +$ xzcat dev-0/in.tsv.xz | head -n 5 | ./predict.sh +terminate called after throwing an instance of 'std::invalid_argument' + what(): models/sentiment140.fasttext.bin cannot be opened for loading! +``` + +What went wrong!? The fastText model is pretty large (420 MB), so it +would not be a good idea to commit it to the git repository directly. +It was stored using git-annex instead. +[Git-annex](https://git-annex.branchable.com/) is a neat git extension +with which you commit only metadata and keep the actual contents +wherever you want (directory, rsync host, S3 bucket, DropBox etc.). + +I put the model on my server, you can download it using the bash script supplied: + +``` +./get-annexed-files.sh models/sentiment140.fasttext.bin +``` + +Now should be OK: + +``` +$ xzcat dev-0/in.tsv.xz | head -n 5 | ./predict.sh +positive +positive +negative +positive +negative +``` + +Well… provided that you have fastText installed. So it's not exactly a +perfect reproducibility. Don't worry, we solve this issue with +Docker in a moment. + +What if you want retrain the model from scratch, then you should run +the `train.sh` script, let's set the random seed to some other value: + +``` +./get-annexed-files.sh train/in.tsv.xz +rm models/* +RANDOM_SEED=42 ./train.sh +``` + +Note that we need to download the input part of the train set first. As it +is pretty large, I decided to store it in a git-annex storage too. + +The evaluation results are slightly different: + +``` +$ geval -t dev-0 +79.86 +``` + +It's not surprising as a different seed was chosen (and fastText might +not be deterministic itself). + +#### How did I actually uploaded this solution? + +I ran the `train.sh` script. All files except the model were added +using the regular `git add` command: + +``` +git add code dev-0/out.tsv .dockerignore Dockerfile gonito.yaml predict.sh test-A/out.tsv train.sh +``` + +The model was added with `git annex`: + +``` +git annex add models/sentiment140.fasttext.bin +``` + +Then I committed the changes and pushed the files to the repo. Still, +the model file had to be uploaded to the git-annex storage. +I was using a directory on a server to which I have access via SSH: + +``` +git annex initremote gonito type=rsync rsyncurl=gonito.vm.wmi.amu.edu.pl:/srv/http/annex encryption=none +``` + +I uploaded the file there: + +``` +git annex copy models/* --to gonito +``` + +The problem is that only I could download the files from this +git-annex remote. In order to make it available to the whole world, I +set up an HTTP server and served the files from there. The trick is to +add +[httpalso](https://git-annex.branchable.com/special_remotes/httpalso/) +special remote: + +``` +git annex initremote --sameas=gonito gonito-https type=httpalso url=https://gonito.vm.wmi.amu.edu.pl/annex +``` + +Finally, you need to synchronize the information about special remotes: + +``` +git annex -a --no-content +``` + +### Docker + +Still, the problem with reproducibility of the sample solution +remains, as you must install requirements: fastText (plus some Python +modules for training). It's a quite a big hassle, if you consider that +there might a lot of different solutions each with a different set of +requirements. + +Docker containers might come in handy here. The idea is that a +submitter should supply a Dockerfile meeting the following conditions: + +* defined as a multi-stage build; +* there are at least 2 images defined: `train` and `predict`; +* `train` defines an environment required for training + - but training scripts and the data set should _not_ be included in the image, + - the image should be run on the directory with the solution mounted to `/workspace` + - i.e. the following commands should run training + +``` +docker build . --target train -t foo-train + +docker run -v $(pwd):/workspace -it foo-train /workspace/train.sh +``` + +* `predict` defines a self-contained predictor + - contrary to the `train` image all the scripts and binaries needed + for the actual prediction should be there, + - … except for the models that needs to be supplied in the directory mounted + at `/workspace/models`, i.e. the following commands should just work: + +``` +docker build . --target predict -t foo-predict + +docker run -v $(pwd)/models:/workspace/models -i foo-predict +``` + - this way you can easily switch to another model without changing the base code + +#### Back to the example + +And it works for the example given above. With one caveat: due an +unfortunate interaction of git-annex and Docker, you need to _unlock_ model files +before running the Docker container: + +``` +$ docker build . --target predict -t sentiment140-predict + +$ git annex unlock models/* + +$ echo -e '2021.99999\t20211231\tGEval is awesome!' | docker run -v $(pwd)/models:/workspace/models -i sentiment140-predict +positive +``` + ## `geval` options ```