gonito/README.md

139 lines
6.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Gonito platform
===============
[Gonito](https://gonito.net) (pronounced _ɡɔ̃ˈɲitɔ_) is a Kaggle-like
platform for machine learning competitions (disclaimer: Gonito is
neither affiliated with nor endorsed by [Kaggle](https://www.kaggle.com)).
What's so special about Gonito:
* free & open-source (AGPL), you can use it your own, in your
company, at your university, etc.
* git-based (challenges and solutions are submitted only with git).
See the home page (and an instance of Gonito) at https://gonito.net .
Installation
------------
[Gonito](https://gonito.net) is written in [Haskell](https://www.haskell.org) and uses
[Yesod Web Framework](http://www.yesodweb.com/), but all you need is
just [the Stack tool](https://github.com/commercialhaskell/stack). See https://github.com/commercialhaskell/stack
for instruction how to install Stack on your computer.
By default, Gonito uses [Postgresql](http://www.postgresql.org/), so it needs to be installed and running at your computer.
After installing Stack:
createdb -E utf8 gonito
git clone git://gonito.net/geval
git clone git://gonito.net/gonito
cd gonito
stack setup
# before starting the build you might need some non-Haskell dependencies, e.g. in Ubuntu:
# sudo apt-get install libbz2-dev liblzma-dev libpcre3-dev libcairo-dev libfcgi-dev
stack build
stack exec yesod devel
The last command will start the Web server with Gonito (go to
http://127.0.0.1:3000 in your browser).
Gonito & git
------------
Gonito uses git in an inherent manner:
* challenges (data sets) are provided as git repositories,
* submissions are uploaded via git repositories, they are referred to with
git commit hashes.
Advantages:
* great flexibility as far as where you want to keep your challenges
and submissions (could be external, well-known services such as
GitHub or GitLab, your local git server, let's say gitolite or Gogs, or
just a disk accessible in a Gonito instance),
* even if Gonito ceases to exist, the challenges and submissions are still available
in a standard manner, provided that git repositories (be it external or local) are
accessible,
* data sets can be easily downloaded using the command line
(e.g. `git clone git://gonito.net/paranormal-or-skeptic`), without
even clicking anything in the Web browser,
* facilitates experiment repeatability and reproducibility (at worst
the system output is easily available via git)
* tools that were used to generate the output could be linked as git subrepositories
* some challenge/submission metadata are tracked in a Gonito-independent way
(within git commits),
* copying data can be avoided with git mechanisms (e.g. when the challenge is already
cloned, downloading specific submissions should be much quicker),
* large data sets and models could be stored if needed using mechanisms such as git-annex (see below).
### Commit structure
The following flow of git commits is recommended (though not required):
* the challenge without hidden data for main test sets (i.e. files such as `test-A/expected.tsv`)
should be pushed to the `master` branch
* the hidden files (`test-A/expected.tsv`) should be added in a
subsequent commit and pushed either to the `dont-peek` branch or a
`master` branch of a separate repository (if access to the hidden
data must be more strict),
* the submissions should be committed with the `master` branch as the
parent (or at least ancestor) commit and pushed to the same
repository as the challenge data (in some user-specific branch) or any other
repository (could be user-owned repositories)
* any subsequent submissions could be derived in a natural way from other git commits
(e.g. when a submission is improved, or even two approaches are merged)
* new versions of the challenge can be committed (a challenge can be updated at Gonito)
to the `master` (and `dont-peek`) branches
See also the following picture:
![Recommended commit structure](misc/commits.png)
### git-annex
In some cases, you don't want to store challenge/submissions files simply in git:
* very large data files, textual files (e.g. `train/in.tsv` even if
compressed as `train/in.tsv.xz`)
* binary training/testing data (PDF files, images, movies, recordings)
* data sensitive due to privacy/security concerns (a scenario where it's OK to store
metadata and some files in a widely accessible repository, but some files require
limited access)
* large ML models (note that Gonito does not require models for evaluation, but still
it might be a good practice to commit them along with output files and scripts)
Such cases can be handled in a natural manner using git-annex, a git
extension for handling files and their metadata without commiting
their content to the repository. The contents can be stored at a wide
range of [special
remotes](https://git-annex.branchable.com/special_remotes/), e.g. S3
buckets, WebDAV, rsync servers.
It's up to you which files are stored in git in a regular manner and
which are added with `git annex add`, but note that if a
challenge/submission file must be stored via git-annex and are required
for evaluation (e.g. `expected.tsv` files for the challenge or
`out.tsv` files for submissions), the git-annex special remote must be
given when a challenge is created or a submission is done and the
Gonito server must have access to such a special remote.
Authors
-------
* Filip Graliński
References
----------
@inproceedings{gralinski:2016:gonito,
title="{Gonito.net - Open Platform for Research Competition, Cooperation and Reproducibility}",
author={Grali{\'n}ski, Filip and Jaworski, Rafa{\l} and Borchmann, {\L}ukasz and Wierzcho{\'n}, Piotr},
booktitle="{Branco, Ant{\'o}nio and Nicoletta Calzolari and Khalid Choukri (eds.), Proceedings of the 4REAL Workshop: Workshop on Research Results Reproducibility and Resources Citation in Science and Technology of Language}",
pages={13--20},
year=2016,
url="http://4real.di.fc.ul.pt/wp-content/uploads/2016/04/4REALWorkshopProceedings.pdf"
}