Russian-Polish Opensubtitles ============================ Translate subtitles from Russian into Polish. Sources ------- The data set is based on the [Opensubtitles 2016](http://opus.lingfil.uu.se/OpenSubtitles2016.php) corpus [prepared by Pierre Lison and Jörg Tiedemann](http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf). Metric ------ [BLEU](https://en.wikipedia.org/wiki/BLEU) is used as the evaluation metric. Directory structure ------------------- * `README.md` — this file * `config.txt` — GEval configuration file * `train/` — directory with training data * `train/train.tsv` — train set (Russian-Polish corpus of 2M sentence pairs, the Polish sentence is given in the first column, the Russian sentence — in the second one) * `dev-0/` — directory with development data (20K sentence pairs) * `dev-0/in.tsv` — input data for the dev set (Russian utterances) * `dev-0/expected.tsv` — expected (reference) data for the dev set (Polish utterances) * `test-A` — directory with test data (20K sentence pairs) * `test-A/in.tsv` — input data for the test set * `test-A/expected.tsv` — expected data for the test set (not available in the master branch) The Polish utterances in `{dev-0,test-A}/expected.tsv` are tokenized (by [Moses](http://www.statmt.org/moses/) tokenizer.perl script, with `-no-escape-option`), but the Russian utterances in all the files and Polish utterances in `train/train.tsv` are *NOT* tokenized. Note that some of the files have `.tsv` extension, even though they actually do not contain TABs — this is just for the compatibility with [Gonito](http://gonito.net) platform.