opensubtitles-rupl/README.md


Russian-Polish Opensubtitles
============================

Translate subtitles from Russian into Polish.

Sources
-------

The data set is based on the
[Opensubtitles 2016](http://opus.lingfil.uu.se/OpenSubtitles2016.php)
corpus [prepared by Pierre Lison and Jörg Tiedemann](http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf).

Metric
------

[BLEU](https://en.wikipedia.org/wiki/BLEU) is used as the evaluation metric.


Directory structure
-------------------

* `README.md` — this file
* `config.txt` — GEval configuration file
* `train/` — directory with training data
* `train/train.tsv` — train set (Russian-Polish corpus of 2M sentence pairs, the Polish sentence is given in the first column, the Russian sentence — in the second one)
* `dev-0/` — directory with development data (20K sentence pairs)
* `dev-0/in.tsv` — input data for the dev set (Russian utterances)
* `dev-0/expected.tsv` — expected (reference) data for the dev set (Polish utterances)
* `test-A` — directory with test data (20K sentence pairs)
* `test-A/in.tsv` — input data for the test set
* `test-A/expected.tsv` — expected data for the test set (not available in the master branch)

The Polish utterances in `{dev-0,test-A}/expected.tsv` are tokenized
(by [Moses](http://www.statmt.org/moses/) tokenizer.perl script, with
`-no-escape-option`), but the Russian utterances in all the files and
Polish utterances in `train/train.tsv` are *NOT* tokenized.

Note that some of the files have `.tsv` extension, even though they
actually do not contain TABs — this is just for the compatibility with
[Gonito](http://gonito.net) platform.
init 2016-10-03 20:41:59 +02:00
fix typo 2016-10-03 21:25:04 +02:00			`Russian-Polish Opensubtitles`
init 2016-10-03 20:41:59 +02:00			`============================`

			`Translate subtitles from Russian into Polish.`

			`Sources`
			`-------`

			`The data set is based on the`
			`[Opensubtitles 2016](http://opus.lingfil.uu.se/OpenSubtitles2016.php)`
			`corpus [prepared by Pierre Lison and Jörg Tiedemann](http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf).`

			`Metric`
			`------`

			`[BLEU](https://en.wikipedia.org/wiki/BLEU) is used as the evaluation metric.`


			`Directory structure`
			`-------------------`

			* `README.md` — this file
			* `config.txt` — GEval configuration file
			* `train/` — directory with training data
			* `train/train.tsv` — train set (Russian-Polish corpus of 2M sentence pairs, the Polish sentence is given in the first column, the Russian sentence — in the second one)
			* `dev-0/` — directory with development data (20K sentence pairs)
			* `dev-0/in.tsv` — input data for the dev set (Russian utterances)
			* `dev-0/expected.tsv` — expected (reference) data for the dev set (Polish utterances)
			* `test-A` — directory with test data (20K sentence pairs)
			* `test-A/in.tsv` — input data for the test set
			* `test-A/expected.tsv` — expected data for the test set (not available in the master branch)

			The Polish utterances in `{dev-0,test-A}/expected.tsv` are tokenized
			`(by [Moses](http://www.statmt.org/moses/) tokenizer.perl script, with`
			`-no-escape-option`), but the Russian utterances in all the files and
			Polish utterances in `train/train.tsv` are NOT tokenized.

			Note that some of the files have `.tsv` extension, even though they
			`actually do not contain TABs — this is just for the compatibility with`
			`[Gonito](http://gonito.net) platform.`