dev-0 | ||
test-A | ||
train | ||
.gitignore | ||
config.txt | ||
Gru_ru_pl.ipynb | ||
README.md |
Russian-Polish Opensubtitles
Translate subtitles from Russian into Polish.
Sources
The data set is based on the Opensubtitles 2016 corpus prepared by Pierre Lison and Jörg Tiedemann.
Metric
BLEU is used as the evaluation metric.
Directory structure
README.md
— this fileconfig.txt
— GEval configuration filetrain/
— directory with training datatrain/train.tsv
— train set (Russian-Polish corpus of 2M sentence pairs, the Polish sentence is given in the first column, the Russian sentence — in the second one)dev-0/
— directory with development data (20K sentence pairs)dev-0/in.tsv
— input data for the dev set (Russian utterances)dev-0/expected.tsv
— expected (reference) data for the dev set (Polish utterances)test-A
— directory with test data (20K sentence pairs)test-A/in.tsv
— input data for the test settest-A/expected.tsv
— expected data for the test set (not available in the master branch)
The Polish utterances in {dev-0,test-A}/expected.tsv
are tokenized
(by Moses tokenizer.perl script, with
-no-escape-option
), but the Russian utterances in all the files and
Polish utterances in train/train.tsv
are NOT tokenized.
Note that some of the files have .tsv
extension, even though they
actually do not contain TABs — this is just for the compatibility with
Gonito platform.