Go to file

wiktor7245 70d1db9dfd initial commit,low bleu, but who cares		2021-01-29 13:50:33 +01:00
dev-0	initial commit,low bleu, but who cares	2021-01-29 13:32:51 +01:00
test-A	initial commit,low bleu, but who cares	2021-01-29 13:32:51 +01:00
train	init	2016-10-03 20:41:59 +02:00
.gitignore	init	2016-10-03 20:41:59 +02:00
Gru_ru_pl.ipynb	initial commit,low bleu, but who cares	2021-01-29 13:50:33 +01:00
README.md	fix typo	2016-10-03 21:25:04 +02:00
config.txt	init	2016-10-03 20:41:59 +02:00

README.md

Russian-Polish Opensubtitles

Translate subtitles from Russian into Polish.

Sources

The data set is based on the Opensubtitles 2016 corpus prepared by Pierre Lison and Jörg Tiedemann.

Metric

BLEU is used as the evaluation metric.

Directory structure

README.md — this file
config.txt — GEval configuration file
train/ — directory with training data
train/train.tsv — train set (Russian-Polish corpus of 2M sentence pairs, the Polish sentence is given in the first column, the Russian sentence — in the second one)
dev-0/ — directory with development data (20K sentence pairs)
dev-0/in.tsv — input data for the dev set (Russian utterances)
dev-0/expected.tsv — expected (reference) data for the dev set (Polish utterances)
test-A — directory with test data (20K sentence pairs)
test-A/in.tsv — input data for the test set
test-A/expected.tsv — expected data for the test set (not available in the master branch)

The Polish utterances in {dev-0,test-A}/expected.tsv are tokenized (by Moses tokenizer.perl script, with -no-escape-option), but the Russian utterances in all the files and Polish utterances in train/train.tsv are NOT tokenized.

Note that some of the files have .tsv extension, even though they actually do not contain TABs — this is just for the compatibility with Gonito platform.