Go to file
wiktor7245 70d1db9dfd initial commit,low bleu, but who cares 2021-01-29 13:50:33 +01:00
dev-0 initial commit,low bleu, but who cares 2021-01-29 13:32:51 +01:00
test-A initial commit,low bleu, but who cares 2021-01-29 13:32:51 +01:00
train init 2016-10-03 20:41:59 +02:00
.gitignore init 2016-10-03 20:41:59 +02:00
Gru_ru_pl.ipynb initial commit,low bleu, but who cares 2021-01-29 13:50:33 +01:00
README.md fix typo 2016-10-03 21:25:04 +02:00
config.txt init 2016-10-03 20:41:59 +02:00

README.md

Russian-Polish Opensubtitles

Translate subtitles from Russian into Polish.

Sources

The data set is based on the Opensubtitles 2016 corpus prepared by Pierre Lison and Jörg Tiedemann.

Metric

BLEU is used as the evaluation metric.

Directory structure

  • README.md — this file
  • config.txt — GEval configuration file
  • train/ — directory with training data
  • train/train.tsv — train set (Russian-Polish corpus of 2M sentence pairs, the Polish sentence is given in the first column, the Russian sentence — in the second one)
  • dev-0/ — directory with development data (20K sentence pairs)
  • dev-0/in.tsv — input data for the dev set (Russian utterances)
  • dev-0/expected.tsv — expected (reference) data for the dev set (Polish utterances)
  • test-A — directory with test data (20K sentence pairs)
  • test-A/in.tsv — input data for the test set
  • test-A/expected.tsv — expected data for the test set (not available in the master branch)

The Polish utterances in {dev-0,test-A}/expected.tsv are tokenized (by Moses tokenizer.perl script, with -no-escape-option), but the Russian utterances in all the files and Polish utterances in train/train.tsv are NOT tokenized.

Note that some of the files have .tsv extension, even though they actually do not contain TABs — this is just for the compatibility with Gonito platform.