init
This commit is contained in:
commit
ea840a0351
|
@ -0,0 +1,2 @@
|
|||
*~
|
||||
temp/
|
|
@ -0,0 +1,41 @@
|
|||
|
||||
Russain-Polish Opensubtitles
|
||||
============================
|
||||
|
||||
Translate subtitles from Russian into Polish.
|
||||
|
||||
Sources
|
||||
-------
|
||||
|
||||
The data set is based on the
|
||||
[Opensubtitles 2016](http://opus.lingfil.uu.se/OpenSubtitles2016.php)
|
||||
corpus [prepared by Pierre Lison and Jörg Tiedemann](http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf).
|
||||
|
||||
Metric
|
||||
------
|
||||
|
||||
[BLEU](https://en.wikipedia.org/wiki/BLEU) is used as the evaluation metric.
|
||||
|
||||
|
||||
Directory structure
|
||||
-------------------
|
||||
|
||||
* `README.md` — this file
|
||||
* `config.txt` — GEval configuration file
|
||||
* `train/` — directory with training data
|
||||
* `train/train.tsv` — train set (Russian-Polish corpus of 2M sentence pairs, the Polish sentence is given in the first column, the Russian sentence — in the second one)
|
||||
* `dev-0/` — directory with development data (20K sentence pairs)
|
||||
* `dev-0/in.tsv` — input data for the dev set (Russian utterances)
|
||||
* `dev-0/expected.tsv` — expected (reference) data for the dev set (Polish utterances)
|
||||
* `test-A` — directory with test data (20K sentence pairs)
|
||||
* `test-A/in.tsv` — input data for the test set
|
||||
* `test-A/expected.tsv` — expected data for the test set (not available in the master branch)
|
||||
|
||||
The Polish utterances in `{dev-0,test-A}/expected.tsv` are tokenized
|
||||
(by [Moses](http://www.statmt.org/moses/) tokenizer.perl script, with
|
||||
`-no-escape-option`), but the Russian utterances in all the files and
|
||||
Polish utterances in `train/train.tsv` are *NOT* tokenized.
|
||||
|
||||
Note that some of the files have `.tsv` extension, even though they
|
||||
actually do not contain TABs — this is just for the compatibility with
|
||||
[Gonito](http://gonito.net) platform.
|
|
@ -0,0 +1 @@
|
|||
--metric BLEU --precision 4
|
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
Binary file not shown.
Loading…
Reference in New Issue