TAU 28 Marian s2s, tokenizer, truecaser
This commit is contained in:
commit
f393c9cd67
45
README.md
Normal file
45
README.md
Normal file
@ -0,0 +1,45 @@
|
||||
WMT2017 Czech-English machine translation challenge for news
|
||||
=============================================================
|
||||
|
||||
Translate news articles from Czech into English.
|
||||
|
||||
This is WMT2017 news challenge reformatted as a Gonito.net challenge,
|
||||
all the data were taken from <http://www.statmt.org/wmt17/translation-task.html>.
|
||||
|
||||
BLEU is used as the evaluation metric.
|
||||
|
||||
Directory structure
|
||||
-------------------
|
||||
|
||||
* `README.md` — this file
|
||||
* `config.txt` — configuration file
|
||||
* `train/` — directory with training data
|
||||
* `train/commoncrawl.tsv.xz` — Common Crawl parallel corpus
|
||||
* `train/news-commentary-v12.tsv.xz` — News Commentary parallel corpus
|
||||
* `train/europarl-v7.tsv.xz` — Europarliament parallel corpus
|
||||
* `dev-0/` — directory with dev (test) data (Newstest 2013)
|
||||
* `dev-0/in.tsv` — German input text for the dev set
|
||||
* `dev-0/expected.tsv` — English reference translation for the dev set
|
||||
* `test-A` — directory with test data
|
||||
* `test-A/in.tsv` — German input data for the test set (WMT2017 test set)
|
||||
* `test-A/expected.tsv` — English reference translation for the test set
|
||||
|
||||
Training sets
|
||||
-------------
|
||||
|
||||
All training sets were compressed with xz, use `xzcat` to decompress:
|
||||
|
||||
$ xzcat train/*.tsv.xz | ...
|
||||
|
||||
The pairs where German or English side is empty were removed from the
|
||||
training sets.
|
||||
|
||||
Test sets
|
||||
---------
|
||||
|
||||
Reference English translations in the dev and test sets is not tokenised.
|
||||
|
||||
Monolingual data
|
||||
----------------
|
||||
|
||||
Monolingual data was not included here.
|
1
config.txt
Normal file
1
config.txt
Normal file
@ -0,0 +1 @@
|
||||
--metric BLEU --metric GLEU --tokenizer 13a --precision 4
|
3000
dev-0/expected.tsv
Normal file
3000
dev-0/expected.tsv
Normal file
File diff suppressed because it is too large
Load Diff
3000
dev-0/in.tc.cs
Normal file
3000
dev-0/in.tc.cs
Normal file
File diff suppressed because it is too large
Load Diff
3000
dev-0/in.tc.cs.output
Normal file
3000
dev-0/in.tc.cs.output
Normal file
File diff suppressed because it is too large
Load Diff
3000
dev-0/in.tok.tsv
Normal file
3000
dev-0/in.tok.tsv
Normal file
File diff suppressed because it is too large
Load Diff
3000
dev-0/in.tsv
Normal file
3000
dev-0/in.tsv
Normal file
File diff suppressed because it is too large
Load Diff
3000
dev-0/out.tsv
Normal file
3000
dev-0/out.tsv
Normal file
File diff suppressed because it is too large
Load Diff
48
preprocess-data_dev-0_test-A.sh
Executable file
48
preprocess-data_dev-0_test-A.sh
Executable file
@ -0,0 +1,48 @@
|
||||
#!/bin/bash -v
|
||||
|
||||
# This sample script preprocesses a sample corpus, including tokenization,
|
||||
# truecasing, and subword segmentation.
|
||||
# For application to a different language pair, change source and target
|
||||
# prefix, optionally the number of BPE operations, and the file names
|
||||
# (currently, data/corpus and data/newsdev2016 are being processed).
|
||||
|
||||
# In the tokenization step, you will want to remove Romanian-specific
|
||||
# normalization / diacritic removal, and you may want to add your own.
|
||||
# Also, you may want to learn BPE segmentations separately for each language,
|
||||
# especially if they differ in their alphabet
|
||||
|
||||
# Suffix of source language files
|
||||
SRC=cs
|
||||
# Suffix of target language files
|
||||
TRG=en
|
||||
|
||||
# Number of merge operations. Network vocabulary should be slightly larger (to
|
||||
# include characters), or smaller if the operations are learned on the joint
|
||||
# vocabulary
|
||||
bpe_operations=85000
|
||||
|
||||
# path to moses decoder: https://github.com/moses-smt/mosesdecoder
|
||||
|
||||
mosesdecoder=/home/jakub/TAU/TAU_24/marian-dev/examples/tools/moses-scripts
|
||||
|
||||
# path to subword segmentation scripts: https://github.com/rsennrich/subword-nmt
|
||||
subword_nmt=/home/jakub/TAU/TAU_24/marian-dev/examples/tools/subword-nmt
|
||||
|
||||
# tokenize
|
||||
|
||||
cat dev-0/in.tsv \
|
||||
| $mosesdecoder/scripts/tokenizer/tokenizer.perl -a -l tsv > dev-0/in.tok.tsv
|
||||
|
||||
cat test-A/in.tsv \
|
||||
| $mosesdecoder/scripts/tokenizer/tokenizer.perl -a -l tsv > test-A/in.tok.tsv
|
||||
|
||||
|
||||
|
||||
# apply truecaser (dev/test files)
|
||||
for prefix in dev-0/in test-A/in
|
||||
do
|
||||
$mosesdecoder/scripts/recaser/truecase.perl -model preprocess-train/model/tc.$SRC < $prefix.tok.tsv > $prefix.tc.$SRC
|
||||
|
||||
done
|
||||
|
||||
|
1720
preprocess-train/model/train.log
Normal file
1720
preprocess-train/model/train.log
Normal file
File diff suppressed because it is too large
Load Diff
36
preprocess-train/model/valid.log
Normal file
36
preprocess-train/model/valid.log
Normal file
@ -0,0 +1,36 @@
|
||||
[2020-01-24 03:53:38] [valid] Ep. 1 : Up. 10000 : cross-entropy : 108.847 : new best
|
||||
[2020-01-24 03:56:42] [valid] Ep. 1 : Up. 10000 : translation : 7.52 : new best
|
||||
[2020-01-24 05:33:26] [valid] Ep. 2 : Up. 20000 : cross-entropy : 94.6385 : new best
|
||||
[2020-01-24 05:36:21] [valid] Ep. 2 : Up. 20000 : translation : 12.28 : new best
|
||||
[2020-01-24 07:13:19] [valid] Ep. 3 : Up. 30000 : cross-entropy : 89.3 : new best
|
||||
[2020-01-24 07:16:05] [valid] Ep. 3 : Up. 30000 : translation : 14.42 : new best
|
||||
[2020-01-24 08:52:52] [valid] Ep. 4 : Up. 40000 : cross-entropy : 86.3268 : new best
|
||||
[2020-01-24 08:55:27] [valid] Ep. 4 : Up. 40000 : translation : 15.55 : new best
|
||||
[2020-01-24 12:16:53] [valid] Ep. 5 : Up. 50000 : cross-entropy : 84.0588 : new best
|
||||
[2020-01-24 12:19:29] [valid] Ep. 5 : Up. 50000 : translation : 15.5 : stalled 1 times (last best: 15.55)
|
||||
[2020-01-24 13:58:10] [valid] Ep. 5 : Up. 60000 : cross-entropy : 82.5176 : new best
|
||||
[2020-01-24 14:01:02] [valid] Ep. 5 : Up. 60000 : translation : 16.08 : new best
|
||||
[2020-01-24 14:07:51] [valid] Ep. 6 : Up. 60621 : cross-entropy : 82.5103 : new best
|
||||
[2020-01-24 14:10:42] [valid] Ep. 6 : Up. 60621 : translation : 15.87 : stalled 1 times (last best: 16.08)
|
||||
[2020-01-24 16:26:57] [valid] Ep. 6 : Up. 70000 : cross-entropy : 82.2896 : new best
|
||||
[2020-01-24 16:29:33] [valid] Ep. 6 : Up. 70000 : translation : 16.03 : stalled 2 times (last best: 16.08)
|
||||
[2020-01-24 18:09:49] [valid] Ep. 7 : Up. 80000 : cross-entropy : 82.905 : stalled 1 times (last best: 82.2896)
|
||||
[2020-01-24 18:12:31] [valid] Ep. 7 : Up. 80000 : translation : 16.49 : new best
|
||||
[2020-01-24 19:53:05] [valid] Ep. 8 : Up. 90000 : cross-entropy : 83.1733 : stalled 2 times (last best: 82.2896)
|
||||
[2020-01-24 19:55:48] [valid] Ep. 8 : Up. 90000 : translation : 16.18 : stalled 1 times (last best: 16.49)
|
||||
[2020-01-24 21:36:56] [valid] Ep. 9 : Up. 100000 : cross-entropy : 84.6166 : stalled 3 times (last best: 82.2896)
|
||||
[2020-01-24 21:39:23] [valid] Ep. 9 : Up. 100000 : translation : 16.42 : stalled 2 times (last best: 16.49)
|
||||
[2020-01-24 23:16:42] [valid] Ep. 10 : Up. 110000 : cross-entropy : 85.5421 : stalled 4 times (last best: 82.2896)
|
||||
[2020-01-24 23:19:12] [valid] Ep. 10 : Up. 110000 : translation : 16.23 : stalled 3 times (last best: 16.49)
|
||||
[2020-01-25 00:57:49] [valid] Ep. 10 : Up. 120000 : cross-entropy : 86.5355 : stalled 5 times (last best: 82.2896)
|
||||
[2020-01-25 01:00:23] [valid] Ep. 10 : Up. 120000 : translation : 15.92 : stalled 4 times (last best: 16.49)
|
||||
[2020-01-25 02:37:11] [valid] Ep. 11 : Up. 130000 : cross-entropy : 88.0175 : stalled 6 times (last best: 82.2896)
|
||||
[2020-01-25 02:39:36] [valid] Ep. 11 : Up. 130000 : translation : 15.98 : stalled 5 times (last best: 16.49)
|
||||
[2020-01-25 04:16:32] [valid] Ep. 12 : Up. 140000 : cross-entropy : 89.7736 : stalled 7 times (last best: 82.2896)
|
||||
[2020-01-25 04:18:58] [valid] Ep. 12 : Up. 140000 : translation : 15.72 : stalled 6 times (last best: 16.49)
|
||||
[2020-01-25 05:55:59] [valid] Ep. 13 : Up. 150000 : cross-entropy : 91.4267 : stalled 8 times (last best: 82.2896)
|
||||
[2020-01-25 05:58:28] [valid] Ep. 13 : Up. 150000 : translation : 15.43 : stalled 7 times (last best: 16.49)
|
||||
[2020-01-25 07:35:33] [valid] Ep. 14 : Up. 160000 : cross-entropy : 93.3989 : stalled 9 times (last best: 82.2896)
|
||||
[2020-01-25 07:38:02] [valid] Ep. 14 : Up. 160000 : translation : 15.24 : stalled 8 times (last best: 16.49)
|
||||
[2020-01-25 09:15:09] [valid] Ep. 15 : Up. 170000 : cross-entropy : 94.6314 : stalled 10 times (last best: 82.2896)
|
||||
[2020-01-25 09:17:37] [valid] Ep. 15 : Up. 170000 : translation : 15.18 : stalled 9 times (last best: 16.49)
|
62
preprocess-train/preprocess-data.sh
Executable file
62
preprocess-train/preprocess-data.sh
Executable file
@ -0,0 +1,62 @@
|
||||
#!/bin/bash -v
|
||||
|
||||
# This sample script preprocesses a sample corpus, including tokenization,
|
||||
# truecasing, and subword segmentation.
|
||||
# For application to a different language pair, change source and target
|
||||
# prefix, optionally the number of BPE operations, and the file names
|
||||
# (currently, data/corpus and data/newsdev2016 are being processed).
|
||||
|
||||
# In the tokenization step, you will want to remove Romanian-specific
|
||||
# normalization / diacritic removal, and you may want to add your own.
|
||||
# Also, you may want to learn BPE segmentations separately for each language,
|
||||
# especially if they differ in their alphabet
|
||||
|
||||
# Suffix of source language files
|
||||
SRC=cs
|
||||
# Suffix of target language files
|
||||
TRG=en
|
||||
|
||||
# Number of merge operations. Network vocabulary should be slightly larger (to
|
||||
# include characters), or smaller if the operations are learned on the joint
|
||||
# vocabulary
|
||||
bpe_operations=85000
|
||||
|
||||
# path to moses decoder: https://github.com/moses-smt/mosesdecoder
|
||||
|
||||
mosesdecoder=/home/jakub/TAU/TAU_24/marian-dev/examples/tools/moses-scripts
|
||||
|
||||
# path to subword segmentation scripts: https://github.com/rsennrich/subword-nmt
|
||||
subword_nmt=/home/jakub/TAU/TAU_24/marian-dev/examples/tools/subword-nmt
|
||||
|
||||
# tokenize
|
||||
for prefix in europarl valideuroparl
|
||||
do
|
||||
cat data/$prefix.$SRC \
|
||||
| $mosesdecoder/scripts/tokenizer/tokenizer.perl -a -l $SRC > data/$prefix.tok.$SRC
|
||||
|
||||
cat data/$prefix.$TRG \
|
||||
| $mosesdecoder/scripts/tokenizer/tokenizer.perl -a -l $TRG > data/$prefix.tok.$TRG
|
||||
|
||||
done
|
||||
|
||||
# clean empty and long sentences, and sentences with high source-target ratio (training corpus only)
|
||||
$mosesdecoder/scripts/training/clean-corpus-n.perl data/europarl.tok $SRC $TRG data/europarl.tok.clean 1 80
|
||||
|
||||
# train truecaser
|
||||
$mosesdecoder/scripts/recaser/train-truecaser.perl -corpus data/europarl.tok.clean.$SRC -model model/tc.$SRC
|
||||
$mosesdecoder/scripts/recaser/train-truecaser.perl -corpus data/europarl.tok.clean.$TRG -model model/tc.$TRG
|
||||
|
||||
# apply truecaser (cleaned training corpus)
|
||||
for prefix in europarl
|
||||
do
|
||||
$mosesdecoder/scripts/recaser/truecase.perl -model model/tc.$SRC < data/$prefix.tok.clean.$SRC > data/$prefix.tc.$SRC
|
||||
$mosesdecoder/scripts/recaser/truecase.perl -model model/tc.$TRG < data/$prefix.tok.clean.$TRG > data/$prefix.tc.$TRG
|
||||
done
|
||||
|
||||
# apply truecaser (dev/test files)
|
||||
for prefix in valideuroparl
|
||||
do
|
||||
$mosesdecoder/scripts/recaser/truecase.perl -model model/tc.$SRC < data/$prefix.tok.$SRC > data/$prefix.tc.$SRC
|
||||
$mosesdecoder/scripts/recaser/truecase.perl -model model/tc.$TRG < data/$prefix.tok.$TRG > data/$prefix.tc.$TRG
|
||||
done
|
||||
|
45
preprocess-train/scripts/average.py
Executable file
45
preprocess-train/scripts/average.py
Executable file
@ -0,0 +1,45 @@
|
||||
#!/usr/bin/env python
|
||||
|
||||
from __future__ import print_function
|
||||
|
||||
import os
|
||||
import sys
|
||||
import argparse
|
||||
|
||||
import numpy as np
|
||||
|
||||
# Parse arguments.
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('-m', '--model', nargs='+', required=True,
|
||||
help="Models to average")
|
||||
parser.add_argument('-o', '--output', required=True,
|
||||
help="Output path")
|
||||
args = parser.parse_args()
|
||||
|
||||
# *average* holds the model matrix.
|
||||
average = dict()
|
||||
# No. of models.
|
||||
n = len(args.model)
|
||||
|
||||
for filename in args.model:
|
||||
print("Loading {}".format(filename))
|
||||
with open(filename, "rb") as mfile:
|
||||
# Loads matrix from model file.
|
||||
m = np.load(mfile)
|
||||
for k in m:
|
||||
if k != "history_errs":
|
||||
# Initialize the key.
|
||||
if k not in average:
|
||||
average[k] = m[k]
|
||||
# Add to the appropriate value.
|
||||
elif average[k].shape == m[k].shape and "special" not in k:
|
||||
average[k] += m[k]
|
||||
|
||||
# Actual averaging.
|
||||
for k in average:
|
||||
if "special" not in k:
|
||||
average[k] /= n
|
||||
|
||||
# Save averaged model to file.
|
||||
print("Saving to {}".format(args.output))
|
||||
np.savez(args.output, **average)
|
20
preprocess-train/scripts/remove-diacritics.py
Executable file
20
preprocess-train/scripts/remove-diacritics.py
Executable file
@ -0,0 +1,20 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
# Author: Barry Haddow
|
||||
# Distributed under MIT license
|
||||
|
||||
#
|
||||
# Remove Romanian diacritics. Assumes s-comma and t-comma are normalised
|
||||
|
||||
import io
|
||||
import sys
|
||||
istream = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')
|
||||
ostream = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
|
||||
|
||||
for line in istream:
|
||||
line = line.replace("\u0218", "S").replace("\u0219", "s") #s-comma
|
||||
line = line.replace("\u021a", "T").replace("\u021b", "t") #t-comma
|
||||
line = line.replace("\u0102", "A").replace("\u0103", "a")
|
||||
line = line.replace("\u00C2", "A").replace("\u00E2", "a")
|
||||
line = line.replace("\u00CE", "I").replace("\u00EE", "i")
|
||||
ostream.write(line)
|
8
preprocess-train/scripts/validate.sh
Executable file
8
preprocess-train/scripts/validate.sh
Executable file
@ -0,0 +1,8 @@
|
||||
#!/bin/bash
|
||||
|
||||
cat $1 \
|
||||
| sed 's/\@\@ //g' \
|
||||
| /home/jakub/TAU/TAU_24/marian-dev/examples/tools/moses-scripts/scripts/recaser/detruecase.perl 2> /dev/null \
|
||||
| /home/jakub/TAU/TAU_24/marian-dev/examples/tools/moses-scripts/scripts/tokenizer/detokenizer.perl -l en 2>/dev/null \
|
||||
| /home/jakub/TAU/TAU_24/marian-dev/examples/tools/moses-scripts/scripts/generic/multi-bleu-detok.perl preprocess-train/data/valideuroparl.en \
|
||||
| sed -r 's/BLEU = ([0-9.]+),.*/\1/'
|
6
scores_geval.txt
Normal file
6
scores_geval.txt
Normal file
@ -0,0 +1,6 @@
|
||||
BLEU 0.1547
|
||||
GLEU 0.2063
|
||||
BLEU 0.1547
|
||||
GLEU 0.2063
|
||||
BLEU 0.1592
|
||||
GLEU 0.2100
|
3005
test-A/in.tc.cs
Normal file
3005
test-A/in.tc.cs
Normal file
File diff suppressed because it is too large
Load Diff
0
test-A/in.tc.cs.output
Normal file
0
test-A/in.tc.cs.output
Normal file
3005
test-A/in.tok.tsv
Normal file
3005
test-A/in.tok.tsv
Normal file
File diff suppressed because it is too large
Load Diff
3005
test-A/in.tsv
Normal file
3005
test-A/in.tsv
Normal file
File diff suppressed because it is too large
Load Diff
0
test-A/out.tsv
Normal file
0
test-A/out.tsv
Normal file
|
83
train-and-translate.sh
Executable file
83
train-and-translate.sh
Executable file
@ -0,0 +1,83 @@
|
||||
#!/bin/bash -v
|
||||
|
||||
MARIAN=/home/jakub/TAU/TAU_24/marian-dev/build
|
||||
|
||||
# if we are in WSL, we need to add '.exe' to the tool names
|
||||
if [ -e "/bin/wslpath" ]
|
||||
then
|
||||
EXT=.exe
|
||||
fi
|
||||
|
||||
MARIAN_TRAIN=$MARIAN/marian$EXT
|
||||
MARIAN_DECODER=$MARIAN/marian-decoder$EXT
|
||||
MARIAN_VOCAB=$MARIAN/marian-vocab$EXT
|
||||
MARIAN_SCORER=$MARIAN/marian-scorer$EXT
|
||||
|
||||
# set chosen gpus
|
||||
GPUS=0
|
||||
if [ $# -ne 0 ]
|
||||
then
|
||||
GPUS=$@
|
||||
fi
|
||||
echo Using GPUs: $GPUS
|
||||
|
||||
if [ ! -e $MARIAN_TRAIN ]
|
||||
then
|
||||
echo "marian is not installed in $MARIAN, you need to compile the toolkit first"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
|
||||
# create common vocabulary
|
||||
if [ ! -e "preprocess-train/model/vocab.cs.yml" ] || [ ! -e "preprocess-train/model/vocab.en.yml" ]
|
||||
then
|
||||
cat preprocess-train/data/europarl.tc.cs preprocess-train/data/valideuroparl.tc.cs | $MARIAN_VOCAB --max-size 50000 > preprocess-train/model/vocab.cs.yml
|
||||
cat preprocess-train/data/europarl.tc.en preprocess-train/data/valideuroparl.tc.en | $MARIAN_VOCAB --max-size 50000 > preprocess-train/model/vocab.en.yml
|
||||
fi
|
||||
|
||||
# train model
|
||||
if [ ! -e "preprocess-train/model/model.npz.best-translation.npz" ]
|
||||
then
|
||||
$MARIAN_TRAIN \
|
||||
--devices $GPUS \
|
||||
--type s2s \
|
||||
--early-stopping 10 -e 15 \
|
||||
--model preprocess-train/model/model.npz \
|
||||
--train-sets preprocess-train/data/europarl.tc.cs preprocess-train/data/europarl.tc.en \
|
||||
--vocabs preprocess-train/model/vocab.cs.yml preprocess-train/model/vocab.en.yml \
|
||||
--dim-vocabs 50000 50000 \
|
||||
--mini-batch-fit -w 1024 \
|
||||
--valid-freq 10000 --save-freq 10000 --disp-freq 1000 \
|
||||
--valid-mini-batch 23 --valid-max-length 100 \
|
||||
--valid-metrics cross-entropy translation \
|
||||
--valid-sets preprocess-train/data/valideuroparl.tc.cs preprocess-train/data/valideuroparl.tc.en \
|
||||
--valid-script-path "bash ./preprocess-train/scripts/validate.sh" \
|
||||
--log preprocess-train/model/train.log --valid-log preprocess-train/model/valid.log \
|
||||
--overwrite --keep-best \
|
||||
|
||||
|
||||
|
||||
fi
|
||||
|
||||
|
||||
# translate dev set
|
||||
cat dev-0/in.tc.cs \
|
||||
| $MARIAN_DECODER -c preprocess-train/model/model.npz.best-translation.npz.decoder.yml -d $GPUS -b 12 -n1 \
|
||||
--mini-batch 64 --maxi-batch 10 --maxi-batch-sort src \
|
||||
| sed 's/\@\@ //g' \
|
||||
| /home/jakub/TAU/TAU_24/marian-dev/examples/tools/moses-scripts/scripts/recaser/detruecase.perl \
|
||||
| /home/jakub/TAU/TAU_24/marian-dev/examples/tools/moses-scripts/scripts/tokenizer/detokenizer.perl -l en \
|
||||
> dev-0/in.tc.cs.output
|
||||
cp dev-0/in.tc.cs.output dev-0/out.tsv
|
||||
|
||||
# translate test set
|
||||
cat test-A/in.tc.cs \
|
||||
| $MARIAN_DECODER -c preprocess-train/model/model.npz.best-translation.npz.decoder.yml -d $GPUS -b 12 -n1 \
|
||||
--mini-batch 64 --maxi-batch 10 --maxi-batch-sort src \
|
||||
| sed 's/\@\@ //g' \
|
||||
| /home/jakub/TAU/TAU_24/marian-dev/examples/tools/tools/moses-scripts/scripts/recaser/detruecase.perl \
|
||||
| /home/jakub/TAU/TAU_24/marian-dev/examples/tools/tools/moses-scripts/scripts/tokenizer/detokenizer.perl -l en \
|
||||
> test-A/in.tc.cs.output
|
||||
cp test-A/in.tc.cs.output test-A/out.tsv
|
||||
geval -t dev-0 >> scores_geval.txt
|
||||
|
7
train/clean.rb
Normal file
7
train/clean.rb
Normal file
@ -0,0 +1,7 @@
|
||||
|
||||
|
||||
while l=gets
|
||||
|
||||
l.gsub! /\t+/, "\t"
|
||||
puts l
|
||||
end
|
21
train/split.py
Normal file
21
train/split.py
Normal file
@ -0,0 +1,21 @@
|
||||
import pandas as pd
|
||||
|
||||
|
||||
europarl = pd.read_csv('news-commentary-v12-clean.tsv', sep='\t', names=['cs', 'en'], header=None)
|
||||
|
||||
cs = europarl['cs']
|
||||
en = europarl['en']
|
||||
|
||||
file1 = open("./after_split/news-commentary-v12-clean.cs","w")
|
||||
for row in cs:
|
||||
file1.write(str(row) + "\n")
|
||||
file1.close()
|
||||
|
||||
|
||||
file2 = open("./after_split/news-commentary-v12-clean.en","w")
|
||||
for row in en:
|
||||
file2.write(str(row) + "\n")
|
||||
file2.close()
|
||||
|
||||
|
||||
|
22
train/split.rb
Normal file
22
train/split.rb
Normal file
@ -0,0 +1,22 @@
|
||||
|
||||
filename1 = ARGV.shift
|
||||
filename2 = ARGV.shift
|
||||
f1 = File.open(filename1, 'w')
|
||||
f2 = File.open(filename2, 'w')
|
||||
lines = readlines
|
||||
lines.each{|line| a = line.split(/\t+/)
|
||||
if a.length == 2 && line.match(/^[^\t]+\t+[^\t]+$/) then
|
||||
f1.puts(a[0])
|
||||
f2.puts(a[1])
|
||||
else
|
||||
puts("@@@BŁĄD@@@\n#{line}")
|
||||
end
|
||||
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user