Datasets

Datasets in this submission were tokenized with moses, used: tokenizer.perl (with pl language) and deescape-special-chars.perl.

Common Crawl

Additional data (used in RoBERTa-small-cc model) were filtered from: Deduplicated CommonCrawl Text and WikiMatrix. Filtering was done using kenlm model with very strict conditions (like: omission of short sentences, containing 80% of tokens from the training set).

Pretrained models

Each model were trained for 5 epochs (base on RoBERTa architecture with: 512 sequnece length, 60000 vocab size - byte level BPE) - pretrained models are saved in the models/pretrained directory:

RoBERTa-small - 4 layers, embedding dimension 256
RoBERTa-small-cc - 4 layers, embedding dimension 256, additional training set includes Common Crawl data (the amount of data is the same as training set) - which contains twice the amount of training set (compared to RoBERTa-small)
RoBERTa-normal - 4 layers, embedding dimension 512
RoBERTa-big - 8 layers, embedding dimension 512

For pretraining process was used run_language_modeling.py scripts from transformers.

Fine-tune (classifications)

Classification models were train in few settings, outputs contains information (in the file name, e.g. out-model=small,corpus=base_with_cc,seq_len=256,sliding=False,valid=dev-0.tsv):

model={small, normal, big} - type of model used
corpus={base, base_with_cc} - base = actual training set (used in the pretrained models), base_with_cc = actual training set with filtered Common Crawl
seq_len={128, 256, 512} - sequence length of text for classifier
sliding={False, True} - True = text will be split using a sliding window (equal to the sequence length) and each window will be assigned the label from the original text, during evaluation and prediction, the predictions for each window will be the final prediction on each sample (read more in simpletransformers and options: sliding_window/stride)
valid={dev-0, dev-1, dev-0-dev-1} - validation data, dev-0-dev-1 = concatenated dev-0 and dev-1 into into one

Each models should be trained in different directories or change cache_dir (in training script arguments) for each training process. Each models were trained for 5 epochs (usually model after 2/3 epoch was the best).

Used libraries (see `requirements.txt`):

torch 1.4.0
transformers 2.8.0
simpletransformers 0.24.8

Scripts

train.py - training script used for train classifier
eval.py - evaluation script (using best model)
concatenate validation sets: paste expected.tsv in.tsv > valid.tsv

Branch:

tokenized-data - contains tokenized data only
roberta - contains this submission with RoBERTa models

3.0 KiB Raw Permalink Blame History