# Datasets

Datasets in this submission were tokenized with [moses](https://github.com/moses-smt/mosesdecoder/tree/master/scripts/tokenizer), used: `tokenizer.perl` (with pl language) and `deescape-special-chars.perl`.


# Common Crawl

Additional data (used in RoBERTa-small-cc model) were filtered from: [Deduplicated CommonCrawl Text](http://statmt.org/ngrams/deduped/) and [WikiMatrix](https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix). Filtering was done using [kenlm](https://kheafield.com/code/kenlm/) model with very strict conditions (like: omission of short sentences, containing 80% of tokens from the training set).


# Pretrained models

Each model were trained for 5 epochs (base on RoBERTa architecture with: 512 sequnece length, 60000 vocab size - byte level BPE) - pretrained models are saved in the `models/pretrained` directory:
- RoBERTa-small - 4 layers, embedding dimension 256
- RoBERTa-small-cc - 4 layers, embedding dimension 256, additional training set includes Common Crawl data (the amount of data is the same as training set) - which contains twice the amount of training set (compared to RoBERTa-small)
- RoBERTa-normal - 4 layers, embedding dimension 512
- RoBERTa-big - 8 layers, embedding dimension 512

For pretraining process was used `run_language_modeling.py` scripts from transformers.


# Fine-tune (classifications)
Classification models were train in few settings, outputs contains information (in the file name, e.g. `out-model=small,corpus=base_with_cc,seq_len=256,sliding=False,valid=dev-0.tsv`):
- model={small, normal, big} - type of model used
- corpus={base, base_with_cc} - base = actual training set (used in the pretrained models), base_with_cc = actual training set with filtered Common Crawl
- seq_len={128, 256, 512} - sequence length of text for classifier
- sliding={False, True} - True = text will be split using a sliding window (equal to the sequence length) and each window will be assigned the label from the original text, during evaluation and prediction, the predictions for each window will be the final prediction on each sample (read more in `simpletransformers` and options: `sliding_window`/`stride`)
- valid={dev-0, dev-1, dev-0-dev-1} - validation data, dev-0-dev-1 = concatenated dev-0 and dev-1 into into one

Each models should be trained in different directories or change `cache_dir` (in training script arguments) for each training process. Each models were trained for 5 epochs (usually model after 2/3 epoch was the best).


# Used libraries (see `requirements.txt`):
- torch 1.4.0
- transformers 2.8.0
- simpletransformers 0.24.8


# Scripts
- `train.py` - training script used for train classifier
- `eval.py` - evaluation script (using best model)
- concatenate validation sets: `paste expected.tsv in.tsv > valid.tsv`

# Branch:
- [tokenized-data](https://git.wmi.amu.edu.pl/s402227/petite-difference-challenge2/src/tokenized-data) - contains tokenized data only
- [roberta](https://git.wmi.amu.edu.pl/s402227/petite-difference-challenge2/src/roberta) - contains this submission with RoBERTa models