"He Said She Said" classification challenge (2nd edition) - see gonito.net - pretrained models
dev-0 | ||
dev-1 | ||
test-A | ||
.gitignore | ||
1-create-data.sh | ||
2-preproc-classifier.sh | ||
2-preproc-mlm.sh | ||
3-train-mlm.sh | ||
4-finetune.sh | ||
5-eval.py | ||
config.txt | ||
dict-mlm.txt | ||
README.md | ||
vocab_spm_bpe.model |
"He Said She Said" classification challenge (2nd edition)
Give the probability that a text in Polish was written by a man.
This challenge is based on the "He Said She Said" corpus for Polish. The corpus was created by grepping gender-specific first person expressions (e.g. "zrobiłem/zrobiłam", "jestem zadowolony/zadowolona", "będę robił/robiła") in the Common Crawl corpus. Such expressions were normalised here into masculine forms.
Classes
0
— text written by a woman1
— text written by a man
Directory structure
README.md
— this fileconfig.txt
— configuration filetrain/
— directory with training datatrain/train.tsv.gz
— train set (gzipped), the class is given in the first column, a text fragment in the second onetrain/meta.tsv.gz
— metadata (do not use during training)dev-0/
— directory with dev (test) datadev-0/in.tsv
— input data for the dev set (text fragments)dev-0/expected.tsv
— expected (reference) data for the dev setdev-0/meta.tsv
— metadata (not used during testing)dev-1/
— directory with extra dev (test) datadev-1/in.tsv
— input data for the extra dev set (text fragments)dev-1/expected.tsv
— expected (reference) data for the extra dev setdev-1/meta.tsv
— metadata (not used during testing)test-A
— directory with test datatest-A/in.tsv
— input data for the test set (text fragments)test-A/expected.tsv
— expected (reference) data for the test set (hidden)