"He Said She Said" classification challenge (2nd edition) - see gonito.net
Go to file
Karol Kaczmarek b4b30b6d63 Classification with RoBERTa (update README.md) 2020-05-02 14:32:44 +02:00
dev-0 Classification with RoBERTa 2020-05-02 14:19:08 +02:00
dev-1 Classification with RoBERTa 2020-05-02 14:19:08 +02:00
models/pretrained Classification with RoBERTa 2020-05-02 14:19:08 +02:00
test-A Classification with RoBERTa 2020-05-02 14:19:08 +02:00
train Add tokenized data 2020-05-01 08:45:59 +02:00
.gitignore init 2016-11-15 09:07:10 +01:00
README-INTERNAL.md Classification with RoBERTa (update README.md) 2020-05-02 14:32:44 +02:00
README.md Classification with RoBERTa (update README.md) 2020-05-02 14:32:44 +02:00
config.txt init 2016-11-15 09:07:10 +01:00
eval.py Classification with RoBERTa 2020-05-02 14:19:08 +02:00
requirements.txt Classification with RoBERTa 2020-05-02 14:19:08 +02:00
train.py Classification with RoBERTa 2020-05-02 14:19:08 +02:00

README.md

"He Said She Said" classification challenge (2nd edition)

See description for more details about this submission.

Guess whether a text in Polish was written by a man or woman.

This challenge is based on the "He Said She Said" corpus for Polish. The corpus was created by grepping gender-specific first person expressions (e.g. "zrobiłem/zrobiłam", "jestem zadowolony/zadowolona", "będę robił/robiła") in the Common Crawl corpus. Such expressions were normalised here into masculine forms.

Classes

  • F — text written by a woman
  • M — text written by a man

Directory structure

  • README.md — this file
  • config.txt — configuration file
  • train/ — directory with training data
  • train/train.tsv.gz — train set (gzipped), the class is given in the first column, a text fragment in the second one
  • train/meta.tsv.gz — metadata (do not use during training)
  • dev-0/ — directory with dev (test) data
  • dev-0/in.tsv — input data for the dev set (text fragments)
  • dev-0/expected.tsv — expected (reference) data for the dev set
  • dev-0/meta.tsv — metadata (not used during testing)
  • dev-1/ — directory with extra dev (test) data
  • dev-1/in.tsv — input data for the extra dev set (text fragments)
  • dev-1/expected.tsv — expected (reference) data for the extra dev set
  • dev-1/meta.tsv — metadata (not used during testing)
  • test-A — directory with test data
  • test-A/in.tsv — input data for the test set (text fragments)
  • test-A/expected.tsv — expected (reference) data for the test set (hidden)