Go to file
2021-04-17 14:23:23 +02:00
dev-0 Improve the results to 0.8ish 2021-04-17 14:23:23 +02:00
test-A Improve the results to 0.8ish 2021-04-17 14:23:23 +02:00
.gitignore init 2021-04-12 14:56:01 +02:00
config.txt init 2021-04-12 14:56:01 +02:00
papers.bib init 2021-04-12 14:56:01 +02:00
README.md init 2021-04-12 14:56:01 +02:00
solution-dev0.py Improve the results to 0.8ish 2021-04-17 14:23:23 +02:00
solution-testA.py Improve the results to 0.8ish 2021-04-17 14:23:23 +02:00
stopwords.txt Improve the result by around 0.1 geval score by removing polish stopwords 2021-04-15 18:29:53 +02:00

Cluster Polish urban legend texts

Cluster Polish urban legend texts the way folklorists do.

The task is to group texts into urban legend types. Note that this is an unsupervised machine learning task. The metric used is Normalized Mutual Information (NMI).

Bibliography

Please cite:

Roman Grundkiewicz and Filip Graliński. How to distinguish a kidney theft from a death car? Experiments in clustering urban-legend texts. In Proceedings of the RANLP 2011 Workshop on Information Extraction and Knowledge Acquisition, pages 29-36, Hissar, Bulgaria, September 2011. Association for Computational Linguistics.

See also:

Roman Grundkiewicz, Automatic identification and clustering of short narrative texts published on the Internet, master thesis, supervisor: Filip Graliński, Faculty of Mathematics and Computer Science, Adam Mickiewicz University in Poznań, June 2011

Filip Graliński. Tropiąc czarną wołgę w sieci. o poszukiwaniu legend miejskich w internecie. In Anna Gumkowska, editor, Tekst (w) sieci, volume 2, pages 253-261, Warszawa, 2009. Wydawnictwa Akademickie i Profesjonalne.

(See papers.bib for BiBTeX entries.)

Directory structure

  • README.md — this file
  • config.txt — configuration file
  • papers.bib — BiBTeX entries
  • dev-0/ — directory with dev (test) data (87 texts)
  • dev-0/in.tsv — input data for the dev set
  • dev-0/expected.tsv — expected (reference) data for the dev set
  • test-A — directory with test data (691 texts)
  • test-A/in.tsv — input data for the test set
  • test-A/expected.tsv — expected (reference) data for the test set (hidden)