polish-urban-legends-public/README.md


Cluster Polish urban legend texts
=================================

Cluster Polish urban legend texts the way folklorists do.

The task is to group texts into urban legend types. Note that this is
an unsupervised machine learning task. The metric used is Normalized
Mutual Information (NMI).

Bibliography
------------

Please cite:

Roman Grundkiewicz and Filip Graliński. How to distinguish a kidney
theft from a death car? Experiments in clustering urban-legend
texts. In *Proceedings of the RANLP 2011 Workshop on Information
Extraction and Knowledge Acquisition*, pages 29-36, Hissar, Bulgaria,
September 2011. Association for Computational Linguistics.

See also:

Roman Grundkiewicz, Automatic identification and clustering of short
narrative texts published on the Internet, master thesis, supervisor:
Filip Graliński, Faculty of Mathematics and Computer Science, Adam
Mickiewicz University in Poznań, June 2011

Filip Graliński. Tropiąc czarną wołgę w sieci. o poszukiwaniu legend
miejskich w internecie. In Anna Gumkowska, editor, Tekst (w) sieci,
volume 2, pages 253-261, Warszawa, 2009. Wydawnictwa Akademickie i
Profesjonalne.

(See `papers.bib` for BiBTeX entries.)

Directory structure
-------------------

* `README.md` — this file
* `config.txt` — configuration file
* `papers.bib` — BiBTeX entries
* `dev-0/` — directory with dev (test) data (87 texts)
* `dev-0/in.tsv` — input data for the dev set
* `dev-0/expected.tsv` — expected (reference) data for the dev set
* `test-A` — directory with test data (691 texts)
* `test-A/in.tsv` — input data for the test set
* `test-A/expected.tsv` — expected (reference) data for the test set (hidden)