48 lines
1.6 KiB
Markdown
48 lines
1.6 KiB
Markdown
|
|
Cluster Polish urban legend texts
|
|
=================================
|
|
|
|
Cluster Polish urban legend texts the way folklorists do.
|
|
|
|
The task is to group texts into urban legend types. Note that this is
|
|
an unsupervised machine learning task. The metric used is Normalized
|
|
Mutual Information (NMI).
|
|
|
|
Bibliography
|
|
------------
|
|
|
|
Please cite:
|
|
|
|
Roman Grundkiewicz and Filip Graliński. How to distinguish a kidney
|
|
theft from a death car? Experiments in clustering urban-legend
|
|
texts. In *Proceedings of the RANLP 2011 Workshop on Information
|
|
Extraction and Knowledge Acquisition*, pages 29-36, Hissar, Bulgaria,
|
|
September 2011. Association for Computational Linguistics.
|
|
|
|
See also:
|
|
|
|
Roman Grundkiewicz, Automatic identification and clustering of short
|
|
narrative texts published on the Internet, master thesis, supervisor:
|
|
Filip Graliński, Faculty of Mathematics and Computer Science, Adam
|
|
Mickiewicz University in Poznań, June 2011
|
|
|
|
Filip Graliński. Tropiąc czarną wołgę w sieci. o poszukiwaniu legend
|
|
miejskich w internecie. In Anna Gumkowska, editor, Tekst (w) sieci,
|
|
volume 2, pages 253-261, Warszawa, 2009. Wydawnictwa Akademickie i
|
|
Profesjonalne.
|
|
|
|
(See `papers.bib` for BiBTeX entries.)
|
|
|
|
Directory structure
|
|
-------------------
|
|
|
|
* `README.md` — this file
|
|
* `config.txt` — configuration file
|
|
* `papers.bib` — BiBTeX entries
|
|
* `dev-0/` — directory with dev (test) data (87 texts)
|
|
* `dev-0/in.tsv` — input data for the dev set
|
|
* `dev-0/expected.tsv` — expected (reference) data for the dev set
|
|
* `test-A` — directory with test data (691 texts)
|
|
* `test-A/in.tsv` — input data for the test set
|
|
* `test-A/expected.tsv` — expected (reference) data for the test set (hidden)
|