dariah-disambiguation-chall.../README.md
2024-11-19 12:47:34 +01:00

1.7 KiB

Dariah character names disambiguation challenge

This challenge is based on the contents of the classic Polish novel "Lalka" (English: "The Doll") by Bolesław Prus. Fragments of the novel text included in the dataset had all the different names indicating the same character annotated with a common character label. For example, the main character - Stanisław Wokulski - is always annotated as "WOKULSKI" no matter if he is referred to as "Wokulski", "Stach", "S. Wokulski", etc. in the text. In the same manner, all different mentions of Izabela Łęcka are annotated as "LECKA", all mentions of Ignacy Rzecki as "RZECKI", and so on. The training split of the dataset consists of 3336 annotated sentences from the novel.

The goal of the challenge is twofold: firstly, spans of the text containing character names have to be identified. Secondly, each span has to be classified with the appropriate character label.

Proper name disambiguation is an important step of search tasks in computational literary studies. Literary figures can be referenced in the text in various ways. Individual references differ in terms of morphological features, register, adopted terminology, or the context of use, among other things. A proper name disambiguation module makes it possible to detect proper names referring to persons and to indicate which of those refer to the same individuals.

Metrics: precision, recall, F-score

Labels: 53 named entity classes (one label per unique character) with 3 positional tokens: B- for the beginning of a named entity, I- for the inside of it, and O for non-entity tokens

Dataset author: Dr Marek Kubis (marek.kubis@amu.edu.pl)

Challenge authors: https://csi.amu.edu.pl/zespoly/zespol-lingwistyki-diachronicznej

License: CC BY-NC 4.0