Go to file
2024-08-01 09:33:14 +02:00
test Add files 2024-07-31 15:11:48 +02:00
train Add files 2024-07-31 15:11:48 +02:00
README.md Update README.md 2024-08-01 09:33:14 +02:00

HANOI challenge

This challenge is based on the contents of the HANOI corpus described in detail below. This is a binary classification challenge: the aim is to classify interpreter notes as either being written by a trainee or a professional. The training split of the dataset consists of 988 training examples in the form of scans of interpreter notes, with 786 of them being made by professionals and 202 by trainees (university students during an interpreting course).

HANOI, or Handwritten Notation of Interpreters, is a corpus of handwritten notes for consecutive interpreting, collected from professional interpreters and interpreting students. It is the only resource of its kind in the world.

Interpreting is the act of translating spoken language. Professional interpreters are needed to e.g. translate the discussion between international guests speaking in their native tongues during a conference. There are several types of interpreting, with one of them being consecutive interpreting. In this case, the interpreter waits for the speaker to finish his whole speech before starting to interpret. As such speeches can last up to 20 minutes, to accurately convey the content of the original speech interpreters rely on handwritten notes. The interpreter listens to the source language and, at the same time, notes down selected content to remember it, and later recreate it in the target language.

There are rules for note-taking. The writing should be sparse and diagonal, using abbreviations, acronyms, and symbols. Interpreters often take notes in two or more languages at the same time. The resulting specialized multilingual text, the so-called semi-product of interpreting, serves a unique function: supporting short-term memory during interpretation. Developing note-taking skills for interpreting is a process that starts at university with a course in notation and continues basically throughout an interpreter's entire career. Every interpreter's notation style is different, and it is virtually impossible to read someone else's notes.

The notes of consecutive interpreters constitute a unique type of handwritten text, quite unlike the notes people use for everyday tasks, school, and work. Interestingly, the notes of professional interpreters and of those who are new to the skill are also different. The notation of interpreting trainees is more reminiscent of 'traditional' notes: there are grammatically correct sentences and multi-syllable words, pages are densely written, and there are no symbols, abbreviations, or distinctive lines that would divide the speech into separate ideas.

(Description adapted from https://hanoi.amu.edu.pl/)

Taking into account the above-outlined unique characteristics of interpreter notes as well as the differences between the ones created by trainees versus the ones made by professionals, an interesting question arises: could a machine learning model reliably identify the interpreting experience of the author of a note across several hundred examples? Take part in the challenge and prove that the answer can be 'yes'!

Metric: accuracy

Labels: trainee, pro

Dataset authors: https://hanoi.amu.edu.pl/ Challenge authors: https://csi.amu.edu.pl/zespoly/zespol-lingwistyki-diachronicznej

License: CC BY-NC 4.0

HANOI is part of the Digital Research Infrastructure for the Humanities and Arts DARIAH-PL, funded from the Intelligent Development Operational Programme, Polish National Centre for Research and Development, ID: POIR.04.02.00-00-D006/20.