485 lines
17 KiB
Markdown
485 lines
17 KiB
Markdown
|
## Punctuation restoration from read text
|
||
|
|
||
|
Restore punctuation marks from ASR outputs.
|
||
|
|
||
|
## Motivation
|
||
|
|
||
|
Speech transcripts generated by Automatic Speech Recognition (ASR)
|
||
|
systems typically do not contain any punctuation or capitalization. In
|
||
|
longer stretches of automatically recognized speech, the lack of
|
||
|
punctuation affects the general clarity of the output text \[1\]. The
|
||
|
primary purpose of punctuation (PR) and capitalization restoration (CR)
|
||
|
as a distinct natural language processing (NLP) task is to improve the
|
||
|
legibility of ASR-generated text, and possibly other types of texts
|
||
|
without punctuation. Aside from their intrinsic value, PR and CR may
|
||
|
improve the performance of other NLP aspects such as Named Entity
|
||
|
Recognition (NER), part-of-speech (POS) and semantic parsing or spoken
|
||
|
dialog segmentation \[2, 3\]. As useful as it seems, It is hard to
|
||
|
systematically evaluate PR on transcripts of conversational language;
|
||
|
mainly because punctuation rules can be ambiguous even for originally
|
||
|
written texts, and the very nature of naturally-occurring spoken
|
||
|
language makes it difficult to identify clear phrase and sentence
|
||
|
boundaries \[4,5\]. Given these requirements and limitations, a PR task
|
||
|
based on a redistributable corpus of read speech was suggested. 1200
|
||
|
texts included in this collection (totaling over 240,000 words) were
|
||
|
selected from two distinct sources: WikiNews and WikiTalks. Punctuation
|
||
|
found in these sources should be approached with some reservation when
|
||
|
used for evaluation: these are original texts and may contain some
|
||
|
user-induced errors and bias. The texts were read out by over a hundred
|
||
|
different speakers. Original texts with punctuation were forced-aligned
|
||
|
with recordings and used as the ideal ASR output. The goal of the task
|
||
|
is to provide a solution for restoring punctuation in the test set
|
||
|
collated for this task. The test set consists of time-aligned ASR
|
||
|
transcriptions of read texts from the two sources. Participants are
|
||
|
encouraged to use both text-based and speech-derived features to
|
||
|
identify punctuation symbols (e.g. multimodal framework \[6\]). In
|
||
|
addition, the train set is accompanied by reference text corpora of
|
||
|
WikiNews and WikiTalks data that can be used in training and fine-tuning
|
||
|
punctuation models.
|
||
|
|
||
|
## Task description
|
||
|
|
||
|
The purpose of this task is to restore punctuation in the ASR
|
||
|
recognition of texts read out loud.
|
||
|
|
||
|
![](media/image1.png){width="6.267716535433071in"
|
||
|
height="7.402777777777778in"}
|
||
|
|
||
|
## Dataset -- WikiPunct
|
||
|
|
||
|
WikiPunct is a crowdsourced text and audio data set of Polish Wikipedia
|
||
|
pages read out loud by Polish lectors. The dataset is divided into two
|
||
|
parts: conversational (WikiTalks) and information (WikiNews). Over a
|
||
|
hundred people were involved in the production of the audio component.
|
||
|
The total length of audio data reaches almost thirty-six hours,
|
||
|
including the test set. Steps were taken to balance the male-to-female
|
||
|
ratio.
|
||
|
|
||
|
WikiPunct has over thirty-two thousand texts and 1200 audio files, one
|
||
|
thousand in the training set and two hundred in the test set. There is a
|
||
|
transcript of automatically recognized speech and force-aligned text for
|
||
|
each text. The details behind the data format and evaluation metrics are
|
||
|
presented below in the respective sections.
|
||
|
|
||
|
**Statistics:**
|
||
|
|
||
|
- **Text:**
|
||
|
|
||
|
- over thirty-two thousand texts; WikiNews ca. 15,000, WikiTalks
|
||
|
> ca. 17,000;
|
||
|
|
||
|
- **Audio:**
|
||
|
|
||
|
- Selection procedure:
|
||
|
|
||
|
- randomly selected WikiNews (80% that is equal 800 entries
|
||
|
> for the training set) with the word count above 150 words
|
||
|
> and smaller than 300 words;
|
||
|
|
||
|
- randomly selected WikiTalks (20%) with word the count above
|
||
|
> 150 words but smaller than 300 words and at least one
|
||
|
> question mark
|
||
|
|
||
|
- Data set split
|
||
|
|
||
|
- Training data: 1000 recordings
|
||
|
|
||
|
- Test data: at 274 recordings
|
||
|
|
||
|
- Speakers:
|
||
|
|
||
|
- Polish male: 51 speakers, 16.7 hours of speech
|
||
|
|
||
|
- Polish female: 54 speakers, 19 hours of speech
|
||
|
|
||
|
**Punctuation for raw text:**
|
||
|
|
||
|
**symbol** **mean** **median** **max** **sum** **included**
|
||
|
---------------------- ------------ ---------- ------------ --------- ----------- --------------
|
||
|
**fullstop** . 12.44 7.0 1129.0 404 378 yes
|
||
|
**comma** , 10.97 5.0 1283.0 356 678 yes
|
||
|
**question_mark** ? 0.83 0.0 130.0 26 879 yes
|
||
|
**exclamation_mark** ! 0.22 0.0 55.0 7 164 yes
|
||
|
**hyphen** \- 2.64 1.0 363.0 81 190 yes
|
||
|
**colon** : 1.49 0.0 202.0 44 995 yes
|
||
|
**ellipsis** \... 0.27 0.0 60.0 8 882 yes
|
||
|
**semicolon** ; 0.13 0.0 51.0 4 270 no
|
||
|
**quote** " 3.64 0.0 346.0 116 874 no
|
||
|
**words** 169.50 89.0 17252.0 5 452 032 \-
|
||
|
|
||
|
The dataset is divided into two parts: conversational (WikiTalks) and
|
||
|
information (WikiNews).
|
||
|
|
||
|
**Part 1. WikiTalks**
|
||
|
|
||
|
Data scraped from Polish Wikipedia Talk pages. Talk pages, also known as
|
||
|
discussion pages, are administration pages with editorial details and
|
||
|
discussions for Wikipedia articles.. Talk pages were scrapped from the
|
||
|
web using a list of article titles shared alongside Wikipedia dump
|
||
|
archives.
|
||
|
|
||
|
Wikipedia Talk pages serve as conversational data. Here, users
|
||
|
communicate with each other by writing comments. Vocabulary and
|
||
|
punctuation errors are expected. This data set covers 20% of the spoken
|
||
|
data.
|
||
|
|
||
|
Example:
|
||
|
|
||
|
- **wikitalks001948:** Cóż za bzdury tu powypisywane! Fra Diavolo
|
||
|
> starał się nie dopuścić do upadku Republiki Partenopejskiej? Kto
|
||
|
> to wymyślił?! Człowiek ten był jednym z najżarliwszych wrogów
|
||
|
> francuskiej okupacji, a za zasługi w wypędzeniu Francuzów został
|
||
|
> mianowany pułkownikiem w królewskiej armii z prawdziwie królewską
|
||
|
> pensją. Bez niego wyzwolenie, nazywać to tak czy też nie,
|
||
|
> północnej części królestwa byłoby dużo trudniejsze, bo dysponował
|
||
|
> siłą kilku tysięcy sprawnych w boju i umiejętnie wziętych w karby
|
||
|
> rzezimieszków. Toteż armia Burbonów nie pokonywała go, jak to się
|
||
|
> twierdzi w artykule, lecz ściśle współpracowała. Redaktorów
|
||
|
> zachęcam do jak najszybszej korekty artykułu, bo aktualnie jest
|
||
|
> obrazą dla ambicji Wikipedii. 91.199.250.17
|
||
|
|
||
|
- **wikitalks008902:** Stare wątki w dyskusji przeniosłem do archiwum.
|
||
|
> Od prawie roku dyskusja w nich nie była kontynuowana. Sławek
|
||
|
> Borewicz
|
||
|
|
||
|
**Part 2. WikiNews**
|
||
|
|
||
|
**Wikinews** is a free-content news wiki and a project of the Wikimedia
|
||
|
Foundation. The site works through collaborative journalism. The data
|
||
|
was scraped directly from wikinews dump archive. The overall text
|
||
|
quality is high, but vocabulary and punctuation errors may occur. This
|
||
|
data set covers 80% of the spoken data.
|
||
|
|
||
|
Example:
|
||
|
|
||
|
- **wikinews222361:** Misja STS-127 promu kosmicznego Endeavour do
|
||
|
> Międzynarodowej Stacji Kosmicznej została przełożona ze względu na
|
||
|
> wyciek wodoru. Podczas procesu napełniania zewnętrznego zbiornika
|
||
|
> paliwem, część ciekłego wodoru przemieniła się w gaz i przedostała
|
||
|
> się do systemu odpowietrzania. System ten jest używany do
|
||
|
> bezpiecznego odprowadzania nadmiaru wodoru z platformy startowej
|
||
|
> 39A do Centrum Lotów Kosmicznych imienia Johna F. Kennedy\'ego.
|
||
|
> Początek misji miał mieć miejsce dzisiaj, o godzinie 13:17. Ze
|
||
|
> względu jednak na awarię, najbliższa możliwa data startu
|
||
|
> wahadłowca to środa 17 czerwca, jednak na ten dzień NASA na
|
||
|
> Przylądku Canaveral zaplanowana wystrzelenie sondy kosmicznej
|
||
|
> Lunar Reconnaissance Orbiter. Misja może być zatem opóźniona do 20
|
||
|
> czerwca, który jest ostatnią możliwą datą startu w tym miesiącu. W
|
||
|
> niedzielę odbędzie się spotkanie specjalistów NASA, na którym
|
||
|
> zostanie ustalona nowa data startu i dalszy plan misji STS-127.
|
||
|
|
||
|
## Data format
|
||
|
|
||
|
Input is a TSV file with two columns:
|
||
|
|
||
|
1. Text ID (to be used when handling forced-aligned transcriptions and WAV files if needed)
|
||
|
2. Input text - in lower-case letter without punctuation marks
|
||
|
|
||
|
The output should have the same number of lines as the input file, in each line
|
||
|
the text with punctuation marks should be given.
|
||
|
|
||
|
### Forced-aligned transcriptions
|
||
|
|
||
|
We use force-aligned transcriptions of the original texts to approximate
|
||
|
ASR output. Files in the *.clntmstmp* format contain forced-alignment of
|
||
|
the original text together with the audio file read out by a group of
|
||
|
volunteers. The files may contain errors resulting from incorrect
|
||
|
reading of the text (skipping fragments, adding words missing from the
|
||
|
original text) and alignment errors resulting from the configuration of
|
||
|
the alignment tool for text and audio files. The configuration targeted
|
||
|
Polish; names from foreign languages may be poorly recognised, with the
|
||
|
word duration equal to zero (start and end timestamps are equal). Data
|
||
|
is given in the following format:
|
||
|
|
||
|
> **(timestamp_start,timestamp_end) word**
|
||
|
>
|
||
|
> **...**
|
||
|
>
|
||
|
> **\</s\>**
|
||
|
|
||
|
where **\</s\>** is a symbol of the end of recognition.
|
||
|
|
||
|
Example:
|
||
|
|
||
|
(990,1200) Rosja
|
||
|
|
||
|
> (1230,1500) zaczyna
|
||
|
>
|
||
|
> (1590,1950) powracać
|
||
|
>
|
||
|
> (1980,2040) do
|
||
|
>
|
||
|
> (2070,2400) praktyk
|
||
|
>
|
||
|
> (2430,2490) z
|
||
|
>
|
||
|
> (2520,2760) czasów
|
||
|
>
|
||
|
> (2820,3090) zimnej
|
||
|
>
|
||
|
> (3180,3180) wojny.
|
||
|
>
|
||
|
> (3960,4290) Rosjanie
|
||
|
>
|
||
|
> (4380,4770) wznowili
|
||
|
>
|
||
|
> (4860,5070) bowiem
|
||
|
>
|
||
|
> (5100,5160) na
|
||
|
>
|
||
|
> (5220,5430) stałe
|
||
|
>
|
||
|
> (5520,5670) loty
|
||
|
>
|
||
|
> (5760,6030) swoich
|
||
|
>
|
||
|
> (6120,6600) bombowców
|
||
|
>
|
||
|
> (6630,7230) strategicznych
|
||
|
>
|
||
|
> (7350,7530) poza
|
||
|
>
|
||
|
> (7590,7890) granice
|
||
|
>
|
||
|
> (8010,8010) kraju.
|
||
|
>
|
||
|
> (8880,9300) Prezydent
|
||
|
>
|
||
|
> (9360,9810) Władimir
|
||
|
>
|
||
|
> (9930,10200) Putin
|
||
|
>
|
||
|
> (10650,10650) wyjaśnił,
|
||
|
>
|
||
|
> (10830,10920) iż
|
||
|
>
|
||
|
> (10980,11130) jest
|
||
|
>
|
||
|
> (11160,11190) to
|
||
|
>
|
||
|
> (11220,11520) odpowiedź
|
||
|
>
|
||
|
> (11550,11640) na
|
||
|
>
|
||
|
> (11670,12120) zagrożenie
|
||
|
>
|
||
|
> (12240,12300) ze
|
||
|
>
|
||
|
> (12330,12570) strony
|
||
|
>
|
||
|
> (12660,12870) innych
|
||
|
>
|
||
|
> (13140,13140) państw.
|
||
|
>
|
||
|
> \</s\>
|
||
|
|
||
|
## Evaluation procedure
|
||
|
|
||
|
Baseline results will be provided in final evaluation.
|
||
|
|
||
|
### Punctuation
|
||
|
|
||
|
During the task the following punctuation marks will be evaluated:
|
||
|
|
||
|
**Punctuation mark** **symbol**
|
||
|
------------------------ ------------
|
||
|
fullstop .
|
||
|
comma ,
|
||
|
question mark ?
|
||
|
exclamation mark !
|
||
|
hyphen \-
|
||
|
colon :
|
||
|
ellipsis \...
|
||
|
blank (no punctuation)
|
||
|
|
||
|
### Submission format
|
||
|
|
||
|
Results are to be submitted in a JSON file with the format matching the
|
||
|
input data. Files with results will be tested against the gold standard
|
||
|
annotations kept in the file with the matching text ID in the file name.
|
||
|
|
||
|
#### Example result directory structure
|
||
|
|
||
|
For a given **poleval_text.test.tar.gz** data set with has the following
|
||
|
structure:
|
||
|
|
||
|
+---------------------------+
|
||
|
| **test/** |
|
||
|
| |
|
||
|
| **json/** |
|
||
|
| |
|
||
|
| **wikitalks109264.json** |
|
||
|
| |
|
||
|
| **wikitalks0017548.json** |
|
||
|
| |
|
||
|
| **wikitalks0017518.json** |
|
||
|
| |
|
||
|
| **wikitalks0017499.json** |
|
||
|
| |
|
||
|
| **...** |
|
||
|
| |
|
||
|
| **csv/** |
|
||
|
| |
|
||
|
| **wikitalks109264.csv** |
|
||
|
| |
|
||
|
| **wikitalks0017548.csv** |
|
||
|
| |
|
||
|
| **wikitalks0017518.csv** |
|
||
|
| |
|
||
|
| **wikitalks0017499.csv** |
|
||
|
| |
|
||
|
| **...** |
|
||
|
+---------------------------+
|
||
|
|
||
|
This is the directory structure for **poleval_wav.test.tar.gz**:
|
||
|
|
||
|
+--------------------------------+
|
||
|
| **poleval_final_dataset_wav/** |
|
||
|
| |
|
||
|
| **test/** |
|
||
|
| |
|
||
|
| **wikitalks109264.wav** |
|
||
|
| |
|
||
|
| **wikitalks0017548.wav** |
|
||
|
| |
|
||
|
| **wikitalks0017518.wav** |
|
||
|
| |
|
||
|
| **wikitalks0017499.wav** |
|
||
|
| |
|
||
|
| **...** |
|
||
|
+--------------------------------+
|
||
|
|
||
|
Here is the directory structure for **poleval_fa.test.tar.gz**:
|
||
|
|
||
|
+--------------------------------+
|
||
|
| **poleval_final_dataset/** |
|
||
|
| |
|
||
|
| **test/** |
|
||
|
| |
|
||
|
| **wikitalks109264.clntmstmp** |
|
||
|
| |
|
||
|
| **wikitalks0017548.clntmstmp** |
|
||
|
| |
|
||
|
| **wikitalks0017518.clntmstmp** |
|
||
|
| |
|
||
|
| **wikitalks0017499.clntmstmp** |
|
||
|
| |
|
||
|
| **...** |
|
||
|
+--------------------------------+
|
||
|
|
||
|
The correct submission format should be:
|
||
|
|
||
|
+---------------------------+
|
||
|
| **system_response/** |
|
||
|
| |
|
||
|
| **wikitalks109264.json** |
|
||
|
| |
|
||
|
| **wikitalks0017548.json** |
|
||
|
| |
|
||
|
| **wikitalks0017518.json** |
|
||
|
| |
|
||
|
| **wikitalks0017499.json** |
|
||
|
| |
|
||
|
| **...** |
|
||
|
+---------------------------+
|
||
|
|
||
|
#### Schema validation
|
||
|
|
||
|
Use *jsonschema* to run a sanity check against the file with the
|
||
|
results:
|
||
|
[[https://pypi.org/project/jsonschema/]{.ul}](https://pypi.org/project/jsonschema/)
|
||
|
|
||
|
------------------------------------------------
|
||
|
**\$ jsonschema -i result.json result.schema**
|
||
|
------------------------------------------------
|
||
|
|
||
|
For multiple files:
|
||
|
|
||
|
-----------------------------------------------------------------------
|
||
|
**\$ ls result/\*.json \| xargs -I{} jsonschema -i {} result.schema**
|
||
|
-----------------------------------------------------------------------
|
||
|
|
||
|
### Evaluation Script
|
||
|
|
||
|
Evaluation script will be provided on the task's page.
|
||
|
|
||
|
------------------------------------------------------------
|
||
|
**\$ python3 evaluate.py gold_directory system_directory**
|
||
|
------------------------------------------------------------
|
||
|
|
||
|
### Metrics
|
||
|
|
||
|
Final results are evaluated in terms of precision, recall, and F1 scores
|
||
|
for predicting each punctuation mark separately. Submissions are
|
||
|
compared with respect to the weighted average of F1 scores for each
|
||
|
punctuation sign.
|
||
|
|
||
|
##### Per-document score:
|
||
|
|
||
|
$Precision\ = \ \frac{\text{TP}}{TP\ + \ FP}$$;\ Recall\ = \ \frac{\text{TP}}{TP\ + \ FN}$${;\ F}_{1}\ = \ 2\ *\ \frac{Precision\ *\ Recall}{Precision\ + \ Recall}$
|
||
|
|
||
|
##### Global score per punctuation sign *p*:
|
||
|
|
||
|
$P_{p}\ = avg_{\text{micro}}Precision(p)\ = \frac{\text{TP}}{TP + FP}\text{\ \ \ \ \ \ \ \ }R_{p} = avg_{\text{micro}}Recall(p)\ = \frac{\text{TP}}{TP + FN}\ $
|
||
|
|
||
|
Final scoring metric calculated as weighted average of global scores per
|
||
|
|
||
|
$\frac{1}{N}support(p)\ *\ avg_{\text{micro}}F_{1}(p)$
|
||
|
|
||
|
We would like to invite participants to discussion about evaluation
|
||
|
metrics, taking into account such factors as:
|
||
|
|
||
|
- ASR and Forced-Alignment errors,
|
||
|
|
||
|
- inconsistencies among annotators,
|
||
|
|
||
|
- impact of only slight displacement of punctuation,
|
||
|
|
||
|
- assigning different weights to different types of errors.
|
||
|
|
||
|
### Downloads
|
||
|
|
||
|
Data can be downloaded from Google Drive. Below is a list of file names
|
||
|
along with a description of what they contain.
|
||
|
|
||
|
- poleval_fa.train.tar.gz - archive contain forced-alignment of the
|
||
|
> original text together with the audio file
|
||
|
|
||
|
- poleval_text.train.tar.gz - archive contain original text in
|
||
|
> provided JSON format and CSV corresponding to audio files
|
||
|
|
||
|
- poleval_text.rest.tar.gz - archive contain original text in provided
|
||
|
> JSON format and CSV for which no audio files were provided
|
||
|
|
||
|
- poleval_wav.train.tar.gz - archive contain audio files
|
||
|
|
||
|
### References
|
||
|
|
||
|
1. Yi, J., Tao, J., Bai, Y., Tian, Z., & Fan, C. (2020). Adversarial
|
||
|
> transfer learning for punctuation restoration. *arXiv preprint
|
||
|
> arXiv:2004.00248*.
|
||
|
|
||
|
2. Nguyen, Thai Binh, et al. \"Improving Vietnamese Named Entity
|
||
|
> Recognition from Speech Using Word Capitalization and Punctuation
|
||
|
> Recovery Models.\" *Proc. Interspeech 2020* (2020): 4263-4267.
|
||
|
|
||
|
3. Hlubík, Pavel, et al. \"Inserting Punctuation to ASR Output in a
|
||
|
> Real-Time Production Environment.\" *International Conference on
|
||
|
> Text, Speech, and Dialogue*. Springer, Cham, 2020.
|
||
|
|
||
|
4. Sirts, Kairit, and Kairit Peekman. \"Evaluating Sentence
|
||
|
> Segmentation and Word Tokenization Systems on Estonian Web
|
||
|
> Texts.\" *Human Language Technologies--The Baltic Perspective:
|
||
|
> Proceedings of the Ninth International Conference Baltic HLT
|
||
|
> 2020*. Vol. 328. IOS Press, 2020.
|
||
|
|
||
|
5. Wang, Xueyujie. \"Analysis of Sentence Boundary of the Host\'s
|
||
|
> Spoken Language Based on Semantic Orientation Pointwise Mutual
|
||
|
> Information Algorithm.\" *2020 12th International Conference on
|
||
|
> Measuring Technology and Mechatronics Automation (ICMTMA)*.
|
||
|
> IEEE, 2020.
|
||
|
|
||
|
6. Sunkara, Monica, et al. \"Multimodal Semi-supervised Learning
|
||
|
> Framework for Punctuation Prediction in Conversational Speech.\"
|
||
|
> *arXiv preprint arXiv:2008.00702* (2020).
|