2021-04-09 17:44:26 +02:00
|
|
|
# OCR challenge for index cards
|
|
|
|
|
|
|
|
The goal of this task is to post-process the output from the Tesseract
|
|
|
|
OCR engine. Alternatively, it could be treated as an OCR, as images
|
|
|
|
are also available.
|
|
|
|
|
|
|
|
The data set is based on the index cards from [Korpus Frazeologiczny
|
|
|
|
Języka Polskiego](http://leksykografia.amu.edu.pl/).
|
|
|
|
|
|
|
|
## Metrics
|
|
|
|
|
|
|
|
The task will be evaluated using the following metrics:
|
|
|
|
|
|
|
|
* CharMatch (main metric) — it measures the deviation between the
|
|
|
|
output, the input (as obtained from Tesseract OCR) and the input;
|
|
|
|
see
|
|
|
|
<https://re-research.pl/ltc-2017-iayko-jassem-gralinski-obrebski.pdf>,
|
|
|
|
page 4 (CharMatch penalizes unwanted changes more than WER/CER).
|
|
|
|
* WER (Word Error Rate) — the equivalent of CER for words (number of words inserted, substituted
|
|
|
|
and deleted divided by the total number of words).
|
|
|
|
* CER (Character Error Rate) — the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance)
|
|
|
|
between real text and the OCR engine output, divided by the total number of chacracters,
|
|
|
|
|
|
|
|
## Evaluation
|
|
|
|
|
|
|
|
You can carry out evaluation using the [GEval](https://gitlab.com/filipg/geval),
|
|
|
|
when you generate `out.tsv` files (in the same format as `expected.tsv` files):
|
|
|
|
|
|
|
|
```
|
|
|
|
wget https://gonito.net/get/bin/geval
|
|
|
|
chmod u+x geval
|
|
|
|
./geval -t dev-0
|
|
|
|
```
|
|
|
|
|
|
|
|
## Directory structure
|
|
|
|
|
|
|
|
* `README.md` — this file
|
|
|
|
* `config.txt` — GEval configuration file
|
|
|
|
* `images/` — images to be processed, referenced in TSV files
|
|
|
|
* `in-header.tsv` — one-line TSV file with column names for input data
|
|
|
|
* `out-header.tsv` — one-line TSV file with column names for the output data
|
|
|
|
* `train/` — directory with hand-annotated gold-standard OCR train data
|
|
|
|
* `train/in.tsv` — input data for the train set
|
|
|
|
* `train/expected.tsv` — expected (reference) data for the dev set
|
|
|
|
* `dev-0/` — directory with hand-annotated gold-standard OCR test data
|
|
|
|
* `dev-0/in.tsv` — input data for the dev set
|
|
|
|
* `dev-0/expected.tsv` — expected (reference) data for the dev set
|
|
|
|
* `test-A/` — directory with hand-annotated gold-standard OCR test data
|
|
|
|
* `test-A/in.tsv` — input data for the test set
|
|
|
|
* `test-A/expected.tsv` — expected (reference) data for the test set (hidden)
|
|
|
|
|
|
|
|
Note that we mean TSV, *not* CSV files. In particular, double quotes
|
|
|
|
are not considered special characters here! In particular, set
|
|
|
|
`quoting` to `QUOTE_NONE` in the Python `csv` module:
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
|
|
|
import csv
|
|
|
|
with open('file.tsv', 'r') as tsvfile:
|
|
|
|
reader = csv.reader(tsvfile, delimiter='\t', quoting=csv.QUOTE_NONE)
|
|
|
|
for item in reader:
|
|
|
|
pass
|
|
|
|
```
|
|
|
|
|
|
|
|
### Downloading image files
|
|
|
|
|
|
|
|
Image files are kept using [git-annex](https://git-annex.branchable.com).
|
2022-02-22 17:33:19 +01:00
|
|
|
If you need them, install git-annex and run `./get-annexed-files.sh`.
|
2021-04-09 17:44:26 +02:00
|
|
|
|
|
|
|
## Format of the test sets
|
|
|
|
|
|
|
|
The input file (`in.tsv`) consists of 4 TAB-separated columns:
|
|
|
|
|
|
|
|
* the name of the test image (MD5 digest of the binary content with `.png` extension); these
|
|
|
|
files are in `images/` directory,
|
|
|
|
* the ISO-639-3 language code of the source document (always `pol`),
|
|
|
|
* the pixel depth of an image (always 400),
|
|
|
|
* the output from Tesseract OCR to be corrected.
|
|
|
|
|
|
|
|
(The 2nd and 3rd field is for compatibility with other OCR challenges,
|
|
|
|
it's always, respectively, `pol` and `400` in this challenge.)
|
|
|
|
|
|
|
|
Each entry in the `expected.tsv` contains the text recognized from the test image.
|
|
|
|
|
|
|
|
The carriage returns (CR) and backslashs are replaced with `\n` and
|
|
|
|
`\\` respectively, so you should decode them, using for example this
|
|
|
|
Python code
|
|
|
|
|
|
|
|
def decode_text(t):
|
|
|
|
return t.replace('\\n', '\n').replace('\\\\', '\\')
|
|
|
|
|
|
|
|
Each line in the `expected.tsv.xz` corresponds to the line in
|
|
|
|
`in.tsv.xz`.
|
|
|
|
|
|
|
|
All the files are UTF-8 encoded.
|
|
|
|
|
|
|
|
Note that `out.tsv` and `expected.tsv` files have the `.tsv` extension
|
|
|
|
only for consistency. Actually they are just plain text files.
|
|
|
|
|
|
|
|
### Submission format
|
|
|
|
|
|
|
|
Each entry in `expected.tsv` contains entire text file to be
|
|
|
|
recognized, compressed to one line. In order to achieve best possible
|
|
|
|
results, one should format submitted `out.tsv` in similar way, i.e.
|
|
|
|
don't forget to encode backslashes and carriage returns:
|
|
|
|
|
|
|
|
def encode_text(t):
|
|
|
|
return t.replace('\\', '\\\\').replace('\n', '\\n')
|