Go to file

Filip Gralinski 96cda9951d Get new script		2022-02-22 17:32:16 +01:00
dev-0	init	2021-04-09 17:44:26 +02:00
images	init	2021-04-09 17:44:26 +02:00
test-A	init	2021-04-09 17:44:26 +02:00
train	init	2021-04-09 17:44:26 +02:00
.gitignore	init	2021-04-09 17:44:26 +02:00
config.txt	init	2021-04-09 17:44:26 +02:00
get-annexed-files.sh	Get new script	2022-02-22 17:32:16 +01:00
in-header.tsv	init	2021-04-09 17:44:26 +02:00
out-header.tsv	init	2021-04-09 17:44:26 +02:00
README.md	init	2021-04-09 17:44:26 +02:00

README.md

OCR challenge for index cards

The goal of this task is to post-process the output from the Tesseract OCR engine. Alternatively, it could be treated as an OCR, as images are also available.

The data set is based on the index cards from Korpus Frazeologiczny Języka Polskiego.

Metrics

The task will be evaluated using the following metrics:

CharMatch (main metric) — it measures the deviation between the output, the input (as obtained from Tesseract OCR) and the input; see https://re-research.pl/ltc-2017-iayko-jassem-gralinski-obrebski.pdf, page 4 (CharMatch penalizes unwanted changes more than WER/CER).
WER (Word Error Rate) — the equivalent of CER for words (number of words inserted, substituted and deleted divided by the total number of words).
CER (Character Error Rate) — the Levenshtein distance between real text and the OCR engine output, divided by the total number of chacracters,

Evaluation

You can carry out evaluation using the GEval, when you generate out.tsv files (in the same format as expected.tsv files):

wget https://gonito.net/get/bin/geval
chmod u+x geval
./geval -t dev-0

Directory structure

README.md — this file
config.txt — GEval configuration file
images/ — images to be processed, referenced in TSV files
in-header.tsv — one-line TSV file with column names for input data
out-header.tsv — one-line TSV file with column names for the output data
train/ — directory with hand-annotated gold-standard OCR train data
train/in.tsv — input data for the train set
train/expected.tsv — expected (reference) data for the dev set
dev-0/ — directory with hand-annotated gold-standard OCR test data
dev-0/in.tsv — input data for the dev set
dev-0/expected.tsv — expected (reference) data for the dev set
test-A/ — directory with hand-annotated gold-standard OCR test data
test-A/in.tsv — input data for the test set
test-A/expected.tsv — expected (reference) data for the test set (hidden)

Note that we mean TSV, not CSV files. In particular, double quotes are not considered special characters here! In particular, set quoting to QUOTE_NONE in the Python csv module:


    import csv
    with open('file.tsv', 'r') as tsvfile:
        reader = csv.reader(tsvfile, delimiter='\t', quoting=csv.QUOTE_NONE)
        for item in reader:
            pass

Downloading image files

Image files are kept using git-annex. If you need them, install git-annex and run ./annex-get-all.sh.

Format of the test sets

The input file (in.tsv) consists of 4 TAB-separated columns:

the name of the test image (MD5 digest of the binary content with .png extension); these files are in images/ directory,
the ISO-639-3 language code of the source document (always pol),
the pixel depth of an image (always 400),
the output from Tesseract OCR to be corrected.

(The 2nd and 3rd field is for compatibility with other OCR challenges, it's always, respectively, pol and 400 in this challenge.)

Each entry in the expected.tsv contains the text recognized from the test image.

The carriage returns (CR) and backslashs are replaced with \n and \\ respectively, so you should decode them, using for example this Python code

def decode_text(t):
   return t.replace('\\n', '\n').replace('\\\\', '\\')

Each line in the expected.tsv.xz corresponds to the line in in.tsv.xz.

All the files are UTF-8 encoded.

Note that out.tsv and expected.tsv files have the .tsv extension only for consistency. Actually they are just plain text files.

Submission format

Each entry in expected.tsv contains entire text file to be recognized, compressed to one line. In order to achieve best possible results, one should format submitted out.tsv in similar way, i.e. don't forget to encode backslashes and carriage returns:

def encode_text(t):
    return t.replace('\\', '\\\\').replace('\n', '\\n')