Optical Character Recognition

Warning

This section is very much work in progress!

Optical character recognition (OCR), is a process of extracting text from document images. An OCR engine usually takes an image as input and gives text as output. OCR is a well known Computer Vison problem which is mostly considered as a solved problem. Depending on the quality of the data OCR can be a challenging problem to obtain good results.

OCR output

Figure: Output from an OCR engine for a text line image.

Tesseract

Tesseract, is an open-source OCR engine that uses Bidirectional LSTM and language models to perform OCR on images. The images can be of any type of document images. Tesseract provides various options to process an image with text, such as the language of the text on the image or the language order if the image contains the multilingual text. As of now, Tesseract can not perform OCR on the handwritten text. Processing an image with Tesseract follows a pipeline like process where the first step is to perform binarization and noise removal. Text lines are recognized by performing Document layout analysis on the binary image. The text lines are then given to the Bidirectional LSTM to recognize text.

OCR preprocessing

Image preprocessing is an important step when performing OCR on document images. An image can have different properties that would affect the quality of the OCR process. Based on the acquisition methods, each image might need different preprocessing steps that would improve the image quality to achieve optimal OCR results. Follwoing are some preprocessing steps that are recommended to perform on an image before performing OCR.

Rotation

An OCR expects the image to have text with the reading order of left to right and top to bottom for any Latin language. The orientation of the image would make a significant difference in the text prediction since the OCR engine tries to load the respective language model and word dictionaries to perform the text prediction. The orientation of the image also has a significant effect on the recognition of text regions and text lines.

Binarization

Binarization is a process of converting an image with more than one color channel to one channel with only 2 types of pixel values, either a 1 or a 0. Binarization makes the background of the image to one color, i.e., either white or black, and the foreground elements to the other color. The objective of binarization is to distinguish foreground and background elements. Binarization of document images helps to distinguish between page background and the text on the page.

Noise removal

Performing binarization on a document image will not necessarily give noise free results in text binarization. Some methods would result in noise on the document images due to their less effectiveness. There are different types of noises that can be present on a binary image such as,

  • Non-typewritten text
  • Bleed-through text
  • Salt and pepper noise
  • Broken or connected characters
  • Marginal noise
  • Other artifact noise

Text error rate

Text recognized from the given input image can not always be correct due to several reasons. An OCR engine can sometimes fails to identify some text regions from document layout analysis due to the existence of overlapping text components or non-text components such as artifacts or images.

The ground truth text extracted manually from the corresponding document image is used to calculate the word error rate and the character error rate of the recognized text. The error rate of the text extracted can be evaluated by computing the edit distance between the extracted text and ground truth text. The edit distance is the number of changes, i.e., insertions (i), subtraction (s), and deletions(d) needed for the extracted text to become ground truth.

\[Word\ Error\ Rate = \frac{i_w + s_w + d_w}{n_w}\]
\[Character\ Error\ Rate = \frac{i_c + s_c + d_c}{n_c}\]