Text Recognition or Optical Character Recognition (OCR) is an electronic and mechanical way to convert typed, handwritten or printed text into machine-encoded code (elaborate more about OCR for example). Google Tesseract is an open-source OCR engine initially developed by HP and was released as open-source software in 2005 and was sponsored by Google in 2006.
Tesseract Implementation in Python
For our implementation, we are going to use a Python wrapper for Tesseract known as Pyteseract. There are other wrappers of Tesseract available for different languages which can be found here. We can install it using pip install by Tesseract.
We can start it with reading the image and converting it into a NumPy array using the OpenCV function imread as shown in the image below.
We can also check the image if it’s correctly read or not using matplotlib function imshow.
We have read the image and now we have the image in the form of an array showing pixel values. Now we have to process the information using tesseract using the function image_to_string which has Google Tesseract pre-trained models. We can use Pytesseract with default configurations or with our configuration.
For default configuration we just need to pass the image array to function as shown below.
For our own custom configuration, we can set the input language, OEM (OCR engine mode), and page segmentation.
- Language: We can set our Tesseract model to detect from single or multiple languages. We can set different language configurations using the -l eng argument.
- OEM (OCR engine Mode): We can set different OCR modes with this configuration. Currently, in Tesseract 4, there are 4 modes available
- 0 Legacy Engine
- 1 Neutral nets LSTM engine only
- 2 Legacy + LSTM engines.
- 3 Default, based on what is available.
We can try different OEM configurations using the OEM 3 argument
- Page Segmentation: We can adjust different page segmentation according to our text for better results. Currently in Tesseract 4 following page segmentations are available
- Orientation and script detection (OSD) only.
- Automatic page segmentation with OSD.
- Automatic page segmentation, but no OSD, or OCR.
- Fully automatic page segmentation, but no OSD. (Default)
- Assume a single column of text of variable sizes.
- Assume a single uniform block of vertically aligned text.
- Assume a single uniform block of text.
- Treat the image as a single text line.
- Treat the image as a single word.
- Treat the image as a single word in a circle.
- Treat the image as a single character.
- Sparse text. Find as much text as possible in no particular order.
- Sparse text with OSD.
- Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.
We can try different page segmentation using –psm 3 argument.
The following example demonstrates setting custom configurations for our model.
Results of model:
We have run our model with default configurations and have fetched the following results.
Input Image
Output text:
We can achieve better results with better pre-processing but there are also a few limitations of this model i.e. we can’t train it on our dataset for better results.
Pros and Cons of Tesseract
Pros
● Support for 40 languages
● Very easy to use (see the manual page, not built-in help)
● Works great with 300 DPI files
● Open source and had different wrappers for different programming languages
Cons
● Rudimentary image processing
● Not good results with tilted text
● Not good results with sharp and bright images
● Not good with stylish handwritten text
Dawood is a digital marketing pro and AI/ML enthusiast. His blogs on Folio3 AI are a blend of marketing and tech brilliance. Dawood’s knack for making AI engaging for users sets his content apart, offering a unique and insightful take on the dynamic intersection of marketing and cutting-edge technology.