Introduction to Python Pytesseract Package (original) (raw)

Last Updated : 20 Sep, 2024

In the age of digital transformation, extracting information from images and scanned documents has become a crucial task.

Optical Character Recognition (OCR) is a technology that enables this by converting different types of documents, such as scanned paper documents, PDF files, or images taken by a digital camera, into editable and searchable data. The Pytesseract module, a Python wrapper for Google's Tesseract-OCR Engine, is one of the most popular tools for this purpose.

What is Pytesseract?

Pytesseract is an OCR tool for Python, which enables developers to convert images containing text into string formats that can be processed further. It is essentially a Python binding for Tesseract, which is one of the most accurate open-source OCR engines available today.

Steps to Download and Configure Tesseract-OCR

1. **Download and Install Tesseract-OCR

Tesseract is a free and open-source OCR (Optical Character Recognition) engine.

For Windows:

Go to the official Tesseract GitHub release page: **https://github.com/UB-Mannheim/tesseract/wiki
Download the Windows installer (tesseract-ocr-setup.exe) from the releases section.
Run the installer and complete the installation process.

Screenshot-2024-09-20-093327

Download Tesseract-OCR

For macOS:

We can install Tesseract via Homebrew:

brew install tesseract

For Linux (Ubuntu/Debian):

Install Tesseract using the package manager:

sudo apt update
sudo apt install tesseract-ocr

2. **Add Tesseract to the Environment Variables

For Windows:

Open the **Start Menu, search for **Environment Variables, and select **Edit the system environment variables.
In the **System Properties window, click the **Environment Variables button.
Under **System variables, find and select the Path variable, then click **Edit.
Click **New and add the path to the tesseract.exe file (usually located in C:\Program Files\Tesseract-OCR\**).
Click **OK to save changes.

For macOS and Linux:

Open a terminal and add the Tesseract path to the shell configuration file (.bashrc, .zshrc, or .bash_profile):

export PATH="/usr/local/bin/tesseract:$PATH"

Save the file and run the following command to apply changes:

source ~/.bashrc # or source ~/.zshrc

3. **Verify Tesseract Installation

After adding Tesseract to our environment variables, open a terminal (or Command Prompt on Windows) and type:

tesseract --version

Screenshot-2024-09-20-092838

check tesseract version

**4. Install Pytesseract:

To use Tesseract with **Python, we also need to install the pytesseract package, which acts as a Python wrapper for Tesseract.

Let's install **pytesseract using pip:

pip install pytesseract

Verify the installation:

python -c "import pytesseract; print(pytesseract.get_tesseract_version())"

Screenshot-2024-09-20-093208

Check pytesseract version

Usage of Pytesseract With Practical Examples

In the first and second example, we have used the first image and for the third example we have used the second image.

Example 1: Basic Text Extraction from an Image

**Explanation : This simple example shows how to open an image and use Pytesseract to extract the text. The image_to_string method converts the image into text.

Python `

from PIL import Image import pytesseract

Load an image

img = Image.open('image.png')

Extract text from image

text = pytesseract.image_to_string(img)

print(text)

**Output

Reading text from image using pytessact

Example 2 : Extracting Text from a Specific Area in an Image

**Explanation: Sometimes, we may need to extract text from a specific area of the image. This example shows how to crop a region of interest (ROI) from the image and extract text from that region.

Python `

from PIL import Image import pytesseract

Load an image

img = Image.open('image.png')

Define the region of interest (left, upper, right, lower)

roi = img.crop((50, 50, 300, 300))

Extract text from the cropped region

text = pytesseract.image_to_string(roi)

print(text)

**Output:

Screenshot-2024-09-20-094106

Extracting Text from a Specific part of the image

Example 3: Text Extraction from a Noisy Image with Pre-processing

**Explanation : In this example, the image is first converted to grayscale to simplify the data. A median filter is then applied to reduce noise, which is especially helpful for images with a lot of background noise. After that, the contrast of the image is enhanced to make the text more distinguishable from the background. Finally, the pre-processed image is passed to Pytesseract for text extraction.

Python `

from PIL import Image, ImageFilter, ImageEnhance import pytesseract

Load an image with noise or low contrast

img = Image.open('noisy_image.jpg')

Convert the image to grayscale

img = img.convert('L')

Apply a median filter to reduce noise

img = img.filter(ImageFilter.MedianFilter())

Enhance the image contrast

enhancer = ImageEnhance.Contrast(img) img = enhancer.enhance(2)

Extract text from the pre-processed image

text = pytesseract.image_to_string(img)

print(text)

**Output

Reading Text from a noisy image using pytesseract

Advantages of Pytesseract Module

**Accuracy: Pytesseract is based on Tesseract-OCR, which is known for its high accuracy in text extraction, especially for printed documents.
**Language Support: It supports over 100 languages, making it versatile for various applications worldwide.
**Open Source: Both Pytesseract and Tesseract-OCR are open-source, allowing for free usage and modification according to project needs.
**Ease of Use: With simple integration into Python projects, Pytesseract provides an easy way to implement OCR functionality.

Limitations of Pytesseract Module

**Performance: Pytesseract can be slow when processing large documents or images with a lot of text, making it less suitable for real-time applications.
**Complex Layouts: It may struggle with images that have complex layouts, such as those containing tables or multiple columns of text.
**Accuracy with Handwriting: While it's quite effective with printed text, its accuracy decreases significantly with handwritten text.
**Image Quality Dependence: The quality of the extracted text heavily depends on the quality of the input image. Low-resolution or noisy images may lead to incorrect text extraction.

When to Use

**Digitizing Printed Documents: If we need to convert printed text from scanned documents into editable text.
**Extracting Text from Images: Useful in scenarios where we have images containing text that we need to process further.
**Multi-language Text Extraction: When our application requires OCR in different languages, Pytesseract’s extensive language support comes in handy.

When Not to Use

**Real-Time Processing: For applications requiring real-time text extraction, Pytesseract might not be the best choice due to its performance limitations.
**Complex Layouts or Handwritten Text: If our documents have complex layouts or handwritten text, we may need a more sophisticated OCR solution or additional pre-processing steps.
**High-Resolution Image Requirement: If the images are of low quality or resolution, the results may be inaccurate, making Pytesseract less effective.

Conclusion

Pytesseract is a powerful and accessible tool for anyone looking to incorporate OCR functionality into their Python projects. While it has its limitations, particularly with handwritten text and complex layouts, it excels in extracting text from images and printed documents with high accuracy. Whether we're working on digitizing documents or extracting text from images in multiple languages, Pytesseract provides a simple and effective solution.