Data Extraction from Images through OCR (original) (raw)

Optical Character Recognition by Open Source OCR Tool Tesseract: A Case Study

Optical character recognition (OCR) method has been used in converting printed text into editable text. OCR is very useful and popular method in various applications. Accuracy of OCR can be dependent on text preprocessing and segmentation algorithms. Sometimes it is difficult to retrieve text from the image because of different size, style, orientation, complex background of image etc. We begin this paper with an introduction of Optical Character Recognition (OCR) method, History of Open Source OCR tool Tesseract, architecture of it and experiment result of OCR performed by Tesseract on different kinds images are discussed. We conclude this paper by comparative study of this tool with other commercial OCR tool Transym OCR by considering vehicle number plate as input. From vehicle number plate we tried to extract vehicle number by using Tesseract and Transym and compared these tools based on various parameters.

An overview of Tesseract OCR Engine

Optical character recognition is the machine replication of hu- man reading . It can be described as Mechanical or electronic conversion of scanned images where images can be hand written, type- written or printed text. This paper presents Google’s open source Optical Character Recognition software Tesseract. We will give an overview of the algorithms used in the various stages in the pipeline of Tesseract. In particular we will focus on the aspects that are novel or at least unusual in Tesseract compared to other OCR engines.

Optical Character Recognition Using Tesseract

International Journal for Research in Applied Science and Engineering Technology (IJRASET), 2022

Optical Character Recognition (OCR) is a process or technology in which text within a digital image is recognized. With rapid pace of technology, people want quicker, handy and reliable tools, which can fulfil their daily needs. With this moto we had gone forward and analyzed the existing tools and made up this Android App, which provides seamless experience (No ads and easy-to-use), and great accuracy. The main objective of this project is to allow automatic extraction of the information that a user wants from the paper document and using it wherever it is needed. In this project, OCR uses Tesseract as an engine to display the text to the user and uses a Deep learning model to classify the letters and display them to the user. It adds a new neural network (LSTM) based OCR engine which is focused on line recognition but also still supports the legacy Tesseract OCR engine which works by recognizing character patterns.

Offline optical character recognition (OCR) method: An effective method for scanned documents

22nd International Conference on Computer and Information Technology (ICCIT) (Publisher: IEEE), 2019

Optical Character Recognition (OCR) is a major computer vision task by which characters of image are detected and recognized by comparing to training set images. Process of detecting character is one of the perplexing tasks in computer vision. This is because of input image often not correctly aligned or because of noise. This paper presents a complete Optical Character Recognition (OCR) system which is worked for English character mostly for Calibri font. This system first corrects skew of image if input image is not correctly aligned followed by noise reduction from input image. This process is passed through line and character segmentation that are passed into the recognition module and recognize characters. By experimenting with a set of 50 images, average achievement is 92%, 98% is for Calibri font. Moreover, the developed technique is computationally efficient and requires less time than other Optical character recognition system.

Optical Character Recognition (OCR) System

In the running world, there is growing demand for the software systems to recognize characters in computer system when information is scanned through paper documents as we know that we have number of newspapers and books which are in printed format related to different subjects. These days there is a huge demand in " storing the information available in these paper documents in to a computer storage disk and then later reusing this information by searching process ". One simple way to store information in these paper documents in to computer system is to first scan the documents and then store them as IMAGES. But to reuse this information it is very difficult to read the individual contents and searching the contents form these documents line-by-line and word-byword. The reason for this difficulty is the font characteristics of the characters in paper documents are different to font of the characters in computer system. As a result, computer is unable to recognize the characters while reading them. This concept of storing the contents of paper documents in computer storage place and then reading and searching the content is called DOCUMENT PROCESSING. Sometimes in this document processing we need to process the information that is related to languages other than the English in the world. For this document processing we need a software system called CHARCATER RECOGNITION SYSTEM. This process is also called DOCUMENT IMAGE ANALYSIS (DIA).

A Detailed study and recent research on OCR

Vol. 19 No. 2 FEBRUARY 2021 International Journal of Computer Science and Information Security (IJCSIS), 2021

This paper provides a total overview of OCR. Optical character recognition is nothing but the ability of the computer to collect and decipher the handwritten inputs from documents, photos or any other devices. Over these many years, many researchers have been researching and paying attention on this topic and proposed many methods which can be solved. This research provides a historical view and the summarization of the research which done on this field.

IMPROVING THE EFFICIENCY OF TESSERACT OCR ENGINE

This project investigates the principles of optical character recognition used in the Tesseract OCR engine and techniques to improve its efficiency and runtime. Optical character recognition (OCR) method has been used in converting printed text into editable text in various applications over a variety of devices such as Scanners, computers, tablets etc. But now Mobile is taking over the computer in all the domains but OCR still remains one not so conquered field.

Text Extraction from Images Using OCR

International Journal for Research in Applied Science and Engineering Technology IJRASET, 2020

Nowadays, there is an enormous demand in storing information available on papers, such as books or newspapers. There is an existing way to store information by scanning the desired text, but it will be stored as an image that won't help for further processing. For instance, if stored scanned text images, can't read the text word by word, or line by line; the text in these scanned images can't be reused unless we rewrite that whole content by ourselves. Detection of text from documents in which text is embedded in complex colored document images is a very challenging problem. There are a lot of potential users who want to extract the text from images, archiving documents etc. For this reason, user need an Optical Character Recognition (OCR). It aims at detecting textual regions from the document and separating it from the graphics portion. Getting information directly from applications forms and it saves a lot of time.

A Survey of OCR Applications

International Journal of Machine Learning and Computing, 2012

Optical Character Recognition or OCR is the electronic translation of handwritten, typewritten or printed text into machine translated images. It is widely used to recognize and search text from electronic documents or to publish the text on a website. The paper presents a survey of applications of OCR in different fields and further presents the experimentation for three important applications such as Captcha, Institutional Repository and Optical Music Character Recognition. We make use of an enhanced image segmentation algorithm based on histogram equalization using genetic algorithms for optical character recognition. The paper will act as a good literature survey for researchers starting to work in the field of optical character recognition.

OCR as a Service: An Experimental Evaluation of Google Docs OCR, Tesseract, ABBYY FineReader, and Transym

Optical character recognition (OCR) as a classic machine learning challenge has been a longstanding topic in a variety of applications in healthcare, education, insurance, and legal industries to convert different types of electronic documents, such as scanned documents, digital images, and PDF files into fully editable and searchable text data. The rapid generation of digital images on a daily basis prioritizes OCR as an imperative and foundational tool for data analysis. With the help of OCR systems, we have been able to save a reasonable amount of effort in creating, processing, and saving electronic documents, adapting them to different purposes. A set of different OCR platforms are now available which, aside from lending theoretical contributions to other practical fields, have demonstrated successful applications in real-world problems. In this work, several qualitative and quantitative experimental evaluations have been performed using four well-know OCR services, including Google Docs OCR, Tesseract, ABBYY FineReader, and Transym. We analyze the accuracy and reliability of the OCR packages employing a dataset including 1227 images from 15 different categories. Furthermore, we review the state-of-the-art OCR applications in healtcare informatics. The present evaluation is expected to advance OCR research, providing new insights and consideration to the research area, and assist researchers to determine which service is ideal for optical character recognition in an accurate and efficient manner.