Text Extraction from Tamil and Hindi Document Images using Open Source Optical Character Recognition tools (original) (raw)
Related papers
2014 9th International Conference on Industrial and Information Systems (ICIIS), 2014
Optical Character Recognition (OCR) deals with automated recognition of characters that are in the format of digital image. OCR refers to the process by which scanned images are electronically processed and converted to an editable document. Handwritten and printed texts are the primary research areas of an OCR. Many OCR systems are commercially available for English and Arabic characters but there is still no recognition system available which yields higher recognition rate even though the scanned images are of high quality. The general framework of a Tamil OCR in the literature involves: preprocessing, line segmentation, word segmentation, character segmentation, feature extraction and recognition of characters. OCR for printed Tamil documents poses challenge owing to: one line may have different font styles, presence of pictures, multi columns, touching of adjacent characters, presence of broken characters, low print quality and complex layout. Furthermore, when comparing 26 alphabets in English, Tamil language has 247 alphabets which makes the recognition more difficult. There are few OCRs for Tamil language that are freely available with a moderate recognition rate as the performance comparisons of such OCRs are not available on a benchmark dataset. In this paper we compare OCRs for printed Tamil texts on four different types of documents: books, magazines, newspapers and pamphlets. Furthermore we propose a post-processing error correction technique to the tested OCRs which reduces the overall mean error rate by nearly 10% on those four categories.
Optical Character Recognition or OCR is the electronic translation of handwritten, typewritten or printed text into machine translated images. Optical Character Recognition (OCR) is a very important task in Pattern Recognition. Foreign languages, especially English character recognition has been extensively studied by many researches but due to complication of Indian Languages like Hindi ,Punjabi ,teulgu ,malyalam etc. the research work is very limited and constrained. This paper presents the research work related to all Indian languages, various approaches to character recognition along with some applications of character recognition is also discussed in this paper. The aim of this paper is to provide an overview of the research going on in Indian script OCR systems. This survey paper has been felt necessary when the research on OCRs for Indian scripts is still a challenging task. Hence, a brief introduction to the general OCR and typical steps in the development of an OCR are give...
Recognition of Hindi Character Using OCR-Technology: A Review
International Journal of Advanced Trends in Computer Science and Engineering , 2023
Recognition of character is a technique that enables the transformation of various kinds of scanned papers into an editable, readable, and searchable format. In the last two decades, several researchers and technologists have been continuously working in this field to enhance the rate of accuracy. Recognition of character is classified into printed, handwritten , and characters written at image recognition. Recognition of character is the major area of research in the field of pattern recognition. This paper presents an overview of Hindi character recognition by utilizing the optical character recognition (OCR) technique. We surveyed some major research breakthroughs in character recognition, especially for Hindi characters. This research article focuses to provide a deeper insight into the researchers and technologists working in the field of recognition of Hindi-character.
A Complete OCR for Printed Tamil Text
Proc. Tamil Internet 2000 (TI 2000)
A multi-font, multi-size Optical Character Recognizer (OCR) of Tamil Script is developed. The input image to the system is binary and is assumed to contain only text. The skew angle of the document is estimated using a combination of Hough transform and Principal Component Analysis. A multi-rate-signal-processing based algorithm is devised to achieve distortion-free rotation of the binary image during skew correction. Text segmentation is noise-tolerant. The statistics of the line height and the character gap are used to segment the text lines and the words. The images of the words are subjected to morphological closing followed by connected component-based segmentation to separate out the individual symbols. Each segmented symbol is resized to a pre-fixed size and thinned before it is fed to the classifier. A three-level, tree-structured classifier for Tamil script is designed. The net classification accuracy is 99.0%.
Review on OCR for Handwritten Indian Scripts Character Recognition
Natural language processing and pattern recognition have been successfully applied to Optical Character Recognition (OCR). Character recognition is an important area in pattern recognition. Character recognition can be printed or handwritten. Handwritten character recognition can be offline or online. Many researchers have been done work on handwritten character recognition from the last few years. As compared to non-Indian scripts, the research on OCR of handwritten Indian scripts has not achieved that perfection. There are large numbers of systems available for handwritten character recognition for non-Indian scripts. But there is no complete OCR system is available for recognition of handwritten text in any Indian script, in general. Few attempts have been carried out on the recognition of Devanagari, Bangla, Tamil, Oriya and Gurmukhi handwritten scripts. In this paper, we presented a survey on OCR of these most popular Indian scripts.
The comparative analysis of Marathi OCR softwares
Marathi is one of the oldest and enriched language, which consists of classic literature. A lot of reference work is done in Marathi. It has a rich tradition of lexicons and encyclopedias. So the major works done in this language is worth archiving. Up till now it was done in the format of JPEGs or PDFs. But there are many limitations according to the point of view of researchers, as it is not in searchable or editable format. Here OCR (Optical Character Recognition) plays a vital role. OCR for localised languages is comparatively recent technological development. In this paper researcher tries to find the effectivity and accuracy of the results of OCR by using various softwares.
Optical Character Recognition for Hindi
2018
Assistant Professor, Department of CSE, Assam down town University, Assam, India ---------------------------------------------------------------------***--------------------------------------------------------------------Abstract -Optical Character Recognition is a system which can perform the translation of images from handwritten or printed form to machine-editable form. Devanagari script is used in many Indian languages like Hindi, Nepali, Marathi, Sindhi etc. This script forms the foundation of the language like Hindi which is the national and most widely spoken language in India. In current scenario, there is a huge demand in “storing the information in digital format available in paper documents and then later reusing this information by searching process”. In this paper we propose a new method for recognition of printed Hindi characters in Devanagari script. In this project different pre-processing operations like features extraction, segmentations and classification have bee...
A Complete Tamil Optical Character Recognition System
5th International Workshop, DAS 2002, Proceedings, 2002
Document Image processing and Optical Character Recognition (OCR) have been a frontline research area in the field of human-machine interface for the last few decades. Recognition of Indian language characters has been a topic of interest for quite some time. The earlier contributions were reported in [1] and [2]. A more recent work is reported in [3] and [9]. The need for efficient and robust algorithms and systems for recognition is being felt in India, especially in the post and telegraph department where OCR can assist the staff in sorting mail. Character recognition can also form a part in applications like intelligent scanning machines, text to speech converters, and automatic language-to-language translators.
A SURVEY OF OCR ALGORITHMS FOR TAMIL HANDWRITTEN CHARACTER RECOGNITION
IAEME PUBLICATION, 2014
Recognition of Tamil handwritten characters has been one of the active areas of research in the field of Tamil OCR. The Tamil OCR has various application potentials like bank cheque processing, automated data entry, Digital form conversion of Tamil literature etc. In This paper two OCR algorithms used to recognize the Tamil handwritten characters and their recognition rate were compared.
31 System for OCR of Printed Telugu Text in Complicated Layouts and Backgrounds
2016
Most of the work reported in the literature for Optical Character recognition (OCR) assumes the background to be clean or white and works for one or two fonts. Real documents can range from simple plain backgrounds to complex uneven illuminated backgrounds. OCR of such documents is further complicated due to text written in a variety of fonts and sizes. In this paper, an OCR system is proposed for OCR of printed Telugu text written in complicated layouts and backgrounds. The proposed system has been tested on a variety of images taken from different newspapers, old books and synthetic images on textured backgrounds. It works on several unknown fonts even in the presence of complicated backgrounds although the database consists of only four fonts. The recognition accuracies obtained are 98% on average. Thus the proposed approach for Telugu OCR performs well on a larger variety of images than the previous attempts which were more restrictive and domain specific.