A Complete Tamil Optical Character Recognition System (original) (raw)

An optical character recognition system for printed Telugu text

Pattern Analysis and Applications, 2004

Telugu is one of the oldest and popular languages of India, spoken by more than 66 million people, especially in South India. Not much work has been reported on the development of optical character recognition (OCR) systems for Telugu text. Therefore, it is an area of current research. Some characters in Telugu are made up of more than one connected symbol. Compound characters are written by associating modifiers with consonants, resulting in a huge number of possible combinations, running into hundreds of thousands. A compound character may contain one or more connected symbols. Therefore, systems developed for documents of other scripts, like Roman, cannot be used directly for the Telugu language. The individual connected portions of a character or a compound character are defined as basic symbols in this paper and treated as a unit of recognition. The algorithms designed exploit special characteristics of Telugu script for processing the document images efficiently. The algorithms have been implemented to create a Telugu OCR system for printed text (TOSP). The output of TOSP is in phonetic English that can be transliterated to generate editable Telugu text. A special feature of TOSP is that it is designed to handle a large variety of sizes and multiple fonts, and still provides raw OCR accuracy of nearly 98%. The phonetic English representation can be also used to develop a Telugu text-to-speech system; work is in progress in this regard.

Text Extraction from Tamil and Hindi Document Images using Open Source Optical Character Recognition tools

Optical Character Recognition (OCR) is a technique, which is used to extract the text from document images and converted into text format. This kind of information retrieval is called as recognition based retrieval hence that it can be edited, searched, stored more efficiently. OCR is used for many applications such as library, organization, bank cheques, number plate recognition, historical book analysis and many others applications. Various OCR tools are available for converting document images in different types of languages. The primary objective of this work is to compare the performance analysis of the three different OCR tools for extracting the text information from Tamil and Hindi document images.

Optical Character Recognition for Hindi

2018

Assistant Professor, Department of CSE, Assam down town University, Assam, India ---------------------------------------------------------------------***--------------------------------------------------------------------Abstract -Optical Character Recognition is a system which can perform the translation of images from handwritten or printed form to machine-editable form. Devanagari script is used in many Indian languages like Hindi, Nepali, Marathi, Sindhi etc. This script forms the foundation of the language like Hindi which is the national and most widely spoken language in India. In current scenario, there is a huge demand in “storing the information in digital format available in paper documents and then later reusing this information by searching process”. In this paper we propose a new method for recognition of printed Hindi characters in Devanagari script. In this project different pre-processing operations like features extraction, segmentations and classification have bee...

A Complete OCR for Printed Tamil Text

Proc. Tamil Internet 2000 (TI 2000)

A multi-font, multi-size Optical Character Recognizer (OCR) of Tamil Script is developed. The input image to the system is binary and is assumed to contain only text. The skew angle of the document is estimated using a combination of Hough transform and Principal Component Analysis. A multi-rate-signal-processing based algorithm is devised to achieve distortion-free rotation of the binary image during skew correction. Text segmentation is noise-tolerant. The statistics of the line height and the character gap are used to segment the text lines and the words. The images of the words are subjected to morphological closing followed by connected component-based segmentation to separate out the individual symbols. Each segmented symbol is resized to a pre-fixed size and thinned before it is fed to the classifier. A three-level, tree-structured classifier for Tamil script is designed. The net classification accuracy is 99.0%.

OCR in Indian Languages

Optical Character Recognition or OCR is the electronic translation of handwritten, typewritten or printed text into machine translated images. Optical Character Recognition (OCR) is a very important task in Pattern Recognition. Foreign languages, especially English character recognition has been extensively studied by many researches but due to complication of Indian Languages like Hindi ,Punjabi ,teulgu ,malyalam etc. the research work is very limited and constrained. This paper presents the research work related to all Indian languages, various approaches to character recognition along with some applications of character recognition is also discussed in this paper. The aim of this paper is to provide an overview of the research going on in Indian script OCR systems. This survey paper has been felt necessary when the research on OCRs for Indian scripts is still a challenging task. Hence, a brief introduction to the general OCR and typical steps in the development of an OCR are give...

Embedded Optical Character Recognition On Tamil Text Image

2015

Optical Character recognition is used to digitize and reproduce texts that have been produced with non-computerized system. Digitizing texts also helps reduce storage space. Editing and Reprinting of Text document that were printed on paper are time consuming and labour intensive. Optical Character recognition is also useful for visually impaired people who cannot read Text document, but need to access the content of the Text documents. This paper is on Methodology of a camera based assistive device that can be used by people to read Tamil Text document. The framework is on implementing image capturing technique in an embedded system based on Raspberry Pi board

An Efficient Multi Lingual Optical Character Recognition System for Indian Languages Through Use of Bharati Script

2018

Optical character recognition performs a critical part in interpreting videos and documents. Document specific issues like low image quality, distortions, composite background, noise etc. and language specific issues like cursive connectivity among the characters etc. makes OCR challenging and erroneous for Indian languages. The language specific challenges can be overcome by computing the script-based features and can achieve better accuracy. Computing the script based invariant features and patterns is computationally complex and error prone. In this background, we put forward Bharathi script (www.bharatiscript.com) based OCR system in which the inherent drawbacks of Indian scripts i.e. Hindi, Tamil, Telugu etc. are eliminated. The proposed OCR model has been tested on a synthetic dataset of documents of Bharathi script (in which Hindi scripts are converted to Bharathi script). Thorough experimental analysis with varied levels of noise confirms the promising results of character r...

Recognition of Hindi Character Using OCR-Technology: A Review

International Journal of Advanced Trends in Computer Science and Engineering , 2023

Recognition of character is a technique that enables the transformation of various kinds of scanned papers into an editable, readable, and searchable format. In the last two decades, several researchers and technologists have been continuously working in this field to enhance the rate of accuracy. Recognition of character is classified into printed, handwritten , and characters written at image recognition. Recognition of character is the major area of research in the field of pattern recognition. This paper presents an overview of Hindi character recognition by utilizing the optical character recognition (OCR) technique. We surveyed some major research breakthroughs in character recognition, especially for Hindi characters. This research article focuses to provide a deeper insight into the researchers and technologists working in the field of recognition of Hindi-character.

A simple and efficient optical character recognition system for basic symbols in printed Kannada text

Sadhana-academy Proceedings in Engineering Sciences, 2007

Optical Character Recognition (OCR) systems have been effectively developed for the recognition of printed characters of non-Indian languages. Efforts are on the way for the development of efficient OCR systems for Indian languages, especially for Kannada, a popular South Indian language. We present in this paper an OCR system developed for the recognition of basic characters (vowels and consonants) in printed Kannada text, which can handle different font sizes and font types. Hu’s invariant moments and Zernike moments that have been progressively used in pattern recognition are used in our system to extract the features of printed Kannada characters. Neural classifiers have been effectively used for the classification of characters based on moment features. An encouraging recognition rate of 96.8% has been obtained. The system methodology can be extended for the recognition of other south Indian languages, especially for Telugu.

A performance comparison and post-processing error correction technique to OCRs for printed Tamil texts

2014 9th International Conference on Industrial and Information Systems (ICIIS), 2014

Optical Character Recognition (OCR) deals with automated recognition of characters that are in the format of digital image. OCR refers to the process by which scanned images are electronically processed and converted to an editable document. Handwritten and printed texts are the primary research areas of an OCR. Many OCR systems are commercially available for English and Arabic characters but there is still no recognition system available which yields higher recognition rate even though the scanned images are of high quality. The general framework of a Tamil OCR in the literature involves: preprocessing, line segmentation, word segmentation, character segmentation, feature extraction and recognition of characters. OCR for printed Tamil documents poses challenge owing to: one line may have different font styles, presence of pictures, multi columns, touching of adjacent characters, presence of broken characters, low print quality and complex layout. Furthermore, when comparing 26 alphabets in English, Tamil language has 247 alphabets which makes the recognition more difficult. There are few OCRs for Tamil language that are freely available with a moderate recognition rate as the performance comparisons of such OCRs are not available on a benchmark dataset. In this paper we compare OCRs for printed Tamil texts on four different types of documents: books, magazines, newspapers and pamphlets. Furthermore we propose a post-processing error correction technique to the tested OCRs which reduces the overall mean error rate by nearly 10% on those four categories.