A Survey on Character Recognitions in Tamil Scripts using OCR (original) (raw)

Tamil Gnani - an OCR on Windows

Proc. Tamil Internet 2001 (TI 2001)

A complete working model of Optical Character Recognizer for Tamil script is developed. The system works in a multi-font and multi-size scenario. The input to the system is a scanned or digitized document and the output is in TAM code. The basic techniques were presented in Tamilnet 2000 [1]. Now, we have added the recognition of other symbols, such as punctuations and numerals. Further, the entire scheme, from scanning to obtaining the TAM codes, has been implemented on Visual C++ platform. The product is designed to run on Windows 95 and 98 platforms. The current overall recognition rate is around 98%.

An Overview of OCR Research in Indian Scripts

This paper gives an overview of the ongoing research in optical character recognition (OCR) systems for Indian language scripts. This survey paper has been felt necessary when the work on developing OCRs for Indian scripts is very promising, and is still in emerging status. The aim of this paper is to provide a starting point for the researchers entering into this field. Peculiarities in Indian scripts, present status of the OCRs for Indian scripts, techniques used in them, recognition accuracies, and the resources available, are discussed in detail. Examples given in this paper are based on authors' work on developing a character recognition system for Telugu, a south Indian language.

A SURVEY OF OCR ALGORITHMS FOR TAMIL HANDWRITTEN CHARACTER RECOGNITION

IAEME PUBLICATION, 2014

Recognition of Tamil handwritten characters has been one of the active areas of research in the field of Tamil OCR. The Tamil OCR has various application potentials like bank cheque processing, automated data entry, Digital form conversion of Tamil literature etc. In This paper two OCR algorithms used to recognize the Tamil handwritten characters and their recognition rate were compared.

A Complete OCR for Printed Tamil Text

Proc. Tamil Internet 2000 (TI 2000)

A multi-font, multi-size Optical Character Recognizer (OCR) of Tamil Script is developed. The input image to the system is binary and is assumed to contain only text. The skew angle of the document is estimated using a combination of Hough transform and Principal Component Analysis. A multi-rate-signal-processing based algorithm is devised to achieve distortion-free rotation of the binary image during skew correction. Text segmentation is noise-tolerant. The statistics of the line height and the character gap are used to segment the text lines and the words. The images of the words are subjected to morphological closing followed by connected component-based segmentation to separate out the individual symbols. Each segmented symbol is resized to a pre-fixed size and thinned before it is fed to the classifier. A three-level, tree-structured classifier for Tamil script is designed. The net classification accuracy is 99.0%.

A Survey on Malayalam OCR modules

2016

People start learning to read and write during the early stage of education. As years pass by they may have acquired good reading and writing skills. It may not be difficult for them to read any kind of either printed or handwritten characters. But Computers may find difficulty in deciphering many kinds of printed characters which is of different fonts and styles or handwritten characters. Malayalam OCR is a complex task owing to the various character scripts available and more importantly the difference in ways in which the characters are written. The dimensions are never the same and may be never mapped onto a square grid unlike English characters. This survey paper provides the details of different Malayalam ocr modules and their techniques for identifying and recognizing the malayalam old scripts and converting it to new Malayalam script.

OCR in Indian scripts: A survey

2005

India is a multilingual country. A significantly large number of scripts are used to represent these languages. A desire of vision researchers is to develop an integrated optical character recognition (OCR) system, which will be able to process all such scripts. Such a development, if objectified, will not only enable faster flow of information across the country, but also have a profound effect on its scientific and economical development. Courageous endeavours have been successfully made towards the development of systems capable of recognizing machine-printed or handwritten characters and/or numerals. However, most Indian scripts do not have an integrated OCR system. Further, the development of a unified system, which is capable of processing all Indian scripts is still a dream. This article presents a survey of the current literature on the development of OCR's in Indian scripts. Reviewing the basis of and the motivation towards the development of OCR system, the article analyzes the various methodologies employed in general purpose pattern recognition systems. A critical analysis of the work towards OCR systems in Indian languages, with pointers towards possible future work, is also presented.

A Survey On OCR For Telugu Language

International Journal of Scientific & Technology Research, 2019

Text in the image file will not be in editable format on computer. Optical Character Recognition (OCR) is the process to understand the text in the image, either printed or handwritten and creates a file with the text in the image file that can be editable on the computer. OCR for English language is well developed. At present day there is a need of OCR for Indian languages to preserve historical documents which are written mostly in Indian languages, to organize books in library and for application form processing etc. OCR for Telugu language is difficult as a consonant or single vowel forms a single character or it can be a combination of vowels and consonants that can form a compound character. This paper presents survey on methodologies used in OCR system for Telugu Language till now.