An optical character recognition system for printed Telugu text (original) (raw)

International Journal of Scientific & Technology Research, 2019

Text in the image file will not be in editable format on computer. Optical Character Recognition (OCR) is the process to understand the text in the image, either printed or handwritten and creates a file with the text in the image file that can be editable on the computer. OCR for English language is well developed. At present day there is a need of OCR for Indian languages to preserve historical documents which are written mostly in Indian languages, to organize books in library and for application form processing etc. OCR for Telugu language is difficult as a consonant or single vowel forms a single character or it can be a combination of vowels and consonants that can form a compound character. This paper presents survey on methodologies used in OCR system for Telugu Language till now.

Optical Character Recognition (OCR) for Telugu: Database, Algorithm and Application

2018 25th IEEE International Conference on Image Processing (ICIP), 2018

Telugu is a Dravidian language spoken by more than 80 million people worldwide. The optical character recognition (OCR) of the Telugu script has wide ranging applications including education, health-care, administration etc. The beautiful Telugu script however is very different from Germanic scripts like English and German. This makes the use of transfer learning of Germanic OCR solutions to Telugu a non-trivial task. To address the challenge of OCR for Telugu, we make three contributions in this work: (i) a database of Telugu characters, (ii) a deep learning based OCR algorithm, and (iii) a client server solution for the online deployment of the algorithm. For the benefit of the Telugu people and the research community, our code has been made freely available at https://gayamtrishal.github.io/OCR Telugu.github.io/.

OCR of Printed Telugu Text with High Recognition Accuracies

Lecture Notes in Computer Science, 2006

Telugu is one of the oldest and popular languages of India spoken by more than 66 million people especially in South India. Development of Optical Character Recognition systems for Telugu text is an area of current research. OCR of Indian scripts is much more complicated than the OCR of Roman script because of the use of huge number of combinations of characters and modifiers. Basic Symbols are identified as the unit of recognition in Telugu script. Edge Histograms are used for a feature based recognition scheme for these basic symbols. During recognition, it is observed that, in many cases, the recognizer incorrectly outputs a very similar looking symbol. Special logic and algorithms are developed using simple structural features for improving recognition accuracies considerably without too much additional computational effort. It is shown that recognition accuracies of 98.5 % can be achieved on laser quality prints with such a procedure.

31 System for OCR of Printed Telugu Text in Complicated Layouts and Backgrounds

2016

Most of the work reported in the literature for Optical Character recognition (OCR) assumes the background to be clean or white and works for one or two fonts. Real documents can range from simple plain backgrounds to complex uneven illuminated backgrounds. OCR of such documents is further complicated due to text written in a variety of fonts and sizes. In this paper, an OCR system is proposed for OCR of printed Telugu text written in complicated layouts and backgrounds. The proposed system has been tested on a variety of images taken from different newspapers, old books and synthetic images on textured backgrounds. It works on several unknown fonts even in the presence of complicated backgrounds although the database consists of only four fonts. The recognition accuracies obtained are 98% on average. Thus the proposed approach for Telugu OCR performs well on a larger variety of images than the previous attempts which were more restrictive and domain specific.

A Complete Tamil Optical Character Recognition System

5th International Workshop, DAS 2002, Proceedings, 2002

Document Image processing and Optical Character Recognition (OCR) have been a frontline research area in the field of human-machine interface for the last few decades. Recognition of Indian language characters has been a topic of interest for quite some time. The earlier contributions were reported in [1] and [2]. A more recent work is reported in [3] and [9]. The need for efficient and robust algorithms and systems for recognition is being felt in India, especially in the post and telegraph department where OCR can assist the staff in sorting mail. Character recognition can also form a part in applications like intelligent scanning machines, text to speech converters, and automatic language-to-language translators.

Optical Character Recognition for Hindi

2018

Assistant Professor, Department of CSE, Assam down town University, Assam, India ---------------------------------------------------------------------***--------------------------------------------------------------------Abstract -Optical Character Recognition is a system which can perform the translation of images from handwritten or printed form to machine-editable form. Devanagari script is used in many Indian languages like Hindi, Nepali, Marathi, Sindhi etc. This script forms the foundation of the language like Hindi which is the national and most widely spoken language in India. In current scenario, there is a huge demand in “storing the information in digital format available in paper documents and then later reusing this information by searching process”. In this paper we propose a new method for recognition of printed Hindi characters in Devanagari script. In this project different pre-processing operations like features extraction, segmentations and classification have bee...

A Complete OCR for Printed Tamil Text

Proc. Tamil Internet 2000 (TI 2000)

A multi-font, multi-size Optical Character Recognizer (OCR) of Tamil Script is developed. The input image to the system is binary and is assumed to contain only text. The skew angle of the document is estimated using a combination of Hough transform and Principal Component Analysis. A multi-rate-signal-processing based algorithm is devised to achieve distortion-free rotation of the binary image during skew correction. Text segmentation is noise-tolerant. The statistics of the line height and the character gap are used to segment the text lines and the words. The images of the words are subjected to morphological closing followed by connected component-based segmentation to separate out the individual symbols. Each segmented symbol is resized to a pre-fixed size and thinned before it is fed to the classifier. A three-level, tree-structured classifier for Tamil script is designed. The net classification accuracy is 99.0%.

A comparative Study of Classification Algorithm for Printed Telugu Character Recognition

International Journal of Electronics Communication and Computer Engineering, 2012

Optical character recognition (OCR) refers to a process whereby printed documents are transformed into ASCII files for the purpose of compact storage, editing, fast retrieval, and other file manipulations through the use of a computer. Optical Character Recognition plays a key role in the field of digital image processing, pattern recognition and The motivation for development from then on, was the possible applications within the business world.

A simple and efficient optical character recognition system for basic symbols in printed Kannada text

Sadhana-academy Proceedings in Engineering Sciences, 2007

Optical Character Recognition (OCR) systems have been effectively developed for the recognition of printed characters of non-Indian languages. Efforts are on the way for the development of efficient OCR systems for Indian languages, especially for Kannada, a popular South Indian language. We present in this paper an OCR system developed for the recognition of basic characters (vowels and consonants) in printed Kannada text, which can handle different font sizes and font types. Hu’s invariant moments and Zernike moments that have been progressively used in pattern recognition are used in our system to extract the features of printed Kannada characters. Neural classifiers have been effectively used for the classification of characters based on moment features. An encouraging recognition rate of 96.8% has been obtained. The system methodology can be extended for the recognition of other south Indian languages, especially for Telugu.

Recognition of Hindi Character Using OCR-Technology: A Review

International Journal of Advanced Trends in Computer Science and Engineering , 2023

Recognition of character is a technique that enables the transformation of various kinds of scanned papers into an editable, readable, and searchable format. In the last two decades, several researchers and technologists have been continuously working in this field to enhance the rate of accuracy. Recognition of character is classified into printed, handwritten , and characters written at image recognition. Recognition of character is the major area of research in the field of pattern recognition. This paper presents an overview of Hindi character recognition by utilizing the optical character recognition (OCR) technique. We surveyed some major research breakthroughs in character recognition, especially for Hindi characters. This research article focuses to provide a deeper insight into the researchers and technologists working in the field of recognition of Hindi-character.

An optical character recognition system for printed Telugu text (original) (raw)

Related papers