Optical Character Recognition for printed Tamil text using Unicode (original) (raw)

OCR Related Technology Methods

International Journal of Advanced Trends in Computer Science and Engineering, 2020

The technology associated with character recognition has emerged as a vital technology within the era of the fourth historic period. Character recognition is developing as a core technology needed in various fields. Character recognition is performed by extracting characters from a picture and recognizing the extracted characters. Character recognition technology has been continuously developed. Recently, together with the event of the fourth historic period, character recognition technology has been used as a core technology in many places. This paper introduces the technology associated with character recognition and therefore the program for character recognition.

Machine recognition of printed Odiya text

M S Thesis, 2001

Automatic Recognition of Characters by a machine is one of the challenging problems in Artificial Intelligence. The motivation for the design of such a machine comes from the human visual system (HVS). HVS is endowed with astonishing versatility and constitutes the ultimate physical (albeit neural) realization of a pattern recognition system whose performance is not affected by geometric transformations of patterns, like characters of various styles and sizes. The prime goal of the design of such a machine is to replace the HVS in practical applications involving repetitive, monotonous tasks such as mass digitization of printed manuscripts, processing of letters and mails in postal services, job applications and banking papers. Most research endeavors and commercial software packages focus on the Roman script. In the case of Indian scripts, the problem of automatic recognition is still a topic of considerable interest. In this thesis, an attempt to develop an integrated Optical Character Recognition (OCR) system for printed Odiya script is presented. The task of automatic recognition of documents has the following major subtasks: Digitization, Preprocessing, Segmentation, Feature Extraction, Classification. In this thesis, a novel binarization technique based on windows of variable width is developed and implemented. The width of the window is selected based upon the local statistics of the image. Skew in the document is detected with the help of a two level precise skew detection algorithm, employing Hough transform and statistical properties of the image. The task of segmenting individual lines from the text is accomplished employing horizontal projection vectors, while that of separating words from lines is done with the help of vertical projection vectors. The segmented words are then subjected to connected component analysis to obtain the basic characters and associated matras. Identifying and extracting the right features with minimal error is one of the most important tasks in automatic recognition of documents. The ability of various types of features in discriminating Odiya characters is analyzed and the features that exhibit better discriminating capabilities are chosen for use in the recognition phase. Of the tested features, it was found that the projection profiles of the characters yielded better discrimination. Apart from these features, some heuristic-based features are also employed in the final classification phase. An important requirement of pattern classifiers is their robustness to noise in the input patterns. In an attempt to design a robust classifier, various classification techniques reported in the literature are tried. These include the nearest neighbor, k-NN and modified k-NN classifiers. Apart from these classical pattern classification techniques, modern techniques involving Support Vector Machines (SVM's) are also employed.

Optical Character Recognition (OCR) System

In the running world, there is growing demand for the software systems to recognize characters in computer system when information is scanned through paper documents as we know that we have number of newspapers and books which are in printed format related to different subjects. These days there is a huge demand in " storing the information available in these paper documents in to a computer storage disk and then later reusing this information by searching process ". One simple way to store information in these paper documents in to computer system is to first scan the documents and then store them as IMAGES. But to reuse this information it is very difficult to read the individual contents and searching the contents form these documents line-by-line and word-byword. The reason for this difficulty is the font characteristics of the characters in paper documents are different to font of the characters in computer system. As a result, computer is unable to recognize the characters while reading them. This concept of storing the contents of paper documents in computer storage place and then reading and searching the content is called DOCUMENT PROCESSING. Sometimes in this document processing we need to process the information that is related to languages other than the English in the world. For this document processing we need a software system called CHARCATER RECOGNITION SYSTEM. This process is also called DOCUMENT IMAGE ANALYSIS (DIA).

The mechanical or electronic change of composed by printed or hand content into machine encoded content is known as Optical character recognition (OCR). OCR is the specialty of perceive character by PC that on optically sifted

2019

2317 Published By: Blue Eyes Intelligence Engineering & Sciences Publication Retrieval Number F8614088619/2019©BEIESP DOI: 10.35940/ijeat.F8614.088619  Abstract: Optical character recognition (OCR) is a strategy to perceive character from optically checked and digitized pages. OCR plays an important role for Indian script research. The official language of the state Odisha is Odia. OCR face an incredible difficulties to recognize Odia language due to similar shape characters, their complex nature, the complicated way in which they combine form to compound character, use of Matra etc. Each character and numbers are passed through several modules like binarization, noise removal, segmentation, line segmentation, word segmentation, skeletonization, deskewing, thinning, thickening. The input picture is standardized to a size of 50 x 50 2D pictures. HMM is a stochastic process which has utilized in various applications for example speech recognition, Handwriting recognition, Gesture rec...

SEVERAL METHODS OF FEATURE EXTRACTION TO HELP IN OPTICAL CHARACTER RECOGNITION

International Journal of Students’ Research in Technology & Management, 2017

An Optical Character Recognition (OCR) consists of three bold steps namely Preprocessing, Feature extraction, Classification. Methods of Feature extraction yield feature vectors based on which the classification of a testing pattern is executed. The paper aims at proposing some methods of feature extraction that may go a long way to recognize a Bengali numeral or character. Pixel Ex-OR Method presents a digital gating (Ex-OR) technique to extract the information in an image. Two successive elements of a row in image matrix have been Ex-ORed and the output is again Ex-ORed with the next element. Alphabetical coding codes a binary character image by means of letters of English alphabet. Directional features find gradient information using Sobel Masks to make position of stroke clear in an image. The features have been derived in eight standard directions and then these eight feature vectors are merged into four sets of features to reduce the system complexity and hence processing time is saved considerably. These features will help develop a Bengali numeral recognition system.

OCR in Indian Languages

Optical Character Recognition or OCR is the electronic translation of handwritten, typewritten or printed text into machine translated images. Optical Character Recognition (OCR) is a very important task in Pattern Recognition. Foreign languages, especially English character recognition has been extensively studied by many researches but due to complication of Indian Languages like Hindi ,Punjabi ,teulgu ,malyalam etc. the research work is very limited and constrained. This paper presents the research work related to all Indian languages, various approaches to character recognition along with some applications of character recognition is also discussed in this paper. The aim of this paper is to provide an overview of the research going on in Indian script OCR systems. This survey paper has been felt necessary when the research on OCRs for Indian scripts is still a challenging task. Hence, a brief introduction to the general OCR and typical steps in the development of an OCR are give...

A Complete OCR for Printed Tamil Text

Proc. Tamil Internet 2000 (TI 2000)

A multi-font, multi-size Optical Character Recognizer (OCR) of Tamil Script is developed. The input image to the system is binary and is assumed to contain only text. The skew angle of the document is estimated using a combination of Hough transform and Principal Component Analysis. A multi-rate-signal-processing based algorithm is devised to achieve distortion-free rotation of the binary image during skew correction. Text segmentation is noise-tolerant. The statistics of the line height and the character gap are used to segment the text lines and the words. The images of the words are subjected to morphological closing followed by connected component-based segmentation to separate out the individual symbols. Each segmented symbol is resized to a pre-fixed size and thinned before it is fed to the classifier. A three-level, tree-structured classifier for Tamil script is designed. The net classification accuracy is 99.0%.

An optical character recognition system for printed Telugu text

Pattern Analysis and Applications, 2004

Telugu is one of the oldest and popular languages of India, spoken by more than 66 million people, especially in South India. Not much work has been reported on the development of optical character recognition (OCR) systems for Telugu text. Therefore, it is an area of current research. Some characters in Telugu are made up of more than one connected symbol. Compound characters are written by associating modifiers with consonants, resulting in a huge number of possible combinations, running into hundreds of thousands. A compound character may contain one or more connected symbols. Therefore, systems developed for documents of other scripts, like Roman, cannot be used directly for the Telugu language. The individual connected portions of a character or a compound character are defined as basic symbols in this paper and treated as a unit of recognition. The algorithms designed exploit special characteristics of Telugu script for processing the document images efficiently. The algorithms have been implemented to create a Telugu OCR system for printed text (TOSP). The output of TOSP is in phonetic English that can be transliterated to generate editable Telugu text. A special feature of TOSP is that it is designed to handle a large variety of sizes and multiple fonts, and still provides raw OCR accuracy of nearly 98%. The phonetic English representation can be also used to develop a Telugu text-to-speech system; work is in progress in this regard.

Character Segmentation Scheme for OCR System

International Journal of Computer Vision and Image Processing, 2011

Automatic machine-printed Optical Characters or texts Recognizers (OCR) are highly desirable for a multitude of modern IT applications, including Digital Library software. However, the state of the art OCR systems cannot do for Myanmar scripts as the language poses many challenges for document understanding. Therefore, the authors design an Optical Character Recognition System for Myanmar Printed Document (OCRMPD), with several proposed techniques that can automatically recognize Myanmar printed text from document images. In order to get more accurate system, the authors propose the method for isolation of the character image by using not only the projection methods but also structural analysis for wrongly segmented characters. To reveal the effectiveness of the segmentation technique, the authors follow a new hybrid feature extraction method and choose the SVM classifier for recognition of the character image. The proposed algorithms have been tested on a variety of Myanmar printed...