A Novel Approach to OCR using Image Recognition based Classification for Ancient Tamil Inscriptions in Temples (original) (raw)

Recognition of Ancient Tamil Characters from Epigraphical inscriptions using Raspberry Pi based Tesseract OCR

International Journal of Scientific Research in Computer Science, Engineering and Information Technology, 2021

Optical Character Recognition (OCR) is the process of identification of the printed text using photoelectric devices and computer software. It converts the inscribed text on the stones into machine encoded format. OCR is widely used in machine learning process like cognitive computing, machine translation, text to speech conversion and text mining.OCR is mainly used in the research fields like Character Recognition, Artificial Intelligence and Computer Vision. In this research, the recognition process is done using OCR, the inscribed character is processed using Raspberry Pi device on which it recognizes characters using Artificial Neural Network. This work mainly focuses on the recognition of ancient Tamil characters inscribed on stones to modern Tamil characters belong to 9th and 12th century characters. The input image is subjected to gray scale conversion process and enhanced using adaptive thresholding process. The output image is subjected to thinning process to reduce the pixel size of the image. Then the characters are classified using Artificial Neural Network Architecture and the classified characters are mapped to modern Tamil character using Unicode. The Artificial Neural Network has input layer, hidden layer of 15 neurons and output layer of 1 neuron to classify the characters. The accuracy of the constructed system for the recognition of epigraphical inscriptions is calculated. The above process is carried out in raspbian environment using python process.

IRJET- Convolution Neural Network based Ancient Tamil Character Recognition from Epigraphical Inscriptions

IRJET, 2020

Tamil is one of the oldest languages in the world with several million speakers in the Southern part of TamilNadu. Recognition of ancient Tamil characters is one of the challenging tasks of the epigraphers in the field of archaeology. More information is revealed by recognizing the characters inscribed on stones. Using OCR techniques the ancient Tamil characters between 9 th and 12 th century characters are recognized. Optical Character Recognition (OCR) is the process of converting the input images of text into a machine editable format. The important steps in OCR are pre-processing, segmentation & recognition. Deep Learning (DL) has been used in image classification, object tracking, face recognition, scene labeling, text detection, etc. Convolution Neural Network (CNN) is the most commonly used model in Deep Learning that has demonstrated high performance on image classification. In the present study, we performed certain amount of training of a 18 layers CNN for 73 class character recognition problem. This CNN architecture is trained towards the feature extraction of samples using ReLU activation function. CNN can automatically learn a unique set of features directly from the images in a hierarchical manner. Using our framework we achieved the Segmentation Rate & Recognition Rate as by mapping the. Ancient Tamil characters to Modern Tamil characters

Recognizing ancient Sinhala Inscription Characters using Neural Network Technologies

Recognizing ancient Sinhala inscription characters enable archeologists to reveal historical events in ancient Sri Lanka. Currently, this is done by the archaeology experts with a huge effort. The inefficiency of this manual procedure will negatively impact on the future research in field of archaeology. This research involves in developing an application with Optical Character Recognition (OCR) functionality to recognize ancient Sinhala inscription.

12th Century Ancient Tamil Character Recognition From Temple Wall Inscriptions

i-manager's Journal on Embedded Systems

Recognition of any ancient Tamil characters with respect to any language is complicated, since the ancient Tamil characters differ in written format, intensity, scale, style, and orientation, from person to person. Researchers for the recognition of ancient Tamil languages and scripts are comparatively less with other languages, this is a result of the lack of utilities such as Tamil text databases, dictionaries etc. The problem of ancient Tamil character recognition is the technical challenge than other languages in respects to the similarity and complexity of characters that are composed of circles, holes, loops and curves. Hence ancient Tamil recognition requires more research to reach the ultimate goal of machine simulation of human reading. In this paper, we have made an attempt to recognize ancient Tamil characters by using SIFT features and presented a new and efficient approach based on bag-of key points representation. Collection of SIFT features are first extracted from local patches on the pre-processed images, and they are then quantized by K-means algorithm to form the bag-of-key points representation of the original images. These fixed-length feature vectors are used to classify the characters. A recognition system consists of the activities, namely, digitization, pre-processing, feature extraction and classification. This system achieves a maximum recognition accuracy of 84% using SIFT features.

Ancient text recognition: a review

Artificial Intelligence Review, 2020

Optical character recognition (OCR) is an important research area in the field of pattern recognition. A lot of research has been done on OCR in the last 60 years. There is a large volume of paper-based data in various libraries and offices. Also, there is a wealth of knowledge in the form of ancient text documents. It is a challenge to maintain and search from this paper-based data. At many places, efforts are being done to digitize this data. Paper based documents are scanned to digitize data but scanned data is in pictorial form. It cannot be recognized by computers because computers can understand standard alphanumeric characters as ASCII or some other codes. Therefore, alphanumeric information must be retrieved from scanned images. Optical character recognition system allows us to convert a document into electronic text, which can be used for edit, search, etc. operations. OCR system is the machine replication of human reading and has been the subject of intensive research for more than six decades. This paper presents a comprehensive survey of the work done in the various phases of an OCR with special focus on the OCR for ancient text documents. This paper will help the novice researchers by providing a comprehensive study of the various phases, namely, segmentation, feature extraction and classification techniques required for an OCR system especially for ancient documents. It has been observed that there is a limited work is done for the recognition of ancient documents especially for Devanagari script. This article also presents future directions for the upcoming researchers in the field of ancient text recognition.

Optical Character Recognition (OCR) for Telugu: Database, Algorithm and Application

2018 25th IEEE International Conference on Image Processing (ICIP), 2018

Telugu is a Dravidian language spoken by more than 80 million people worldwide. The optical character recognition (OCR) of the Telugu script has wide ranging applications including education, health-care, administration etc. The beautiful Telugu script however is very different from Germanic scripts like English and German. This makes the use of transfer learning of Germanic OCR solutions to Telugu a non-trivial task. To address the challenge of OCR for Telugu, we make three contributions in this work: (i) a database of Telugu characters, (ii) a deep learning based OCR algorithm, and (iii) a client server solution for the online deployment of the algorithm. For the benefit of the Telugu people and the research community, our code has been made freely available at https://gayamtrishal.github.io/OCR Telugu.github.io/.

Recognition of Handwritten Roman Script Using Tesseract Open source OCR Engine

Computing Research Repository, 2010

The objective of the paper is to recognize handwritten samples of Roman numerals using Tesseract open source Optical Character Recognition (OCR) engine. Tesseract is trained with data samples of different persons to generate one user-independent language model, representing the handwritten Roman digit-set. The system is trained with 1226 digit samples collected form the different users. The performance is tested on two different datasets, one consisting of samples collected from the known users (those who prepared the training data samples) and the other consisting of handwritten data samples of unknown users. The overall recognition accuracy is obtained as 92.1% and 86.59% on these test datasets respectively.

Optical Character Recognition for printed Tamil text using Unicode

Journal of Zhejiang University SCIENCE, 2005

Optical Character Recognition (OCR) refers to the process of converting printed Tamil text documents into software translated Unicode Tamil Text. The printed documents available in the form of books, papers, magazines, etc. are scanned using 9 standard scanners which produce an image of the scanned document. As part of the preprocessing phase the image file is checked for skewing. If the image is skewed, it is corrected by a simple rotation technique in the appropriate direction. Then the image is passed through a noise elimination phase and is binarized. The preprocessed image is segmented using an algorithm which decomposes the scanned text into paragraphs using special space detection technique and then the paragraphs into lines using vertical histograms, and lines into words using horizontal histograms, and words into character image glyphs using horizontal histograms. Each image glyph is comprised of 32x32 pixels. Thus a database of character image glyphs is created out of the segmentation phase. Then all the image glyphs are considered for recognition using Unicode mapping. Each image glyph is passed through various routines which extract the features of the glyph. The various features that are considered for classification are the character height, character width, the number of horizontal lines (long and short), the number of vertical lines (long and short), the horizontally oriented curves, the vertically oriented curves, the number of circles, number of slope lines, image centroid and special dots. The glyphs are now set ready for classification based on these features. The extracted features are passed to a Support Vector Machine (SVM) where the characters are classified by Supervised Learning Algorithm. These classes are mapped onto Unicode for recognition. Then the text is reconstructed using Unicode fonts.

A Complete Tamil Optical Character Recognition System

5th International Workshop, DAS 2002, Proceedings, 2002

Document Image processing and Optical Character Recognition (OCR) have been a frontline research area in the field of human-machine interface for the last few decades. Recognition of Indian language characters has been a topic of interest for quite some time. The earlier contributions were reported in [1] and [2]. A more recent work is reported in [3] and [9]. The need for efficient and robust algorithms and systems for recognition is being felt in India, especially in the post and telegraph department where OCR can assist the staff in sorting mail. Character recognition can also form a part in applications like intelligent scanning machines, text to speech converters, and automatic language-to-language translators.

A Complete OCR for Printed Tamil Text

Proc. Tamil Internet 2000 (TI 2000)

A multi-font, multi-size Optical Character Recognizer (OCR) of Tamil Script is developed. The input image to the system is binary and is assumed to contain only text. The skew angle of the document is estimated using a combination of Hough transform and Principal Component Analysis. A multi-rate-signal-processing based algorithm is devised to achieve distortion-free rotation of the binary image during skew correction. Text segmentation is noise-tolerant. The statistics of the line height and the character gap are used to segment the text lines and the words. The images of the words are subjected to morphological closing followed by connected component-based segmentation to separate out the individual symbols. Each segmented symbol is resized to a pre-fixed size and thinned before it is fed to the classifier. A three-level, tree-structured classifier for Tamil script is designed. The net classification accuracy is 99.0%.