OCR for CJK Classical Texts -- Preliminary Examination (original) (raw)
Related papers
Character Segmentation Scheme for OCR System
International Journal of Computer Vision and Image Processing, 2011
Automatic machine-printed Optical Characters or texts Recognizers (OCR) are highly desirable for a multitude of modern IT applications, including Digital Library software. However, the state of the art OCR systems cannot do for Myanmar scripts as the language poses many challenges for document understanding. Therefore, the authors design an Optical Character Recognition System for Myanmar Printed Document (OCRMPD), with several proposed techniques that can automatically recognize Myanmar printed text from document images. In order to get more accurate system, the authors propose the method for isolation of the character image by using not only the projection methods but also structural analysis for wrongly segmented characters. To reveal the effectiveness of the segmentation technique, the authors follow a new hybrid feature extraction method and choose the SVM classifier for recognition of the character image. The proposed algorithms have been tested on a variety of Myanmar printed...
A New Character Segmentation Approach for Off-Line Cursive Handwritten Words
Procedia Computer Science, 2013
Character Segmentation is the most crucial step for any OCR (Optical Character Recognition) System. The selection of segmentation algorithm being used is the key factor in deciding the accuracy of OCR system. If there is a good segmentation of characters, the recognition accuracy will also be high. Segmentation of words into characters becomes very difficult due to the cursive and unconstrained nature of the handwritten script. This paper proposes a new vertical segmentation algorithm in which the segmentation points are located after thinning the word image to get the stroke width of a single pixel. The knowledge of shape and geometry of English characters is used in the segmentation process to detect ligatures. The proposed segmentation approach is tested on a local benchmark database and high segmentation accuracy is found to be achieved.
Thinning Chinese, Korean, Japanese and Thai script for segmentation-free OCRs
International Journal of Scientific Research in Computer Science, Engineering and Information Technology, 2024
While searching on the internet, the OCR keyword will return a thousand research papers on optical character recognition. These papers are ranging from Latin language scripts, Cyrillic, Devanagari, Korean, Japanese, Chinese and Arabic scripts. Sindhi and many other languages extend the Arabic script in which base characters are same while the other characters are adopted in a same situation. Many of the languages possess OCRs for their languages but still there are some other languages which still require the OCRs for their language. The paper is organized in various sections such as introduction followed by Sindhi language characteristics. The OCR approaches and methods are explained. The last section describes the conclusion and future work. An OCR is a set of complex steps to convert image text to editable text. Skeletonization or shrining a word or character body is a method which helps to recognize text more easily. Multiple languages impose various challenges and are hard to recognize and skeletonization or thinning produces a new image which can be easy to recognize. The connected elements are found with this approach. A custom-built software has been developed to interface the generalized thinning algorithm so that the scripts of Chinese, Japanese, Korean and Thai be tested. The output of this algorithm is the final image to be used for the further processing of the OCR. Although the intention was to create algorithms for segmentation free OCRs, the study results and the software can also be used for segmentation-based algorithms. The generalized algorithm shows the accuracy of more than 95% for the experimented four scripts.
A survey of methods and strategies in character segmentation
IEEE Transactions on Pattern Analysis and Machine Intelligence, 1996
Character segmentation has long been a critical area of the OCR process. The higher recognition rates for isolated characters vs. those obtained for words and connected character strings well illustrate this fact. A good part of recent progress in reading unconstrained printed and written text may be ascribed to more insightful handling of segmentation.
Segmentation of Handwritten Characters for Digitalizing Korean Historical Documents
Lecture Notes in Computer Science, 2004
The historical documents are valuable cultural heritages and sources for the study of history, social aspect and life at that time. The digitalization of historical documents aims to provide instant access to the archives for the researchers and the public, who had been endowed with limited chance due to maintenance reasons. However, most of these documents are not only written by hand in ancient Chinese characters, but also have complex page layouts. As a result, it is not easy to utilize conventional OCR(optical character recognition) system about historical documents even if OCR has received the most attention for several years as a key module in digitalization. We have been developing OCR-based digitalization system of historical documents for years. In this paper, we propose dedicated segmentation and rejection methods for OCR of Korean historical documents. Proposed recognition-based segmentation method uses geometric feature and context information with Viterbi algorithm. Rejection method uses Mahalanobis distance and posterior probability for solving out-of-class problem, especially. Some promising experimental results are reported.
A Detailed study and recent research on OCR
Vol. 19 No. 2 FEBRUARY 2021 International Journal of Computer Science and Information Security (IJCSIS), 2021
This paper provides a total overview of OCR. Optical character recognition is nothing but the ability of the computer to collect and decipher the handwritten inputs from documents, photos or any other devices. Over these many years, many researchers have been researching and paying attention on this topic and proposed many methods which can be solved. This research provides a historical view and the summarization of the research which done on this field.
Off-line Character Segmentation Technique for Handwritten Cursive Word – A Survey
This paper presents a survey of segmentation handwritten cursive word images into individual characters. It provides a comprehensive review of segmentation rates and descriptions of test data for the approaches discussed. Techniques that used to decide segmentation points describe in detail. Then the performances are compared in two categories based on number of words and existence of learning system.
Segmentation of Handwritten Text in Gurmukhi Script
Character segmentation is an important preprocessing step for text recognition. The size and shape of characters generally play an important role in the process of segmentation. But for any optical character recognition (OCR) system, the presence of touching characters in textual as well handwritten documents further decreases correct segmentation as well as recognition rate drastically. Because one can not control the size and shape of characters in handwritten documents so the segmentation process for the handwritten document is too difficult. We tried to segment handwritten text by proposing some algorithms, which were implemented and have shown encouraging results. Algorithms have been proposed to segment the touching characters. These algorithms have shown a reasonable improvement in segmenting the touching handwritten characters in Gurmukhi script.
An Efficient Skewed Line Segmentation Technique for Cursive Script OCR
2020
Segmentation of cursive text remains the challenging phase in the recognition of text. In OCR systems, the recognition accuracy of text is directly dependent on the quality of segmentation. In cursive text OCR systems, the segmentation of handwritten Urdu language text is a complex task because of the context sensitivity and diagonality of the text. This paper presents a line segmentation algorithm for Urdu handwritten and printed text and subsequently to ligatures. In the proposed technique, the counting pixel approach is employed for modified header and baseline detection, in which the system first removes the skewness of the text page, and then the page is converted into lines and ligatures. The algorithm is evaluated on manually generated Urdu printed and handwritten dataset. The proposed algorithm is tested separately on handwritten and printed text, showing 96.7% and 98.3% line accuracy, respectively. Furthermore, the proposed line segmentation algorithm correctly extracts the...
A DETAILED STUDY AND ANALYSIS OF OCR USING MATLAB
This paper presents detailed review in the field of Optical Character Recognition. Various techniques are determine that have been proposed to realize the center of character recognition in an optical character recognition system. Even though, sufficient studies and papers are describes the techniques for converting textual content from a paper document into machine readable form. Optical character recognition is a process where the computer understands automatically the image of handwritten script and transfer into classify character. This material use as a guide and update for readers working in the Character Recognition area. Selection of a relevant feature extraction method is probably the single most important factor in achieving high character recognition with much better accuracy in character recognition systems without any variation. Character recognition techniques associate a symbolic identity with the image of character. In a typical OCR systems input characters are digitized by an optical scanner. Each character is then located and segmented, and the resulting character image is fed into a pre-processor for noise reduction and normalization. Certain characteristics are the extracted from the character for classification. The feature extraction is critical and many different techniques exist, each having its strengths and weaknesses. After classification the identified characters are grouped to reconstruct the original symbol strings, and context may then be applied to detect and correct errors.