Recognizable units in Pashto language for OCR (original) (raw)

Choice of recognizable units for URDU OCR

2012

There has been considerable work on Arabic OCR. However, all that work is based on Naskh style. Urdu script is based on Arabic alphabet, but uses Nastalique style. The Nastalique style makes OCR in general and character segmentation in particular, a highly challenging task, so most of the researchers avoid the character segmentation phase and go in for higher unit of recognition. For Urdu, the next higher recognition unit considered by researchers is ligature, which lies between character and word. A ligature is a connected component of one or more characters and usually an Urdu word is composed of 1 to 8 ligatures. A related issue is identification of all possible ligatures for recognition purpose. For this purpose, we have performed a statistical analysis of Urdu corpus to collect and organise the Urdu ligatures. The number of unique ligatures comes to be more than 26,000, and recognition of such a huge class is again a Herculean task. It becomes necessary to reduce the class count and look for alternative recognition unit. From OCR point of view, a ligature can further be segmented into one primary connected component and zero or more secondary connected components. The primary component represents the basic shape of the ligature, while the secondary connected component corresponds to the dots and diacritics marks and special symbols associated with the ligature. To reduce the class count, the ligatures with similar primary components are clubbed together. Further statistical analysis is performed to count and arrange in descending order the primary components and a manageable class of around 2300 recognition units has been generated, which covers 99% of Urdu corpus.

Ligature Segmentation for Urdu OCR

2013 12th International Conference on Document Analysis and Recognition, 2013

Urdu script uses superset of Arabic alphabet, but uses Nastaliq writing style. Nastaliq script is highly cursive, context sensitive and is written diagonally from top right to bottom left with stacking of characters, which makes it very hard to process for OCR. In addition, line and word segmentation are non-trivial tasks as we have frequently merging lines and vertically overlapping words and ligatures. But the real challenge is in character segmentation and so most of the researchers have taken the next higher unit, ligature, as recognition unit. A ligature is a connected component of one or more characters including diacritic marks and usually an Urdu word is composed of 1 to 8 ligatures. In this paper, we present a methodology for segmenting the Urdu text into ligatures. A hybrid approach, which uses top down technique for line segmentation and bottom up design for segmenting the line into ligatures, has been employed. The various challenges encountered during ligature segmentation such as horizontally overlapping and broken lines, merged ligatures and diacritic association have been discussed in detail. I.

Recognition of Nastalique Urdu ligatures

Proceedings of the 4th International Workshop on Multilingual OCR, 2013

There has been considerable work on Arabic OCR. However, all that work is based on Naskh style. Urdu script is based on Arabic alphabet, but uses Nastalique style. The Nastalique style makes OCR in general and character segmentation in particular, a highly challenging task, so most of the researchers avoid the character segmentation phase and go in for higher unit of recognition. For Urdu, the next higher recognition unit considered by researchers is ligature, which lies between character and word. A ligature is a connected component of one or more characters and usually an Urdu word is composed of 1 to 8 ligatures. There are more than 25,000 Urdu ligatures, out of which top 4567 ligatures account for 99% of coverage. From OCR point of view, a ligature can further be segmented into one primary connected component and zero or more secondary connected components. The primary component represents the basic shape of the ligature, while the secondary connected component corresponds to the dots and diacritics marks and special symbols associated with the ligature. To reduce the class count, the ligatures with similar primary components are clubbed together. In this paper, we have presented a system to recognize 9262 ligatures formed from 2190 primary and 17 secondary components. Various combinations of DCT, Gabor filters and zoning based features along with kNN, HMM and SVM classifiers have been tried and a recognition accuracy of 98% has been reported on pre-segmented ligatures.

Offline Urdu OCR using Ligature based Segmentation for Nastaliq Script

There are two most popular writing styles of Urdu i.e. Naskh and Nastaliq. Considering Arabic OCR research, ample amount of work has been done on Naskh writing style; focusing on Urdu, which uses Arabic character set commonly used Nastaliq writing style. Due to Nastaliq writing style, Urdu OCR poses many distinct challenges like compactness, diagonal orientation and context character shape sensitivity etc., for OCR system to correctly recognize the Urdu text image. Due to compactness and slanting nature of Nastaliq writing style, existing methods for Naskh style would not give desirable results. Therefore, in this paper, we are presenting ligature based segmentation OCR system for Urdu Nastaliq script. We have discussed in detail various unique challenges for the Urdu OCR and different feature extraction techniques for Ligature recognition using SVM and kNN classifier. The system is trained to recognize 11,000 Urdu ligatures. We have achieved overall 90.29% accuracy tested on Urdu text images.

A Methodology for Urdu Word Segmentation using Ligature and Word Probabilities

2011

This paper introduce a technique for Word segmentation for the handwritten recognition of Urdu script. Word segmentation or word tokenization is a primary technique for understanding the sentences written in Urdu lan-guage. Several techniques are available for word segmentation in other languages but not much work has been done for word segmentation of Urdu Optical Character Recognition (OCR) System. A method is proposed for word segmentation in this paper. It finds the boundaries of words in a sequence of ligatures using probabilistic formulas, by utilizing the knowledge of collocation of ligatures and words in the corpus. The word identification rate using this technique is 97.10 % with 66.63 % unknown words identification rate.

Line and Ligature Segmentation of Urdu Nastaleeq Text

IEEE Access, 2017

The recognition accuracy of ligature-based Urdu language optical character recognition (OCR) systems highly depends on the accuracy of segmentation that converts Urdu text into lines and ligatures. In general, lines and ligatures-based Urdu language OCRs are more successful as compared to characters-based. This paper presents the techniques for segmenting Urdu Nastaleeq text images into lines and subsequently to ligatures. Classical horizontal projection-based segmentation method is augmented with a curved-line-split algorithm for successfully overcoming the problems, such as text line split position, overlapping, merged ligatures, and ligatures crossing line split positions. Ligature segmentation algorithm extracts connected components from text lines, categorizes them into primary and secondary classes, and allocates secondary components to the primary class by examining width, height, coordinates, overlapping, centroids, and baseline information. The proposed line segmentation algorithm is tested on 47 pages with 99.17% accuracy. The proposed ligature segmentation algorithm is mainly tested on a large Urdu-printed text images data set. The proposed algorithm segmented Urdu-printed text images data set to 189 000 ligatures from 10 063 text lines having 332 000 connected components. A total of about 142 000 secondary components have been successfully allocated to more than 189 000 primary ligatures with accuracy rate of 99.80%. Thus, both of the proposed segmentation algorithms outperform the existing algorithms employed for Urdu Nastaleeq text segmentation. Moreover, the proposed line segmentation algorithm is also tested on Arabic, for which it also extracted lines correctly.

Robust Optical Recognition of Cursive Pashto Script Using Scale, Rotation and Location Invariant Approach

PLOS ONE, 2015

The presence of a large number of unique shapes called ligatures in cursive languages, along with variations due to scaling, orientation and location provides one of the most challenging pattern recognition problems. Recognition of the large number of ligatures is often a complicated task in oriental languages such as Pashto, Urdu, Persian and Arabic. Research on cursive script recognition often ignores the fact that scaling, orientation, location and font variations are common in printed cursive text. Therefore, these variations are not included in image databases and in experimental evaluations. This research uncovers challenges faced by Arabic cursive script recognition in a holistic framework by considering Pashto as a test case, because Pashto language has larger alphabet set than Arabic, Persian and Urdu. A database containing 8000 images of 1000 unique ligatures having scaling, orientation and location variations is introduced. In this article, a feature space based on scale invariant feature transform (SIFT) along with a segmentation framework has been proposed for overcoming the above mentioned challenges. The experimental results show a significantly improved performance of proposed scheme over traditional feature extraction techniques such as principal component analysis (PCA).

Segmentation of Cursive Textual Images (Applied to Urdu Script)

Segmentation plays a vital role in the development of an OCR system. Segmentation of Urdu script is still an open issue. Many segmentation models for Urdu text have been proposed by AI / machine learning community and still no agreement on giving a solution. This research used design approach and it practically demonstrates how we can use machine learning algorithms to solve the segmentation in document image processing using Urdu language. A novel hidden Markov Models HMMs based algorithm is proposed to segment out Urdu text up to character level. Segmentation can be reached using projection, morphology, relaxation labeling and connected component. Several character recognition techniques are available such as template matching, bounding box algorithm, artificial neural network, support vector machines, Markov process, chain codes, k-nearest neighbour, Bayes classifier and Fuzzy c-means, hit and detect algorithm, drop-fall algorithm, structure feature based algorithms, mim-max algo...

Scale and rotation invariant OCR for Pashto cursive script using MDLSTM network

2015 13th International Conference on Document Analysis and Recognition (ICDAR), 2015

Optical Character Recognition (OCR) of cursive scripts like Pashto and Urdu is difficult due the presence of complex ligatures and connected writing styles. In this paper, we evaluate and compare different approaches for the recognition of such complex ligatures. The approaches include Hidden Markov Model (HMM), Long Short Term Memory (LSTM) network and Scale Invariant Feature Transform (SIFT). Current state of the art in cursive script assumes constant scale without any rotation, while real world data contain rotation and scale variations. This research aims to evaluate the performance of sequence classifiers like HMM and LSTM and compare their performance with descriptor based classifier like SIFT. In addition, we also assess the performance of these methods against the scale and rotation variations in cursive script ligatures. Moreover, we introduce a database of 480,000 images containing 1000 unique ligatures or sub-words of Pashto. In this database, each ligature has 40 scale and 12 rotation variations. The evaluation results show a significantly improved performance of LSTM over HMM and traditional feature extraction technique such as SIFT.

A Feature-based Approach to Segmentation of Online Persian Cursive Script

International Journal of Modeling and Optimization, 2012

In recent years, most of the research in handwriting recognition seems to have focused on the problem of on-line cursive script recognition. The cursive nature of Persian script and the existence of different handwriting styles for its alphabet as well as the fact that each character is written in different forms depending on its location in a word, which make segmentation and recognition of Persian words a challenging task. In this paper, we propose a novel segmentation method for online Persian handwriting using some generic features of Persian letters in cursive words. In addition, some easy to implement techniques for extracting those features are presented as well. Our segmentation process is composed of two modules. The first one copes with the preprocessing of input data for which we propose a normalization technique to make distances of consecutive points of the input uniform. By doing so, the input data becomes independent of writing speed and input device. The second module deals with segmentation of a word into its constructing letters. Our results from implementation of the proposed method show a total accuracy of up to 98.625%.