A Path Planning for Line Segmentation of Handwritten Documents (original) (raw)

Text line segmentation in handwritten document using a production system

2004

Text line segmentation in handwritten documents is an important step in document processing. We present a new text line segmentation method based on the Mumford-Shah model. The algorithm is script independent. In addition, we use morphing to remove overlaps between neighboring text lines and connect broken ones. Experimental results show the validity of our method.

A new scheme for unconstrained handwritten text-line segmentation

Pattern Recognition, 2010

Variations in inter-line gaps and skewed or curled text-lines are some of the challenging issues in segmentation of handwritten text-lines. Moreover, overlapping and touching text-lines that frequently appear in unconstrained handwritten text documents significantly increase segmentation complexities. In this paper, we propose a novel approach for unconstrained handwritten text-line segmentation. A new painting technique is employed to smear the foreground portion of the document image. The painting technique enhances the separability between the foreground and background portions enabling easy detection of text-lines. A dilation operation is employed on the foreground portion of the painted image to obtain a single component for each text-line. Thinning of the background portion of the dilated image and subsequently some trimming operations are performed to obtain a number of separating lines, called candidate line separators. By using the starting and ending points of the candidate line separators and analyzing the distances among them, related candidate line separators are connected to obtain segmented text-lines. Furthermore, the problems of overlapping and touching components are addressed using some novel techniques. We tested the proposed scheme on text-pages of English, French, German, Greek, Persian, Oriya, and Bangla and remarkable results were obtained.

Text line and word segmentation of handwritten documents

Pattern Recognition, 2009

In this paper, we present a segmentation methodology of a handwritten document in its distinct entities namely text lines and words. Text line segmentation is achieved making use of the Hough Transform on a subset of the connected components of the document image. Also, a post-processing step includes the correction of possible false alarms, the creation of text lines that Hough Transform failed to create and finally the efficient separation of vertically connected characters using a novel method. Word segmentation is treated as a two class problem. The distances between adjacent overlapped components in a text line are calculated and each of these is categorized either as an inter-word or an intra-word distance after the comparison with a threshold. The performance of the proposed methodology is based on a consistent and concrete evaluation technique that relies on the comparison between the text line segmentation result and the corresponding ground truth annotation as well as the word segmentation result and the corresponding ground truth annotation.

Text Line Segmentation in Images of Handwritten Historical Documents

2008 First Workshops on Image Processing Theory, Tools and Applications, 2008

This paper describes an original method to segment handwritten text lines from historical document images. After an initial preprocessing, we compute a black/white transition map to achieve a rough detection of the line regions in the image. Using this map, the corresponding line axes are extracted through a skeletonization algorithm and the conflicts between adjacent cutting lines are solved by some heuristics. Our approach was tested on a set of handwritten digitized documents (from the PROHIST Project database) from the end of the 19th century onwards. The proposed method worked well even with difficult images and it achieved an 82.18% of correct segmented lines for our database. The results of comparing our method with other recent proposal for automatic line extraction on the same test images offered more than a 38% of correct segmentation improvement.

Handwritten document image segmentation into text lines and words

Pattern Recognition, 2010

Two novel approaches to extract text lines and words from handwritten document are presented. The line segmentation algorithm is based on locating the optimal succession of text and gap areas within vertical zones by applying Viterbi algorithm. Then, a text-line separator drawing technique is applied and finally the connected components are assigned to text lines. Word segmentation is based on a gap metric that exploits the objective function of a soft-margin linear SVM that separates successive connected components. The algorithms tested on the benchmarking datasets of ICDAR07 handwriting segmentation contest and outperformed the participating algorithms.

Learning-Free Text Line Segmentation for Historical Handwritten Documents

Applied Sciences

We present a learning-free method for text line segmentation of historical handwritten document images. This method relies on automatic scale selection together with second derivative of anisotropic Gaussian filters to detect the blob lines that strike through the text lines. Detected blob lines guide an energy minimization procedure to extract the text lines. Historical handwritten documents contain noise, heterogeneous text line heights, skews and touching characters among text lines. Automatic scale selection allows for automatic adaption to the heterogeneous nature of handwritten text lines in case the character height range is correctly estimated. In the extraction phase, the method can accurately split the touching characters among the text lines. We provide results investigating various settings and compare the model with recent learning-free and learning-based methods on the cBAD competition dataset.

Text Line Segmentation In Handwritten Documents Based On Dynamic Weights

Journal of Information Systems, Operations Management, 2013

Identification of text lines in documents, or text line segmentation, represents the first step in the process called ‘Text recognition”, whose purpose is to extract the text and put it in a more understandable format. The paper proposes a seam carving algorithm as an approach to find the text lines. This algorithm uses a new method that allocates dynamic weights for every processed pixel in the original image. With this addition, the resulting lines follow the text more accurately. The downside of this technique is the computational time overhead

Language Independent Text-Line Extraction Algorithm for Handwritten Documents

—Text-line extraction in handwritten documents is an important step for document image understanding, and a number of algorithms have been proposed to address this problem. In order to overcome this limitation, we develop text-line extraction algorithm for cursive handwriting. Our method is based on connected components (CCs), however, unlike conventional methods, we analysed strokes and partition under-segmented CCs into normalized ones. Due to this normalization, the proposed method is able to estimate the states of CCs for a range of different languages and writing styles. I. INTRODUCTION TEXT-LINE extraction in document images is an essential step for various document image processing tasks such as layout analysis and optical character recognition (OCR).Therefore, there have been a lot of researches in this area, and a number of algorithms have been proposed for the extraction of text-lines in machine-printed document images. However, text-line extraction in handwritten documents is still considered a challenging problem: the scale and orientation of characters are spatially varying, inter-line distances are irregular, and characters may touch across words and/or text-lines. Handwriting detection is a technique or ability of computer to receive & interpret intelligible handwritten input from source. Handwriting recognition is comparatively difficult, because different people have different handwriting style. In optical character recognition, segmentation is a significant phase and accuracy of character recognition highly depends on accuracy of segmentation. Incorrect segmentation leads to incorrect character recognition. Segmentation phase includes text line, word, and character segmentation. Text line detection and separation in digital image documents is a challenging job for handwritten document analysis and character recognition. The problem becomes compounded if the text lines in the text image are connected or overlapped. Emergence of these problems is common in handwritten documents in comparison of printed documents because of individual's varying handwriting styles. Researchers are continuously working on these problems for different languages. Text-line extraction in handwritten documents is an important step for document image understanding, we develop a language-independent text-line extraction algorithm. However, most conventional work focused on specific character sets. That is, conventional algorithms address the variations caused by individual writers by exploiting language-specific features. The situation is worse for Indian scripts where most characters are connected. On the other hand, character components are placed in a one-dimensional way in cursive Latin-based and Indian scripts, allowing us to develop horizontal bottom-up clustering rules. Our method is based on connected components (CCs), however, unlike conventional methods; we analyze strokes and partition under-segmented CCs into normalized ones. Due to this normalization, the proposed method is able to estimate the states of CCs for a range of different languages and writing styles. From the estimated states, we build a cost function whose minimization yields text-lines. We develop an effective CC segmentation method: by partitioning under-segmented CCs into normalized ones, we can estimate states reliably in a variety of documents.

A graph-based approach for segmenting touching lines in historical handwritten documents

International Journal on Document Analysis and Recognition (IJDAR), 2014

Text line segmentation in handwritten documents is an important task in the recognition of historical documents. Handwritten document images contain text-lines with multiple orientations, touching and overlapping characters between consecutive text-lines and different document structures making line segmentation a difficult task. In this paper we present a new approach for handwritten text line segmentation solving the problems of touching components, curvilinear text lines and horizontally-overlapping components. The proposed algorithm formulates line segmentation as finding the central path in the area between two consecutive lines. This is solved as a graph traversal problem. A graph is constructed using the skeleton of the image. Then, a path-finding algorithm is used to find the optimum path between text lines. The proposed algorithm has been evaluated on a comprehensive dataset consisting of five databases: ICDAR2009, ICDAR2013, UMD, the George Washington and the Barcelona Marriages Database. The proposed method outperforms the state of the art considering the different types and difficulties of the benchmarking data.

Line Segmentation of Handwritten Document Using Pre-Processing Techniques

2014

Preprocessing of document image is a very important step to handle the deformations namely noise, different handwriting complexities that may result in base line skew, word skew, character skew, accents may be cited either above or below the text line and parts of neighboring text lines may be connected, etc. The paper proposes a novel preprocessing technique for handwritten document to handle some of the deformations usually present in the document like touching components, overlapping components, skewed lines, words with individual skews etc. and build a proper text image with all these deformations removed. Based on the analysis of Indian script character shapes and literature survey, it proposes a new sequence of preprocessing methods. A binarized image is sub-sampled and connected components are extracted. These components are dilated and thinned and is given to Hough transform for both global skew and local skew detection for line extraction. The word segmentation is done with...