A statistical learning approach to document image analysis (original) (raw)

Geometric Layout Analysis Techniques for Document Image Understanding: a Review. TR 9703-09

1998

Document Image Understanding (DIU) is an interesting research area with a large variety of challenging applications. Researchers have worked from decades on this topic, as witnessed by the scientific literature. The main purpose of the present report is to describe the current status of DIU with particular attention to two subprocesses: document skew angle estimation and page decomposition. Several algorithms proposed in the literature are synthetically described. They are included in a novel classification scheme. Some methods proposed for the evaluation of page decomposition algorithms are described. Critical discussions are reported about the current status of the field and about the open problems. Some considerations about the logical layout analysis are also reported.

Geometric Layout Analysis Techniques for Document Image Understanding: a Review

1998

Document Image Understanding (DIU) is an interesting research area with a large variety of challenging applications. Researchers have worked from decades on this topic, as witnessed by the scientific literature. The main purpose of the present report is to describe the current status of DIU with particular attention to two subprocesses: document skew angle estimation and page decomposition. Several algorithms proposed in the literature are synthetically described. They are included in a novel classification scheme. Some methods proposed for the evaluation of page decomposition algorithms are described. Critical discussions are reported about the current status of the field and about the open problems. Some considerations about the logical layout analysis are also reported.

BINYAS: a complex document layout analysis system

Multimedia Tools and Applications, 2020

Document layout analysis (DLA) is an irreplaceable prerequisite for the development of a comprehensive document image processing and analysis system. The main purpose of DLA is to segment an input document image into its constituent and coherent regions and identify their classes. In this paper, we propose a competent DLA system, named as BINYAS, based on the connected component (CC) and pixel analysis based approach. Here, we initially identify the regions and then classify these regions as paragraph, separator, graphic, image, table, chart, and inverted text etc. The proposed system is evaluated on four publicly available standard datasets, namely ICDAR 2009, 2015, 2017 and 2019 page segmentation competition datasets, and the performance is compared with many contemporary methods, which also include some well-known software products and deep learning based methods. Experimental results show that our method performs significantly better than state-of-the-art methods in terms of the evaluation metrics considered by the research community of this domain.

Evaluation of Geometric Layout Analysis Techniques for Document Image Analysis

International Journal of Computer Applications, 2010

Document Image Analysis (DIA) is an interesting research area with a large variety of challenging applications. Document analysis is a component which decomposes a document image into several consistent items which represent coherent components of the document such as text-lines, photographs, graphics etc. without any knowledge of the specific format. A document image is composed of several blocks, each of which represents a coherent component of the document. One coherent component corresponds to a set of text lines with the same typeface and a consistent line spacing. The geometric structure means the geometric relationships between the blocks. This paper describes the current status of Document Image Analysis and Understanding techniques with particular attention to the evaluation of geometric layout analysis techniques. The textbased approach and A region-based approach are the two evaluation methods for page decomposition described in this paper.

Geometric Structure Analysis of Document Images: A Knowledge-Based Approach

IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000

AbstractÐGeometric structure analysis is a prerequisite to create electronic documents from logical components extracted from document images. This paper presents a knowledge-based method for sophisticated geometric structure analysis of technical journal pages. The proposed knowledge base encodes geometric characteristics that are not only common in technical journals but also publication-specific in the form of rules. The method takes the hybrid of top-down and bottom-up techniques and consists of two phases: region segmentation and identification. Generally, the result of the segmentation process does not have a one-to-one matching with composite layout components. Therefore, the proposed method identifies nontext objects, such as images, drawings, and tables, as well as text objects, such as text lines and equations, by splitting or grouping segmented regions into composite layout components. Experimental results with 372 images scanned from the IEEE Transactions on Pattern Analysis and Machine Intelligence show that the proposed method has performed geometric structure analysis successfully on more than 99 percent of the test images, resulting in impressive performance compared with previous works.

Document Layout Analysis and Classification and Its Application in OCR

2006 10th IEEE International Enterprise Distributed Object Computing Conference Workshops (EDOCW'06), 2006

Digitization of paper-bound documents is one of the foremost commercial interests worldwide. First step in all such applications is transforming a paper bound document into an electronic document by scanning, subsequently applying to the image OCR to generate textual information from the document image. In this paper we describe our work that acts as a pre-processing stage for OCR application. Automatic document layout extraction and segmentation is done using spatial configuration of various text/image segments represented as bounded boxes; this segmented layout is than analyzed with certain heuristic tests and each segment is assigned labels (title, authors, abstract, body, header, footer etc). This information is than passed on to OCR module as an XML interface, accelerating it's performance by allowing it to label recognized text segments and identifying only those parts of the document which have text resulting saving in computation. Although, the work has been motivated for application to an automated machine translation system preserving the overall document layout, it has a number of other applications such as in information retrieval, search etc. This information is also being used to classify technical documents into three categories which can be extended to any number of classes based on spatial configuration heuristics.

Document structure analysis based on layout and textual features

… . of International Workshop on Document …, 2000

Document image processing is a crucial process in the office automation and begins from the 'OCR' phase with difficulty of the document 'analysis' and 'understanding'. This paper presents a hybrid and comprehensive approach to document structure analysis. Hybrid in the sense, that it makes use of layout (geometrical) as well as textual features of a given document. These features are the base for potential conditions which in turn are used to express fuzzy matched rules of an underlying rule base. Rules can be formulated based on features which might be observed within one specific layout object. But furthermore, rules can also express dependencies between different layout objects. In addition to its rule driven analysis, which allows an easy adaptation to specific domains with their specific logical objects, the system contains domain-independent markup algorithms for common objects (e.g. lists).

Computer Vision Based Optical Document Layout Analysis: A Compatible Survey

International Journal of Innovative Knowledge Concepts, Vol. 7, Special Issue 1, 2019, 2019

Computer vision based document image layout analysis refers to the generic algorithms and robust techniques. These algorithms and robust techniques are applied to images of documents to obtain a computer-readable description from pixel data. A document image analysis algorithm includes Optical Character Recognition (OCR) software that recognizes characters in a scanned document. OCR makes it possible for the user to edit or search the document’s contents. In the field of document image layout analysis, two types of problems may occur – physical and logical analysis of the document. Several heuristics, grammar based algorithms and rule based algorithms is applied here. In this paper we performed an In-depth survey on various methods (for layout detection) in order to classify the graphical area, paragraph text area, sub-paragraph text area etc. within a document image without using the Optical Character Recognition (OCR) software.

Adaptive Layout Analysis of Document Images

Lecture Notes in Computer Science, 2002

Layout analysis is the process of extracting a hierarchical structure describing the layout of a page. In the document processing system WISDOM++ the layout analysis is performed in two steps: firstly, the global analysis determines possible areas containing paragraphs, sections, columns, figures and tables, and secondly, the local analysis groups together blocks that possibly fall within the same area. The result of the local analysis process strongly depends on the quality of the results of the first step. In this paper we investigate the possibility of supporting the user during the correction of the results of the global analysis. This is done by allowing the user to correct the results of the global analysis and then by learning rules for layout correction from the sequence of user actions. Experimental results on a set of multi-page documents are reported.