Document processing Research Papers - Academia.edu (original) (raw)

We consider the generic hypermedia structure of a document to be a means of representing the document that allows it to be processed into a wide variety of presentations. Representing a document in this manner requires additional... more

We consider the generic hypermedia structure of a document to be a means of representing the document that allows it to be processed into a wide variety of presentations. Representing a document in this manner requires additional specification and resources to render it into any presentation. In this paper we discuss the relationship between the generic hypermedia structure of documents

This article presents a new method for the binarization of color document images. Initially, the colors of the document image are reduced to a small number using a new color reduction technique. Specifically, this technique estimates the... more

This article presents a new method for the binarization of color document images. Initially, the colors of the document image are reduced to a small number using a new color reduction technique. Specifically, this technique estimates the dominant colors and then assigns the original image colors to them in order that the background and text components to become uniform. Each dominant color defines a color plane in which the connected components (CCs) are extracted. Next, in each color plane a CC filtering procedure is applied which is followed by a grouping procedure. At the end of this stage, blocks of CCs are constructed which are next redefined by obtaining the direction of connection (DOC) property for each CC. Using the DOC property, the blocks of CCs are classified as text or nontext. The identified text blocks are binarized properly using suitable binarization techniques, considering the rest of the pixels as background. The final result is a binary image which contains always black characters in white background independently of the original colors of each text block. The proposed document binarization approach can also be used for binarization of noisy color (or gray-scale) document images. Several experiments that confirm the effectiveness of the proposed technique are presented. © 2007 Wiley Periodicals, Inc. Int J Imaging Syst Technol, 16, 262–274, 2006

Multiple expert decision combination has received much attention in recent years. This is a multi-disciplinary branch of pattern recognition which has extensive applications in numerous fields including robotic vision, artificial... more

Multiple expert decision combination has received much attention in recent years. This is a multi-disciplinary branch of pattern recognition which has extensive applications in numerous fields including robotic vision, artificial intelligence, document processing, office automation, human-computer interfaces, data acquisition, storage and retrieval, etc. In recent years, this application area has been extended to forensic science, including the identification of individuals using measures depending on biometrics, security and other applications. In this paper, a generalised multi-expert multi-level decision combination strategy, the serial combination approach, has been investigated from the dual viewpoints of theoretical analysis and practical implementation. Different researchers have implicitly utilised various approaches based on this concept over the years in a wide spectrum of application domains, but a comprehensive, coherent and generalised presentation of this approach from both theoretical and implementation viewpoints has not been attempted. While presenting here a unified framework for serial multiple expert decision combination, it is shown that many multi-expert approaches reported in the literature can be easily represented within the proposed framework. Detailed theoretical and practical discussions of the various performance results with these combinations, analysis of the internal processing of this approach, a case study for testing the theoretical framework, issues relating to processing overheads associated with the implementation of this approach, general comments on its applicability to various task domains and the generality of the approach in terms of reevaluating previous research have also been incorporated.

A simple text processing tool which allows positioning of lines within a document is presented using the formal specification language Z. Implementation details such as the use of tab characters and newline sequences are covered. The... more

A simple text processing tool which allows positioning of lines within a document is presented using the formal specification language Z. Implementation details such as the use of tab characters and newline sequences are covered. The program has been implemented under the UNIX operating system. It is hoped that the use of similar techniques will become widespread in the field of software engineering.

The ACL Anthology is a digital archive of conference and journal papers in natural language processing and computational linguistics. Its primary purpose is to serve as a reference repository of research results, but we believe that it... more

The ACL Anthology is a digital archive of conference and journal papers in natural language processing and computational linguistics. Its primary purpose is to serve as a reference repository of research results, but we believe that it can also be an object of study and a platform for research in its own right. We describe an enriched and standardized reference corpus derived from the ACL Anthology that can be used for research in scholarly document processing. This corpus, which we call the ACL Anthology Reference Corpus ( ...

Current solutions for providing access to electronic documents while away from the office do not meet the special needs of mobile document workers. We describe ”Satchel,“ a system that is designed specifically to support the distinctive... more

Current solutions for providing access to electronic documents while away from the office do not meet the special needs of mobile document workers. We describe ”Satchel,“ a system that is designed specifically to support the distinctive features of mobile document work. Satchel is designed to meet the following five high-level design goals (1) easy access to document services; (2) timely document access; (3) streamlined user interface; (4) ubiquity; and (5)compliance with security policies. Our current prototype uses a Nokia 9000 Communicator as the mobile device; it communicates to the rest of the Satchel system using wireless communications, both infrared and radio. A fundamental Satchel concept is the use of tokens, or small secure references, to represent documents on the mobile device. The mobile client only transmits small tokens over te wireless channels, leaving the wired network to transmit the contents of documents when, and only when, they are required. Another fundmental...

Abstract:- Malay Document Analysis and Recongition aims to extract digital malay documents automaticaly. These extracted documents are presented in the form of namely articles, newspapers and magazines. Over the years, Malay digital... more

Abstract:- Malay Document Analysis and Recongition aims to extract digital malay documents automaticaly. These extracted documents are presented in the form of namely articles, newspapers and magazines. Over the years, Malay digital documents has increased and published on ...

layout, functional programming Highly customised variable-data documents make automatic layout of the resulting publication hard. Architectures for defining and processing such documents can benefit if the repertoire of layout methods... more

layout, functional programming Highly customised variable-data documents make automatic layout of the resulting publication hard. Architectures for defining and processing such documents can benefit if the repertoire of layout methods available can be extended smoothly and easily to accommodate new styles of customisation. The Document Description Framework incorporates a model for declarative document layout and processing where documents are treated as functional programs. A canonical XML tree contains nodes describing layout instructions which will modify and combine their children component parts to build sections of the final presentation. Leaf components such as images, vector graphic fragments and text blocks are 'rendered ' to make consistent graphical atoms. These parts are then processed by layout agents, described and parameterised by their parent nodes, which can range from simple layouts like translations, flows, encapsulations and tables through to highly com...