Page Segmentation Research Papers - Academia.edu (original) (raw)

Text information in scanned documents becomes accessible only when extracted and interpreted by a text recognizer. For a recognizer to work successfully, it must have detailed location information about the regions of the document images... more

Text information in scanned documents becomes accessible only when extracted and interpreted by a text recognizer. For a recognizer to work successfully, it must have detailed location information about the regions of the document images that it is asked to analyse. It will need focus on page regions with text skipping non-text regions that include illustrations or photographs. However, text recogni-zers do not work as logical analyzers. Logical layout analysis automatically determines the function of a document text region, that is, it labels each region as a title, paragraph, or caption, and so on, and thus is an essential part of a document understanding system. In the past, rule-based algorithms have been used to conduct logical layout analysis, using limited size data sets. We here instead focus on supervised learning methods for logical layout analysis. We describe LABA, a system based on multiple support vector machines to perform logical Layout Analysis of scanned Books pages in Arabic. The system detects the function of a text region based on the analysis of various images features and a voting mechanism. For a baseline comparison, we implemented an older but state-of-the-art neural network method. We evaluated LABA using a data set of scanned pages from illustrated Arabic books and obtained high recall and precision values. We also found that the F-measure of LABA is higher for five of the tested six classes compared to the state-of-the-art method.

Datasets of documents in Arabic are urgently needed to promote computer vision and natural language processing research that addresses the specifics of the language. Unfortunately, publicly-available Arabic datasets are limited in size... more

Datasets of documents in Arabic are urgently needed to promote computer vision and natural language processing research that addresses the specifics of the language. Unfortunately, publicly-available Arabic datasets are limited in size and restricted to certain document domains. This paper presents the release of BE-Arabic-9K, a dataset of more than 9,000 9000 high-quality scanned images from over 700 Arabic books. Among these, 1,500 1500 images have been manually segmented into regions and labeled by their functionality. BE-Arabic-9K includes book pages with a wide variety of complex layouts and page contents, making it suitable for various document layout analysis and text recognition research tasks. The paper also presents a page layout segmentation and text extraction baseline model based on fine-tuned Faster R-CNN structure (FFRA). This baseline model yields to cross-validation results with an average accuracy of 99.4% and F1 score of 99.1% for text versus non-text block classification on 1,500 1500 annotated images of BE-Arabic-9K. These results are remarkably better than those of the state-of-the-art Arabic book page segmentation system ECDP. FFRA also outperforms three other prior systems when tested on a competition benchmark dataset, making it an outstanding baseline model to challenge.

Successful physical layout analysis (PLA) is a key factor in the performance of text recognizers and many other applications. PLA solutions for scanned Arabic documents are few and difficult to compare due to differences in methods, data,... more

Successful physical layout analysis (PLA) is a key factor in the performance of text recognizers and many other applications. PLA solutions for scanned Arabic documents are few and difficult to compare due to differences in methods, data, and evaluation metrics. To help evaluate the performance of recent Arabic PLA solutions, the ASAR 2018 Competition on Physical Layout Analysis (PLA) was organized. This paper presents the results of this competition. The competition
focused on analyzing layouts for Arabic scanned book pages (SAB). PLA-SAB required solutions of two tasks: page-to-block segmentation and block text/non-text classification. In this paper we briefly describe the methods provided by participating teams, present their results for both tasks using the BCEArabic benchmarking dataset [1], and make an open call for continuous participation outside the context of ASAR 2018.

... [8] Panˇce Panov, Saˇso Dˇzeroski Larisa N. Soldatova, “OntoDM: An Ontology of Data Mining”, 2008 IEEE International Conference on Data Mining Workshops. [9] Francesco Bonchi Carlos Castillo, “Topical Query Decomposition”, Yahoo!... more

... [8] Panˇce Panov, Saˇso Dˇzeroski Larisa N. Soldatova, “OntoDM: An Ontology of Data Mining”, 2008 IEEE International Conference on Data Mining Workshops. [9] Francesco Bonchi Carlos Castillo, “Topical Query Decomposition”, Yahoo! Research Barcelona, Spain. ...

... This work was done when Jalal Mahmud was with the CS Depart-ment in SUNY, Stony Brook. His current affiliation is with IBM Almaden Research Center, 650 Harry Rd, San Jose, CA; email: jumahmud@us.ibm.com Copyright is held by the... more

... This work was done when Jalal Mahmud was with the CS Depart-ment in SUNY, Stony Brook. His current affiliation is with IBM Almaden Research Center, 650 Harry Rd, San Jose, CA; email: jumahmud@us.ibm.com Copyright is held by the International World Wide Web ...

The paper presents Open Source generalized models for recognition and page segmentation, intended for use on the eScriptorium platform or kraken OCR engine, of Medieval Hebrew manuscripts in square script that arrive at a character... more

The paper presents Open Source generalized models for recognition and page segmentation, intended for use on the eScriptorium platform or kraken OCR engine, of Medieval Hebrew manuscripts in square script that arrive at a character accuracy of more than 97% on the validation set and a dataset consisting of 202 pages from almost 100 different literary manuscripts with layout annotation (regions and lines) as well as transcription. The manuscript pages are sourced from material in different script types, geographical, and chronological origins. In addition we describe the bootstrapping procedure that enabled us to create most of the dataset automatically through text-image alignment.

A novel page segmentation algorithm is provided in this paper. Based on the extraction of the background, it offers the benefit of being adaptive to the context of the document and to be insensitive to the orientation of the text blocks.... more

A novel page segmentation algorithm is provided in this paper. Based on the extraction of the background, it offers the benefit of being adaptive to the context of the document and to be insensitive to the orientation of the text blocks. It involves a two-dimensional isotropic structuring element used to characterized the white streams. This element is a disk approximated

In this paper, a preprocessing model for handwritten Arabic text on the basis of the Voronoi Diagrams (VDs) is presented and discussed. The proposed VD-based pre-processing model consists of five stages: a preparatory stage, page... more

In this paper, a preprocessing model for handwritten Arabic text on the basis of the Voronoi Diagrams (VDs) is presented and discussed. The proposed VD-based pre-processing model consists of five stages: a preparatory stage, page segmentation, thinning, baseline estimation, and slanting correction. In the preparatory stage, the text image is converted via VDs into a group of geometrical forms that consist of edges and vertices that are used to create the other stages of the proposed model. This stage consists of four main processes: binarization, edge extraction and contour tracking, sampling, and point-VD construction. The second stage is the page segmentation stage based on the VD area. In the third stage, an efficient method for text structuring (that is, thinning) is presented. In the fourth stage, a novel baseline based VD method is presented. In the fifth stage, an efficient technique for slanting detection and correction is proposed and discussed.

A new web content structure analysis based on visual representation is proposed in this paper. Many web applications such as information retrieval, information extraction and automatic page adaptation can benefit from this structure. This... more

A new web content structure analysis based on visual representation is proposed in this paper. Many web applications such as information retrieval, information extraction and automatic page adaptation can benefit from this structure. This paper presents an automatic top-down, tag-tree independent approach to detect web content structure. It simulates how a user understands web layout structure based on his visual perception. Comparing to other existing techniques, our approach is independent to underlying documentation representation such as HTML and works well even when the HTML structure is far different from layout structure. Experiments show satisfactory results.