Schema-guided wrapper maintenance for web-data extraction (original) (raw)
Related papers
Roadrunner: Towards automatic data extraction from large web sites
Proceedings of the international …, 2001
The paper investigates techniques for extracting data from HTML sites through the use of automatically generated wrappers. To automate the wrapper generation and the data extraction process, the paper develops a novel technique to compare HTML pages and generate a wrapper based on their similarities and differences. Experimental results on real-life data-intensive Web sites confirm the feasibility of the approach.
Automatic Extraction of Complex Web Data
PACIS 2006 Proceedings, 2006
A new wrapper induction algorithm WTM for generating rules that describe the general web page layout template is presented. WTM is mainly designed for use in weblog crawling and indexing system. Most weblogs are maintained by content management systems and have similar layout structures in all pages. In addition, they provide RSS feeds to describe the latest entries. These entries appear in the weblog homepage in HTML format as well. WTM is built upon these two observations. It uses RSS feed data to automatically label the corresponding HTML file (weblog homepage) and induces general template rules from the labeled page. The rules can then be used to extract data from other pages of similar layout template. WTM is tested on some selected weblogs and the results are satisfactory.
Automatic Wrappers for Large Scale Web Extraction
Proceedings of The Vldb Endowment, 2011
We present a generic framework to make wrapper induction algorithms tolerant to noise in the training data. This enables us to learn wrappers in a completely unsupervised manner from automatically and cheaply obtained noisy training data, e.g., using dictionaries and regular expressions. By removing the site-level supervision that wrapper-based techniques require, we are able to perform information extraction at web-scale, with accuracy unattained with existing unsupervised extraction techniques. Our system is used in production at Yahoo! and powers live applications.
A Supervised Visual Wrapper Generator for Web-Data Extraction
2003
Extracting data from Web pages using wrappers is a fundamental problem arising in a large variety of applications of vast practical interest. In this paper, we propose a novel schema-guided approach to wrapper generation. We provide a user-friendly interface that allows users to define the schema of the data to be extracted, and specifies mappings from a HTML page to the target schema. Based on the mappings, the system can automatically generate an extraction rule to extract data from the page. Our approach to wrapper generation can significantly reduce the work of human beings in this process. And the user never have to deal with the internal extraction rule, or even familiarity with the details of HTML.
Finding and Extracting Data Records from Web Pages
Journal of Signal Processing Systems, 2008
Many HTML pages are generated by software programs by querying some underlying databases and then filling in a template with the data. In these situations the metainformation about the data structure is lost, so automated software programs cannot process these data in such powerful manners as information from databases. We propose a set of novel techniques for detecting structured records in a web page and extracting the data values that constitute them. Our method needs only an input page. It starts by identifying the data region of interest in the page. Then it is partitioned into records by using a clustering method that groups similar subtrees in the DOM tree of the page. Finally, the attributes of the data records are extracted by using a method based on multiple string alignment. We have tested our techniques with a high number of real web sources, obtaining high precision and recall values.
Web wrapper induction: a brief survey
Ai Communications, 2004
Nowadays several companies use the information available on the Web for a number of purposes. However, since most of this information is only available as HTML documents, several techniques that allow information from the Web to be automatically extracted have recently been defined. In this paper we review the main techniques and tools for extracting information available on the Web, devising a taxonomy of existing systems. In particular we emphasize the advantages and drawbacks of the techniques analyzed from a user point of view.
Using Clustering and Edit Distance Techniques for Automatic Web Data Extraction
Lecture Notes in Computer Science
Many web sources provide access to an underlying database containing structured data. These data can be usually accessed in HTML form only, which makes it difficult for software programs to obtain them in structured form. Nevertheless, web sources usually encode data records using a consistent template or layout, and the implicit regularities in the template can be used to automatically infer the structure and extract the data. In this paper, we propose a set of novel techniques to address this problem. While several previous works have addressed the same problem, most of them require multiple input pages while our method requires only one. In addition, previous methods make some assumptions about how data records are encoded into web pages, which do not always hold in real websites. Finally, we have tested our techniques with a high number of real web sources and we have found them to be very effective.
Semantic Wrappers for Semi-Structured Data Extraction
2008
In this paper, we propose an approach to extract information from HTML pages and to add semantic (XML) tags to them. Wrapping is an essential technique used to automatically extract information from Web sources. This paper describes both, a general approach based on rules, which can be used to automatically generate wrappers, and an assistant generator wrapper called WebMantic. We also provide some experimental results to show that both the rule generation process and the preprocessing task are fast and reliable.
Optimizing Web Extraction Queries for Robustness
The World Wide Web organizes information in semi-structured HTML documents. For a template-based web page that contains a list of items, information schema can be implied and structured data can be extracted with a query, i.e. a (web) wrapper. When a user interface gets updated, the document structure changes, and the wrapper has a tendency to break. Extracting structured data from multiple data records in a robust way from a single web page is the subject of this project. Current state-of-the-art methods allow to detect template-based data regions on a page, to extract data from a single data record in a robust way, and repeatedly extract data from multiple records. In this thesis, we have combined the three ideas into a new method, which deals with building a robust record-level wrapper from a single user-annotated web page. We have designed, implemented, and tested the algorithm. Experimental results using a large number of web pages from multiple domains show, that the proposed approach works with high precision and within reasonable execution time on commodity hardware.