Wrapping Web Information Providers by Transducer Induction (original) (raw)

A hierarchical approach to wrapper induction

… of the third annual conference on …, 1999

With the tremendous amount of information that becomes available on the Web on a daily basis, the ability to quickly develop information agents has become a crucial problem.

Hierarchical wrapper induction for semistructured information sources

Autonomous Agents and Multi-Agent …, 2001

With the tremendous amount of information that becomes available on the Web on a daily basis, the ability to quickly develop information agents has become a crucial problem. A vital component o f a n y W eb-based information agent is a set of wrappers that can extract the relevant data from semistructured information sources. Our novel approach to wrapper induction is based on the idea of hierarchical information extraction, which turns the hard problem of extracting data from an arbitrarily complex document i n to a series of simpler extraction tasks. We introduce an inductive algorithm, stalker, that generates high accuracy extraction rules based on user-labeled training examples. Labeling the training data represents the major bottleneck in using wrapper induction techniques, and our experimental results show that stalker requires up to two orders of magnitude fewer examples than other algorithms. Furthermore, stalker can wrap information sources that could not be wrapped by existing inductive techniques.

Combining Agents and Wrapper Induction for Information Gathering on Restricted Web Domains

2010

Web is growing constantly and exponentially every day. Thus, gathering relevant information becomes unfeasible. Existent indexing-based search engines ignore information context, which is essential to deciding on its relevance. Restraining to a single web domain, domain ontology can be used to take into consideration the related context, the fact that might enable treating web pages that belong to the considered domain more intelligently. Nevertheless, symbolic rules that exploit domain's ontology to realize this treatment are delicate and fastidious to develop, especially for information extraction task. This paper presents Boosted Wrapper Induction (BWI), a machine learning method for adaptive information extraction, and its exploitation as a replacement of the symbolic approach for information extraction task in AGATHE, a generic multiagent architecture for information gathering on restrained web domains.

Web wrapper induction: a brief survey

Ai Communications, 2004

Nowadays several companies use the information available on the Web for a number of purposes. However, since most of this information is only available as HTML documents, several techniques that allow information from the Web to be automatically extracted have recently been defined. In this paper we review the main techniques and tools for extracting information available on the Web, devising a taxonomy of existing systems. In particular we emphasize the advantages and drawbacks of the techniques analyzed from a user point of view.

Wrapper induction: Efficiency and expressiveness

Artificial Intelligence, 2000

The Internet presents numerous sources of useful information-telephone directories, product catalogs, stock quotes, event listings, etc. Recently, many systems have been built that automatically gather and manipulate such information on a user's behalf. However, these resources are usually formatted for use by people (e.g., the relevant content is embedded in HTML pages), so extracting their content is difficult. Most systems use customized wrapper procedures to perform this extraction task. Unfortunately, writing wrappers is tedious and error-prone. As an alternative, we advocate wrapper induction, a technique for automatically constructing wrappers. In this article, we describe six wrapper classes, and use a combination of empirical and analytical techniques to evaluate the computational tradeoffs among them. We first consider expressiveness: how well the classes can handle actual Internet resources, and the extent to which wrappers in one class can mimic those in another. We then turn to efficiency: we measure the number of examples and time required to learn wrappers in each class, and we compare these results to PAC models of our task and asymptotic complexity analyses of our algorithms. Summarizing our results, we find that most of our wrapper classes are reasonably useful (70% of surveyed sites can be handled in total), yet can rapidly learned (learning usually requires just a handful of examples and a fraction of a CPU second per example).

Semi-automatic wrapper generation for internet information sources

coopis, 1997

To simplify the task of obtaining information from the vast number of information sources that are available on the World Wide Web (WWW), we are building tools to build information mediators for extracting and integrating data from multiple Web sources. In a mediator based approach, wrappers are built around individual information sources, that provide translation between the mediator query language and the individual source. We present an approach for semi-automatically generating wrappers for structured internet sources. The key idea is to exploit formatting information in Web pages from the source to hypothesize the underlying structure of a page. From this structure the system generates a wrapper that facilitates querying of a source and possibly integrating it with other sources. We demonstrate the ease with which we are able to build wrappers for a number of Web sources using our implemented wrapper generation toolkit.

A new path generalization algorithm for html wrapper induction

Advances in Web Intelligence and Data …, 2006

Recently it was shown that Inductive Logic Programming can be successfully applied to data extraction from HTML. However, the approach suffers from two problems: high computational complexity with respect to the number of nodes of the target document and to the arity of the extracted tuples. In this note we address the first problem by proposing an efficient path generalization algorithm for learning rules to extract single information items. The presentation is supplemented with a description of a sample experiment.

Unsupervised wrapper induction using linked data

Proceedings of the seventh international conference on Knowledge capture - K-CAP '13, 2013

This work explores the usage of Linked Data for Web scale Information Extraction and shows encouraging results on the task of Wrapper Induction. We propose a simple knowledge based method which is (i) highly flexible with respect to different domains and (ii) does not require any training material, but exploits Linked Data as background knowledge source to build essential learning resources. The major contribution of this work is a study of how Linked Data -an imprecise, redundant and large-scale knowledge resourcecan be used to support Web scale Information Extraction in an effective and efficient way and identify the challenges involved. We show that, for domains that are covered, Linked Data serve as a powerful knowledge resource for Information Extraction. Experiments on a publicly available dataset demonstrate that, under certain conditions, this simple unsupervised approach can achieve competitive results against some complex state of the art that always depends on training data.