Wrapping Web Information Providers by Transducer Induction (original) (raw)

Modern agent and mediator systems communicate to a multitude of Web information providers to better satisfy user requests. They use wrappers to extract relevant information from HTML responses and to annotate it with userdefined labels. A number of approaches exploit the methods of machine learning to induce instances of certain wrapper classes, by assuming the tabular structure of HTML responses and by observing the regularity of extracted fragments in the HTML structure. In this work, we propose a general approach and consider the information extraction conducted by wrappers as a special form of transduction. We make no assumption about the HTML response structure and profit from the advanced methods of transducer induction, in order to develop two powerful wrapper classes, for samples with and without ambiguous translations. We test the proposed induction methods on a set of general-purpose and bibliographic data providers and report the results of experiments.