Benjamin Habegger | INSA Lyon (original) (raw)
Uploads
Papers by Benjamin Habegger
A crucial aspect when building a system which integrates data from multiple sources is to define ... more A crucial aspect when building a system which integrates data from multiple sources is to define mappings between schemas of the different sources. Using rewriting techniques, such mappings allow the integration system to translate queries posed over one schema into queries in the other schema. Building by hand such mappings is known to be non-evident and labor-intensive. In this paper, we propose an approach combining relational learning and user interaction to build mappings between a database and an ontology. The proposed approach allows to build mappings which can be complex queries over the source database. Furthermore, guarantees on the correctness of the mappings can be given.
Effienct and reliable integration of web data requires building programs called wrappers. Hand wr... more Effienct and reliable integration of web data requires building programs called wrappers. Hand writting wrappers is tedious and error prone. Constant changes in the web, also implies that wrappers need to be constantly refactored. Machine learning has proven to be useful, but current techniques are either limited in expressivity, require non-intuitive user interaction or do not allow for n-ary extraction. We study using tree-patterns as an n-ary extraction language and propose an algorithm learning such queries. It calculates the most information-conservative tree-pattern which is a generalization of two input trees. A notable aspect is that the approach allows to learn queries containing both child and descendant relationships between nodes. More importantly, the proposed approach does not require any labeling other than the data which the user effectively wants to extract. The experiments reported show the effectiveness of the approach.
In this paper we describe briefly three systems: onCue a desktop internet-access toolbar, Snip!t ... more In this paper we describe briefly three systems: onCue a desktop internet-access toolbar, Snip!t a web-based bookmarking application and ontoPIM an ontology-based personal task-management system. These embody context issues to differing degrees, and we use them to exemplify more general issues concerning the use of contextual information in 'intelligent' interfaces. We look at issues relating to interaction and 'appropriate intelligence', at different types of context that arise and at architectural lessons we have learnt. We also highlight outstanding problems, in particular the need to computationally describe and communicate context where reasoning and inference is distributed.
The problem of extracting information from the Web consists in building pat- terns allowing to ex... more The problem of extracting information from the Web consists in building pat- terns allowing to extract specific information from documen ts of a given Web source. Up to now, most existing techniques use string-based representations of documents as well as string-based patterns. Using tree representations naturally allows to overcome limitations of string-based approaches. While some tree-based approaches exist, they are
Most recent research in the field of information extraction from the Web has concentrated on the ... more Most recent research in the field of information extraction from the Web has concentrated on the task of extracting the underlying content of a set of similarly structured web pages. However in order to build real-world web information extraction applications this is not sufficient. Indeed, building such applications requires fully automating the access to web sources. This does not just involve the extraction of the data from web pages. There is a need to set up the necessary infrastructure allowing to query a source, retrieve the result pages, extract the results from these pages and filter out the unwanted results. In this paper we show how such an infrastructure can be set up. We propose to build a web information extraction application by decomposing it into sub-tasks and describing it in an XML based language named WetDL. Each of the sub-tasks consists in applying a web information extraction specific operation onto its input, one of these operators being the application of an extractor. By connecting such operations together it is possible to simply define complex applications. This is shown in the paper by applying this approach to real-world information extraction tasks such as extracting DVD listings from Amazon.com, extracting addresses from online telephone directories superpages.com, etc.
This paper presents a challenging project which aims to extend the current features of search and... more This paper presents a challenging project which aims to extend the current features of search and browsing engines. Different methods are integrated to meet the following requirements: (1) Integration of incremental and focused dynamic crawling with meta-search; (2) Free the user from sifting through the long list of documents returned by the search engines; (3) Extract comprehensive patterns and useful knowledge from the documents; (4) Visual-based support to browse dynamic document collections. Finally, a new paradigm is proposed combining the mining and the visualization methods used for search and exploration.
Most recent research in the field of information extraction from the Web has concentrated on the ... more Most recent research in the field of information extraction from the Web has concentrated on the task of extracting the underlying content of a set of similarly structured web pages. However in order to build real-world web information extraction applications this is not sufficient. Indeed, building such applications requires fully automating the access to web sources. This does not just involve the extraction of the data from web pages. There is a need to set up the necessary infrastructure allowing to query a source, retrieve the result pages, extract the results from these pages and filter out the unwanted results. In this paper we show how such an infrastructure can be set up. We propose to build a web information extraction application by decomposing it into sub-tasks and describing it in an XML based language named WetDL. Each of the sub-tasks consists in applying a web information extraction specific operation onto its input, one of these operators being the application of an extractor. By connecting such operations together it is possible to simply define complex applications. This is shown in the paper by applying this approach to real-world information extraction tasks such as extracting DVD listings from Amazon.com, extracting addresses from online telephone directories superpages.com, etc.
Many online data sources, such as product catalogs, online directories, etc. are available on the... more Many online data sources, such as product catalogs, online directories, etc. are available on the web. Extracting information from such sources is a hard task since these sources are designed to be presented to human users. Many researchers have tackled the problem of building wrappers for such sources. The state of the art approach is to use machine learning techniques based on fully labeled example pages. In this paper we propose and study an approach based on example instances. This allows the user to build a wrapper using only a handful of examples of the whole source allowing to take into account structural differences. The patterns obtained allow to extract the instances of the relation described by the examples and contained in the same data source.
Extracting information from the Web is a complex task with different components which can either ... more Extracting information from the Web is a complex task with different components which can either be generic or specific to the task, going from downloading a given page, following links, querying a Web-based applications via an HTML form and the HTTP protocol, querying a Web service via the SOAP protocol, etc. Therefore building Web services which proceed to executing an information tasks can not be simply hard coded (i.e. written and compiled once and for all in a given programming language). In order to be able to build flexible information extraction Web Services we need to be able to compose different sub tasks together. We propose a, XML-based language to describe information extraction Web services as the compositions of existing Web services and specific functions. The usefulness the proposed framework is demonstrated by three real world applications. (1) Search engines: we show how to describe a task which queries Google's Web service, retrieves more information on the results by querying their respective HTTP servers, and filters them according to this information. (2) E-commerce sites : an information extraction Web service giving access to an existing HTML-based e-commerce online application such as Amazon is built. (3) Patent extraction: a last example shows how to describe an information extraction Web service which allows to query a Web-based application, extract the set of result links, follow them, and extract the needed information on the result pages. In all three applications the generated description can be easily modified and completed to further respond the user's needs and create value-added Web services.
Extracting information extraction from the Web is a com- plex task with different components whic... more Extracting information extraction from the Web is a com- plex task with different components which can either be generic or specific to the task, going from downloading a given page, following links, querying a Web-based applica- tions via an HTML form and the HTTP protocol, querying a Web Service via the SOAP protocol, etc. Therefore build- ing Web Services which
Many online information sources are available on the Web. Giving machine access to such sources l... more Many online information sources are available on the Web. Giving machine access to such sources leads to many interesting applications, such as using web data in mediators or software agents. Up to now most work in the field of information extraction from the web has concentrated on building wrappers, i.e. programs allowing to reformat presentational data in HTML into a more machine comprehensible format. While being an important part of a web information extraction application such wrappers are not sufficient to fully access a source. Indeed, it is necessary to setup an infrastructure allowing to build queries, fetch pages, extract specific links, etc. In this paper we propose a language called WetDL allowing to describe an information extraction task as a network of operators whose execution performs the desired extraction task.
Numerous sources of data are available on the web, for instance, product catalogs, multiple direc... more Numerous sources of data are available on the web, for instance, product catalogs, multiple directories, conference and event sites, etc. The extraction of information from the content of these sources is a challenging problem and a hard task since they are heterogeneous and dynamic. This paper presents a new method for extracting wrappers and relations from the web using both page encoding and context generalization. Its starting point is a training set of instances of the relation the user wishes to extract. Multiple patterns are then extracted considering the occurrences of the input instances in the data source. The generalization of these patterns allows us to identify new instances of the relation in the same data source. The main features of this method are its simplicity, genericity and robustness faced to the diversity of sources. Its efficiency is shown by the experimental results on different sources, i.e., search engines, shopping, product catalogs, paper listings, etc.
Many online data sources, such as product catalogs, online directories, etc. are available on the... more Many online data sources, such as product catalogs, online directories, etc. are available on the web. Extracting information from such sources is a hard task since these sources are designed to be presented to human users. Many researchers have tackled the problem of building wrappers for such sources. The state of the art approach is to use machine learning techniques based on fully labeled example pages. In this paper we propose and study an approach based on example instances. This allows the user to build a wrapper using only a handful of examples of the whole source allowing to take into account structural differences. The patterns obtained allow to extract the instances of the relation described by the examples and contained in the same data source.
A crucial aspect when building a system which integrates data from multiple sources is to define ... more A crucial aspect when building a system which integrates data from multiple sources is to define mappings between schemas of the different sources. Using rewriting techniques, such mappings allow the integration system to translate queries posed over one schema into queries in the other schema. Building by hand such mappings is known to be non-evident and labor-intensive. In this paper, we propose an approach combining relational learning and user interaction to build mappings between a database and an ontology. The proposed approach allows to build mappings which can be complex queries over the source database. Furthermore, guarantees on the correctness of the mappings can be given.
Effienct and reliable integration of web data requires building programs called wrappers. Hand wr... more Effienct and reliable integration of web data requires building programs called wrappers. Hand writting wrappers is tedious and error prone. Constant changes in the web, also implies that wrappers need to be constantly refactored. Machine learning has proven to be useful, but current techniques are either limited in expressivity, require non-intuitive user interaction or do not allow for n-ary extraction. We study using tree-patterns as an n-ary extraction language and propose an algorithm learning such queries. It calculates the most information-conservative tree-pattern which is a generalization of two input trees. A notable aspect is that the approach allows to learn queries containing both child and descendant relationships between nodes. More importantly, the proposed approach does not require any labeling other than the data which the user effectively wants to extract. The experiments reported show the effectiveness of the approach.
In this paper we describe briefly three systems: onCue a desktop internet-access toolbar, Snip!t ... more In this paper we describe briefly three systems: onCue a desktop internet-access toolbar, Snip!t a web-based bookmarking application and ontoPIM an ontology-based personal task-management system. These embody context issues to differing degrees, and we use them to exemplify more general issues concerning the use of contextual information in 'intelligent' interfaces. We look at issues relating to interaction and 'appropriate intelligence', at different types of context that arise and at architectural lessons we have learnt. We also highlight outstanding problems, in particular the need to computationally describe and communicate context where reasoning and inference is distributed.
The problem of extracting information from the Web consists in building pat- terns allowing to ex... more The problem of extracting information from the Web consists in building pat- terns allowing to extract specific information from documen ts of a given Web source. Up to now, most existing techniques use string-based representations of documents as well as string-based patterns. Using tree representations naturally allows to overcome limitations of string-based approaches. While some tree-based approaches exist, they are
Most recent research in the field of information extraction from the Web has concentrated on the ... more Most recent research in the field of information extraction from the Web has concentrated on the task of extracting the underlying content of a set of similarly structured web pages. However in order to build real-world web information extraction applications this is not sufficient. Indeed, building such applications requires fully automating the access to web sources. This does not just involve the extraction of the data from web pages. There is a need to set up the necessary infrastructure allowing to query a source, retrieve the result pages, extract the results from these pages and filter out the unwanted results. In this paper we show how such an infrastructure can be set up. We propose to build a web information extraction application by decomposing it into sub-tasks and describing it in an XML based language named WetDL. Each of the sub-tasks consists in applying a web information extraction specific operation onto its input, one of these operators being the application of an extractor. By connecting such operations together it is possible to simply define complex applications. This is shown in the paper by applying this approach to real-world information extraction tasks such as extracting DVD listings from Amazon.com, extracting addresses from online telephone directories superpages.com, etc.
This paper presents a challenging project which aims to extend the current features of search and... more This paper presents a challenging project which aims to extend the current features of search and browsing engines. Different methods are integrated to meet the following requirements: (1) Integration of incremental and focused dynamic crawling with meta-search; (2) Free the user from sifting through the long list of documents returned by the search engines; (3) Extract comprehensive patterns and useful knowledge from the documents; (4) Visual-based support to browse dynamic document collections. Finally, a new paradigm is proposed combining the mining and the visualization methods used for search and exploration.
Most recent research in the field of information extraction from the Web has concentrated on the ... more Most recent research in the field of information extraction from the Web has concentrated on the task of extracting the underlying content of a set of similarly structured web pages. However in order to build real-world web information extraction applications this is not sufficient. Indeed, building such applications requires fully automating the access to web sources. This does not just involve the extraction of the data from web pages. There is a need to set up the necessary infrastructure allowing to query a source, retrieve the result pages, extract the results from these pages and filter out the unwanted results. In this paper we show how such an infrastructure can be set up. We propose to build a web information extraction application by decomposing it into sub-tasks and describing it in an XML based language named WetDL. Each of the sub-tasks consists in applying a web information extraction specific operation onto its input, one of these operators being the application of an extractor. By connecting such operations together it is possible to simply define complex applications. This is shown in the paper by applying this approach to real-world information extraction tasks such as extracting DVD listings from Amazon.com, extracting addresses from online telephone directories superpages.com, etc.
Many online data sources, such as product catalogs, online directories, etc. are available on the... more Many online data sources, such as product catalogs, online directories, etc. are available on the web. Extracting information from such sources is a hard task since these sources are designed to be presented to human users. Many researchers have tackled the problem of building wrappers for such sources. The state of the art approach is to use machine learning techniques based on fully labeled example pages. In this paper we propose and study an approach based on example instances. This allows the user to build a wrapper using only a handful of examples of the whole source allowing to take into account structural differences. The patterns obtained allow to extract the instances of the relation described by the examples and contained in the same data source.
Extracting information from the Web is a complex task with different components which can either ... more Extracting information from the Web is a complex task with different components which can either be generic or specific to the task, going from downloading a given page, following links, querying a Web-based applications via an HTML form and the HTTP protocol, querying a Web service via the SOAP protocol, etc. Therefore building Web services which proceed to executing an information tasks can not be simply hard coded (i.e. written and compiled once and for all in a given programming language). In order to be able to build flexible information extraction Web Services we need to be able to compose different sub tasks together. We propose a, XML-based language to describe information extraction Web services as the compositions of existing Web services and specific functions. The usefulness the proposed framework is demonstrated by three real world applications. (1) Search engines: we show how to describe a task which queries Google's Web service, retrieves more information on the results by querying their respective HTTP servers, and filters them according to this information. (2) E-commerce sites : an information extraction Web service giving access to an existing HTML-based e-commerce online application such as Amazon is built. (3) Patent extraction: a last example shows how to describe an information extraction Web service which allows to query a Web-based application, extract the set of result links, follow them, and extract the needed information on the result pages. In all three applications the generated description can be easily modified and completed to further respond the user's needs and create value-added Web services.
Extracting information extraction from the Web is a com- plex task with different components whic... more Extracting information extraction from the Web is a com- plex task with different components which can either be generic or specific to the task, going from downloading a given page, following links, querying a Web-based applica- tions via an HTML form and the HTTP protocol, querying a Web Service via the SOAP protocol, etc. Therefore build- ing Web Services which
Many online information sources are available on the Web. Giving machine access to such sources l... more Many online information sources are available on the Web. Giving machine access to such sources leads to many interesting applications, such as using web data in mediators or software agents. Up to now most work in the field of information extraction from the web has concentrated on building wrappers, i.e. programs allowing to reformat presentational data in HTML into a more machine comprehensible format. While being an important part of a web information extraction application such wrappers are not sufficient to fully access a source. Indeed, it is necessary to setup an infrastructure allowing to build queries, fetch pages, extract specific links, etc. In this paper we propose a language called WetDL allowing to describe an information extraction task as a network of operators whose execution performs the desired extraction task.
Numerous sources of data are available on the web, for instance, product catalogs, multiple direc... more Numerous sources of data are available on the web, for instance, product catalogs, multiple directories, conference and event sites, etc. The extraction of information from the content of these sources is a challenging problem and a hard task since they are heterogeneous and dynamic. This paper presents a new method for extracting wrappers and relations from the web using both page encoding and context generalization. Its starting point is a training set of instances of the relation the user wishes to extract. Multiple patterns are then extracted considering the occurrences of the input instances in the data source. The generalization of these patterns allows us to identify new instances of the relation in the same data source. The main features of this method are its simplicity, genericity and robustness faced to the diversity of sources. Its efficiency is shown by the experimental results on different sources, i.e., search engines, shopping, product catalogs, paper listings, etc.
Many online data sources, such as product catalogs, online directories, etc. are available on the... more Many online data sources, such as product catalogs, online directories, etc. are available on the web. Extracting information from such sources is a hard task since these sources are designed to be presented to human users. Many researchers have tackled the problem of building wrappers for such sources. The state of the art approach is to use machine learning techniques based on fully labeled example pages. In this paper we propose and study an approach based on example instances. This allows the user to build a wrapper using only a handful of examples of the whole source allowing to take into account structural differences. The patterns obtained allow to extract the instances of the relation described by the examples and contained in the same data source.