Oren Etzioni - Academia.edu (original) (raw)

Papers by Oren Etzioni

Research paper thumbnail of Unsupervised named-entity extraction from the Web: An experimental study

Artificial Intelligence, 2005

The KNOWITALL system aims to automate the tedious process of extracting large collections of fact... more The KNOWITALL system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of KNOW-ITALL's novel architecture and design principles, emphasizing its distinctive ability to extract information without any hand-labeled training examples. In its first major run, KNOW-ITALL extracted over 50,000 class instances, but suggested a challenge: How can we improve KNOWITALL's recall and extraction rate without sacrificing precision? This paper presents three distinct ways to address this challenge and evaluates their performance. Pattern Learning learns domain-specific extraction rules, which enable additional extractions. Subclass Extraction automatically identifies sub-classes in order to boost recall (e.g., "chemist" and "biologist" are identified as sub-classes of "scientist"). List Extraction locates lists of class instances, learns a "wrapper" for each list, and extracts elements of each list. Since each method bootstraps from KNOWITALL's domain-independent methods, the methods also obviate hand-labeled training examples. The paper reports on experiments, focused on building lists of named entities, that measure the relative efficacy of each method and demonstrate their synergy. In concert, our methods gave KNOWITALL a 4-fold to 8-fold increase in recall at precision of 0.90, and discovered over 10,000 cities missing from the Tipster Gazetteer.

Research paper thumbnail of Privacy interfaces for information management

Communications of The ACM, 1999

The designers of information management software must strike a delicate balance between protectin... more The designers of information management software must strike a delicate balance between protecting user privacy and facilitating the sharing of information. Since there is no universal policy appropriate for all users, designers must provide users with a means of specifying their own individual privacy policies. Each user then determines what information to conceal, what to reveal, and to whom. While information protection mechanisms abound, the user interface to such mechanisms has received scant attention.

Research paper thumbnail of A scalable comparison-shopping agent for the world-wide web domain

ERWISE, OR TO REPUBLISH, REQUIRES A FEE AND/OR SPECIFIC PERMISSION. AGENTS '97 CONFERENCE PROCEED... more ERWISE, OR TO REPUBLISH, REQUIRES A FEE AND/OR SPECIFIC PERMISSION. AGENTS '97 CONFERENCE PROCEEDINGS, COPYRIGHT 1997 ACM.

Research paper thumbnail of Intelligent Agents on the Internet: Fact, Fiction, and Forecast

IEEE Expert / IEEE Intelligent Systems, 1995

Computer technology has dramatically enhanced our ability to generate, deliver , and store inform... more Computer technology has dramatically enhanced our ability to generate, deliver , and store information. Unfortunately, our tools for locating, filtering, and analyzing information have not kept pace. A popular solution is intelligent agents. But what are they?

Research paper thumbnail of Scaling question answering to the Web

ACM Transactions on Information Systems, 2001

The wealth of information on the web makes it an attractive resource for seeking quick answers to... more The wealth of information on the web makes it an attractive resource for seeking quick answers to simple, factual questions such as "who was the first American in space?" or "what is the second tallest mountain in the world?" Yet today's most advanced web search services (e.g., Google and AskJeeves) make it surprisingly tedious to locate answers to such questions. In this paper, we extend question-answering techniques, first studied in the information retrieval literature, to the web and experimentally evaluate their performance.

Research paper thumbnail of Tractable Closed World Reasoning with Updates

... Oren Etzioni Keith Golden Daniel Weld Department of Computer Science and Engineering Universi... more ... Oren Etzioni Keith Golden Daniel Weld Department of Computer Science and Engineering University of Washington Seattle, WA 98195 fetzioni, kgolden, weldg@cs.washington.edu Abstract ... To see this, consider a singleton LCW query such as LCW(parent:dir(f; /kr94)). ...

Research paper thumbnail of Open Information Extraction from the Web

Traditionally, Information Extraction (IE) has focused on satisfying precise, narrow, pre-specifi... more Traditionally, Information Extraction (IE) has focused on satisfying precise, narrow, pre-specified requests from small homogeneous corpora (e.g., extract the location and time of seminars from a set of announcements). Shifting to a new domain requires the user to name the target relations and to manually create new extraction rules or hand-tag new training examples. This manual labor scales linearly with the number of target relations. This paper introduces Open IE (OIE), a new extraction paradigm where the system makes a single data-driven pass over its corpus and extracts a large set of relational tuples without requiring any human input. The paper also introduces TEXTRUNNER, a fully implemented, highly scalable OIE system where the tuples are assigned a probability and indexed to support efficient extraction and exploration via user queries. We report on experiments over a 9,000,000 Web page corpus that compare TEXTRUNNER with KNOWITALL, a state-of-the-art Web IE system. TEXTRUNNER achieves an error reduction of 33% on a comparable set of extractions. Furthermore, in the amount of time it takes KNOWITALL to perform extraction for a handful of pre-specified relations, TEXTRUNNER extracts a far broader set of facts reflecting orders of magnitude more relations, discovered on the fly. We report statistics on TEXTRUNNER's 11,000,000 highest probability tuples, and show that they contain over 1,000,000 concrete facts and over 6,500,000 more abstract assertions.

Research paper thumbnail of An Approach to Planning with Incomplete Information

... paint to create green. The get-paint operator su ces to get all other colors.10 Name: (MAKE-G... more ... paint to create green. The get-paint operator su ces to get all other colors.10 Name: (MAKE-GREEN-PAINT) Preconds: (satisfy ((have-color blue) . T)) (satisfy ((have-color yellow) . T)) Postconds: (cause ((have-color green) . T)) ...

Research paper thumbnail of Extracting Product Features and Opinions from Reviews

Consumers are often forced to wade through many on-line reviews in order to make an informed prod... more Consumers are often forced to wade through many on-line reviews in order to make an informed product choice. This paper introduces OPINE, an unsupervised informationextraction system which mines reviews in order to build a model of important product features, their evaluation by reviewers, and their relative quality across products. Compared to previous work, OPINE achieves 22% higher precision (with only 3% lower recall) on the feature extraction task. OPINE's novel use of relaxation labeling for finding the semantic orientation of words in context leads to strong performance on the tasks of finding opinion phrases and their polarity.

Research paper thumbnail of Web-scale information extraction in knowitall: (preliminary results

Manually querying search engines in order to accumulate a large body of factual information is a ... more Manually querying search engines in order to accumulate a large body of factual information is a tedious, error-prone process of piecemeal search. Search engines retrieve and rank potentially relevant documents for human perusal, but do not extract facts, assess confidence, or fuse information from multiple documents. This paper introduces KNOWITALL, a system that aims to automate the tedious process of extracting large collections of facts from the web in an autonomous, domain-independent, and scalable manner.

Research paper thumbnail of A softbot-based interface to the Internet

Communications of The ACM, 1994

Research paper thumbnail of A scalable comparison-shopping agent for the World-Wide Web

ERWISE, OR TO REPUBLISH, REQUIRES A FEE AND/OR SPECIFIC PERMISSION. AGENTS '97 CONFERENCE PROCEED... more ERWISE, OR TO REPUBLISH, REQUIRES A FEE AND/OR SPECIFIC PERMISSION. AGENTS '97 CONFERENCE PROCEEDINGS, COPYRIGHT 1997 ACM.

Research paper thumbnail of The World-Wide Web: quagmire or gold mine

Communications of The ACM, 1996

Skeptics believe the Web is too unstructured for Web mining to suc-ceed. Indeed, data mining has ... more Skeptics believe the Web is too unstructured for Web mining to suc-ceed. Indeed, data mining has been applied traditionally to databases, yet much of the information on the Web lies buried in documents designed for human consumption such as home pages or product catalogs. ...

Research paper thumbnail of Unsupervised named-entity extraction from the Web: An experimental study

Artificial Intelligence, 2005

The KNOWITALL system aims to automate the tedious process of extracting large collections of fact... more The KNOWITALL system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of KNOW-ITALL's novel architecture and design principles, emphasizing its distinctive ability to extract information without any hand-labeled training examples. In its first major run, KNOW-ITALL extracted over 50,000 class instances, but suggested a challenge: How can we improve KNOWITALL's recall and extraction rate without sacrificing precision? This paper presents three distinct ways to address this challenge and evaluates their performance. Pattern Learning learns domain-specific extraction rules, which enable additional extractions. Subclass Extraction automatically identifies sub-classes in order to boost recall (e.g., "chemist" and "biologist" are identified as sub-classes of "scientist"). List Extraction locates lists of class instances, learns a "wrapper" for each list, and extracts elements of each list. Since each method bootstraps from KNOWITALL's domain-independent methods, the methods also obviate hand-labeled training examples. The paper reports on experiments, focused on building lists of named entities, that measure the relative efficacy of each method and demonstrate their synergy. In concert, our methods gave KNOWITALL a 4-fold to 8-fold increase in recall at precision of 0.90, and discovered over 10,000 cities missing from the Tipster Gazetteer.

Research paper thumbnail of Privacy interfaces for information management

Communications of The ACM, 1999

The designers of information management software must strike a delicate balance between protectin... more The designers of information management software must strike a delicate balance between protecting user privacy and facilitating the sharing of information. Since there is no universal policy appropriate for all users, designers must provide users with a means of specifying their own individual privacy policies. Each user then determines what information to conceal, what to reveal, and to whom. While information protection mechanisms abound, the user interface to such mechanisms has received scant attention.

Research paper thumbnail of A scalable comparison-shopping agent for the world-wide web domain

ERWISE, OR TO REPUBLISH, REQUIRES A FEE AND/OR SPECIFIC PERMISSION. AGENTS '97 CONFERENCE PROCEED... more ERWISE, OR TO REPUBLISH, REQUIRES A FEE AND/OR SPECIFIC PERMISSION. AGENTS '97 CONFERENCE PROCEEDINGS, COPYRIGHT 1997 ACM.

Research paper thumbnail of Intelligent Agents on the Internet: Fact, Fiction, and Forecast

IEEE Expert / IEEE Intelligent Systems, 1995

Computer technology has dramatically enhanced our ability to generate, deliver , and store inform... more Computer technology has dramatically enhanced our ability to generate, deliver , and store information. Unfortunately, our tools for locating, filtering, and analyzing information have not kept pace. A popular solution is intelligent agents. But what are they?

Research paper thumbnail of Scaling question answering to the Web

ACM Transactions on Information Systems, 2001

The wealth of information on the web makes it an attractive resource for seeking quick answers to... more The wealth of information on the web makes it an attractive resource for seeking quick answers to simple, factual questions such as "who was the first American in space?" or "what is the second tallest mountain in the world?" Yet today's most advanced web search services (e.g., Google and AskJeeves) make it surprisingly tedious to locate answers to such questions. In this paper, we extend question-answering techniques, first studied in the information retrieval literature, to the web and experimentally evaluate their performance.

Research paper thumbnail of Tractable Closed World Reasoning with Updates

... Oren Etzioni Keith Golden Daniel Weld Department of Computer Science and Engineering Universi... more ... Oren Etzioni Keith Golden Daniel Weld Department of Computer Science and Engineering University of Washington Seattle, WA 98195 fetzioni, kgolden, weldg@cs.washington.edu Abstract ... To see this, consider a singleton LCW query such as LCW(parent:dir(f; /kr94)). ...

Research paper thumbnail of Open Information Extraction from the Web

Traditionally, Information Extraction (IE) has focused on satisfying precise, narrow, pre-specifi... more Traditionally, Information Extraction (IE) has focused on satisfying precise, narrow, pre-specified requests from small homogeneous corpora (e.g., extract the location and time of seminars from a set of announcements). Shifting to a new domain requires the user to name the target relations and to manually create new extraction rules or hand-tag new training examples. This manual labor scales linearly with the number of target relations. This paper introduces Open IE (OIE), a new extraction paradigm where the system makes a single data-driven pass over its corpus and extracts a large set of relational tuples without requiring any human input. The paper also introduces TEXTRUNNER, a fully implemented, highly scalable OIE system where the tuples are assigned a probability and indexed to support efficient extraction and exploration via user queries. We report on experiments over a 9,000,000 Web page corpus that compare TEXTRUNNER with KNOWITALL, a state-of-the-art Web IE system. TEXTRUNNER achieves an error reduction of 33% on a comparable set of extractions. Furthermore, in the amount of time it takes KNOWITALL to perform extraction for a handful of pre-specified relations, TEXTRUNNER extracts a far broader set of facts reflecting orders of magnitude more relations, discovered on the fly. We report statistics on TEXTRUNNER's 11,000,000 highest probability tuples, and show that they contain over 1,000,000 concrete facts and over 6,500,000 more abstract assertions.

Research paper thumbnail of An Approach to Planning with Incomplete Information

... paint to create green. The get-paint operator su ces to get all other colors.10 Name: (MAKE-G... more ... paint to create green. The get-paint operator su ces to get all other colors.10 Name: (MAKE-GREEN-PAINT) Preconds: (satisfy ((have-color blue) . T)) (satisfy ((have-color yellow) . T)) Postconds: (cause ((have-color green) . T)) ...

Research paper thumbnail of Extracting Product Features and Opinions from Reviews

Consumers are often forced to wade through many on-line reviews in order to make an informed prod... more Consumers are often forced to wade through many on-line reviews in order to make an informed product choice. This paper introduces OPINE, an unsupervised informationextraction system which mines reviews in order to build a model of important product features, their evaluation by reviewers, and their relative quality across products. Compared to previous work, OPINE achieves 22% higher precision (with only 3% lower recall) on the feature extraction task. OPINE's novel use of relaxation labeling for finding the semantic orientation of words in context leads to strong performance on the tasks of finding opinion phrases and their polarity.

Research paper thumbnail of Web-scale information extraction in knowitall: (preliminary results

Manually querying search engines in order to accumulate a large body of factual information is a ... more Manually querying search engines in order to accumulate a large body of factual information is a tedious, error-prone process of piecemeal search. Search engines retrieve and rank potentially relevant documents for human perusal, but do not extract facts, assess confidence, or fuse information from multiple documents. This paper introduces KNOWITALL, a system that aims to automate the tedious process of extracting large collections of facts from the web in an autonomous, domain-independent, and scalable manner.

Research paper thumbnail of A softbot-based interface to the Internet

Communications of The ACM, 1994

Research paper thumbnail of A scalable comparison-shopping agent for the World-Wide Web

ERWISE, OR TO REPUBLISH, REQUIRES A FEE AND/OR SPECIFIC PERMISSION. AGENTS '97 CONFERENCE PROCEED... more ERWISE, OR TO REPUBLISH, REQUIRES A FEE AND/OR SPECIFIC PERMISSION. AGENTS '97 CONFERENCE PROCEEDINGS, COPYRIGHT 1997 ACM.

Research paper thumbnail of The World-Wide Web: quagmire or gold mine

Communications of The ACM, 1996

Skeptics believe the Web is too unstructured for Web mining to suc-ceed. Indeed, data mining has ... more Skeptics believe the Web is too unstructured for Web mining to suc-ceed. Indeed, data mining has been applied traditionally to databases, yet much of the information on the Web lies buried in documents designed for human consumption such as home pages or product catalogs. ...