Oren Etzioni - Academia.edu (original) (raw)
Papers by Oren Etzioni
Artificial Intelligence, 2005
The KNOWITALL system aims to automate the tedious process of extracting large collections of fact... more The KNOWITALL system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of KNOW-ITALL's novel architecture and design principles, emphasizing its distinctive ability to extract information without any hand-labeled training examples. In its first major run, KNOW-ITALL extracted over 50,000 class instances, but suggested a challenge: How can we improve KNOWITALL's recall and extraction rate without sacrificing precision? This paper presents three distinct ways to address this challenge and evaluates their performance. Pattern Learning learns domain-specific extraction rules, which enable additional extractions. Subclass Extraction automatically identifies sub-classes in order to boost recall (e.g., "chemist" and "biologist" are identified as sub-classes of "scientist"). List Extraction locates lists of class instances, learns a "wrapper" for each list, and extracts elements of each list. Since each method bootstraps from KNOWITALL's domain-independent methods, the methods also obviate hand-labeled training examples. The paper reports on experiments, focused on building lists of named entities, that measure the relative efficacy of each method and demonstrate their synergy. In concert, our methods gave KNOWITALL a 4-fold to 8-fold increase in recall at precision of 0.90, and discovered over 10,000 cities missing from the Tipster Gazetteer.
Communications of The ACM, 1999
The designers of information management software must strike a delicate balance between protectin... more The designers of information management software must strike a delicate balance between protecting user privacy and facilitating the sharing of information. Since there is no universal policy appropriate for all users, designers must provide users with a means of specifying their own individual privacy policies. Each user then determines what information to conceal, what to reveal, and to whom. While information protection mechanisms abound, the user interface to such mechanisms has received scant attention.
ERWISE, OR TO REPUBLISH, REQUIRES A FEE AND/OR SPECIFIC PERMISSION. AGENTS '97 CONFERENCE PROCEED... more ERWISE, OR TO REPUBLISH, REQUIRES A FEE AND/OR SPECIFIC PERMISSION. AGENTS '97 CONFERENCE PROCEEDINGS, COPYRIGHT 1997 ACM.
IEEE Expert / IEEE Intelligent Systems, 1995
Computer technology has dramatically enhanced our ability to generate, deliver , and store inform... more Computer technology has dramatically enhanced our ability to generate, deliver , and store information. Unfortunately, our tools for locating, filtering, and analyzing information have not kept pace. A popular solution is intelligent agents. But what are they?
ACM Transactions on Information Systems, 2001
The wealth of information on the web makes it an attractive resource for seeking quick answers to... more The wealth of information on the web makes it an attractive resource for seeking quick answers to simple, factual questions such as "who was the first American in space?" or "what is the second tallest mountain in the world?" Yet today's most advanced web search services (e.g., Google and AskJeeves) make it surprisingly tedious to locate answers to such questions. In this paper, we extend question-answering techniques, first studied in the information retrieval literature, to the web and experimentally evaluate their performance.
... Oren Etzioni Keith Golden Daniel Weld Department of Computer Science and Engineering Universi... more ... Oren Etzioni Keith Golden Daniel Weld Department of Computer Science and Engineering University of Washington Seattle, WA 98195 fetzioni, kgolden, weldg@cs.washington.edu Abstract ... To see this, consider a singleton LCW query such as LCW(parent:dir(f; /kr94)). ...
Traditionally, Information Extraction (IE) has focused on satisfying precise, narrow, pre-specifi... more Traditionally, Information Extraction (IE) has focused on satisfying precise, narrow, pre-specified requests from small homogeneous corpora (e.g., extract the location and time of seminars from a set of announcements). Shifting to a new domain requires the user to name the target relations and to manually create new extraction rules or hand-tag new training examples. This manual labor scales linearly with the number of target relations. This paper introduces Open IE (OIE), a new extraction paradigm where the system makes a single data-driven pass over its corpus and extracts a large set of relational tuples without requiring any human input. The paper also introduces TEXTRUNNER, a fully implemented, highly scalable OIE system where the tuples are assigned a probability and indexed to support efficient extraction and exploration via user queries. We report on experiments over a 9,000,000 Web page corpus that compare TEXTRUNNER with KNOWITALL, a state-of-the-art Web IE system. TEXTRUNNER achieves an error reduction of 33% on a comparable set of extractions. Furthermore, in the amount of time it takes KNOWITALL to perform extraction for a handful of pre-specified relations, TEXTRUNNER extracts a far broader set of facts reflecting orders of magnitude more relations, discovered on the fly. We report statistics on TEXTRUNNER's 11,000,000 highest probability tuples, and show that they contain over 1,000,000 concrete facts and over 6,500,000 more abstract assertions.
... paint to create green. The get-paint operator su ces to get all other colors.10 Name: (MAKE-G... more ... paint to create green. The get-paint operator su ces to get all other colors.10 Name: (MAKE-GREEN-PAINT) Preconds: (satisfy ((have-color blue) . T)) (satisfy ((have-color yellow) . T)) Postconds: (cause ((have-color green) . T)) ...
Consumers are often forced to wade through many on-line reviews in order to make an informed prod... more Consumers are often forced to wade through many on-line reviews in order to make an informed product choice. This paper introduces OPINE, an unsupervised informationextraction system which mines reviews in order to build a model of important product features, their evaluation by reviewers, and their relative quality across products. Compared to previous work, OPINE achieves 22% higher precision (with only 3% lower recall) on the feature extraction task. OPINE's novel use of relaxation labeling for finding the semantic orientation of words in context leads to strong performance on the tasks of finding opinion phrases and their polarity.
Manually querying search engines in order to accumulate a large body of factual information is a ... more Manually querying search engines in order to accumulate a large body of factual information is a tedious, error-prone process of piecemeal search. Search engines retrieve and rank potentially relevant documents for human perusal, but do not extract facts, assess confidence, or fuse information from multiple documents. This paper introduces KNOWITALL, a system that aims to automate the tedious process of extracting large collections of facts from the web in an autonomous, domain-independent, and scalable manner.
Communications of The ACM, 1994
ERWISE, OR TO REPUBLISH, REQUIRES A FEE AND/OR SPECIFIC PERMISSION. AGENTS '97 CONFERENCE PROCEED... more ERWISE, OR TO REPUBLISH, REQUIRES A FEE AND/OR SPECIFIC PERMISSION. AGENTS '97 CONFERENCE PROCEEDINGS, COPYRIGHT 1997 ACM.
Communications of The ACM, 1996
Skeptics believe the Web is too unstructured for Web mining to suc-ceed. Indeed, data mining has ... more Skeptics believe the Web is too unstructured for Web mining to suc-ceed. Indeed, data mining has been applied traditionally to databases, yet much of the information on the Web lies buried in documents designed for human consumption such as home pages or product catalogs. ...
Artificial Intelligence, 2005
The KNOWITALL system aims to automate the tedious process of extracting large collections of fact... more The KNOWITALL system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of KNOW-ITALL's novel architecture and design principles, emphasizing its distinctive ability to extract information without any hand-labeled training examples. In its first major run, KNOW-ITALL extracted over 50,000 class instances, but suggested a challenge: How can we improve KNOWITALL's recall and extraction rate without sacrificing precision? This paper presents three distinct ways to address this challenge and evaluates their performance. Pattern Learning learns domain-specific extraction rules, which enable additional extractions. Subclass Extraction automatically identifies sub-classes in order to boost recall (e.g., "chemist" and "biologist" are identified as sub-classes of "scientist"). List Extraction locates lists of class instances, learns a "wrapper" for each list, and extracts elements of each list. Since each method bootstraps from KNOWITALL's domain-independent methods, the methods also obviate hand-labeled training examples. The paper reports on experiments, focused on building lists of named entities, that measure the relative efficacy of each method and demonstrate their synergy. In concert, our methods gave KNOWITALL a 4-fold to 8-fold increase in recall at precision of 0.90, and discovered over 10,000 cities missing from the Tipster Gazetteer.
Communications of The ACM, 1999
The designers of information management software must strike a delicate balance between protectin... more The designers of information management software must strike a delicate balance between protecting user privacy and facilitating the sharing of information. Since there is no universal policy appropriate for all users, designers must provide users with a means of specifying their own individual privacy policies. Each user then determines what information to conceal, what to reveal, and to whom. While information protection mechanisms abound, the user interface to such mechanisms has received scant attention.
ERWISE, OR TO REPUBLISH, REQUIRES A FEE AND/OR SPECIFIC PERMISSION. AGENTS '97 CONFERENCE PROCEED... more ERWISE, OR TO REPUBLISH, REQUIRES A FEE AND/OR SPECIFIC PERMISSION. AGENTS '97 CONFERENCE PROCEEDINGS, COPYRIGHT 1997 ACM.
IEEE Expert / IEEE Intelligent Systems, 1995
Computer technology has dramatically enhanced our ability to generate, deliver , and store inform... more Computer technology has dramatically enhanced our ability to generate, deliver , and store information. Unfortunately, our tools for locating, filtering, and analyzing information have not kept pace. A popular solution is intelligent agents. But what are they?
ACM Transactions on Information Systems, 2001
The wealth of information on the web makes it an attractive resource for seeking quick answers to... more The wealth of information on the web makes it an attractive resource for seeking quick answers to simple, factual questions such as "who was the first American in space?" or "what is the second tallest mountain in the world?" Yet today's most advanced web search services (e.g., Google and AskJeeves) make it surprisingly tedious to locate answers to such questions. In this paper, we extend question-answering techniques, first studied in the information retrieval literature, to the web and experimentally evaluate their performance.
... Oren Etzioni Keith Golden Daniel Weld Department of Computer Science and Engineering Universi... more ... Oren Etzioni Keith Golden Daniel Weld Department of Computer Science and Engineering University of Washington Seattle, WA 98195 fetzioni, kgolden, weldg@cs.washington.edu Abstract ... To see this, consider a singleton LCW query such as LCW(parent:dir(f; /kr94)). ...
Traditionally, Information Extraction (IE) has focused on satisfying precise, narrow, pre-specifi... more Traditionally, Information Extraction (IE) has focused on satisfying precise, narrow, pre-specified requests from small homogeneous corpora (e.g., extract the location and time of seminars from a set of announcements). Shifting to a new domain requires the user to name the target relations and to manually create new extraction rules or hand-tag new training examples. This manual labor scales linearly with the number of target relations. This paper introduces Open IE (OIE), a new extraction paradigm where the system makes a single data-driven pass over its corpus and extracts a large set of relational tuples without requiring any human input. The paper also introduces TEXTRUNNER, a fully implemented, highly scalable OIE system where the tuples are assigned a probability and indexed to support efficient extraction and exploration via user queries. We report on experiments over a 9,000,000 Web page corpus that compare TEXTRUNNER with KNOWITALL, a state-of-the-art Web IE system. TEXTRUNNER achieves an error reduction of 33% on a comparable set of extractions. Furthermore, in the amount of time it takes KNOWITALL to perform extraction for a handful of pre-specified relations, TEXTRUNNER extracts a far broader set of facts reflecting orders of magnitude more relations, discovered on the fly. We report statistics on TEXTRUNNER's 11,000,000 highest probability tuples, and show that they contain over 1,000,000 concrete facts and over 6,500,000 more abstract assertions.
... paint to create green. The get-paint operator su ces to get all other colors.10 Name: (MAKE-G... more ... paint to create green. The get-paint operator su ces to get all other colors.10 Name: (MAKE-GREEN-PAINT) Preconds: (satisfy ((have-color blue) . T)) (satisfy ((have-color yellow) . T)) Postconds: (cause ((have-color green) . T)) ...
Consumers are often forced to wade through many on-line reviews in order to make an informed prod... more Consumers are often forced to wade through many on-line reviews in order to make an informed product choice. This paper introduces OPINE, an unsupervised informationextraction system which mines reviews in order to build a model of important product features, their evaluation by reviewers, and their relative quality across products. Compared to previous work, OPINE achieves 22% higher precision (with only 3% lower recall) on the feature extraction task. OPINE's novel use of relaxation labeling for finding the semantic orientation of words in context leads to strong performance on the tasks of finding opinion phrases and their polarity.
Manually querying search engines in order to accumulate a large body of factual information is a ... more Manually querying search engines in order to accumulate a large body of factual information is a tedious, error-prone process of piecemeal search. Search engines retrieve and rank potentially relevant documents for human perusal, but do not extract facts, assess confidence, or fuse information from multiple documents. This paper introduces KNOWITALL, a system that aims to automate the tedious process of extracting large collections of facts from the web in an autonomous, domain-independent, and scalable manner.
Communications of The ACM, 1994
ERWISE, OR TO REPUBLISH, REQUIRES A FEE AND/OR SPECIFIC PERMISSION. AGENTS '97 CONFERENCE PROCEED... more ERWISE, OR TO REPUBLISH, REQUIRES A FEE AND/OR SPECIFIC PERMISSION. AGENTS '97 CONFERENCE PROCEEDINGS, COPYRIGHT 1997 ACM.
Communications of The ACM, 1996
Skeptics believe the Web is too unstructured for Web mining to suc-ceed. Indeed, data mining has ... more Skeptics believe the Web is too unstructured for Web mining to suc-ceed. Indeed, data mining has been applied traditionally to databases, yet much of the information on the Web lies buried in documents designed for human consumption such as home pages or product catalogs. ...