A Similarity Grammatical Structures Based Method for Improving Open Information Systems (original) (raw)

Dependency-based open information extraction

Building shallow semantic representations from text corpora is the first step to perform more complex tasks such as text entailment, enrichment of knowledge bases, or question answering. Open Information Extraction (OIE) is a recent unsupervised strategy to extract billions of basic assertions from massive corpora, which can be considered as being a shallow semantic representation of those corpora. In this paper, we propose a new multilingual OIE system based on robust and fast rule-based dependency parsing. It permits to extract more precise assertions (verb-based triples) from text than state of the art OIE systems, keeping a crucial property of those systems: scaling to Web-size document collections.

Inference Approach to Enhance a Portuguese Open Information Extraction

Open Information Extraction (Open IE) enables the extraction of facts in large quantities of texts written in natural language. Despite the fact that almost research has been doing in English texts, methods and techniques for other languages have been less frequent. However, those languages other than English correspond to 48% of content available on websites around the world. In this work, we propose a method for extracting facts in Portuguese without predetermining the types of the facts. Additionally, we increased the quantity of those extracted facts by the use of an inference approach. Our inference method is composed of two issues: a transitive and a symmetric mechanism. To the best of our knowledge, this is the first time that inference approach is used to extract facts in Portuguese texts. Our proposal allowed an increase of 36% in quantity of valid facts extracted in a Portuguese Open IE system, and it is compatible in the quality of facts with English approaches.

DptOIE: a portuguese Open Information Extraction system based on dependency analysis

2019

It is estimated that more than 80% of the information on the Web is stored in textual form. For humans, the task of extracting useful information from data that comes up daily is difficult. In order to automate the process, techniques of Open Information Extraction (OIE) methods, which are capable of extracting facts from large textual bases, have been proposed. At first, most OIE methods were developed for the English language. However, other languages, such as Portuguese, have tackled special attention, since it covers approximately 2.5% of all content available on websites. For English languages, methods based on hand-crafted rules and dependency analysis have gained good results. Nevertheless, methods based on similar approaches, in Portuguese, have not presented equivalent performance. We believe that the rules defined are generic and do not cover specific aspects of the language. For this reason, our DptOIE method defined a new set of hand-craft rules and explore sentences thr...

Multilingual Open Information Extraction

Lecture Notes in Computer Science, 2015

Open Information Extraction (OIE) is a recent unsupervised strategy to extract great amounts of basic propositions (verb-based triples) from massive text corpora which scales to Web-size document collections. We propose a multilingual rule-based OIE method that takes as input dependency parses in the CoNLL-X format, identifies argument structures within the dependency parses, and extracts a set of basic propositions from each argument structure. Our method requires no training data and, according to experimental studies, obtains higher recall and higher precision than existing approaches relying on training data. Experiments were performed in three languages: English, Portuguese, and Spanish.

Open Information Extraction for Spanish Language based on Syntactic Constraints

Proceedings of the ACL 2014 Student Research Workshop, 2014

Open Information Extraction (Open IE) serves for the analysis of vast amounts of texts by extraction of assertions, or relations, in the form of tuples argument 1; relation; argument 2 . Various approaches to Open IE have been designed to perform in a fast, unsupervised manner. All of them require language specific information for their implementation. In this work, we introduce an approach to Open IE based on syntactic constraints over POS tag sequences targeted at Spanish language. We describe the rules specific for Spanish language constructions and their implementation in EXTRHECH, an Open IE system for Spanish. We also discuss language-specific issues of implementation. We compare EXTRHECH's performance with that of REVERB, a similar Open IE system for English, on a parallel dataset and show that these systems perform at a very similar level. We also compare EXTRHECH's performance on a dataset of grammatically correct sentences against its performance on a dataset of random texts extracted from the Web, drastically different in their quality from the first dataset. The latter experiment shows robustness of EXTRHECH on texts from the Web.

More Informative Open Information Extraction via Simple Inference

Lecture Notes in Computer Science, 2014

Recent Open Information Extraction (OpenIE) systems utilize grammatical structure to extract facts with very high recall and good precision. In this paper, we point out that a significant fraction of the extracted facts is, however, not informative. For example, for the sentence The ICRW is a non-profit organization headquartered in Washington, the extracted fact (a non-profit organization) (is headquartered in) (Washington) is not informative. This is a problem for semantic search applications utilizing these triples, which is hard to fix once the triple extraction is completed. We therefore propose to integrate a set of simple inference rules into the extraction process. Our evaluation shows that, even with these simple rules, the percentage of informative triples can be improved considerably and the already high recall can be improved even further. Both improvements directly increase the quality of search on these triples. 1

Relation Extraction With Clause-Based Open Information Extraction

2021

Information Extraction (IE) is one of the challenging tasks in natural language processing. The goal of relation extraction is to discover the relevant segments of information in large numbers of textual documents such that they can be used for structuring data. IE aims at discovering various semantic relations in natural language text and has a wide range of applications such as question answering, information retrieval, knowledge presentation, among others. This thesis proposes approaches for relation extraction with clause-based Open Information Extraction that use linguistic knowledge to capture a variety of information including semantic concepts, words, POS tags, shallow and full syntax, dependency parsing in rich syntactic and semantic structures.Within the plethora of Open Information Extraction that focus on the use of syntactic and dependency parsing for the purposes of detecting relations, incoherent and uninformative relation extractions can still be found. The extracted...

A Review of Open Information Extraction Techniques

IJCI. International Journal of Computers and Information, 2019

Nowadays, massive amount of data flows all the time. Approximately between 20 or 30 percent of these data is text. This data is always organized in semi-structured text, which cannot be used directly. To make use of such huge amounts of textual data, there is a need to detect, extract, and structure the information conveyed through this data in a fast and scalable manner. This can be performed using Information Extraction Techniques. However, the task of information extraction is one of the main challenges in Natural Language Processing and there are limitations for its implementation on a large scale of data. Open Information Extraction (OIE) is an open-domain and relation-independent paradigm to perform information extraction in an unsupervised manner. This technique can lead to high-speed and scalable performance. The review of previous research proposals reveals that there are OIE experiments among different languages, such as English, Portuguese, Spanish, Vietnamese, Chinese, and Germany. This paper reviews the OIE techniques, compare their performance in some languages, and then integrates these results with the languages complexity levels to reveal the relationship between the suitable model and the language complexity level.

Evaluating Various Linguistic Features on Semantic Relation Extraction

Extraction use different types of features to acquire semantically related terms from free text. These features may contain several kinds of linguistic knowledge: from orthographic or lexical to more complex features, like PoStags or syntactic dependencies. In this paper we select four main types of linguistic features and evaluate their performance in a systematic way. Despite the combination of some types of features allows us to improve the fscore of the extraction, we observed that by adjusting the positive and negative ratio of the training examples, we can build high quality classifiers with just a single type of linguistic feature, based on generic lexico-syntactic patterns. Experiments were performed on the Portuguese version of Wikipedia.

Open Information Extraction: A Review of Baseline Techniques, Approaches, and Applications

arXiv (Cornell University), 2023

With the abundant amount of available online and offline text data, there arises a crucial need to extract the relation between phrases and summarize the main content of each document in a few words. For this purpose, there have been many studies recently in Open Information Extraction (OIE). OIE improves upon relation extraction techniques by analyzing relations across different domains and avoids requiring handlabeling pre-specified relations in sentences. This paper surveys recent approaches of OIE and its applications on Knowledge Graph (KG), text summarization, and Question Answering (QA). Moreover, the paper describes OIE basis methods in relation extraction. It briefly discusses the main approaches and the pros and cons of each method. Finally, it gives an overview about challenges, open issues, and future work opportunities for OIE, relation extraction, and OIE applications.