Mitigating linked data quality issues in knowledge-intense information extraction methods (original) (raw)

Linked Data Quality Assessment: A Survey

2021

Data is of high quality if it is fit for its intended use in operations, decision-making, and planning. There is a colossal amount of linked data available on the web. However, it is difficult to understand how well the linked data fits into the modeling tasks due to the defects present in the data. Faults emerged in the linked data, spreading far and wide, affecting all the services designed for it. Addressing linked data quality deficiencies requires identifying quality problems, quality assessment, and the refinement of data to improve its quality. This study aims to identify existing end-to-end frameworks for quality assessment and improvement of data quality. One important finding is that most of the work deals with only one aspect rather than a combined approach. Another finding is that most of the framework aims at solving problems related to DBpedia. Therefore, a standard scalable system is required that integrates the identification of quality issues, the evaluation, and the improvement of the linked data quality. This survey contributes to understanding the state of the art of data quality evaluation and data quality improvement. A solution based on ontology is also proposed to build an end-to-end system that analyzes quality violations' root causes.

Luzzu -- A Framework for Linked Data Quality Assessment

2016 IEEE Tenth International Conference on Semantic Computing (ICSC), 2016

The Web meanwhile got complemented with a Web of Data. Examples are the Linked Open Data cloud, the RDFa and Microformats data increasingly being embedded in ordinary Web pages, or the schema.org initiative. However, the Web of Data shares many characteristics with the original Web of documents, for example, varying quality. There are a large variety of dimensions and measures of data quality. Hence, the assessment of of quality in terms of fitness for use with respect to a certain use case is challenging. In this article, we present a comprehensive and extensible framework for the automatic assessment of linked data quality. Within this framework we implemented around 30 data quality metrics. A particular focus of our work is on scalability and support for the evolution of data. Regarding scalability, we follow a stream processing approach, which provides an easy interface for the integration of domain specific quality measures. With regard to the evolution of data, we introduce data quality assessment as a stage of a holistic data life cycle.

Linked Data Quality

2018

The wides pread of semantic web technologies such as RDF, SPARQL and OWL enables individuals to build their databases on the web, write vocabularies, and define rules to arrange and explain the relationships between data according to the Linked Data principles. As a consequence, a large amount of structured and interlinked data is being generated daily. A close examination of the quality of this data could be very critical, especially if important researches and professional decisions depend on it. Several linked data quality metrics have been proposed, and they cover numerous dimensions of linked data quality such as completeness, consistency, conciseness and interlinking. In this work, we are interested in linked data quality dimensions, especially the completeness and conciseness of linked datasets. A set of experiments were conducted on a real-world dataset (DBpedia) to evaluate our proposed approaches.

Mining and Leveraging Background Knowledge for Improving Named Entity Linking

Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics, 2018

Knowledge-rich Information Extraction (IE) methods aspire towards combining classical IE with background knowledge obtained from third-party resources. Linked Open Data repositories that encode billions of machine readable facts from sources such as Wikipedia play a pivotal role in this development. The recent growth of Linked Data adoption for Information Extraction tasks has shed light on many data quality issues in these data sources that seriously challenge their usefulness such as completeness, timeliness and semantic correctness. Information Extraction methods are, therefore, faced with problems such as name variance and type confusability. If multiple linked data sources are used in parallel, additional concerns regarding link stability and entity mappings emerge. This paper develops methods for integrating Linked Data into Named Entity Linking methods and addresses challenges in regard to mining knowledge from Linked Data, mitigating data quality issues, and adapting algorithms to leverage this knowledge. Finally, we apply these methods to Recognyze, a graph-based Named Entity Linking (NEL) system, and provide a comprehensive evaluation which compares its performance to other well-known NEL systems, demonstrating the impact of the suggested methods on its own entity linking performance.

Knowledge Obtention Combining Information Extraction Techniques with Linked Data

Today, we can nd a vast amount of textual information stored in proprietary data stores. The experience of search- ing information in these systems could be improved in a remarkable manner if we combine these private data stores with the information supplied by the Internet, merging both data sources to get new knowledge. In this paper, we pro- pose an architecture with the goal of automatically obtain- ing knowledge about entities (e.g., persons, places, organi- zations, etc.) from a set of natural text documents, building smart data from raw data. We have tested the system in the context of the news archive of a real Media Group.

Sieve: Linked Data quality assessment and fusion

2012

The Web of Linked Data grows rapidly and already contains data originating from hundreds of data sources. The quality of data from those sources is very diverse, as values may be out of date, incomplete or incorrect. Moreover, data sources may provide conflicting values for a single real-world object. In order for Linked Data applications to consume data from this global data space in an integrated fashion, a number of challenges have to be overcome. One of these challenges is to rate and to integrate data based on their quality. However, quality is a very subjective matter, and finding a canonic judgement that is suitable for each and every task is not feasible. To simplify the task of consuming high-quality data, we present Sieve, a framework for flexibly expressing quality assessment methods as well as fusion methods. Sieve is integrated into the Linked Data Integration Framework (LDIF), which handles Data Access, Schema Mapping and Identity Resolution, all crucial preliminaries for quality assessment and fusion. We demonstrate Sieve in a data integration scenario importing data from the English and Portuguese versions of DBpedia, and discuss how we increase completeness, conciseness and consistency through the use of our framework.

Linked Data Quality: Identifying and Tackling the Key Challenges

The awareness of quality issues in Linked Data is constantly rising as new datasets and applications that consume Linked Data are emerging. In this paper we summarize key problems of Linked Data quality that data consumers are facing and propose approaches to tackle these problems. The majority of challenges presented here have been collected in a Lightning Talk Session at the First Workshop on Linked Data Quality (LDQ2014).

Test-driven evaluation of linked data quality

Proceedings of the 23rd international conference on World wide web - WWW '14, 2014

Linked Open Data (LOD) comprises of an unprecedented volume of structured data on the Web. However, these datasets are of varying quality ranging from extensively curated datasets to crowd-sourced or extracted data of often relatively low quality. We present a methodology for testdriven quality assessment of Linked Data, which is inspired by test-driven software development. We argue, that vocabularies, ontologies and knowledge bases should be accompanied by a number of test-cases, which help to ensure a basic level of quality. We present a methodology for assessing the quality of linked data resources, based on a formalization of bad smells and data quality problems. Our formalization employs SPARQL query templates, which are instantiated into concrete quality test queries. Based on an extensive survey, we compile a comprehensive library of data quality test patterns. We perform automatic test instantiation based on schema constraints or semi-automatically enriched schemata and allow the user to generate specific test instantiations that are applicable to a schema or dataset. We provide an extensive evaluation of five LOD datasets, manual test instantiation for five schemas and automatic test instantiations for all available schemata registered with LOV. One of the main advantages of our approach is that domain specific semantics can be encoded in the data quality test cases, thus being able to discover data quality problems beyond conventional quality heuristics.

Linked data as background knowledge for information extraction on the web" by Ziqi Zhang, Anna Lisa Gentile and Isabelle Augenstein with Martin Vesely as coordinator

ACM SIGWEB Newsletter, 2014

Information Extraction (IE) is the technique for transforming textual data into structured representation that can be understood by machines. It is a crucial technique in enabling the Semantic Web, where increasing interest has been seen in recent years. This article reports recent progress in the LODIE project - Linked Open Data for Information Extraction, aimed at advancing Web IE to a new frontier by exploiting largely available, semantically annotated, Linked Open Data as background knowledge. We cover topics of wrapper induction, IE from semi-structured content such as tables and lists, and IE from free-text. We describe new challenges in the research and methods proposed to address them, together with summaries of recent evaluations showing encouraging results.