Evaluating Data Quality for Integration of Data Sources (original) (raw)

Determining the Quality of Product Data Integration

Lecture Notes in Computer Science, 2015

To meet customer demands, companies must manage numerous variants and versions of their products. Since product-related data (e.g., requirements' specifications, geometric models, and source code, or test cases) are usually scattered over a large number of heterogeneous, autonomous information systems, their integration becomes crucial when developing complex products on one hand and aiming at reduced development costs on the other. In general, product data are created in different stages of the product development process. Furthermore, they should be integrated in a complete and consistent way at certain milestones during process development (e.g., prototype construction). Usually, this data integration process is accomplished manually, which is both costly and error prone. Instead semi-automated product data integration is required meeting the data quality requirements of the various stages during product development. In turn, this necessitates a close monitoring of the progress of the data integration process based on proper metrics. Contemporary approaches solely focus on metrics assessing schema integration, while not measuring the quality and progress of data integration. This paper elicits fundamental requirements relevant in this context. Based on them, we develop appropriate metrics for measuring product data quality and apply them in a case study we conducted at an automotive original equipment manufacturer.

Anatomy of data integration

Journal of Biomedical Informatics, 2007

Producing reliable information is the ultimate goal of data processing. The ocean of data created with the advances of science and technologies calls for integration of data coming from heterogeneous sources that are diverse in their purposes, business rules, underlying models and enabling technologies. Reference models, Semantic Web, standards, ontology, and other technologies enable fast and efficient merging of heterogeneous data, while the reliability of produced information is largely defined by how well the data represent the reality. In this paper we initiate a framework for assessing the informational value of data that includes data dimensions; aligning data quality with business practices; identifying authoritative sources and integration keys; merging models; uniting updates of varying frequency and overlapping or gapped data sets.

A framework for quality evaluation in data integration systems

irisa.fr

Ensuring and maximizing the quality and integrity of information is a crucial process for today enterprise information systems (EIS). It requires a clear understanding of the interdependencies between the dimensions characterizing quality of data (QoD), quality of conceptual data model (QoM) of the database, keystone of the EIS, and quality of data management and integration processes (QoP). The improvement of one quality dimension (such as data accuracy or model expressiveness) may have negative consequences on other quality dimensions (e.g., freshness or completeness of data). In this paper we briefly present a framework, called QUADRIS, relevant for adopting a quality improvement strategy on one or many dimensions of QoD or QoM with considering the collateral effects on the other interdependent quality dimensions. We also present the scenarios of our ongoing validations on a CRM EIS.

Data Integration Schema Analysis: An Approach With Information Quality

Iq, 2007

Integrated access to distributed data is an important problem faced in many scientific and commercial applications. A data integration system provides a unified view for users to submit queries over multiple autonomous data sources. The queries are processed over a global schema that offers an integrated view of the data sources. Much work has been done on query processing and choosing plans under cost criteria. However, not so much is known about incorporating Information Quality analysis into data integration systems, particularly in the integrated schema. In this work we present an approach of Information Quality analysis of schemas in data integration environments. We discuss the evaluation of schema quality focusing in minimality and consistency aspects and define some schema transformations to be applied in order to improve schema generation and, consequently, the quality of data integration query execution.

A framework for data quality evaluation in a data integration system

19º Simposio Brasileiro …, 2004

To solve complex user requirements the information systems need to integrate data from several, possibly autonomous data sources. One challenge in such environment is to provide the user with data meeting his requirements in terms of quality. These requirements are difficult to satisfy because of the strong heterogeneity of the sources. In this paper we address the problem of data quality evaluation in data integration systems. We present a framework which is a first attempt to formalize the evaluation of data quality. It is based on a graph model of the data integration system which allows us to define evaluation methods and demonstrate propositions in terms of graph properties. To illustrate our approach, we also present a first experiment with the data freshness quality factor and we show how the framework is used to evaluate this factor according to different scenarios.

Conceptual modeling for data integration

2009

The goal of data integration is to provide a uniform access to a set of heterogeneous data sources, freeing the user from the knowledge about where the data are, how they are stored, and how they can be accessed. One of the outcomes of the research work carried out on data integration in the last years is a clear architecture, comprising a global schema, the source schema, and the mapping between the source and the global schema.

Coping with Data Inconsistencies in the Integration of Heterogenous Data Sources

Global Journal of Computer Science and Technology, 2023

This research examines the problem of inconsistent data when integrating information from multiple sources into a unified view. Data inconsistencies undermine the ability to provide meaningful query responses based on the integrated data. The study reviews current techniques for handling inconsistent data including domain-specific data cleaning and declarative methods that provide answers despite integrity violations. A key challenge identified is modeling data consistency and ensuring clean integrated data. Data integration systems based on a global schema must carefully map heterogeneous sources to that schema. However, dependencies in the integrated data can prevent attaining consistency due to issues like conflicting facts from different sources. The research summarizes various proposed approaches for resolving inconsistencies through data cleaning, integrity constraints, and dependency mapping techniques. However, outstanding challenges remain regarding accuracy, availability, timeliness, and other data quality restrictions of autonomous sources.

Data integration: A theoretical perspective

2002

ABSTRACT Data integration is the problem of combining data residing at different sources, and providing the user with a unified view of these data. The problem of designing data integration systems is important in current real world applications, and is characterized by a number of issues that are interesting from a theoretical point of view. This document presents on overview of the material to be presented in a tutorial on data integration.

Subjective Information Quality in Data Integration

Advances in business strategy and competitive advantage book series, 2014

This chapter focuses on the science of human perception of information quality and describes a subset of Information Quality (IQ) dimensions, which are termed Subjective Information Quality (SIQ). These dimensions typically require a user's opinion and do not have a clear mathematical technique for finding their value. Note that most dimensions can be measured through multiple techniques, but the SIQ ones are most useful when the user's experience, opinion, or performance is accounted for. This chapter explores SIQ while considering information obtained from multiple sources, which is a common occurrence when employing visualizations to perform business or intelligence analytics. Thus, the issues addressed here are the assessment of subjective perception of quality of data shown through visual means and principles on how to estimate the subjective quality of combined information sources. Value-Added Can increase the value of data The user can judge or assess the value added to the data Objectivity Formulas applied User opinion Timeliness Can reflect how up-to-date the data is with respect to the task User judges based on previous experience Understandability Can provide clear and simple data User can understand the data easily Concise Representation The shortest representation is known User judges based on previous experience Appropriate Amount of Data The needed amount is known User expertise is required Security Against a standard metric Users experience or performance with the data Accessibility Against a standard metric Based on user's experiences Consistent Representation Count different representations User's opinion Accuracy Formula based on known, exact value Expert estimation when exact value not available Completeness Count missing values in structured sources User's opinion for unstructured text