Data Integration Research Papers - Academia.edu (original) (raw)
Declarations can be found on page 12 DOI 10.7717/peerj.1017 Copyright 2015 Perrineau et al.
Data integration is the technique of merging data residing at different sources at different locations, and providing users with an integrated, reconciled view of these data. Such unified view is called global or mediated schema. It... more
Data integration is the technique of merging data residing at different sources at different locations, and providing users with an integrated, reconciled view of these data. Such unified view is called global or mediated schema. It represents the intentional level of the integrated and reconciled data. In the data integration system, our area of interest in this paper is characterized by an architecture based on a global schema and a set of sources or source schemas. The objective of this paper is to provide a study on the theoretical aspects of data integration systems and to present a comprehensive review of the applications of data integration in various fields including biomedicine, environment, and social networks. It also discusses a privacy framework for protecting user’s privacy with privacy views and privacy policies.
Data integration enables combining data from various data sources in a standard format. Internet of things (IoT) applications use ontology approaches to provide a machine-understandable conceptualization of a domain. We propose a unified... more
Data integration enables combining data from various data sources in a standard format. Internet of things (IoT) applications use ontology approaches to provide a machine-understandable conceptualization of a domain. We propose a unified ontology schema approach to solve all IoT integration problems at once. The data unification layer maps data from different formats to data patterns based on the unified ontology model. This paper proposes a middleware consisting of an ontology-based approach that collects data from different devices. IoT middleware requires an additional semantic layer for cloud-based IoT platforms to build a schema for data generated from diverse sources. We tested the proposed model on real data consisting of approximately 160,000 readings from various sources in different formats like CSV, JSON, raw data, and XML. The data were collected through the file transfer protocol (FTP) and generated 960,000 resource description framework (RDF) triples. We evaluated the proposed approach by running different queries on different machines on SPARQL protocol and RDF query language (SPARQL) endpoints to check query processing time, validation of integration, and performance of the unified ontology model. The average response time for query execution on generated RDF triples on the three servers were approximately 0.144 seconds, 0.070 seconds, 0.062 seconds, respectively.
"Few landscapes change more rapid than the marine. Sandbanks, channels and even complete coastlines can change dramatically overnight. This is a threat not only for modern mariners, our seafaring forefathers knew this problem also all too... more
"Few landscapes change more rapid than the marine. Sandbanks, channels and even complete coastlines can change dramatically overnight. This is a threat not only for modern mariners, our seafaring forefathers knew this problem also all too well. With modern techniques we can monitor these changes and adapt our maps on a regular basis. These techniques not only provide saver shipping, they can also be used to find the wreck of unfortunate former mariners. How can this method be used to predict where wrecks can be found. And, if a wreck is found, is it possible to preserve it?
In order to get a full picture of possible wreck sites, we need to know what the underwater landscape was in various periods, and how it has changed over time.
Historic Cartographical analysis can give and insight in the use and sometimes in the morphology of former landscapes. The problem with this is that it only provides qualitative information; i.e. descriptive data (map legends, interpretations, names or remarks). Modern remote-sensing devices give purely quantitative data. In order to model changes in landscape overtime, the historical qualitative data should be in some way ‘quantified’ to make calculations possible. If the historical records provide quantitative data as well, they should somehow be extrapolated to be comparable with modern high resolution data. This ‘quantifying’ of data can also be used for modern qualitative maps, such as soil type maps or land use maps.
This way historical data can be integrated with modern remote-sensing and survey techniques."
We provide a novel framework based on a systematic treatment of data inconsistency and the related concept of data reliability in integrated databases. Our main contribution is the formalization of reliability assessment for historical... more
We provide a novel framework based on a systematic treatment of data inconsistency and the related concept of data reliability in integrated databases. Our main contribution is the formalization of reliability assessment for historical data where redundancy and inconsistency are common. We discover data inconsistency through the analysis of relationships between existing reports in the integrated database. We present a new approach by defining properties (rules) that a good measure of reliability should satisfy. We then propose such measures and show which properties they satisfy. We also report on a simulation-based study of the introduced framework.
In the field of data science, integrating and analyzing data distributed across different data sources is necessary for decision-making and planning business strategies. Hence, the record linkage is important component in data integration... more
In the field of data science, integrating and analyzing data distributed across different data sources is necessary for decision-making and planning business strategies. Hence, the record linkage is important component in data integration and analytics for matching and linking records across various databases. Since databases contain personally identifying information and sensitive data about individuals, there is a need for protecting privacy for record linkage. Thus, the secure record linkage involves encoding of records from multiple data sources and then performs matching on them. The Bloom filter encoding techniques were found to provide better privacy while performing approximate matching on erroneous records. Still the Bloom filter based secure record linkage techniques can suffer from re-identification attacks and may lead to imbalance between linkage accuracy and privacy. This work includes the overview of secure record linkage and implementation of recent attack methods on EPPRL approach, Basic and Balanced Bloom filter. Moreover, we provide recommendations to limit the vulnerability and re-identification of encoded identifiers from Bloom filter based secure record linkage.
To interoperate data sources which differ structurally and semantically, particular problems occur, for example, problems of changing schemas in data sources will affect the integrated schema. In this paper, we propose the mediated... more
To interoperate data sources which differ structurally and semantically, particular problems occur, for example, problems of changing schemas in data sources will affect the integrated schema. In this paper, we propose the mediated integration architecture (MedInt), which employs mediation and wrapping techniques as the main components for the integration of heterogeneous systems. With MedInt, a mediator acts as an intermediate medium transforming queries to sub-queries, integrating result data and resolving conflicts. Wrappers then transform sub-queries to specific local queries so that each local system is able to understand the queries.
From time to time there have been different models of data integration to manage and analyze data. Also with the emergence of big data, the database community has proposed newer and better solutions to manage such disparate and large... more
From time to time there have been different models of data integration to manage and analyze data. Also with the emergence of big data, the database community has proposed newer and better solutions to manage such disparate and large data. Also, the changes in the data storage models and massive data repositories on the web have encouraged the need for novel data integration models. In this article, we try to present a case of various trends in integrating data through different models. We present a brief overview of Federated Database Systems, Data Warehouse, Mediators and new proposed Polystore Systems with the evolution of architecture, query processing, distribution, automation and data models supported within those data integration models. The similarities and differences of these models are also presented. Also, the novelty of Polystore Systems with various examples is discussed. This article also highlights the importance of such system for integrating large scale heterogeneous data.
Abstract. Satellite data are used in several environmental applications, particularly in air quality supervising, climate change monitoring, and natural disaster predictions. However, remote sensing (RS) data occur in huge volume, in... more
Abstract. Satellite data are used in several environmental applications, particularly in air quality supervising, climate change monitoring, and natural disaster predictions. However, remote sensing (RS) data occur in huge volume, in near-real time, and are stored inside complex structures. We aim to prove that satellite data are big data (BD). Accordingly, we propose a software as an extract-transform-load tool for satellite data preprocessing. We focused on the ingestion layer that will enable an efficient RSBD integration. As a result, the developed software layer receives data continuously and removes ∼86 % of the unused files. This layer also eliminates nearly 20% of erroneous datasets. Thanks to the proposed approach, we successfully reduced storage space consumption, enhanced the RS data accuracy, and integrated preprocessed datasets into a Hadoop distributed file system.
There is increasing interest of organization for advanced presentation and data analysis for public users. This paper shows how to integrate data from enterprise data warehouse with spatial data warehouse, publish them together to online... more
There is increasing interest of organization for advanced presentation and data analysis for public users. This paper shows how to integrate data from enterprise data warehouse with spatial data warehouse, publish them together to online interactive map, and enable public users to perform analysis in simple web interface. As case study is used Business Intelligence System for Investors, where data comes from different sources different levels, structured and unstructured. This approach has three phases: creating spatial data warehouse, implementing ETL (extract, transform and load) procedure for data from different sources (spatial and non-spatial) and, finally, designing interface for performing data analysis. The fact, that this is a public site, where users are not known in advanced and not trained, calls for importance of usability design and self-evident interface. Investors are not willing to invest any time in learning the basics of a system. Geographic information providers ...
Mixed methods is a youthful but increasingly robust methodological movement characterised by: a growing body of trans-disciplinary literature; prominent research methodologists/authorities; the emergence of mixed method specific journals,... more
Mixed methods is a youthful but increasingly robust methodological movement characterised by: a growing body of trans-disciplinary literature; prominent research methodologists/authorities; the emergence of mixed method specific journals, research texts, and courses; a growth in popularity amongst research funding bodies. Mixed methods is being utilised and reported within business and management fields, despite the quantitative traditions attached to certain business and management disciplines. This paper has utilised a multistrand conversion mixed model research design to undertake a retrospective content analysis of refereed papers (n = 281) from the 21st Australian and New Zealand Academy of Management (ANZAM) Conference 2007. The aim of the study is to provide a methodological map of the management research reported at the conference, and in particular the use, quality and acceptance level of mixed methods research within business and management fields. Implications for further...
The topic of data integration from external data sources or independent IT-systems has received increasing attention recently in IT departments as well as at management level, in particular concerning data integration in federated... more
The topic of data integration from external data sources or independent IT-systems has received increasing attention recently in IT departments as well as at management level, in particular concerning data integration in federated database systems. An example of the latter are commercial research information systems (RIS), which regularly import, cleanse, transform and prepare the analysis research information of the institutions of a variety of databases. In addition, all these so-called steps must be provided in a secured quality. As several internal and external data sources are loaded for integration into the RIS, ensuring information quality is becoming increasingly challenging for the research institutions. Before the research information is transferred to a RIS, it must be checked and cleaned up. An important factor for successful or competent data integration is therefore always the data quality. The removal of data errors (such as duplicates and harmonization of the data structure, inconsistent data and outdated data, etc.) are essential tasks of data integration using extract, transform, and load (ETL) processes. Data is extracted from the source systems, transformed and loaded into the RIS. At this point conflicts between different data sources are controlled and solved, as well as data quality issues during data integration are eliminated. Against this background, our paper presents the process of data transformation in the context of RIS which gains an overview of the quality of research information in an institution's internal and external data sources during its integration into RIS. In addition, the question of how to control and improve the quality issues during the integration process in RIS will be addressed.
ABSTRACT Avian monitoring system is developed based on data fusion of thermal/Infrared camera (IR) and marine radar. First data were processed separately using video/image processing and radar signal processing techniques and features of... more
ABSTRACT Avian monitoring system is developed based on data fusion of thermal/Infrared camera (IR) and marine radar. First data were processed separately using video/image processing and radar signal processing techniques and features of the targets are obtained by each sensor. Data fusion of radar and IR then is implemented to achieve feature vectors of the target. IR camera provides the coordinate information of the targets as well as some features such as flight's straightness index, direction, and target's heat, while the radar provides altitude information (z-coordinates) which is not provided by IR. So the data fusion of the IR and radar provide more detail and reliable information possible of the avian targets and their activity. Data was collected near Lake Erie in Ohio during 2011 spring and fall migration periods. Data analysis was performed in accordance to needs of wildlife biologists.
The increasing complexity and costs of modern production processes make it necessary to plan processes virtually before they are tested and realized in real environments. Therefore, several tools facilitating the simulation of different... more
The increasing complexity and costs of modern production processes make it necessary to plan processes virtually before they are tested and realized in real environments. Therefore, several tools facilitating the simulation of different production techniques and design domains have been developed. On the one hand there are specialized tools simulating specific production techniques with exactness close to the real object of the simulation. On the other hand there are simulations which simulate whole production processes, but in general do not achieve prediction accuracy comparable to such specialized tools. Hence, the interconnection of tools is the only way, because otherwise the achievable prediction accuracy would be insufficient. In this chapter, a framework is presented that helps to interconnect heterogeneous simulation tools, considering their incompatible file formats, different semantics of data and missing data consistency.
- by Tobias Meisen and +2
- •
- Semantics, Data Integration, Adaptive Data Integration
In this new era of technology, information and communication technology is getting advanced compared to the past decades. The emergence of Internet and networking bring a lot of benefits to mankind nowadays. Technology is getting improved... more
In this new era of technology, information and communication technology is getting advanced compared to the past decades. The emergence of Internet and networking bring a lot of benefits to mankind nowadays. Technology is getting improved after a lot of research has been done. We can see the evolution of our world, from an agricultural society to an industrial society. Different kind of digital products are being invented during this evolution. Digital images, MP3s, videos are widely used everywhere in this world. The evolution of image capture tool has evolved from
film camera to digital camera. Film capture is no longer used nowadays. Digital camera is commonly used to capture pictures. Captured images are converted into pixels form and saved into digital signal form. Users can upload their images into computer or onto internet. However, images uploaded are widely spread and copied by other internet users. This may cause a serious copyright problem because the original owner of the image cannot prove that the image belongs to the owner. A digital right management technique by using watermarking system is introduced in the report to prevent this problem. DWT watermarking system is a system which will embed certain information into digital image by using DWT decomposition. Owner can use the extraction method to show the ownership of images if his/her images are copied or stolen by other people. This watermarking system contains two major process, which is embedding process and extraction process. The watermark used will be an image with meaningful text inside.
AbstractWith the proliferation of international standards for grid-enabled databases, the need for data loading and data mapping in a large integrated environment of heterogeneous databases highlights issues of consistency and integrity.... more
AbstractWith the proliferation of international standards for grid-enabled databases, the need for data loading and data mapping in a large integrated environment of heterogeneous databases highlights issues of consistency and integrity. We discuss methods for providing semi-...
This paper describes an approach introducing location intelligence using open-source software components as the solution for planning and construction of the airport infrastructure. As a case study, the spatial information system of the... more
This paper describes an approach introducing location intelligence using open-source software components as the solution for planning and construction of the airport infrastructure. As a case study, the spatial information system of the International Airport in Sarajevo is selected. Due to the frequent construction work on new terminals and the increase of existing airport capacities, as one of the measures for more efficient management of airport infrastructures, the development team has suggested to airport management to introduce location intelligence, meaning to upgrade the existing information system with a functional WebGIS solution. This solution is based on OpenGeo architecture that includes a set of spatial data management technologies used to create an online internet map and build a location intelligence infrastructure.
- by Vishnu Mathur
- •
- Data Integration, ETL
Cyber-Physical Systems (CPS) covers from M2M and Internet of Things (IoT) communications, heterogeneous data integration from multiple sources, security / privacy and its integration into the cloud computing and Big Data platforms. The... more
Cyber-Physical Systems (CPS) covers from M2M and Internet of Things (IoT) communications, heterogeneous data integration from multiple sources, security / privacy and its integration into the cloud computing and Big Data platforms. The integration of Big Data into CPS solutions presents several challenges and opportunities. Big Data for CPS is not suitable with conventional solutions based on offline or batch processing. The interconnection with the real-world, in industrial and critical environments, requires reaction in real-time. Therefore, real-time will be a vertical requirement from communication to Big Data analytics. Big Data for CPS requires on the one hand, real-time streams processing for real-time control, and on the other hand, batch processing for modeling and behaviors learning. This paper describes the existing solutions and the pending challenges, providing some guidelines to address the challenges.
International journal of Web & Semantic Technology (IJWesT) is a quarterly open access peer- reviewed journal that provides excellent international forum for sharing knowledge and results in theory, methodology and applications of web &... more
International journal of Web & Semantic Technology (IJWesT) is a quarterly open access peer- reviewed journal that provides excellent international forum for sharing knowledge and results in theory, methodology and applications of web & semantic technology. The growth of the World- Wide Web today is simply phenomenal. It continues to grow rapidly and new technologies, applications are being developed to support end users modern life. Semantic Technologies are designed to extend the capabilities of information on the Web and enterprise databases to be networked in meaningful ways. Semantic web is emerging as a core discipline in the field of Computer Science & Engineering from distributed computing, web engineering, databases, social networks, Multimedia, information systems, artificial intelligence, natural language processing, soft computing, and human-computer interaction. The adoption of standards like XML, Resource Description Framework and Web Ontology Language serve as foundation technologies to advancing the adoption of semantic technologies.
A Big Data Platform (BDA) with Hadoop/MapReduce technologies distributed over HBase (key-value NoSQL database storage) and generate hospitalization metadata was established for testing functionality and performance. Performance tests... more
A Big Data Platform (BDA) with Hadoop/MapReduce technologies distributed over HBase (key-value NoSQL database storage)
and generate hospitalization metadata was established for testing functionality and performance. Performance tests retrieved
results from simulated patient records with Apache tools in Hadoop’s ecosystem. At optimized iteration, Hadoop distributed file
system (HDFS) ingestion with HBase exhibited sustained database integrity over hundreds of iterations; however, to complete its
bulk loading via MapReduce to HBase required a month. The framework over generated HBase data files took a week and a month
for one billion (10TB) and three billion (30TB), respectively. Apache Spark and Apache Drill showed high performance. However,
inconsistencies of MapReduce limited the capacity to generate data. Hospital system based on a patient encounter-centric database
was very difficult to establish because data profiles have complex relationships. Recommendations for key-value storage should be
considered for healthcare when analyzing large volumes of data over simplified clinical event models.
Un sistema de base de datos federada es una capa de abstracción de software que permite gestionar, como si se tratara de una única fuente, a una colección de sistemas de bases de datos componentes. La investigación encarada en este... more
Un sistema de base de datos federada es una capa de abstracción de software que permite gestionar, como si se tratara de una única fuente, a una colección de sistemas de bases de datos componentes.
La investigación encarada en este trabajo, abarca la revisión de la bibliografía sobre las teorías de las bases de datos federadas y de los componentes involucrados. En base a esta revisión, se realiza el análisis y la construcción de un prototipo para demostrar la tecnología. Finalmente, se efectúa la implementación del paradigma en un caso de estudio.
Se pretende exponer una perspectiva clara del modo en que una base de datos federada puede ser implementada y los conjuntos de técnicas disponibles para este propósito.
El caso de estudio se implementa mediante un sistema de base de datos federadas para la gestión y la integración de diversas fuentes de datos heterogéneas, dispersas en una organización gubernamental, de tal forma que toda esa información, en base a ciertos parámetros, se pueda integrar totalmente y que, al mismo tiempo, permita la incorporación de futuras fuentes de datos.
This article presents the background to and prospects for a new initiative in archaeological field survey and database integration. The Roman Hinterland Project combines data from the Tiber Valley Project, Roman Suburbium Project, and the... more
This article presents the background to and prospects for a new initiative in archaeological field survey and database integration. The Roman Hinterland Project combines data from the Tiber Valley Project, Roman Suburbium Project, and the Pontine Region Project into a single database, which the authors believe to be one of the most complete repositories of data for the hinterland of a major ancient metropolis, covering nearly 2000 years of history. The logic of combining these databases in the context
of studying the Roman landscape is explained and illustrated with analyses that show their capacity to contribute to major debates in Roman economy, demography, and the longue durée of the human condition in a globalizing world.
Data integration: cara menginstall power designer