Just-In-Time Modeling with DataMingler (original) (raw)

Data Virtual Machines: Data-Driven Conceptual Modeling of Big Data Infrastructures

2020

In this paper we introduce the concept of Data Virtual Machines (DVM), a graph-based conceptual model of the data infrastructure of an organization, much like the traditional Entity-Relationship Model (ER). However, while ER uses a top-down approach, in which real-world entities and their relationships are depicted and utilized in the production of a relational representation, DVMs are based on a bottom up approach, mapping the data infrastructure of an organization to a graph-based model. With the term “data infrastructure” we refer to not only data persistently stored in data management systems adhering to some data model, but also of generic data processing tasks that produce an output useful in decision making. For example, a python program that “does something” and computes for each customer her probability to churn is an essential component of the organization’s data landscape and has to be made available to the user, e.g. a data scientist, in an easy to understand and intuiti...

Predicting an Optimal Virtual Data Model for Uniform Access to Large Heterogeneous Data

Data Intelligence

The growth of generated data in the industry requires new efficient big data integration approaches for uniform data access by end-users to perform better business operations. Data virtualization systems, including Ontology-Based Data Access (ODBA) query data on-the-fly against the original data sources without any prior data materialization. Existing approaches by design use a fixed model e.g., TABULAR as the only Virtual Data Model - a uniform schema built on-the-fly to load, transform, and join relevant data. While other data models, such as GRAPH or DOCUMENT, are more flexible and, thus, can be more suitable for some common types of queries, such as join or nested queries. Those queries are hard to predict because they depend on many criteria, such as query plan, data model, data size, and operations. To address the problem of selecting the optimal virtual data model for queries on large datasets, we present a new approach that (1) builds on the principal of OBDA to query and jo...

Big Data in Smart City: Management Challenges

Applied Sciences

Smart cities use digital technologies such as cloud computing, Internet of Things, or open data in order to overcome limitations of traditional representation and exchange of geospatial data. This concept ensures a significant increase in the use of data to establish new services that contribute to better sustainable development and monitoring of all phenomena that occur in urban areas. The use of the modern geoinformation technologies, such as sensors for collecting different geospatial and related data, requires adequate storage options for further data analysis. In this paper, we suggest the biG dAta sMart cIty maNagEment SyStem (GAMINESS) that is based on the Apache Spark big data framework. The model of the GAMINESS management system is based on the principles of the big data modeling, which differs greatly from standard databases. This approach provides the ability to store and manage huge amounts of structured, semi-structured, and unstructured data in real time. System perfo...

Evaluation of XPath Queries Over XML Documents Using SparkSQL Framework

Beyond Databases, Architectures and Structures. Towards Efficient Solutions for Data Analysis and Knowledge Representation

I would like to express sincere thanks to my supervisor Ing. Adam Šenk for his helpful advices and comments that helped me to finish this master's thesis. Also I would like to thank Prof. Dr. Wolfgang Benn and Johannes Fliege from Technische Universität Chemnitz, Faculty of Computer Science for all their help and the opportunity to work on my master's thesis abroad at university in Chemnitz. Last but not the least, I would like to express my heartfelt gratitude to my parents, all my family and friends for the support not only during the work on the thesis, but during my whole university study.

Rumble: data independence when data is in a mess

ArXiv, 2019

This paper introduces Rumble, an engine that executes JSONiq queries on large, heterogenous and nested collections of JSON objects, leveraging the parallel capabilities of Spark so as to provide a high degree of data independence. The design is based on two key insights: (i) how to map JSONiq expressions to Spark transformations on RDDs and (ii) how to map JSONiq FLWOR clauses to Spark SQL on DataFrames. We have developed a working implementation of these mappings showing that JSONiq can efficiently run on Spark to query billions of objects into, at least, the TB range. The JSONiq code is concise in comparison to Spark's host languages while seamlessly supporting the nested, heterogeneous datasets that Spark SQL does not. The ability to process this kind of input, commonly found, is paramount for data cleaning and curation. The experimental analysis indicates that there is no excessive performance loss, occasionally even a gain, over Spark SQL for structured data, and a performa...

Querying Heterogeneous Data in an In-situ Unified Agile System

2018

Data integration provides a unified view of data by combining different data sources. In today’s multi-disciplinary and collaborative research environments, data is often produced and consumed by various means, multiple researchers operate on the data in different divisions to satisfy various research requirements, and using different query processors and analysis tools. This makes data integration a crucial component of any successful data intensive research activity. The fundamental difficulty is that data is heterogeneous not only in syntax, structure, and semantics, but also in the way it is accessed and queried. We introduce QUIS (QUery In-Situ), an agile query system equipped with a unified query language and a federated execution engine. It is capable of running queries on heterogeneous data sources in an in-situ manner. Its language provides advanced features such as virtual schemas, heterogeneous joins, and polymorphic result set representation. QUIS utilizes a federation o...

Large Scale Querying and Processing for Property Graphs

2020

Recently, large scale graph data management, querying and processing have experienced a renaissance in several timely application domains (e.g., social networks, bibliographical networks and knowledge graphs). However, these applications still introduce new challenges with large-scale graph processing. Therefore, recently, we have witnessed a remarkable growth in the prevalence of work on graph processing in both academia and industry. Querying and processing large graphs is an interesting and challenging task. Recently, several centralized/distributed large-scale graph processing frameworks have been developed. However, they mainly focus on batch graph analytics. On the other hand, the state-of-the-art graph databases can’t sustain for distributed efficient querying for large graphs with complex queries. In particular, online large scale graph querying engines are still limited. In this paper, we present a research plan shipped with the stateof-the-art techniques for large-scale pr...

Big data analytics on Apache Spark

International Journal of Data Science and Analytics, 2016

Apache Spark has emerged as the de facto framework for big data analytics with its advanced in-memory programming model and upper-level libraries for scalable machine learning, graph analysis, streaming and structured data processing. It is a general-purpose cluster computing framework with language-integrated APIs in Scala, Java, Python and R. As a rapidly evolving open source project, with an increasing number of contributors from both academia and industry, it is difficult for researchers to comprehend the full body of development and research behind Apache Spark, especially those who are beginners in this area. In this paper, we present a technical review on big data analytics using Apache Spark. This review focuses on the key components, abstractions and features of Apache Spark. More specifically, it shows what Apache Spark has for designing and implementing big data algorithms and pipelines for machine learning, graph analysis and stream processing. In addition, we highlight some research and development directions on Apache Spark for big data analytics.

Relaxed Context Search over Multiple Structured and Semi-structured Institutional Data

The NoSQL graph data model has been widely employed as a unified relationship centric modeling method for various types of data. Such graph modeled data is typically queried by means of queries expressed in native graph query languages. However, the lack of familiarity that users have with the formal query languages and the structured of the underlying data has called for relaxed search methods (like, keyword search). Many research efforts have studied the problem of keyword search but in a single database setting. Relaxed query answering becomes more challenging when querying multiple heterogeneous data sources. This paper presents a technique for relaxed query processing over multiple graph-modeled data that represents heterogeneous structured and semi-structured data sources. The proposed technique supports various forms of relaxed search (including, context keyword, phrase, proximity search). An extensive experimental evaluation on a real world dataset demonstrates that the proposed technique is more effective than the existing keyword search methods in terms of precision, recall and f-measure.