Ricardo Jimenez-peris | Universidad Politécnica de Madrid (original) (raw)

Papers by Ricardo Jimenez-peris

Research paper thumbnail of Transactional Processing for Polyglot Persistence

NoSQL data stores have emerged in last years as a solution to provide scalability and flexibility... more NoSQL data stores have emerged in last years as a solution to provide scalability and flexibility in data modelling for operational databases. These data stores have proven that they are better suited for some kinds of problems than relational databases. In order to scale, they relaxed the properties provided by relational databases, mainly transactions. However, transactional semantics is still needed by most applications. In this paper we describe how CoherentPaaS provides scalable holistic transactions across SQL and NoSQL data stores such as document-oriented data stores, key-value data stores and graph databases.

Research paper thumbnail of PaaS-CEP - A Query Language for Complex Event Processing and Databases

Research paper thumbnail of Snapshot isolation for Neo4j

Extending Database Technology, 2016

NoSQL data stores are becoming more and more popular. Graph databases are one of this kind of dat... more NoSQL data stores are becoming more and more popular. Graph databases are one of this kind of data stores. Neo4j is a very popular graph database. In Neo4j all operations that access a graph must be performed in a transaction. Transactions in Neo4j use read-committed isolation level. Higher isolation levels are not available. In this paper we present an overview of the implementation of snapshot isolation (SI) for Neo4j. SI provides stronger guarantees that read-committed and provides more concurrency than serializability.

Research paper thumbnail of CumuloNimbo: parallel-distributed transactional processing

Minho c 1. Motivation This paper describes a new cloud platform, CumuloNimbo, that envisions ultr... more Minho c 1. Motivation This paper describes a new cloud platform, CumuloNimbo, that envisions ultra-scalable transactional processing for multi-tier applications with the goal to process a million update transactions per second while providing the same level of consistency and transparency as traditional relational database systems. Most of the current approaches to attain scalability for transactional processing in the cloud resort to sharding. Sharding is a technique in which the database is split into many different partitions (e.g. thousands) that work as separate databases sharing the original schema of the database. Sharding is technically simple but neither syntactically nor semantically transparent. Syntactic transparency is lost because applications have to be rewritten as individual transactions are only allowed to access one of the partitions. Semantic transparency is lost, because the ACID properties provided previously by transactions over arbitrary data sets are lost. Alternatives to sharding have been proposed recently [BernRWY11, PengD10], but they are solutions for specialized data structures [BernRWY11] or are not designed for online systems that require fast response times [PengD10].

Research paper thumbnail of CumuloNimbo: a cloud scalable multi-tier SQL database

IEEE Data(base) Engineering Bulletin, 2015

This article presents an overview of the CumuloNimbo platform. CumuloNimbo is a framework for mul... more This article presents an overview of the CumuloNimbo platform. CumuloNimbo is a framework for multi-tier applications that provides scalable and fault-tolerant processing of OLTP workloads. The main novelty of CumuloNimbo is that it provides a standard SQL interface and full transactional support without resorting to sharding and no need to know the workload in advance. Scalability is achieved by distributing request execution and transaction control across many compute nodes while data is persisted in a distributed data store. In this paper we present an overview of the platform.

Research paper thumbnail of Elastic scalable transaction processing in LeanXcale

Information Systems, Mar 1, 2022

Research paper thumbnail of NUMA-aware Deployments for LeanXcale Database Appliance

In this paper we discuss NUMA awareness for the LeanXcale database appliance being developed in c... more In this paper we discuss NUMA awareness for the LeanXcale database appliance being developed in cooperation with Bull-Atos in the Bull Sequana in the context of the CloudDBAppliance European project. The Bull Sequana is a large computer than in its maximum version can reach 896 cores and 140 TB of main memory. Scaling up in such a large computer with a deep NUMA hierarchy is very challenging. In this paper we discuss how LeanXcale database can be deployed in NUMA architectures such as the one of the Bull Sequana and what aspects have been taking into account to maximize efficiency and to introduce the necessary flexibility in the deployment infrastructure.

Research paper thumbnail of Architectural Patterns for Data Pipelines in Digital Finance and Insurance Applications

Springer eBooks, 2022

Data is the new oil that moves businesses. Banking, financial services, and insurance organizatio... more Data is the new oil that moves businesses. Banking, financial services, and insurance organization's value proposition are dependent on the information they can process. Companies require to process more data in less time in a more personalized way. However, current database technology has some limitations to provide in a single database engine the required intaking speed, the capacity to transform it in a usable way in the volume a corporation needs. To overcome these constraints, companies use approaches based on complex platforms that blend several technologies. This complexity means an increment of the total cost of ownership, a longer time to market, and creates long-run friction to adapt to the new business opportunities. According to McKinsey, a midsize organization (between 5Band5B and 5Band10B on operating expenses) may spend around 90M−90M-90M120M to create and maintain these architectures, mainly because of the architecture complexity and data fragmentation. The advice from McKinsey lies in simplifying the manner that financial and insurance organizations use information that will greatly impact the way that a company does business. McKinsey also notes that a company can reduce in up to 30% of the expenses by simplifying the data architecture, in combination with other activities, such as data infrastructure off-loading, engineer's productivity improvement, and pausing expensive projects.

Research paper thumbnail of Simplifying and Accelerating Data Pipelines in Digital Finance and Insurance Applications

Big Data and Artificial Intelligence in Digital Finance

To process their ever-increasing massive data, financial and insurance organizations are developi... more To process their ever-increasing massive data, financial and insurance organizations are developing and deploying data pipelines. However, state-of-the-art data management platforms have limitations in handling many and complex pipelines that blend different kinds of data stores. This chapter introduces a novel Big Data database, namely the LeanXcale database, which enables the development and management of complex pipelines in a scalable fashion. Specifically, the presented database reduces data access time independently of data size and allows for efficient process parallelization. This combination of capabilities helps to reduce the data pipeline complexity and the total cost of ownership of pipeline management. Moreover, it unveils new ways of generating value with new use cases that were previously not possible.

Research paper thumbnail of BigDataStack: A Holistic Data-Driven Stack for Big Data Applications and Operations

2018 IEEE International Congress on Big Data (BigData Congress), 2018

The new data-driven industrial revolution highlights the need for big data technologies to unlock... more The new data-driven industrial revolution highlights the need for big data technologies to unlock the potential in various application domains. In this context, emerging innovative solutions exploit several underlying infrastructure and cluster management systems. However, these systems have not been designed and implemented in a "big data context", and they rather emphasize and address the computational needs and aspects of applications and services to be deployed. In this paper we present the architecture of a complete stack (namely BigDataStack), based on a frontrunner infrastructure management system that drives decisions according to data aspects, thus being fully scalable, runtime adaptable and high-performant to address the needs of big data operations and data-intensive applications. Furthermore, the stack goes beyond purely infrastructure elements by introducing techniques for dimensioning big data applications, modelling and analyzing of processes as well as provisioning data-as-a-service by exploiting a seamless analytics framework.

Research paper thumbnail of Modular FPGA Acceleration of Data Analytics in Heterogenous Computing

2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2019

Emerging cloud applications like machine learning, AI and big data analytics require high perform... more Emerging cloud applications like machine learning, AI and big data analytics require high performance computing systems that can sustain the increased amount of data processing without consuming excessive power. Towards this end, many cloud operators have started deploying hardware accelerators, like FPGAs, to increase the performance of computationally intensive tasks but increasing the programming complexity to utilize these accelerators. VINEYARD has developed an efficient framework that allows the seamless deployment and utilization of hardware accelerators in the cloud without increasing the programming complexity and offering the flexibility of software packages. This paper presents a modular approach for the acceleration of data analytics using FPGAs. The modular approach allows the automatic development of integrated hardware designs for the acceleration of data analytics. The proposed framework shows the data analytics modules can be used to achieve up to 3.5x speedup compared to high performance generalpurpose processors.

Research paper thumbnail of Harnessing the power of DHTs to build dynamic quorums in

Recently, enterprises owning a large IT hardware and software infrastructure have started looking... more Recently, enterprises owning a large IT hardware and software infrastructure have started looking at Peer-to-peer technologies as a mean both to reduce costs and to help their technical divisions to manage huge number of devices characterized by a high level of cooperation and a relatively low churn. Obtaining the complete and exclusive control of the system for maintenance or auditing purposes in these enterprise infrastructures is a fundamental operation to be implemented. In the context of classical distributed applications, quorum systems have been considered as a major building block for implementing many paradigms, from distributed mutual exclusion to data replication management. In this paper, we explore how to architect decentralized protocols implementing quorum systems in Distributed Hash Table based cooperative P2P networks. This paper introduces some design principles for both quorum systems and protocols using them that boost their scalability and performance. These design principles consist of a dynamic and decentralized selection of quorums and in the exposure and exploitation of internals of the DHT. As a third design principle it is also shown how to redesign quorum systems to enable efficient decentralization.

Research paper thumbnail of Cost of Fault-Tolerance on Data Stream Processing

Lecture Notes in Computer Science, 2018

Data streaming engines process data on the fly in contrast to databases that first, store the dat... more Data streaming engines process data on the fly in contrast to databases that first, store the data and then, they process it. In order to process the increasing amount of data produced every day, data stream ing engines run on top of a distributed system. In this setting failures will likely happen. Current distributed data streaming engines like Apache Flink provide fault tolerance. In this paper we evaluate the impact on performance of fault tolerance mechanisms of Flink during regular oper ation (when there are no failures) on a distributed system and the impact on performance when there are failures. We use the Intel HiBench for conducting the evaluation.

Research paper thumbnail of Load balancing for Key Value Data Stores

In the last decade new scalable data stores have emerged in order to process and store the increa... more In the last decade new scalable data stores have emerged in order to process and store the increasing amount of data that is produced every day. These data stores are inherently distributed to adapt to the increasing load and generated data. HBase is one of such data stores built after Google BigTable that stores large tables (hundreds of millions of rows) where data is stored sorted by key. A region is the unit of distribution in HBase and is a continuous range of keys in the key space. HBase lacks a mechanism to distribute the load across region servers in an automated manner. In this paper, we present a load balancer that is able to split tables into an appropriate number of regions of appropriate sizes and distribute them across servers in order to attain a balanced load across all servers. The experimental evaluation shows that the performance is improved with the proposed load balancer.

Research paper thumbnail of CrowdHEALTH: Holistic Health Records and Big Data Analytics for Health Policy Making and Personalized Health

Studies in health technology and informatics, 2017

Today's rich digital information environment is characterized by the multitude of data source... more Today's rich digital information environment is characterized by the multitude of data sources providing information that has not yet reached its full potential in eHealth. The aim of the presented approach, namely CrowdHEALTH, is to introduce a new paradigm of Holistic Health Records (HHRs) that include all health determinants. HHRs are transformed into HHRs clusters capturing the clinical, social and human context of population segments and as a result collective knowledge for different factors. The proposed approach also seamlessly integrates big data technologies across the complete data path, providing of Data as a Service (DaaS) to the health ecosystem stakeholders, as well as to policy makers towards a "health in all policies" approach. Cross-domain co-creation of policies is feasible through a rich toolkit, being provided on top of the DaaS, incorporating mechanisms for causal and risk analysis, and for the compilation of predictions.

Research paper thumbnail of Snapshot Isolation for Neo4j

NoSQL data stores are becoming more and more popular. Graph databases are one of this kind of dat... more NoSQL data stores are becoming more and more popular. Graph databases are one of this kind of data stores. In this paper we present an overview of the implementation of snapshot isolation for Neo4j, a very popular graph database.

Research paper thumbnail of Parallel Efficient Data Loading

Proceedings of the 8th International Conference on Data Science, Technology and Applications, 2019

In this paper we discuss how we architected and developed a parallel data loader for LeanXcale da... more In this paper we discuss how we architected and developed a parallel data loader for LeanXcale database. The loader is characterized for its efficiency and parallelism. LeanXcale can scale up and scale out to very large numbers and loading data in the traditional way it is not exploiting its full potential in terms of the loading rate it can reach. For this reason, we have created a parallel loader that can reach the maximum insertion rate LeanXcale can handle. LeanXcale also exhibits a dual interface, key-value and SQL, that has been exploited by the parallel loader. Basically, the loading leverages the key-value API and results in a highly efficient process that avoids the overhead of SQL processing. Finally, in order to guarantee the parallelism we have developed a data sampler that samples data to generate a histogram of data distribution and use it to pre-split the regions across LeanXcale instances to guarantee that all instances get an even amount of data during loading, thus guaranteeing the peak processing loading capability of the deployment.

Research paper thumbnail of The CrowdHEALTH project and the Hollistic Health Records: Collective Wisdom Driving Public Health Policies

Acta Informatica Medica, 2019

Introduction: With the expansion of available Information and Communication Technology (ICT) serv... more Introduction: With the expansion of available Information and Communication Technology (ICT) services, a plethora of data sources provide structured and unstructured data used to detect certain health conditions or indicators of disease. Data is spread across various settings, stored and managed in different systems. Due to the lack of technology interoperability and the large amounts of health-related data, data exploitation has not reached its full potential yet. Aim: The aim of the CrowdHEALTH approach, is to introduce a new paradigm of Holistic Health Records (HHRs) that include all health determinants defining health status by using big data management mechanisms. Methods: HHRs are transformed into HHRs clusters capturing the clinical, social and human context with the aim to benefit from the collective knowledge. The presented approach integrates big data technologies, providing Data as a Service (DaaS) to healthcare professionals and policy makers towards a "health in all policies" approach. A toolkit, on top of the DaaS, providing mechanisms for causal and risk analysis, and for the compilation of predictions is developed. Results: CrowdHEALTH platform is based on three main pillars: Data & structures, Health analytics, and Policies. Conclusions: A holistic approach for capturing all health determinants in the proposed HHRs, while creating clusters of them to exploit collective knowledge with the aim of the provision of insight for different population segments according to different factors (e.g. location, occupation, medication status, emerging risks, etc) was presented. The aforementioned approach is under evaluation through different scenarios with heterogeneous data from multiple sources.

Research paper thumbnail of Transaction management across data stores

International Journal of High Performance Computing and Networking, 2018

Research paper thumbnail of The VINEYARD Approach: Versatile, Integrated, Accelerator-Based, Heterogeneous Data Centres

Lecture Notes in Computer Science, 2016

Emerging web applications like cloud computing, Big Data and social networks have created the nee... more Emerging web applications like cloud computing, Big Data and social networks have created the need for powerful centres hosting hundreds of thousands of servers. Currently, the data centres are based on general purpose processors that provide high flexibility buts lack the energy efficiency of customized accelerators. VINEYARD aims to develop an integrated platform for energy-efficient data centres based on new servers with novel, coarse-grain and fine-grain, programmable hardware accelerators. It will, also, build a high-level programming framework for allowing end-users to seamlessly utilize these accelerators in heterogeneous computing systems by employing typical data-centre programming frameworks (e.g. MapReduce, Storm, Spark, etc.). This programming framework will, further, allow the hardware accelerators to be swapped in and out of the heterogeneous infrastructure so as to offer high flexibility and energy efficiency. VINEYARD will foster the expansion of the soft-IP core industry, currently limited in the embedded systems, to the data-centre market. VINEYARD plans to demonstrate the advantages of its approach in three real use-cases a) a bio-informatics application for high-accuracy brain modeling, b) two critical financial applications, and c) a big-data analysis application.

Research paper thumbnail of Transactional Processing for Polyglot Persistence

NoSQL data stores have emerged in last years as a solution to provide scalability and flexibility... more NoSQL data stores have emerged in last years as a solution to provide scalability and flexibility in data modelling for operational databases. These data stores have proven that they are better suited for some kinds of problems than relational databases. In order to scale, they relaxed the properties provided by relational databases, mainly transactions. However, transactional semantics is still needed by most applications. In this paper we describe how CoherentPaaS provides scalable holistic transactions across SQL and NoSQL data stores such as document-oriented data stores, key-value data stores and graph databases.

Research paper thumbnail of PaaS-CEP - A Query Language for Complex Event Processing and Databases

Research paper thumbnail of Snapshot isolation for Neo4j

Extending Database Technology, 2016

NoSQL data stores are becoming more and more popular. Graph databases are one of this kind of dat... more NoSQL data stores are becoming more and more popular. Graph databases are one of this kind of data stores. Neo4j is a very popular graph database. In Neo4j all operations that access a graph must be performed in a transaction. Transactions in Neo4j use read-committed isolation level. Higher isolation levels are not available. In this paper we present an overview of the implementation of snapshot isolation (SI) for Neo4j. SI provides stronger guarantees that read-committed and provides more concurrency than serializability.

Research paper thumbnail of CumuloNimbo: parallel-distributed transactional processing

Minho c 1. Motivation This paper describes a new cloud platform, CumuloNimbo, that envisions ultr... more Minho c 1. Motivation This paper describes a new cloud platform, CumuloNimbo, that envisions ultra-scalable transactional processing for multi-tier applications with the goal to process a million update transactions per second while providing the same level of consistency and transparency as traditional relational database systems. Most of the current approaches to attain scalability for transactional processing in the cloud resort to sharding. Sharding is a technique in which the database is split into many different partitions (e.g. thousands) that work as separate databases sharing the original schema of the database. Sharding is technically simple but neither syntactically nor semantically transparent. Syntactic transparency is lost because applications have to be rewritten as individual transactions are only allowed to access one of the partitions. Semantic transparency is lost, because the ACID properties provided previously by transactions over arbitrary data sets are lost. Alternatives to sharding have been proposed recently [BernRWY11, PengD10], but they are solutions for specialized data structures [BernRWY11] or are not designed for online systems that require fast response times [PengD10].

Research paper thumbnail of CumuloNimbo: a cloud scalable multi-tier SQL database

IEEE Data(base) Engineering Bulletin, 2015

This article presents an overview of the CumuloNimbo platform. CumuloNimbo is a framework for mul... more This article presents an overview of the CumuloNimbo platform. CumuloNimbo is a framework for multi-tier applications that provides scalable and fault-tolerant processing of OLTP workloads. The main novelty of CumuloNimbo is that it provides a standard SQL interface and full transactional support without resorting to sharding and no need to know the workload in advance. Scalability is achieved by distributing request execution and transaction control across many compute nodes while data is persisted in a distributed data store. In this paper we present an overview of the platform.

Research paper thumbnail of Elastic scalable transaction processing in LeanXcale

Information Systems, Mar 1, 2022

Research paper thumbnail of NUMA-aware Deployments for LeanXcale Database Appliance

In this paper we discuss NUMA awareness for the LeanXcale database appliance being developed in c... more In this paper we discuss NUMA awareness for the LeanXcale database appliance being developed in cooperation with Bull-Atos in the Bull Sequana in the context of the CloudDBAppliance European project. The Bull Sequana is a large computer than in its maximum version can reach 896 cores and 140 TB of main memory. Scaling up in such a large computer with a deep NUMA hierarchy is very challenging. In this paper we discuss how LeanXcale database can be deployed in NUMA architectures such as the one of the Bull Sequana and what aspects have been taking into account to maximize efficiency and to introduce the necessary flexibility in the deployment infrastructure.

Research paper thumbnail of Architectural Patterns for Data Pipelines in Digital Finance and Insurance Applications

Springer eBooks, 2022

Data is the new oil that moves businesses. Banking, financial services, and insurance organizatio... more Data is the new oil that moves businesses. Banking, financial services, and insurance organization's value proposition are dependent on the information they can process. Companies require to process more data in less time in a more personalized way. However, current database technology has some limitations to provide in a single database engine the required intaking speed, the capacity to transform it in a usable way in the volume a corporation needs. To overcome these constraints, companies use approaches based on complex platforms that blend several technologies. This complexity means an increment of the total cost of ownership, a longer time to market, and creates long-run friction to adapt to the new business opportunities. According to McKinsey, a midsize organization (between 5Band5B and 5Band10B on operating expenses) may spend around 90M−90M-90M120M to create and maintain these architectures, mainly because of the architecture complexity and data fragmentation. The advice from McKinsey lies in simplifying the manner that financial and insurance organizations use information that will greatly impact the way that a company does business. McKinsey also notes that a company can reduce in up to 30% of the expenses by simplifying the data architecture, in combination with other activities, such as data infrastructure off-loading, engineer's productivity improvement, and pausing expensive projects.

Research paper thumbnail of Simplifying and Accelerating Data Pipelines in Digital Finance and Insurance Applications

Big Data and Artificial Intelligence in Digital Finance

To process their ever-increasing massive data, financial and insurance organizations are developi... more To process their ever-increasing massive data, financial and insurance organizations are developing and deploying data pipelines. However, state-of-the-art data management platforms have limitations in handling many and complex pipelines that blend different kinds of data stores. This chapter introduces a novel Big Data database, namely the LeanXcale database, which enables the development and management of complex pipelines in a scalable fashion. Specifically, the presented database reduces data access time independently of data size and allows for efficient process parallelization. This combination of capabilities helps to reduce the data pipeline complexity and the total cost of ownership of pipeline management. Moreover, it unveils new ways of generating value with new use cases that were previously not possible.

Research paper thumbnail of BigDataStack: A Holistic Data-Driven Stack for Big Data Applications and Operations

2018 IEEE International Congress on Big Data (BigData Congress), 2018

The new data-driven industrial revolution highlights the need for big data technologies to unlock... more The new data-driven industrial revolution highlights the need for big data technologies to unlock the potential in various application domains. In this context, emerging innovative solutions exploit several underlying infrastructure and cluster management systems. However, these systems have not been designed and implemented in a "big data context", and they rather emphasize and address the computational needs and aspects of applications and services to be deployed. In this paper we present the architecture of a complete stack (namely BigDataStack), based on a frontrunner infrastructure management system that drives decisions according to data aspects, thus being fully scalable, runtime adaptable and high-performant to address the needs of big data operations and data-intensive applications. Furthermore, the stack goes beyond purely infrastructure elements by introducing techniques for dimensioning big data applications, modelling and analyzing of processes as well as provisioning data-as-a-service by exploiting a seamless analytics framework.

Research paper thumbnail of Modular FPGA Acceleration of Data Analytics in Heterogenous Computing

2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2019

Emerging cloud applications like machine learning, AI and big data analytics require high perform... more Emerging cloud applications like machine learning, AI and big data analytics require high performance computing systems that can sustain the increased amount of data processing without consuming excessive power. Towards this end, many cloud operators have started deploying hardware accelerators, like FPGAs, to increase the performance of computationally intensive tasks but increasing the programming complexity to utilize these accelerators. VINEYARD has developed an efficient framework that allows the seamless deployment and utilization of hardware accelerators in the cloud without increasing the programming complexity and offering the flexibility of software packages. This paper presents a modular approach for the acceleration of data analytics using FPGAs. The modular approach allows the automatic development of integrated hardware designs for the acceleration of data analytics. The proposed framework shows the data analytics modules can be used to achieve up to 3.5x speedup compared to high performance generalpurpose processors.

Research paper thumbnail of Harnessing the power of DHTs to build dynamic quorums in

Recently, enterprises owning a large IT hardware and software infrastructure have started looking... more Recently, enterprises owning a large IT hardware and software infrastructure have started looking at Peer-to-peer technologies as a mean both to reduce costs and to help their technical divisions to manage huge number of devices characterized by a high level of cooperation and a relatively low churn. Obtaining the complete and exclusive control of the system for maintenance or auditing purposes in these enterprise infrastructures is a fundamental operation to be implemented. In the context of classical distributed applications, quorum systems have been considered as a major building block for implementing many paradigms, from distributed mutual exclusion to data replication management. In this paper, we explore how to architect decentralized protocols implementing quorum systems in Distributed Hash Table based cooperative P2P networks. This paper introduces some design principles for both quorum systems and protocols using them that boost their scalability and performance. These design principles consist of a dynamic and decentralized selection of quorums and in the exposure and exploitation of internals of the DHT. As a third design principle it is also shown how to redesign quorum systems to enable efficient decentralization.

Research paper thumbnail of Cost of Fault-Tolerance on Data Stream Processing

Lecture Notes in Computer Science, 2018

Data streaming engines process data on the fly in contrast to databases that first, store the dat... more Data streaming engines process data on the fly in contrast to databases that first, store the data and then, they process it. In order to process the increasing amount of data produced every day, data stream ing engines run on top of a distributed system. In this setting failures will likely happen. Current distributed data streaming engines like Apache Flink provide fault tolerance. In this paper we evaluate the impact on performance of fault tolerance mechanisms of Flink during regular oper ation (when there are no failures) on a distributed system and the impact on performance when there are failures. We use the Intel HiBench for conducting the evaluation.

Research paper thumbnail of Load balancing for Key Value Data Stores

In the last decade new scalable data stores have emerged in order to process and store the increa... more In the last decade new scalable data stores have emerged in order to process and store the increasing amount of data that is produced every day. These data stores are inherently distributed to adapt to the increasing load and generated data. HBase is one of such data stores built after Google BigTable that stores large tables (hundreds of millions of rows) where data is stored sorted by key. A region is the unit of distribution in HBase and is a continuous range of keys in the key space. HBase lacks a mechanism to distribute the load across region servers in an automated manner. In this paper, we present a load balancer that is able to split tables into an appropriate number of regions of appropriate sizes and distribute them across servers in order to attain a balanced load across all servers. The experimental evaluation shows that the performance is improved with the proposed load balancer.

Research paper thumbnail of CrowdHEALTH: Holistic Health Records and Big Data Analytics for Health Policy Making and Personalized Health

Studies in health technology and informatics, 2017

Today's rich digital information environment is characterized by the multitude of data source... more Today's rich digital information environment is characterized by the multitude of data sources providing information that has not yet reached its full potential in eHealth. The aim of the presented approach, namely CrowdHEALTH, is to introduce a new paradigm of Holistic Health Records (HHRs) that include all health determinants. HHRs are transformed into HHRs clusters capturing the clinical, social and human context of population segments and as a result collective knowledge for different factors. The proposed approach also seamlessly integrates big data technologies across the complete data path, providing of Data as a Service (DaaS) to the health ecosystem stakeholders, as well as to policy makers towards a "health in all policies" approach. Cross-domain co-creation of policies is feasible through a rich toolkit, being provided on top of the DaaS, incorporating mechanisms for causal and risk analysis, and for the compilation of predictions.

Research paper thumbnail of Snapshot Isolation for Neo4j

NoSQL data stores are becoming more and more popular. Graph databases are one of this kind of dat... more NoSQL data stores are becoming more and more popular. Graph databases are one of this kind of data stores. In this paper we present an overview of the implementation of snapshot isolation for Neo4j, a very popular graph database.

Research paper thumbnail of Parallel Efficient Data Loading

Proceedings of the 8th International Conference on Data Science, Technology and Applications, 2019

In this paper we discuss how we architected and developed a parallel data loader for LeanXcale da... more In this paper we discuss how we architected and developed a parallel data loader for LeanXcale database. The loader is characterized for its efficiency and parallelism. LeanXcale can scale up and scale out to very large numbers and loading data in the traditional way it is not exploiting its full potential in terms of the loading rate it can reach. For this reason, we have created a parallel loader that can reach the maximum insertion rate LeanXcale can handle. LeanXcale also exhibits a dual interface, key-value and SQL, that has been exploited by the parallel loader. Basically, the loading leverages the key-value API and results in a highly efficient process that avoids the overhead of SQL processing. Finally, in order to guarantee the parallelism we have developed a data sampler that samples data to generate a histogram of data distribution and use it to pre-split the regions across LeanXcale instances to guarantee that all instances get an even amount of data during loading, thus guaranteeing the peak processing loading capability of the deployment.

Research paper thumbnail of The CrowdHEALTH project and the Hollistic Health Records: Collective Wisdom Driving Public Health Policies

Acta Informatica Medica, 2019

Introduction: With the expansion of available Information and Communication Technology (ICT) serv... more Introduction: With the expansion of available Information and Communication Technology (ICT) services, a plethora of data sources provide structured and unstructured data used to detect certain health conditions or indicators of disease. Data is spread across various settings, stored and managed in different systems. Due to the lack of technology interoperability and the large amounts of health-related data, data exploitation has not reached its full potential yet. Aim: The aim of the CrowdHEALTH approach, is to introduce a new paradigm of Holistic Health Records (HHRs) that include all health determinants defining health status by using big data management mechanisms. Methods: HHRs are transformed into HHRs clusters capturing the clinical, social and human context with the aim to benefit from the collective knowledge. The presented approach integrates big data technologies, providing Data as a Service (DaaS) to healthcare professionals and policy makers towards a "health in all policies" approach. A toolkit, on top of the DaaS, providing mechanisms for causal and risk analysis, and for the compilation of predictions is developed. Results: CrowdHEALTH platform is based on three main pillars: Data & structures, Health analytics, and Policies. Conclusions: A holistic approach for capturing all health determinants in the proposed HHRs, while creating clusters of them to exploit collective knowledge with the aim of the provision of insight for different population segments according to different factors (e.g. location, occupation, medication status, emerging risks, etc) was presented. The aforementioned approach is under evaluation through different scenarios with heterogeneous data from multiple sources.

Research paper thumbnail of Transaction management across data stores

International Journal of High Performance Computing and Networking, 2018

Research paper thumbnail of The VINEYARD Approach: Versatile, Integrated, Accelerator-Based, Heterogeneous Data Centres

Lecture Notes in Computer Science, 2016

Emerging web applications like cloud computing, Big Data and social networks have created the nee... more Emerging web applications like cloud computing, Big Data and social networks have created the need for powerful centres hosting hundreds of thousands of servers. Currently, the data centres are based on general purpose processors that provide high flexibility buts lack the energy efficiency of customized accelerators. VINEYARD aims to develop an integrated platform for energy-efficient data centres based on new servers with novel, coarse-grain and fine-grain, programmable hardware accelerators. It will, also, build a high-level programming framework for allowing end-users to seamlessly utilize these accelerators in heterogeneous computing systems by employing typical data-centre programming frameworks (e.g. MapReduce, Storm, Spark, etc.). This programming framework will, further, allow the hardware accelerators to be swapped in and out of the heterogeneous infrastructure so as to offer high flexibility and energy efficiency. VINEYARD will foster the expansion of the soft-IP core industry, currently limited in the embedded systems, to the data-centre market. VINEYARD plans to demonstrate the advantages of its approach in three real use-cases a) a bio-informatics application for high-accuracy brain modeling, b) two critical financial applications, and c) a big-data analysis application.