Anna Queralt | Universitat Politecnica de Catalunya (original) (raw)

Papers by Anna Queralt

Research paper thumbnail of Enabling dynamic and intelligent workflows for HPC, data analytics, and AI convergence

Future Generation Computer Systems

The evolution of High-Performance Computing (HPC) platforms enables the design and execution of p... more The evolution of High-Performance Computing (HPC) platforms enables the design and execution of progressively larger and more complex workflow applications in these systems. The complexity comes not only from the number of elements that compose the workflows but also from the type of computations they perform. While traditional HPC workflows target simulations and modelling of physical phenomena, current needs require in addition data analytics (DA) and artificial intelligence (AI) tasks. However, the development of these workflows is hampered by the lack of proper programming models and environments that support the integration of HPC, DA, and AI, as well as the lack of tools to easily deploy and execute the workflows in HPC systems. To progress in this direction, this paper presents use cases where complex workflows are required and investigates the main issues to be addressed for the HPC/DA/AI

Research paper thumbnail of An Elastic Software Architecture for Extreme-Scale Big Data Analytics

Technologies and Applications for Big Data Value

This chapter describes a software architecture for processing big-data analytics considering the ... more This chapter describes a software architecture for processing big-data analytics considering the complete compute continuum, from the edge to the cloud. The new generation of smart systems requires processing a vast amount of diverse information from distributed data sources. The software architecture presented in this chapter addresses two main challenges. On the one hand, a new elasticity concept enables smart systems to satisfy the performance requirements of extreme-scale analytics workloads. By extending the elasticity concept (known at cloud side) across the compute continuum in a fog computing environment, combined with the usage of advanced heterogeneous hardware architectures at the edge side, the capabilities of the extreme-scale analytics can significantly increase, integrating both responsive data-in-motion and latent data-at-rest analytics into a single solution. On the other hand, the software architecture also focuses on the fulfilment of the non-functional properties...

Research paper thumbnail of EUDAT D8.1: Report of Requirements

The aim of this deliverable is to lay out the ground for future work in WP8 by providing a clear ... more The aim of this deliverable is to lay out the ground for future work in WP8 by providing a clear list of additional services that should be developed as well as an initial list of high-level requirements. In this document, we are presenting general DLC descriptions for core EUDAT communities and for individual researcher and citizen scientists. These DLC descriptions have been derived from community specific user scenarios and were aligned using an agreed upon list of key activities involved in DLCs. Within these descriptions, we included the planned integration of EUDAT services in the different DLC steps extracted from current and future Uptake Plans (WP4) to dissociate as much as possible existing services, requirements and new services. These descriptions provide the first attempt to associate EUDAT services to identified DLC activities and were used to list the specific needs. From these needs, we were able to identify new services and provide a list of high-level requirements ...

Research paper thumbnail of D8.2: Report on Technology Watch

The aim of this deliverable is to present a summary of the different technological approaches tha... more The aim of this deliverable is to present a summary of the different technological approaches that EUDAT investigated to support the design of the prototype services and data models identified in D8.1 and implemented in task 8.2 (D8.3, data models) and task 8.4 (D8.4, service design and prototypes). This document presents the two distinct but related topics of investigation: the existing data models that could support the design of DLC models and a directive language, as well as the existing technologies to support the usage of graph-based data, workflow descriptions, directives, semantic resources and dynamic data. These descriptions are not meant to be exhaustive but reflect the results of our current state of knowledge.

Research paper thumbnail of RosneT: A Block Tensor Algebra Library for Out-of-Core Quantum Computing Simulation

2021 IEEE/ACM Second International Workshop on Quantum Computing Software (QCS), 2021

With the advent of more powerful Quantum Computers, the need for larger Quantum Simulations has b... more With the advent of more powerful Quantum Computers, the need for larger Quantum Simulations has boosted. As the amount of resources grows exponentially with size of the target system Tensor Networks emerge as an optimal framework with which we represent Quantum States in tensor factorizations. As the extent of a tensor network increases, so does the size of intermediate tensors requiring HPC tools for their manipulation. Simulations of medium-sized circuits cannot fit on local memory, and solutions for distributed contraction of tensors are scarce. In this work we present RosneT, a library for distributed, out-ofcore block tensor algebra. We use the PyCOMPSs programming model to transform tensor operations into a collection of tasks handled by the COMPSs runtime, targeting executions in existing and upcoming Exascale supercomputers. We report results validating our approach showing good scalability in simulations of Quantum circuits of up to 53 qubits.

Research paper thumbnail of D8.3: Report on Design Model and Definition of Data Directives

Based on the work described in D8.1 and D8.2, we developed initial prototypes for modeling Data L... more Based on the work described in D8.1 and D8.2, we developed initial prototypes for modeling Data Life Cycles and directives. In this document, we are describing both the processes and the results of this initial implementation. We will discuss the issues we faced in developing these prototypes and the technical choices we made. This deliverable provides an overview of the current status of the work. This work has been used as a basis for concrete implementations of community use-cases, described in D8.6.

Research paper thumbnail of Introduction to CEBDA 2018

2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2018

International Workshop on the Convergence of Extreme Scale Computing and Big Data Analysis The de... more International Workshop on the Convergence of Extreme Scale Computing and Big Data Analysis The deployment of extreme scale computing platforms in research and industry coupled with the proliferation of large and distributed digital data sources have the potential for unprecedented insights and understanding in all areas of science, engineering, business, and society in general. However, challenges related to the Big Data generated and processed by these systems remain a significant barrier in achieving this potential. Addressing these challenges requires a seamless integration of the extreme scale/high performance computing, cloud computing, storage technologies, data management, energy efficiency, and big data analytics research approaches, framework/technologies, and communities. The convergence and integration of exascale systems and data analysis is crucial to the future. To achieve this goal, both communities need to collectively explore and embrace emerging disruptions in architecture and hardware technologies as well as new data-driven application areas such as those enabled by the Internet of Things. Finally, educational and workforce development structures have to evolved to develop the required integrated skillsets.

Research paper thumbnail of Graph-based Data Integration in EUDAT Data Infrastructure

European Data Infrastructure (EUDAT) is a distributed research infrastructure offering generic da... more European Data Infrastructure (EUDAT) is a distributed research infrastructure offering generic data management services to the research communities. The services deal with different phases of the data life cycle, some of them are tailored to account for special needs of the individual communities or replicated to increase the availability and resilience. All that leads to scattering of the large and heterogeneous data across service landscape limiting discoverability, openness, and data reuse. In this paper, we show how graph database technology can be leveraged to integrated the data across service boundaries. Such an integration will facilitate better cooperation among the researchers, improve searching and increase the openness of the infrastructure. We report on our work in progress, to show how better user experience and enhancement of the services can be achieved by using graph algorithms. Keywords–Data Integration; Graph Databases; Designing for Open Data; Linked Data.

Research paper thumbnail of Revisiting active object stores: Bringing data locality to the limit with NVM

Future Generation Computer Systems, 2021

Object stores are widely used software stacks that achieve excellent scale-out with a well-define... more Object stores are widely used software stacks that achieve excellent scale-out with a well-defined interface and robust performance. However, their traditional get/put interface is unable to exploit data locality at its fullest, and limits reaching its peak performance. In particular, there is one way to improve data locality that has not yet achieved mainstream adoption: the active object store. Although there are some projects that have implemented the main idea of the active object store such as Swift's Storlets or Ceph Object Classes, the scope of these implementations is limited. We believe that there is a huge potential for active object stores in the current status quo. Hyper-converged nodes are bringing more computing capabilities to storage nodes-and viceversa. The proliferation of non-volatile memory (NVM) technology is blurring the line between system memory (fast and scarce) and block devices (slow and abundant). More and more applications need to manage a sheer amount of data (data analytics, Big Data, Machine Learning & AI, etc.), demanding bigger clusters and more complex computations. All these elements are potential game changers that need to be evaluated in the scope of active object stores. More specifically, having NVM devices presents additional opportunities, such as in-place execution. Being able to use the NVM from within the storage system while taking advantage of in-place execution (thanks to the byte-addressable nature of the NVM), in conjunction with the computing capabilities of hyper-converged nodes, can lead to active object stores that greatly outperform their non-active counterparts. In this article we propose an active object store software stack and evaluate it on an NVM-populated node. We will show how this setup is able to reduce execution times from 10% up to more than 90% in a variety of representative application scenarios. Our discussion will focus on the active aspect of the system as well as on the implications of the memory configuration.

Research paper thumbnail of Workflow Environments for Advanced Cyberinfrastructure Platforms

2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), 2019

Progress in science is deeply bound to the effective use of high-performance computing infrastruc... more Progress in science is deeply bound to the effective use of high-performance computing infrastructures and to the efficient extraction of knowledge from vast amounts of data. Such data comes from different sources that follow a cycle composed of pre-processing steps for data curation and preparation for subsequent computing steps, and later analysis and analytics steps applied to the results. However, scientific workflows are currently fragmented in multiple components, with different processes for computing and data management, and with gaps in the viewpoints of the user profiles involved. Our vision is that future workflow environments and tools for the development of scientific workflows should follow a holistic approach, where both data and computing are integrated in a single flow built on simple, high-level interfaces. The topics of research that we propose involve novel ways to express the workflows that integrate the different data and compute processes, dynamic runtimes to support the execution of the workflows in complex and heterogeneous computing infrastructures in an efficient way, both in terms of performance and energy. These infrastructures include highly distributed resources, from sensors and instruments, and devices in the edge, to High-Performance Computing and Cloud computing resources. This paper presents our vision to develop these workflow environments and also the steps we are currently following to achieve it.

Research paper thumbnail of Managing the Cloud Continuum: Lessons Learnt from a Real Fog-to-Cloud Deployment

Sensors, 2021

The wide adoption of the recently coined fog and edge computing paradigms alongside conventional ... more The wide adoption of the recently coined fog and edge computing paradigms alongside conventional cloud computing creates a novel scenario, known as the cloud continuum, where services may benefit from the overall set of resources to optimize their execution. To operate successfully, such a cloud continuum scenario demands for novel management strategies, enabling a coordinated and efficient management of the entire set of resources, from the edge up to the cloud, designed in particular to address key edge characteristics, such as mobility, heterogeneity and volatility. The design of such a management framework poses many research challenges and has already promoted many initiatives worldwide at different levels. In this paper we present the results of one of these experiences driven by an EU H2020 project, focusing on the lessons learnt from a real deployment of the proposed management solution in three different industrial scenarios. We think that such a description may help unders...

Research paper thumbnail of Machine Learning-based Query Augmentation for SPARQL Endpoints

Proceedings of the 14th International Conference on Web Information Systems and Technologies, 2018

Linked Data repositories have become a popular source of publicly-available data. Users accessing... more Linked Data repositories have become a popular source of publicly-available data. Users accessing this data through SPARQL endpoints usually launch several restrictive yet similar consecutive queries, either to find the information they need through trial-and-error or to query related resources. However, instead of executing each individual query separately, query augmentation aims at modifying the incoming queries to retrieve more data that is potentially relevant to subsequent requests. In this paper, we propose a novel approach to query augmentation for SPARQL endpoints based on machine learning. Our approach separates the structure of the query from its contents and measures two types of similarity, which are then used to predict the structure and contents of the augmented query. We test the approach on the real-world query logs of the Spanish and English DBpedia and show that our approach yields high-accuracy prediction. We also show that, by caching the results of the predicted augmented queries, we can retrieve data relevant to several subsequent queries at once, achieving a higher cache hit rate than previous approaches.

Research paper thumbnail of Predicting Access to Persistent Objects Through Static Code Analysis

New Trends in Databases and Information Systems, 2017

In this paper, we present a fully-automatic, high-accuracy approach to predict access to persiste... more In this paper, we present a fully-automatic, high-accuracy approach to predict access to persistent objects through static code analysis of object-oriented applications. The most widely-used previous technique uses a simple heuristic to make the predictions while approaches that offer higher accuracy are based on monitoring application execution. These approaches add a non-negligible overhead to the application's execution time and/or consume a considerable amount of memory. By contrast, we demonstrate in our experimental study that our proposed approach offers better accuracy than the most common technique used to predict access to persistent objects, and makes the predictions farther in advance, without performing any analysis during application execution.

Research paper thumbnail of CAPre: Code-Analysis based Prefetching for Persistent Object Stores

Future Generation Computer Systems, 2019

Data prefetching aims to improve access times to data storage systems by predicting data records ... more Data prefetching aims to improve access times to data storage systems by predicting data records that are likely to be accessed by subsequent requests and retrieving them into a memory cache before they are needed. In the case of Persistent Object Stores, previous approaches to prefetching have been based on predictions made through analysis of the store's schema, which generates rigid predictions, or monitoring access patterns to the store while applications are executed, which introduces memory and/or computation overhead. In this paper, we present CAPre, a novel prefetching system for Persistent Object Stores based on static code analysis of object-oriented applications. CAPre generates the predictions at compile-time and does not introduce any overhead to the application execution. Moreover, CAPre is able to predict large amounts of objects that will be accessed in the near future, thus enabling the object store to perform parallel prefetching if the objects are distributed, in a much more aggressive way than in schema-based prediction algorithms. We integrate CAPre into a distributed Persistent Object Store and run a series of experiments that show that it can reduce the execution time of applications from 9% to over 50%, depending on the nature of the application and its persistent data model.

Research paper thumbnail of Proceedings of the 4th ACM MobiHoc Workshop on Experiences with the Design and Implementation of Smart Objects

Proceedings of the 4th ACM MobiHoc Workshop on Experiences with the Design and Implementation of Smart Objects - SMARTOBJECTS '18, 2018

Research paper thumbnail of Dataclay: A distributed data store for effective inter-player data sharing

Journal of Systems and Software, 2017

In the Big Data era, both the academic community and industry agree that a crucial point to obtai... more In the Big Data era, both the academic community and industry agree that a crucial point to obtain the maximum benefits from the explosive data growth is integrating information from different sources, and also combining methodologies to analyze and process it. For this reason, sharing data so that third parties can build new applications or services based on it is nowadays a trend. Although most data sharing initiatives are based on public data, the ability to reuse data generated by private companies is starting to gain importance as some of them (such as Google, Twitter, BBC or New York Times) are providing access to part of their data. However, current solutions for sharing data with third parties are not fully convenient to either or both data owners and data consumers. Therefore we present dataClay, a distributed data store designed to share data with external players in a secure and flexible way based on the concepts of identity and encapsulation. We also prove that dataClay is comparable in terms of performance with trendy NoSQL technologies while providing extra functionality, and resolves impedance mismatch issues based on the Object Oriented paradigm for data representation.

Research paper thumbnail of Big Data Benchmark Compendium

Lecture Notes in Computer Science, 2016

The field of Big Data and related technologies is rapidly evolving. Consequently, many benchmarks... more The field of Big Data and related technologies is rapidly evolving. Consequently, many benchmarks are emerging, driven by academia and industry alike. As these benchmarks are emphasizing different aspects of Big Data and, in many cases, covering different technical platforms and uses cases, it is extremely difficult to keep up with the pace of benchmark creation. Also with the combinations of large volumes of data, heterogeneous data formats and the changing processing velocity, it becomes complex to specify an architecture which best suits all application requirements. This makes the investigation and standardization of such systems very difficult. Therefore, the traditional way of specifying a standardized benchmark with pre-defined workloads, which have been in use for years in the transaction and analytical processing systems, is not trivial to employ for Big Data systems. This document provides a summary of existing benchmarks and those that are in development, gives a side-by-side comparison of their characteristics and discusses their pros and cons. The goal is to understand the current state in Big Data benchmarking and guide practitioners in their approaches and use cases.

Research paper thumbnail of PyCOMPSs: Parallel computational workflows in Python

The International Journal of High Performance Computing Applications, 2016

The use of the Python programming language for scientific computing has been gaining momentum in ... more The use of the Python programming language for scientific computing has been gaining momentum in the last years. The fact that it is compact and readable and its complete set of scientific libraries are two important characteristics that favour its adoption. Nevertheless, Python still lacks a solution for easily parallelizing generic scripts on distributed infrastructures, since the current alternatives mostly require the use of APIs for message passing or are restricted to embarrassingly parallel computations. In that sense, this paper presents PyCOMPSs, a framework that facilitates the development of parallel computational workflows in Python. In this approach, the user programs her script in a sequential fashion and decorates the functions to be run as asynchronous parallel tasks. A runtime system is in charge of exploiting the inherent concurrency of the script, detecting the data dependencies between tasks and spawning them to the available resources. Furthermore, we show how t...

Research paper thumbnail of Specifying Artifact-Centric Business Process Models in UML

Lecture Notes in Business Information Processing, 2015

In recent years, the artifact-centric approach to process modeling has attracted a lot of attenti... more In recent years, the artifact-centric approach to process modeling has attracted a lot of attention. One of the research lines in this area is finding a suitable way to represent the dimensions in this approach. Bearing this in mind, this paper proposes a way to specify artifact-centric business process models by means of well-known UML diagrams, from a high-level of abstraction and with a technology-independent perspective. UML is a graphical language, widely used and with a precise semantics.

Research paper thumbnail of EU-Rent as an artifact-centric process model: technical report

Business process modeling using an artifact-centric approach has raised a significant interest ov... more Business process modeling using an artifact-centric approach has raised a significant interest over the last few years. This approach is usually stated in terms of the BALSA framework which defi nes the four dimensions of an artifact-centric business process model: Business Artifacts, Lifecycles, Services and Associations. One of the research challenges in this area is looking for diff erent diagrams to represent these dimensions. Bearing this in mind, this technical report shows how various UML diagrams can be used to represent all ...

Research paper thumbnail of Enabling dynamic and intelligent workflows for HPC, data analytics, and AI convergence

Future Generation Computer Systems

The evolution of High-Performance Computing (HPC) platforms enables the design and execution of p... more The evolution of High-Performance Computing (HPC) platforms enables the design and execution of progressively larger and more complex workflow applications in these systems. The complexity comes not only from the number of elements that compose the workflows but also from the type of computations they perform. While traditional HPC workflows target simulations and modelling of physical phenomena, current needs require in addition data analytics (DA) and artificial intelligence (AI) tasks. However, the development of these workflows is hampered by the lack of proper programming models and environments that support the integration of HPC, DA, and AI, as well as the lack of tools to easily deploy and execute the workflows in HPC systems. To progress in this direction, this paper presents use cases where complex workflows are required and investigates the main issues to be addressed for the HPC/DA/AI

Research paper thumbnail of An Elastic Software Architecture for Extreme-Scale Big Data Analytics

Technologies and Applications for Big Data Value

This chapter describes a software architecture for processing big-data analytics considering the ... more This chapter describes a software architecture for processing big-data analytics considering the complete compute continuum, from the edge to the cloud. The new generation of smart systems requires processing a vast amount of diverse information from distributed data sources. The software architecture presented in this chapter addresses two main challenges. On the one hand, a new elasticity concept enables smart systems to satisfy the performance requirements of extreme-scale analytics workloads. By extending the elasticity concept (known at cloud side) across the compute continuum in a fog computing environment, combined with the usage of advanced heterogeneous hardware architectures at the edge side, the capabilities of the extreme-scale analytics can significantly increase, integrating both responsive data-in-motion and latent data-at-rest analytics into a single solution. On the other hand, the software architecture also focuses on the fulfilment of the non-functional properties...

Research paper thumbnail of EUDAT D8.1: Report of Requirements

The aim of this deliverable is to lay out the ground for future work in WP8 by providing a clear ... more The aim of this deliverable is to lay out the ground for future work in WP8 by providing a clear list of additional services that should be developed as well as an initial list of high-level requirements. In this document, we are presenting general DLC descriptions for core EUDAT communities and for individual researcher and citizen scientists. These DLC descriptions have been derived from community specific user scenarios and were aligned using an agreed upon list of key activities involved in DLCs. Within these descriptions, we included the planned integration of EUDAT services in the different DLC steps extracted from current and future Uptake Plans (WP4) to dissociate as much as possible existing services, requirements and new services. These descriptions provide the first attempt to associate EUDAT services to identified DLC activities and were used to list the specific needs. From these needs, we were able to identify new services and provide a list of high-level requirements ...

Research paper thumbnail of D8.2: Report on Technology Watch

The aim of this deliverable is to present a summary of the different technological approaches tha... more The aim of this deliverable is to present a summary of the different technological approaches that EUDAT investigated to support the design of the prototype services and data models identified in D8.1 and implemented in task 8.2 (D8.3, data models) and task 8.4 (D8.4, service design and prototypes). This document presents the two distinct but related topics of investigation: the existing data models that could support the design of DLC models and a directive language, as well as the existing technologies to support the usage of graph-based data, workflow descriptions, directives, semantic resources and dynamic data. These descriptions are not meant to be exhaustive but reflect the results of our current state of knowledge.

Research paper thumbnail of RosneT: A Block Tensor Algebra Library for Out-of-Core Quantum Computing Simulation

2021 IEEE/ACM Second International Workshop on Quantum Computing Software (QCS), 2021

With the advent of more powerful Quantum Computers, the need for larger Quantum Simulations has b... more With the advent of more powerful Quantum Computers, the need for larger Quantum Simulations has boosted. As the amount of resources grows exponentially with size of the target system Tensor Networks emerge as an optimal framework with which we represent Quantum States in tensor factorizations. As the extent of a tensor network increases, so does the size of intermediate tensors requiring HPC tools for their manipulation. Simulations of medium-sized circuits cannot fit on local memory, and solutions for distributed contraction of tensors are scarce. In this work we present RosneT, a library for distributed, out-ofcore block tensor algebra. We use the PyCOMPSs programming model to transform tensor operations into a collection of tasks handled by the COMPSs runtime, targeting executions in existing and upcoming Exascale supercomputers. We report results validating our approach showing good scalability in simulations of Quantum circuits of up to 53 qubits.

Research paper thumbnail of D8.3: Report on Design Model and Definition of Data Directives

Based on the work described in D8.1 and D8.2, we developed initial prototypes for modeling Data L... more Based on the work described in D8.1 and D8.2, we developed initial prototypes for modeling Data Life Cycles and directives. In this document, we are describing both the processes and the results of this initial implementation. We will discuss the issues we faced in developing these prototypes and the technical choices we made. This deliverable provides an overview of the current status of the work. This work has been used as a basis for concrete implementations of community use-cases, described in D8.6.

Research paper thumbnail of Introduction to CEBDA 2018

2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2018

International Workshop on the Convergence of Extreme Scale Computing and Big Data Analysis The de... more International Workshop on the Convergence of Extreme Scale Computing and Big Data Analysis The deployment of extreme scale computing platforms in research and industry coupled with the proliferation of large and distributed digital data sources have the potential for unprecedented insights and understanding in all areas of science, engineering, business, and society in general. However, challenges related to the Big Data generated and processed by these systems remain a significant barrier in achieving this potential. Addressing these challenges requires a seamless integration of the extreme scale/high performance computing, cloud computing, storage technologies, data management, energy efficiency, and big data analytics research approaches, framework/technologies, and communities. The convergence and integration of exascale systems and data analysis is crucial to the future. To achieve this goal, both communities need to collectively explore and embrace emerging disruptions in architecture and hardware technologies as well as new data-driven application areas such as those enabled by the Internet of Things. Finally, educational and workforce development structures have to evolved to develop the required integrated skillsets.

Research paper thumbnail of Graph-based Data Integration in EUDAT Data Infrastructure

European Data Infrastructure (EUDAT) is a distributed research infrastructure offering generic da... more European Data Infrastructure (EUDAT) is a distributed research infrastructure offering generic data management services to the research communities. The services deal with different phases of the data life cycle, some of them are tailored to account for special needs of the individual communities or replicated to increase the availability and resilience. All that leads to scattering of the large and heterogeneous data across service landscape limiting discoverability, openness, and data reuse. In this paper, we show how graph database technology can be leveraged to integrated the data across service boundaries. Such an integration will facilitate better cooperation among the researchers, improve searching and increase the openness of the infrastructure. We report on our work in progress, to show how better user experience and enhancement of the services can be achieved by using graph algorithms. Keywords–Data Integration; Graph Databases; Designing for Open Data; Linked Data.

Research paper thumbnail of Revisiting active object stores: Bringing data locality to the limit with NVM

Future Generation Computer Systems, 2021

Object stores are widely used software stacks that achieve excellent scale-out with a well-define... more Object stores are widely used software stacks that achieve excellent scale-out with a well-defined interface and robust performance. However, their traditional get/put interface is unable to exploit data locality at its fullest, and limits reaching its peak performance. In particular, there is one way to improve data locality that has not yet achieved mainstream adoption: the active object store. Although there are some projects that have implemented the main idea of the active object store such as Swift's Storlets or Ceph Object Classes, the scope of these implementations is limited. We believe that there is a huge potential for active object stores in the current status quo. Hyper-converged nodes are bringing more computing capabilities to storage nodes-and viceversa. The proliferation of non-volatile memory (NVM) technology is blurring the line between system memory (fast and scarce) and block devices (slow and abundant). More and more applications need to manage a sheer amount of data (data analytics, Big Data, Machine Learning & AI, etc.), demanding bigger clusters and more complex computations. All these elements are potential game changers that need to be evaluated in the scope of active object stores. More specifically, having NVM devices presents additional opportunities, such as in-place execution. Being able to use the NVM from within the storage system while taking advantage of in-place execution (thanks to the byte-addressable nature of the NVM), in conjunction with the computing capabilities of hyper-converged nodes, can lead to active object stores that greatly outperform their non-active counterparts. In this article we propose an active object store software stack and evaluate it on an NVM-populated node. We will show how this setup is able to reduce execution times from 10% up to more than 90% in a variety of representative application scenarios. Our discussion will focus on the active aspect of the system as well as on the implications of the memory configuration.

Research paper thumbnail of Workflow Environments for Advanced Cyberinfrastructure Platforms

2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), 2019

Progress in science is deeply bound to the effective use of high-performance computing infrastruc... more Progress in science is deeply bound to the effective use of high-performance computing infrastructures and to the efficient extraction of knowledge from vast amounts of data. Such data comes from different sources that follow a cycle composed of pre-processing steps for data curation and preparation for subsequent computing steps, and later analysis and analytics steps applied to the results. However, scientific workflows are currently fragmented in multiple components, with different processes for computing and data management, and with gaps in the viewpoints of the user profiles involved. Our vision is that future workflow environments and tools for the development of scientific workflows should follow a holistic approach, where both data and computing are integrated in a single flow built on simple, high-level interfaces. The topics of research that we propose involve novel ways to express the workflows that integrate the different data and compute processes, dynamic runtimes to support the execution of the workflows in complex and heterogeneous computing infrastructures in an efficient way, both in terms of performance and energy. These infrastructures include highly distributed resources, from sensors and instruments, and devices in the edge, to High-Performance Computing and Cloud computing resources. This paper presents our vision to develop these workflow environments and also the steps we are currently following to achieve it.

Research paper thumbnail of Managing the Cloud Continuum: Lessons Learnt from a Real Fog-to-Cloud Deployment

Sensors, 2021

The wide adoption of the recently coined fog and edge computing paradigms alongside conventional ... more The wide adoption of the recently coined fog and edge computing paradigms alongside conventional cloud computing creates a novel scenario, known as the cloud continuum, where services may benefit from the overall set of resources to optimize their execution. To operate successfully, such a cloud continuum scenario demands for novel management strategies, enabling a coordinated and efficient management of the entire set of resources, from the edge up to the cloud, designed in particular to address key edge characteristics, such as mobility, heterogeneity and volatility. The design of such a management framework poses many research challenges and has already promoted many initiatives worldwide at different levels. In this paper we present the results of one of these experiences driven by an EU H2020 project, focusing on the lessons learnt from a real deployment of the proposed management solution in three different industrial scenarios. We think that such a description may help unders...

Research paper thumbnail of Machine Learning-based Query Augmentation for SPARQL Endpoints

Proceedings of the 14th International Conference on Web Information Systems and Technologies, 2018

Linked Data repositories have become a popular source of publicly-available data. Users accessing... more Linked Data repositories have become a popular source of publicly-available data. Users accessing this data through SPARQL endpoints usually launch several restrictive yet similar consecutive queries, either to find the information they need through trial-and-error or to query related resources. However, instead of executing each individual query separately, query augmentation aims at modifying the incoming queries to retrieve more data that is potentially relevant to subsequent requests. In this paper, we propose a novel approach to query augmentation for SPARQL endpoints based on machine learning. Our approach separates the structure of the query from its contents and measures two types of similarity, which are then used to predict the structure and contents of the augmented query. We test the approach on the real-world query logs of the Spanish and English DBpedia and show that our approach yields high-accuracy prediction. We also show that, by caching the results of the predicted augmented queries, we can retrieve data relevant to several subsequent queries at once, achieving a higher cache hit rate than previous approaches.

Research paper thumbnail of Predicting Access to Persistent Objects Through Static Code Analysis

New Trends in Databases and Information Systems, 2017

In this paper, we present a fully-automatic, high-accuracy approach to predict access to persiste... more In this paper, we present a fully-automatic, high-accuracy approach to predict access to persistent objects through static code analysis of object-oriented applications. The most widely-used previous technique uses a simple heuristic to make the predictions while approaches that offer higher accuracy are based on monitoring application execution. These approaches add a non-negligible overhead to the application's execution time and/or consume a considerable amount of memory. By contrast, we demonstrate in our experimental study that our proposed approach offers better accuracy than the most common technique used to predict access to persistent objects, and makes the predictions farther in advance, without performing any analysis during application execution.

Research paper thumbnail of CAPre: Code-Analysis based Prefetching for Persistent Object Stores

Future Generation Computer Systems, 2019

Data prefetching aims to improve access times to data storage systems by predicting data records ... more Data prefetching aims to improve access times to data storage systems by predicting data records that are likely to be accessed by subsequent requests and retrieving them into a memory cache before they are needed. In the case of Persistent Object Stores, previous approaches to prefetching have been based on predictions made through analysis of the store's schema, which generates rigid predictions, or monitoring access patterns to the store while applications are executed, which introduces memory and/or computation overhead. In this paper, we present CAPre, a novel prefetching system for Persistent Object Stores based on static code analysis of object-oriented applications. CAPre generates the predictions at compile-time and does not introduce any overhead to the application execution. Moreover, CAPre is able to predict large amounts of objects that will be accessed in the near future, thus enabling the object store to perform parallel prefetching if the objects are distributed, in a much more aggressive way than in schema-based prediction algorithms. We integrate CAPre into a distributed Persistent Object Store and run a series of experiments that show that it can reduce the execution time of applications from 9% to over 50%, depending on the nature of the application and its persistent data model.

Research paper thumbnail of Proceedings of the 4th ACM MobiHoc Workshop on Experiences with the Design and Implementation of Smart Objects

Proceedings of the 4th ACM MobiHoc Workshop on Experiences with the Design and Implementation of Smart Objects - SMARTOBJECTS '18, 2018

Research paper thumbnail of Dataclay: A distributed data store for effective inter-player data sharing

Journal of Systems and Software, 2017

In the Big Data era, both the academic community and industry agree that a crucial point to obtai... more In the Big Data era, both the academic community and industry agree that a crucial point to obtain the maximum benefits from the explosive data growth is integrating information from different sources, and also combining methodologies to analyze and process it. For this reason, sharing data so that third parties can build new applications or services based on it is nowadays a trend. Although most data sharing initiatives are based on public data, the ability to reuse data generated by private companies is starting to gain importance as some of them (such as Google, Twitter, BBC or New York Times) are providing access to part of their data. However, current solutions for sharing data with third parties are not fully convenient to either or both data owners and data consumers. Therefore we present dataClay, a distributed data store designed to share data with external players in a secure and flexible way based on the concepts of identity and encapsulation. We also prove that dataClay is comparable in terms of performance with trendy NoSQL technologies while providing extra functionality, and resolves impedance mismatch issues based on the Object Oriented paradigm for data representation.

Research paper thumbnail of Big Data Benchmark Compendium

Lecture Notes in Computer Science, 2016

The field of Big Data and related technologies is rapidly evolving. Consequently, many benchmarks... more The field of Big Data and related technologies is rapidly evolving. Consequently, many benchmarks are emerging, driven by academia and industry alike. As these benchmarks are emphasizing different aspects of Big Data and, in many cases, covering different technical platforms and uses cases, it is extremely difficult to keep up with the pace of benchmark creation. Also with the combinations of large volumes of data, heterogeneous data formats and the changing processing velocity, it becomes complex to specify an architecture which best suits all application requirements. This makes the investigation and standardization of such systems very difficult. Therefore, the traditional way of specifying a standardized benchmark with pre-defined workloads, which have been in use for years in the transaction and analytical processing systems, is not trivial to employ for Big Data systems. This document provides a summary of existing benchmarks and those that are in development, gives a side-by-side comparison of their characteristics and discusses their pros and cons. The goal is to understand the current state in Big Data benchmarking and guide practitioners in their approaches and use cases.

Research paper thumbnail of PyCOMPSs: Parallel computational workflows in Python

The International Journal of High Performance Computing Applications, 2016

The use of the Python programming language for scientific computing has been gaining momentum in ... more The use of the Python programming language for scientific computing has been gaining momentum in the last years. The fact that it is compact and readable and its complete set of scientific libraries are two important characteristics that favour its adoption. Nevertheless, Python still lacks a solution for easily parallelizing generic scripts on distributed infrastructures, since the current alternatives mostly require the use of APIs for message passing or are restricted to embarrassingly parallel computations. In that sense, this paper presents PyCOMPSs, a framework that facilitates the development of parallel computational workflows in Python. In this approach, the user programs her script in a sequential fashion and decorates the functions to be run as asynchronous parallel tasks. A runtime system is in charge of exploiting the inherent concurrency of the script, detecting the data dependencies between tasks and spawning them to the available resources. Furthermore, we show how t...

Research paper thumbnail of Specifying Artifact-Centric Business Process Models in UML

Lecture Notes in Business Information Processing, 2015

In recent years, the artifact-centric approach to process modeling has attracted a lot of attenti... more In recent years, the artifact-centric approach to process modeling has attracted a lot of attention. One of the research lines in this area is finding a suitable way to represent the dimensions in this approach. Bearing this in mind, this paper proposes a way to specify artifact-centric business process models by means of well-known UML diagrams, from a high-level of abstraction and with a technology-independent perspective. UML is a graphical language, widely used and with a precise semantics.

Research paper thumbnail of EU-Rent as an artifact-centric process model: technical report

Business process modeling using an artifact-centric approach has raised a significant interest ov... more Business process modeling using an artifact-centric approach has raised a significant interest over the last few years. This approach is usually stated in terms of the BALSA framework which defi nes the four dimensions of an artifact-centric business process model: Business Artifacts, Lifecycles, Services and Associations. One of the research challenges in this area is looking for diff erent diagrams to represent these dimensions. Bearing this in mind, this technical report shows how various UML diagrams can be used to represent all ...