Verena Kantere | Université de Genève (original) (raw)

Papers by Verena Kantere

Research paper thumbnail of An efficient multi-objective genetic algorithm for cloud computing: NSGA-G

Cloud computing provides computing resources with elasticity following a pay-as-you-go model. Thi... more Cloud computing provides computing resources with elasticity following a pay-as-you-go model. This raises Multi-Objective Optimization Problems (MOOP), in particular to find Query Execution Plans (QEPs) with respect to users' preferences being for example response time, money, quality, etc. In such a context, MOOP may generate Pareto-optimal front with high complexity. Pareto-dominated based Multi-objective Evolutionary Algorithms (MOEA) are often used as an alternative solution, like Non-dominated Sorting Genetic Algorithms (NSGAs) that provide better computational complexity. This paper presents NSGA-G, a NSGA based on Grid Partitioning for improving complexity and quality of current NSGAs. Experiments on DTLZ test problems using Generational Distance (GD), Inverted Generational Distance (IGD) and Maximum Pareto Front Error prove the relevance of our solution.

Research paper thumbnail of Dynamic estimation for medical data management in a cloud federation

HAL (Le Centre pour la Communication Scientifique Directe), Mar 26, 2019

Data sharing is important in the medical domain. Sharing data allows large-scale analysis with ma... more Data sharing is important in the medical domain. Sharing data allows large-scale analysis with many data sources to provide more accurate results (especially in the case of rare diseases with small local datasets). Cloud federations consist in a major progress in sharing medical data stored within different cloud platforms, such as Amazon, Microsoft, Google Cloud, etc. It also enables to access distributed data of mobile patients. The pay-as-you-go model in cloud federations raises an important issue in terms of Multi-Objective Query Processing (MOQP) to find a Query Execution Plan according to users preferences, such as response time, money, quality, etc. However, optimizing a query in a cloud federation is complex with increasing heterogeneity and additional variance, especially due to a wide range of communications and pricing models. Indeed, in such a context, it is difficult to provide accurate estimation to make relevant decision. To address this problem, we present Dynamic Regression Algorithm (DREAM), which can provide accurate estimation in a cloud federation with limited historical data. DREAM focuses on reducing the size of historical data while maintaining the estimation accuracy. The proposed algorithm is integrated in Intelligent Resource Scheduler, a solution for heterogeneous databases, to solve MOQP in cloud federations and validate with preliminary experiments on a decision support benchmark (TPC-H benchmark).

Research paper thumbnail of Multi-objective query optimization in Spark SQL

HAL (Le Centre pour la Communication Scientifique Directe), Aug 22, 2022

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific r... more HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Research paper thumbnail of Just-In-Time Modeling with DataMingler

DataMingler is a prototype tool that implements a novel conceptual model, the Data Virtual Machin... more DataMingler is a prototype tool that implements a novel conceptual model, the Data Virtual Machine (DVM) and can be used for agile just-in-time modeling of data from diverse sources. The DVM provides easy-to-understand semantics and fast and flexible schema manipulations. An important and useful class of queries in analytics environments, dataframes, is defined in the context of DVMs. These queries can be expressed either visually or through a novel query language, DVM-QL. We demonstrate DataMingler's capabilities map relational sources and queries on the latter in a DVM schema and augment it with information from semi-structured and unstructured sources. We also show how to express on the DVM easily complex relational queries or queries on structured, semi-structured and unstructured sources combined.

Research paper thumbnail of Federated Learning Performance on Early ICU Mortality Prediction with Extreme Data Distributions

Lecture Notes in Computer Science, Dec 31, 2022

Research paper thumbnail of Utilizing nullomers in cell-free RNA for early cancer detection

medRxiv (Cold Spring Harbor Laboratory), Jun 16, 2023

Research paper thumbnail of From Cloud to Serverless: MOO in the new Cloud epoch

HAL (Le Centre pour la Communication Scientifique Directe), Mar 29, 2022

Research paper thumbnail of Mapping Construction Compliant with Schema Semantics

Springer eBooks, 2014

A dominant characteristic of autonomous information sources is their heterogeneity, in terms of d... more A dominant characteristic of autonomous information sources is their heterogeneity, in terms of data formats and schemas. We need schema and data mappings between such sources in order to query them in a uniform and systematic manner. Guiding the discovery of mappings employing automatic tools is one of the fundamental unsolved challenges of data interoperability. In this work we consider the problem of discovering mappings for schemas of autonomous sources that can gradually revealed. Using as an example setting an overlay of peer databases, we present a mapping solution that discovers mappings, which can be adapted to new and gradually revealed schema information. Mapping discovery is schema-centric and incorporates new semantics as they are unveiled.

Research paper thumbnail of Spatial Data Management in IoT Systems: Solutions and Evaluation

International journal of semantic computing, Mar 1, 2021

As the Internet of Things (IoT) systems gain in popularity, an increasing number of Big Data sour... more As the Internet of Things (IoT) systems gain in popularity, an increasing number of Big Data sources are available. Ranging from small sensor networks designed for household use to large fully automated industrial environments, the IoT systems create billions of measurements each second making traditional storage and indexing solutions obsolete. While research around Big Data has focused on scalable solutions that can support the datasets produced by these systems, the focus has been mainly on managing the volume and velocity of these data, rather than providing efficient solutions for their retrieval and analysis. A key characteristic of these data, which is, more often than not, overlooked, is the spatial information that can be used to integrate data from multiple sources and conduct multi-dimensional analysis of the collected information. We present here the solutions currently available for the storage and indexing of spatial datasets produced by the IoT systems and we discuss their applicability in real-world scenarios.

Research paper thumbnail of Dynamic Estimation and Grid Partitioning Approach for Multi-objective Optimization Problems in Medical Cloud Federations

Research paper thumbnail of Visualizing and Exploring Big Datasets based on Semantic Community Detection

Extending Database Technology, 2021

Research paper thumbnail of Optimizing DICOM data management with NSGA-G

HAL (Le Centre pour la Communication Scientifique Directe), Mar 26, 2019

Cloud-based systems enable to manage ever-increasing medical data. The Digital Imaging and Commun... more Cloud-based systems enable to manage ever-increasing medical data. The Digital Imaging and Communication in Medicine (DI-COM) standard has been widely accepted to store and transfer the medical data, which uses single (row/column) or hybrid data storage technique (row-column). In particular, hybrid systems leverage the advantages of both techniques and allow to take into account various kinds of queries from full records retrieval (online transaction processing) to analytics (online analytical processing) queries. Additionally, the pay-as-you-go model and elasticity of cloud computing raise an important issue regarding to Multiple Objective Optimization (MOO) to find a data configuration according to users preferences such as storage space, processing response time, monetary cost, quality, etc. In such a context, the considerable space of solutions in MOO leads to generation of Pareto-optimal front with high complexity. Pareto-dominated based Multiple Objective Evolutionary Algorithms are often used as an alternative solution, e.g., Non-dominated Sorting Genetic Algorithms (NSGA) which provide less computational complexity. This paper presents NSGA-G, an NSGA based on Grid Partitioning to improve the complexity and quality of current NSGAs and to obtain efficient storage and querying of DICOM hybrid data. Experimental results on DTLZ test problems [10] and DICOM hybrid data prove the relevance of the proposed algorithm.

Research paper thumbnail of Modelling Processes of Big Data Analytics

Lecture Notes in Computer Science, 2015

Analytics tasks in scientific and industrial environments are to be performed in some order that,... more Analytics tasks in scientific and industrial environments are to be performed in some order that, as a whole, represent the rationale of a specific process on the data. The challenge to process the data is, beyond there mere size, their dispersion and the variety of their formats. The data analysis may include a range of tasks to be executed on a range of query engines, which are created by various users, such as business analysts, engineers, end-users etc. The users, depending on their role and expertise, may need or care for a different level of abstraction with respect to the execution of the individual tasks and overall process. Therefore, a system for Big Data analytics should enable the expression of tasks in an abstract manner, adaptable to the user role, interest and expertise. In this work we discuss the modelling of Big Data Analytics. We propose a novel representation model for analytics tasks and overall processes, that encapsulates their declaration, but, also, their execution semantics. The model allows for the definition of analytics processes with a varying level of abstraction, adaptable to the user role. Our motivation derives from real use cases.

Research paper thumbnail of Rethinking reinforcement learning for cloud elasticity

Cloud elasticity, i.e., the dynamic allocation of resources to applications to meet fluctuating w... more Cloud elasticity, i.e., the dynamic allocation of resources to applications to meet fluctuating workload demands, has been one of the greatest challenges in cloud computing. Approaches based on reinforcement learning have been proposed but they require a large number of states in order to model complex application behavior. In this work we propose a novel reinforcement learning approach that employs adaptive state space partitioning. The idea is to start from one state that represents the entire environment and partition this into finer-grained states adaptively to the observed workload and system behavior following a decision-tree approach. We explore novel statistical criteria and strategies that decide both the correct parameters and the appropriate time to perform the partitioning.

Research paper thumbnail of Data Virtual Machines: Enabling Data Virtualization

Lecture Notes in Computer Science, 2021

Research paper thumbnail of Efficient Representation of Very Large Linked Datasets as Graphs

Large linked datasets are nowadays available on many scientific topics of interest and offer inva... more Large linked datasets are nowadays available on many scientific topics of interest and offer invaluable knowledge. These datasets are of interest to a wide audience, people with limited or no knowledge about the Semantic Web, that want to explore and analyse this information in a user-friendly way. Aiming to support such usage, systems have been developed that support such exploration they impose however many limitations as they provide to users access to a limited part of the input dataset either by aggregating information or by exploiting data formats, such as hierarchies. As more linked datasets are becoming available and more people are interested to explore them, it is imperative to provide an user-friendly way to access and explore diverse and very large datasets in an intuitive way, as graphs. We present here an off-line pre-processing technique, divided in three phases, that can transform any linked dataset, independently of size and characteristics to one continuous graph in the two-dimensional space. We store the spatial information of the graph, add the needed indices and provide the graphical information through a dedicated API to support the exploration of the information. Finally, we conduct an experimental analysis to show that our technique can process and represent as one continuous graph large and diverse datasets.

Research paper thumbnail of An Economic Model for Self-Tuned Cloud Caching

Proceedings, Mar 1, 2009

Cloud computing, the new trend for service infrastructures requires user multi-tenancy as well as... more Cloud computing, the new trend for service infrastructures requires user multi-tenancy as well as minimal capital expenditure. In a cloud that services large amounts of data that are massively collected and queried, such as scientific data, users typically pay for query services. The cloud supports caching of data in order to provide quality query services. User payments cover query execution costs and maintenance of cloud infrastructure, and incur cloud profit. The challenge resides in providing efficient and resource-economic query services while maintaining a profitable cloud. In this work we propose an economic model for self-tuned cloud caching targeting the service of scientific data. The proposed economy is adapted to policies that encourage high-quality individual and overall query services but also brace the profit of the cloud. We propose a cost model that takes into account all possible query and infrastructure expenditure. The experimental study proves that the proposed solution is viable for a variety of workloads and data.

Research paper thumbnail of Predicting cost amortization for query services

... Archit. News, 30(2):209--220, 2002. 19. IF Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulna... more ... Archit. News, 30(2):209--220, 2002. 19. IF Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulnaga. Cords: Automatic discovery of correlations and soft functional dependencies. ... IEEE Trans. Comput., 57(4):433--447, 2008. 21. H. Kimura, G. Huo, A. Rasin, S. Madden, and SB Zdonik. ...

Research paper thumbnail of Data Virtual Machines: Simplifying Data Sharing, Exploration & Querying in Big Data Environments

2022 IEEE International Conference on Big Data (Big Data)

Research paper thumbnail of Elastic management of cloud applications using adaptive reinforcement learning

2017 IEEE International Conference on Big Data (Big Data), 2017

Modern large-scale computing deployments consist of complex applications running over machine clu... more Modern large-scale computing deployments consist of complex applications running over machine clusters. An important issue in these is the offering of elasticity, i.e., the dynamic allocation of resources to applications to meet fluctuating workload demands. Threshold based approaches are typically employed, yet they are difficult to calibrate and optimize. Approaches based on reinforcement learning (RL) have been proposed, but they require a large number of states in order to model complex application behavior. Methods that adaptively partition the state space have been proposed, but their partitioning criteria and strategies are sub-optimal. In this work we present MDP DT, a novel full-model based reinforcement learning algorithm for elastic resource management that employs adaptive state space partitioning. We propose two novel statistical criteria and three strategies and we experimentally prove that they correctly decide both where and when to partition, outperforming existing approaches. We experimentally evaluate MDP DT in a real large scale cluster over variable not-encountered workloads and we show that it takes more informed decisions compared to static, model-free and threshold approaches, while requiring a minimal amount of training data. We experimentally show that this adaptation enabled MDP DT to optimize the achieved profit while being 40% cheaper than calibrated RL and threshold approaches.

Research paper thumbnail of An efficient multi-objective genetic algorithm for cloud computing: NSGA-G

Cloud computing provides computing resources with elasticity following a pay-as-you-go model. Thi... more Cloud computing provides computing resources with elasticity following a pay-as-you-go model. This raises Multi-Objective Optimization Problems (MOOP), in particular to find Query Execution Plans (QEPs) with respect to users' preferences being for example response time, money, quality, etc. In such a context, MOOP may generate Pareto-optimal front with high complexity. Pareto-dominated based Multi-objective Evolutionary Algorithms (MOEA) are often used as an alternative solution, like Non-dominated Sorting Genetic Algorithms (NSGAs) that provide better computational complexity. This paper presents NSGA-G, a NSGA based on Grid Partitioning for improving complexity and quality of current NSGAs. Experiments on DTLZ test problems using Generational Distance (GD), Inverted Generational Distance (IGD) and Maximum Pareto Front Error prove the relevance of our solution.

Research paper thumbnail of Dynamic estimation for medical data management in a cloud federation

HAL (Le Centre pour la Communication Scientifique Directe), Mar 26, 2019

Data sharing is important in the medical domain. Sharing data allows large-scale analysis with ma... more Data sharing is important in the medical domain. Sharing data allows large-scale analysis with many data sources to provide more accurate results (especially in the case of rare diseases with small local datasets). Cloud federations consist in a major progress in sharing medical data stored within different cloud platforms, such as Amazon, Microsoft, Google Cloud, etc. It also enables to access distributed data of mobile patients. The pay-as-you-go model in cloud federations raises an important issue in terms of Multi-Objective Query Processing (MOQP) to find a Query Execution Plan according to users preferences, such as response time, money, quality, etc. However, optimizing a query in a cloud federation is complex with increasing heterogeneity and additional variance, especially due to a wide range of communications and pricing models. Indeed, in such a context, it is difficult to provide accurate estimation to make relevant decision. To address this problem, we present Dynamic Regression Algorithm (DREAM), which can provide accurate estimation in a cloud federation with limited historical data. DREAM focuses on reducing the size of historical data while maintaining the estimation accuracy. The proposed algorithm is integrated in Intelligent Resource Scheduler, a solution for heterogeneous databases, to solve MOQP in cloud federations and validate with preliminary experiments on a decision support benchmark (TPC-H benchmark).

Research paper thumbnail of Multi-objective query optimization in Spark SQL

HAL (Le Centre pour la Communication Scientifique Directe), Aug 22, 2022

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific r... more HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Research paper thumbnail of Just-In-Time Modeling with DataMingler

DataMingler is a prototype tool that implements a novel conceptual model, the Data Virtual Machin... more DataMingler is a prototype tool that implements a novel conceptual model, the Data Virtual Machine (DVM) and can be used for agile just-in-time modeling of data from diverse sources. The DVM provides easy-to-understand semantics and fast and flexible schema manipulations. An important and useful class of queries in analytics environments, dataframes, is defined in the context of DVMs. These queries can be expressed either visually or through a novel query language, DVM-QL. We demonstrate DataMingler's capabilities map relational sources and queries on the latter in a DVM schema and augment it with information from semi-structured and unstructured sources. We also show how to express on the DVM easily complex relational queries or queries on structured, semi-structured and unstructured sources combined.

Research paper thumbnail of Federated Learning Performance on Early ICU Mortality Prediction with Extreme Data Distributions

Lecture Notes in Computer Science, Dec 31, 2022

Research paper thumbnail of Utilizing nullomers in cell-free RNA for early cancer detection

medRxiv (Cold Spring Harbor Laboratory), Jun 16, 2023

Research paper thumbnail of From Cloud to Serverless: MOO in the new Cloud epoch

HAL (Le Centre pour la Communication Scientifique Directe), Mar 29, 2022

Research paper thumbnail of Mapping Construction Compliant with Schema Semantics

Springer eBooks, 2014

A dominant characteristic of autonomous information sources is their heterogeneity, in terms of d... more A dominant characteristic of autonomous information sources is their heterogeneity, in terms of data formats and schemas. We need schema and data mappings between such sources in order to query them in a uniform and systematic manner. Guiding the discovery of mappings employing automatic tools is one of the fundamental unsolved challenges of data interoperability. In this work we consider the problem of discovering mappings for schemas of autonomous sources that can gradually revealed. Using as an example setting an overlay of peer databases, we present a mapping solution that discovers mappings, which can be adapted to new and gradually revealed schema information. Mapping discovery is schema-centric and incorporates new semantics as they are unveiled.

Research paper thumbnail of Spatial Data Management in IoT Systems: Solutions and Evaluation

International journal of semantic computing, Mar 1, 2021

As the Internet of Things (IoT) systems gain in popularity, an increasing number of Big Data sour... more As the Internet of Things (IoT) systems gain in popularity, an increasing number of Big Data sources are available. Ranging from small sensor networks designed for household use to large fully automated industrial environments, the IoT systems create billions of measurements each second making traditional storage and indexing solutions obsolete. While research around Big Data has focused on scalable solutions that can support the datasets produced by these systems, the focus has been mainly on managing the volume and velocity of these data, rather than providing efficient solutions for their retrieval and analysis. A key characteristic of these data, which is, more often than not, overlooked, is the spatial information that can be used to integrate data from multiple sources and conduct multi-dimensional analysis of the collected information. We present here the solutions currently available for the storage and indexing of spatial datasets produced by the IoT systems and we discuss their applicability in real-world scenarios.

Research paper thumbnail of Dynamic Estimation and Grid Partitioning Approach for Multi-objective Optimization Problems in Medical Cloud Federations

Research paper thumbnail of Visualizing and Exploring Big Datasets based on Semantic Community Detection

Extending Database Technology, 2021

Research paper thumbnail of Optimizing DICOM data management with NSGA-G

HAL (Le Centre pour la Communication Scientifique Directe), Mar 26, 2019

Cloud-based systems enable to manage ever-increasing medical data. The Digital Imaging and Commun... more Cloud-based systems enable to manage ever-increasing medical data. The Digital Imaging and Communication in Medicine (DI-COM) standard has been widely accepted to store and transfer the medical data, which uses single (row/column) or hybrid data storage technique (row-column). In particular, hybrid systems leverage the advantages of both techniques and allow to take into account various kinds of queries from full records retrieval (online transaction processing) to analytics (online analytical processing) queries. Additionally, the pay-as-you-go model and elasticity of cloud computing raise an important issue regarding to Multiple Objective Optimization (MOO) to find a data configuration according to users preferences such as storage space, processing response time, monetary cost, quality, etc. In such a context, the considerable space of solutions in MOO leads to generation of Pareto-optimal front with high complexity. Pareto-dominated based Multiple Objective Evolutionary Algorithms are often used as an alternative solution, e.g., Non-dominated Sorting Genetic Algorithms (NSGA) which provide less computational complexity. This paper presents NSGA-G, an NSGA based on Grid Partitioning to improve the complexity and quality of current NSGAs and to obtain efficient storage and querying of DICOM hybrid data. Experimental results on DTLZ test problems [10] and DICOM hybrid data prove the relevance of the proposed algorithm.

Research paper thumbnail of Modelling Processes of Big Data Analytics

Lecture Notes in Computer Science, 2015

Analytics tasks in scientific and industrial environments are to be performed in some order that,... more Analytics tasks in scientific and industrial environments are to be performed in some order that, as a whole, represent the rationale of a specific process on the data. The challenge to process the data is, beyond there mere size, their dispersion and the variety of their formats. The data analysis may include a range of tasks to be executed on a range of query engines, which are created by various users, such as business analysts, engineers, end-users etc. The users, depending on their role and expertise, may need or care for a different level of abstraction with respect to the execution of the individual tasks and overall process. Therefore, a system for Big Data analytics should enable the expression of tasks in an abstract manner, adaptable to the user role, interest and expertise. In this work we discuss the modelling of Big Data Analytics. We propose a novel representation model for analytics tasks and overall processes, that encapsulates their declaration, but, also, their execution semantics. The model allows for the definition of analytics processes with a varying level of abstraction, adaptable to the user role. Our motivation derives from real use cases.

Research paper thumbnail of Rethinking reinforcement learning for cloud elasticity

Cloud elasticity, i.e., the dynamic allocation of resources to applications to meet fluctuating w... more Cloud elasticity, i.e., the dynamic allocation of resources to applications to meet fluctuating workload demands, has been one of the greatest challenges in cloud computing. Approaches based on reinforcement learning have been proposed but they require a large number of states in order to model complex application behavior. In this work we propose a novel reinforcement learning approach that employs adaptive state space partitioning. The idea is to start from one state that represents the entire environment and partition this into finer-grained states adaptively to the observed workload and system behavior following a decision-tree approach. We explore novel statistical criteria and strategies that decide both the correct parameters and the appropriate time to perform the partitioning.

Research paper thumbnail of Data Virtual Machines: Enabling Data Virtualization

Lecture Notes in Computer Science, 2021

Research paper thumbnail of Efficient Representation of Very Large Linked Datasets as Graphs

Large linked datasets are nowadays available on many scientific topics of interest and offer inva... more Large linked datasets are nowadays available on many scientific topics of interest and offer invaluable knowledge. These datasets are of interest to a wide audience, people with limited or no knowledge about the Semantic Web, that want to explore and analyse this information in a user-friendly way. Aiming to support such usage, systems have been developed that support such exploration they impose however many limitations as they provide to users access to a limited part of the input dataset either by aggregating information or by exploiting data formats, such as hierarchies. As more linked datasets are becoming available and more people are interested to explore them, it is imperative to provide an user-friendly way to access and explore diverse and very large datasets in an intuitive way, as graphs. We present here an off-line pre-processing technique, divided in three phases, that can transform any linked dataset, independently of size and characteristics to one continuous graph in the two-dimensional space. We store the spatial information of the graph, add the needed indices and provide the graphical information through a dedicated API to support the exploration of the information. Finally, we conduct an experimental analysis to show that our technique can process and represent as one continuous graph large and diverse datasets.

Research paper thumbnail of An Economic Model for Self-Tuned Cloud Caching

Proceedings, Mar 1, 2009

Cloud computing, the new trend for service infrastructures requires user multi-tenancy as well as... more Cloud computing, the new trend for service infrastructures requires user multi-tenancy as well as minimal capital expenditure. In a cloud that services large amounts of data that are massively collected and queried, such as scientific data, users typically pay for query services. The cloud supports caching of data in order to provide quality query services. User payments cover query execution costs and maintenance of cloud infrastructure, and incur cloud profit. The challenge resides in providing efficient and resource-economic query services while maintaining a profitable cloud. In this work we propose an economic model for self-tuned cloud caching targeting the service of scientific data. The proposed economy is adapted to policies that encourage high-quality individual and overall query services but also brace the profit of the cloud. We propose a cost model that takes into account all possible query and infrastructure expenditure. The experimental study proves that the proposed solution is viable for a variety of workloads and data.

Research paper thumbnail of Predicting cost amortization for query services

... Archit. News, 30(2):209--220, 2002. 19. IF Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulna... more ... Archit. News, 30(2):209--220, 2002. 19. IF Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulnaga. Cords: Automatic discovery of correlations and soft functional dependencies. ... IEEE Trans. Comput., 57(4):433--447, 2008. 21. H. Kimura, G. Huo, A. Rasin, S. Madden, and SB Zdonik. ...

Research paper thumbnail of Data Virtual Machines: Simplifying Data Sharing, Exploration & Querying in Big Data Environments

2022 IEEE International Conference on Big Data (Big Data)

Research paper thumbnail of Elastic management of cloud applications using adaptive reinforcement learning

2017 IEEE International Conference on Big Data (Big Data), 2017

Modern large-scale computing deployments consist of complex applications running over machine clu... more Modern large-scale computing deployments consist of complex applications running over machine clusters. An important issue in these is the offering of elasticity, i.e., the dynamic allocation of resources to applications to meet fluctuating workload demands. Threshold based approaches are typically employed, yet they are difficult to calibrate and optimize. Approaches based on reinforcement learning (RL) have been proposed, but they require a large number of states in order to model complex application behavior. Methods that adaptively partition the state space have been proposed, but their partitioning criteria and strategies are sub-optimal. In this work we present MDP DT, a novel full-model based reinforcement learning algorithm for elastic resource management that employs adaptive state space partitioning. We propose two novel statistical criteria and three strategies and we experimentally prove that they correctly decide both where and when to partition, outperforming existing approaches. We experimentally evaluate MDP DT in a real large scale cluster over variable not-encountered workloads and we show that it takes more informed decisions compared to static, model-free and threshold approaches, while requiring a minimal amount of training data. We experimentally show that this adaptation enabled MDP DT to optimize the achieved profit while being 40% cheaper than calibrated RL and threshold approaches.