Bogdan Nicolae | IBM Research (original) (raw)
Papers by Bogdan Nicolae
Storage elasticity on IaaS clouds is a crucial feature in the age of data intensive computing. Ho... more Storage elasticity on IaaS clouds is a crucial feature in the age of data intensive computing. However, the traditional provisioning model of leveraging virtual disks of fixed capacity and performance characteristics has limited ability to match the increasingly dynamic nature of I/O application requirements. This mismatch is particularly problematic in the context of scientific applications that interleave periods of I/O inactivity with I/O intensive bursts. In this context, over provisioning for best performance during peaks leads to significant extra costs because of unnecessarily tied-up resources, while any other trade-off leads to performance loss. This paper provides a transparent solution that automatically boosts I/O bandwidth during peaks for underlying virtual disks, effectively avoiding over-provisioning without performance loss. Our proposal relies on the idea of leveraging short lived virtual disks of better performance characteristics (and thus more expensive) to act during peaks as a caching layer for the persistent virtual disks where the application data is stored. We show how this idea can be achieved efficiently at the block-device level, using a caching mechanism that leverages iterative behavior and learns from past experience. We demonstrate the benefits of our proposal both for microbenchmarks and for two real-life applications using large- scale experiments.
Storage elasticity on IaaS clouds is an important feature for data-intensive workloads: storage r... more Storage elasticity on IaaS clouds is an important feature for data-intensive workloads: storage requirements can vary greatly during application runtime, making worst-case over-provisioning a poor choice that leads to unnecessarily tied-up storage and extra costs for the user. While the ability to adapt dynamically to storage requirements is thus attractive, how to implement it is not well understood. Current approaches simply rely on users to attach and detach virtual disks to the virtual machine (VM) instances and then manage them manually, thus greatly increasing application complexity while reducing cost efficiency. Unlike such approaches, this paper aims to provide a transparent solution that presents a unified storage space to the VM in the form of a regular POSIX file system that hides the details of attaching and detaching virtual disks by handling those actions transparently based on dynamic application requirements. The main difficulty in this context is to understand the intent of the application and regulate the available storage in order to avoid running out of space while minimizing the performance overhead of doing so. To this end, we propose a storage space prediction scheme that analyzes multiple system parameters and dynamically adapts monitoring based on the intensity of the I/O in order to get as close as possible to the real usage. We show the value of our proposal over static worst-case over provisioning and simpler elastic schemes that rely on a reactive model to attach and detach virtual disks, using both synthetic benchmarks and real-life data-intensive applications. Our experiments demonstrate that we can reduce storage waste/cost by 30-40% with only 2-5% performance overhead.
As the explosion of data sizes continues to push the limits of our abilities to efficiently store... more As the explosion of data sizes continues to push the limits of our abilities to efficiently store and process big data, next generation big data systems face multiple challenges. One such important challenge relates to the limited scalability of I/O, a determining factor in the overall performance of big data applications. Although paradigms like MapReduce have long been used to take advantage of local disks and avoid data movements over the network as much as possible, with increasing core count per node, local storage comes under increasing I/O pressure itself and prompts the need to equip nodes with multiple disks. However, given the rising need to virtualize large datacenters in order to provide a more flexible allocation and consolidation of physical resources (transforming them into public or private/hybrid clouds), the following questions arise: is it possible to take advantage of multiple local disks at virtual machine (VM) level in order to speed up big data analytics? If so, what are the best practices to achieve a high virtualized aggregated I/O throughput? This paper aims to answer these questions in the context of I/O intensive MapReduce workloads: it analyzes and characterizes their behavior under different virtualization scenarios in order to propose best practices for current approaches and speculate on future areas of improvement.
With increasing scale and complexity of supercomputing and cloud computing architectures, faults ... more With increasing scale and complexity of supercomputing and cloud computing architectures, faults are becoming a frequent occurrence. For a large class of applications that run for a long time and are tightly coupled, Checkpoint-Restart (CR) is the only feasible method to survive failures. However, exploding checkpoint sizes that need to be dumped to storage pose a major scalability challenge, prompting the need to reduce the amount of checkpointing data. This paper contributes with a novel collective memory contents deduplication scheme that attempts to identify and eliminate duplicate memory pages before they are saved to storage. Unlike previous approaches that concentrate on the checkpoints of the same process, our approach identifies duplicate memory pages shared by different processes (regardless whether on the same or different node). We show both how to achieve such a global deduplication in a scalable fashion and how to leverage it effectively to optimize the data layout in such way that it minimizes I/O bottlenecks. Large scale experiments show significant reduction of storage space consumption and performance overhead compared to several state-of-art approaches, both in synthetic benchmarks and for a real life high performance computing application.
With increasing scale and complexity of supercomputing and cloud computing architectures, faults ... more With increasing scale and complexity of supercomputing and cloud computing architectures, faults are becoming a frequent occurrence, which makes reliability a difficult challenge. Although for some applications it is enough to restart failed tasks, there is a large class of applications where tasks run for a long time or are tightly coupled, thus making a restart from scratch unfeasible. Checkpoint-Restart (CR), the main method to survive failures for such applications faces additional challenges in this context: not only does it need to minimize the performance overhead on the application due to checkpointing, but it also needs to operate with scarce resources. Given the iterative nature of the targeted applications, we launch the assumption that first-time writes to memory during asynchronous checkpointing generate the same kind of interference as they did in past iterations. Based on this assumption, we propose novel asynchronous checkpointing approach that leverages both current and past access pattern trends in order to optimize the order in which memory pages are flushed to stable storage. Large scale experiments show up to 60% improvement when compared to state-of-art checkpointing approaches, all this achievable with an extra memory requirement of less than 5% of the total application memory.
A critical feature of IaaS cloud computing is the ability to deploy, boot and terminate large gro... more A critical feature of IaaS cloud computing is the ability to deploy, boot and terminate large groups of interdependent VMs very quickly, which enables users to efficiently exploit the on-demand nature and elasticity of clouds even for large-scale deployments. A common pattern in this context is multi-deployment, i.e., using the same VM image template to instantiate a large number of VMs in parallel. A difficult trade-off arises in this context: access the content of the template on demand but slowly due to I/O bottlenecks or pre -broadcast the full contents of the template on the local storage of the hosting nodes to avoid such bottlenecks. Unlike previous approaches that are biased towards either of the extremes, we propose a scheme that augments on-demand access through a collaborative scheme in which the VMs aim to leverage the similarity of access pattern in order to anticipate future accesses and exchange chunks between themselves in an attempt to reduce contention to the remote storage where the VM image template is stored. Large scale experiments show improvements in read throughput between 30%-40% compared to on demand access schemes that perform in isolation.
Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry an... more Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running HPC applications. Given the need to provide fault tolerance, support for suspend–resume and offline migration, an efficient Checkpoint-Restart mechanism becomes paramount in this context. We propose BlobCR, a dedicated checkpoint repository that is able to take live incremental snapshots of the whole disk attached to the virtual machine (VM) instances. BlobCR aims to minimize the performance overhead of checkpointing by persisting VM disk snapshots asynchronously in the background using a low overhead technique we call selective copy-on-write. It includes support for both application-level and process-level checkpointing, as well as support to roll back filesystem changes. Experiments at large scale demonstrate the benefits of our proposal both in synthetic settings and for a real-life HPC application.
As Map-Reduce emerges as a leading programming paradigm for data-intensive computing, today’s fra... more As Map-Reduce emerges as a leading programming paradigm for data-intensive computing, today’s frameworks which support it still have substantial shortcomings that limit its potential scalability. In this paper we discuss several directions where there is room for such progress: they concern storage efficiency under massive data access concurrency, scheduling, volatility and fault tolerance. We place our discussion in the perspective of the current evolution towards an increasing integration of large-scale distributed platforms (clouds, cloud federations, enterprise desktop grids, etc.). We propose an approach which aims to overcome the current limitations of existing Map Reduce frameworks, in order to achieve scalable, concurrency optimized, fault-tolerant Map-Reduce data processing on hybrid infrastructures. This approach will be evaluated with real-life bio-informatics applications on existing Nimbus-powered cloud testbeds interconnected with desktop grids.
As data volumes increase at a high speed in more and more application fields of science, engineer... more As data volumes increase at a high speed in more and more application fields of science, engineering, information services, etc., the challenges posed by data-intensive computing gain an increasing importance. The emergence of highly scalable infrastructures, e.g. for cloud computing and for petascale computing and beyond introduces additional issues for which scalable data management becomes an immediate need. This paper brings several contributions. First, it proposes a set of principles for designing highly scalable distributed storage systems that are optimized for heavy data access concurrency. In particular, we highlight the potentially large benefits of using versioning in this context. Second, based on these principles, we propose a set of versioning algorithms, both for data and metadata, that enable a high throughput under concurrency. Finally, we implement and evaluate these algorithms in the BlobSeer prototype, that we integrate as a storage backend in the Hadoop MapReduce framework. We perform extensive microbenchmarks as well as experiments with real MapReduce applications: they demonstrate that applying the principles defended in our approach brings substantial benefits to data intensive applications.
With data volumes increasing at a high rate and the emergence of highly scalable infrastructures ... more With data volumes increasing at a high rate and the emergence of highly scalable infrastructures (cloud computing, petascale computing), distributed management of data becomes a crucial issue that faces many challenges. This thesis brings several contributions in order to address such challenges. First, it proposes a set of principles for designing highly scalable distributed storage systems that are optimized for heavy data access concurrency. In particular, it highlights the potentially large benefits of using versioning in this context. Second, based on these principles, it introduces a series of distributed data and metadata management algorithms that enable a high throughput under concurrency. Third, it shows how to efficiently implement these algorithms in practice, dealing with key issues such as high-performance parallel transfers, efficient maintainance of distributed data structures, fault tolerance, etc. These results are used to build BlobSeer, an experimental prototype that is used to demonstrate both the theoretical benefits of the approach in synthetic benchmarks, as well as the practical benefits in real-life, applicative scenarios: as a storage backend for MapReduce applications, as a storage backend for deployment and snapshotting of virtual machine images in clouds, as a quality-of-service enabled data storage service for cloud applications. Extensive experimentations on the Grid'5000 testbed show that BlobSeer remains scalable and sustains a high throughput even under heavy access concurrency, outperforming by a large margin several state-of-art approaches.
With increasing interest among mainstream users to run HPC applications, Infrastructure-as-a-Serv... more With increasing interest among mainstream users to run HPC applications, Infrastructure-as-a-Service (IaaS) cloud computing platforms represent a viable alternative to the acquisition and maintenance of expensive hardware, often out of the financial capabilities of such users. Also, one of the critical needs of HPC applications is an efficient, scalable and persistent storage. Unfortunately, storage options proposed by cloud providers are not standardized and typically use a different access model. In this context, the local disks on the compute nodes can be used to save large data sets such as the data generated by Checkpoint-Restart (CR). This local storage offers high throughput and scalability but it needs to be combined with persistency techniques, such as block replication or erasure codes. One of the main challenges that such techniques face is to minimize the overhead of performance and I/O resource utilization (i.e., storage space and bandwidth), while at the same time guaranteeing high reliability of the saved data. This paper introduces a novel persistency technique that leverages Reed Solomon (RS) encoding to save data in a reliable fashion. Compared to traditional approaches that rely on block replication, we demonstrate about 50% higher throughput while reducing network bandwidth and storage utilization by a factor of 2 for the same targeted reliability level. This is achieved both by modeling and real life experimentation on hundreds of nodes.
Live migration of virtual machines (VMs) is key feature of virtualization that is extensively lev... more Live migration of virtual machines (VMs) is key feature of virtualization that is extensively leveraged in IaaS cloud environments: it is the basic building block of several important features, such as load balancing, pro-active fault tolerance, power management, online maintenance, etc. While most live migration efforts concentrate on how to transfer the memory from source to destination during the migration process, comparatively little attention has been devoted to the transfer of storage. This problem is gaining increasing importance: due to performance reasons, virtual machines that run large-scale, data-intensive applications tend to rely on local storage, which poses a difficult challenge on live migration: it needs to handle storage transfer in addition to memory transfer. This paper proposes a memory migration independent approach that addresses this challenge. It relies on a hybrid active push / prioritized prefetch strategy, which makes it highly resilient to rapid changes of disk state exhibited by I/O intensive workloads. At the same time, it is minimally intrusive in order to ensure a maximum of portability with a wide range of hypervisors. Large scale experiments that involve multiple simultaneous migrations of both synthetic benchmarks and a real scientific application show improvements of up to 10x faster migration time, 10x less bandwidth consumption and 8x less performance degradation over state-of art.
Research report, INRIA, Jan 1, 2010
Infrastructure-as-a-Service (IaaS) cloud computing has revolutionized the way we think of acquiri... more Infrastructure-as-a-Service (IaaS) cloud computing has revolutionized the way we think of acquiring resources by in- troducing a simple change: allowing users to lease computational resources from the cloud provider's datacenter for a short time by deploying virtual machines (VMs) on those resources. This new model raises new challenges in the de- sign and development of IaaS middleware. One of those challenges is the need to deploy a large number (hundreds or even thousands) of VM instances simultaneously. Once the VM instances are deployed, another challenge is to simultaneously take a snapshot of many images and transfer them to persistent storage to support management tasks, such as suspend-resume and migration. With datacenters growing at a fast rate and configurations becoming heterogeneous, it is important to enable efficient concurrent deployment and snapshotting that is at the same time hypervisor- independent and ensures a maximum of compatibility with different configurations. This paper addresses these challenges by proposing a virtual file system specifically optimized for virtual machine image storage. It is based on a lazy transfer scheme coupled with object-versioning that demonstrates excellent performance in terms of consumed resources: execution time, network traffic and storage space. Experiments on hundreds of nodes demonstrate performance improvements in concurrent VM deployments ranging from a factor of 2 up to 25 over state-of-art, with storage and bandwidth utilization reduction of as much as 90%, while at the same time keeping comparable snapshotting performance, which comes with the added benefit of high portability.
As Map-Reduce emerges as a leading programming paradigm for data-intensive computing, today's fra... more As Map-Reduce emerges as a leading programming paradigm for data-intensive computing, today's frameworks which support it still have substantial shortcomings that limit its potential scalability. In this paper we discuss several directions where there is room for such progress: they concern storage efficiency under massive data access concurrency, scheduling, volatility and fault-tolerance. We place our discussion in the perspective of the current evolution towards an increasing integration of large-scale distributed platforms (clouds, cloud federations, enterprise desktop grids, etc.). We propose an approach which aims to overcome the current limitations of existing Map-Reduce frameworks, in order to achieve scalable, concurrency-optimized, fault-tolerant Map-Reduce data processing on hybrid infrastructures. This approach will be evaluated with real-life bio-informatics applications on existing Nimbus-powered cloud testbeds interconnected with desktop grids.
Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry an... more Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running scientific applications. Given the dynamic nature of IaaS clouds and the long runtime and resource utilization of such applications, an efficient checkpoint-restart mechanism becomes paramount in this context. This paper proposes a solution to the aforementioned challenge that aims at minimizing the storage space and performance overhead of checkpoint-restart. We introduce an approach that leverages virtual machine (VM) disk-image multi-snapshotting and multi-deployment inside checkpoint-restart protocols running at guest level in order to efficiently capture and potentially roll back the complete state of the application, including file system modifications. Experiments on the G5K testbed show substantial improvement for MPI applications over existing approaches, both for the case when customized checkpointing is available at application level and the case when it needs to be handled at process level.
Parallel & Distributed …, Jan 1, 2010
Hadoop is a software framework supporting the Map-Reduce programming model. It relies on the Hado... more Hadoop is a software framework supporting the Map-Reduce programming model. It relies on the Hadoop Distributed File System (HDFS) as its primary storage system. The efficiency of HDFS is crucial for the performance of Map-Reduce applications. We substitute the original HDFS layer of Hadoop with a new, concurrency-optimized data storage layer based on the BlobSeer data management service. Thereby, the efficiency of Hadoop is significantly improved for data-intensive Map-Reduce applications, which naturally exhibit a high degree of data access concurrency. Moreover, BlobSeer's features (built-in versioning, its support for concurrent append operations) open the possibility for Hadoop to further extend its functionalities. We report on extensive experiments conducted on the Grid'5000 testbed. The results illustrate the benefits of our approach over the original HDFS-based implementation of Hadoop.
Infrastructure-as-a-Service (IaaS) cloud computing has revolutionized the way we think of acquiri... more Infrastructure-as-a-Service (IaaS) cloud computing has revolutionized the way we think of acquiring computational resources: it allows users to deploy virtual machines (VMs) at large scale and pay only for the resources that were actually used throughout the runtime of the VMs. This new model raises new challenges in the design and development of IaaS middleware: excessive storage costs associated with both user data and VM images might make the cloud less attractive, especially for users that need to manipulate huge data sets and a large number of VM images. Storage costs result not only from storage space utilization, but also from bandwidth consumption: in typical deployments, a large number of data transfers between the VMs and the persistent storage are performed, all under high performance requirements. This paper evaluates the trade-off resulting from transparently applying data compression to conserve storage space and bandwidth at the cost of slight computational overhead. We aim at reducing the storage space and bandwidth needs with minimal impact on data access performance. Our solution builds on BlobSeer, a distributed data management service specifically designed to sustain a high throughput for concurrent accesses to huge data sequences that are distributed at large scale. Extensive experiments demonstrate that our approach achieves large reductions (at least 40%) of bandwidth and storage space utilization, while still attaining high performance levels that even surpass the original (no compression) performance levels in several data-intensive scenarios.
With Infrastructure-as-a-Service (IaaS) cloud economics getting increasingly complex and dynamic,... more With Infrastructure-as-a-Service (IaaS) cloud economics getting
increasingly complex and dynamic, resource costs can vary greatly over short periods of time. Therefore, a critical issue is the ability to deploy, boot and terminate VMs very quickly, which enables cloud users to exploit elasticity to find the optimal trade-off between the computational needs (number of resources, usage time) and budget constraints. This paper proposes an adaptive prefetching mechanism aiming to reduce the time required to simultaneously boot a large number of VM instances on clouds from the same initial VM image (multi-deployment). Our proposal does not require any foreknowledge of the exact access pattern. It dynamically adapts to it at run time, enabling the slower instances to learn from the experience of the faster ones. Since all booting instances typically access only a small
part of the virtual image along almost the same pattern, the required data can be pre-fetched in the background. Large scale experiments under concurrency on hundreds of nodes show that introducing such a prefetching mechanism can achieve a speed-up of up to 35% when compared to simple on-demand fetching.
The recent explosion in data sizes manipulated by distributed scientific applications has prompte... more The recent explosion in data sizes manipulated by distributed
scientific applications has prompted the need to develop specialized storage systems capable to deal with specific access patterns in a scalable fashion. In this context, a large class of applications focuses on parallel array processing: small parts of huge multi-dimensional arrays are concurrently accessed
by a large number of clients, both for reading and writing.
A specialized storage system that deals with such an access
pattern faces several challenges at the level of data/metadata management. We introduce Pyramid, an active array-oriented storage system that addresses these challenges. Experimental evaluation demonstrates substantial scalability improvements brought by Pyramid with respect to state-of-art approaches both in weak and strong scaling scenarios, with gains of 100% to 150%.
2nd IEEE International …, Jan 1, 2010
The cloud computing model aims to make large-scale data-intensive computing affordable even for u... more The cloud computing model aims to make large-scale data-intensive computing affordable even for users with limited financial resources, that cannot invest into expensive infrastructures necesssary to run them. In this context, MapReduce is emerging as a highly scalable programming paradigm that enables high-throughput data-intensive processing as a cloud service. Its performance is highly dependent on the underlying storage service, responsible to efficiently support massively parallel data accesses by guaranteeing a high throughput under heavy access concurrency. In this context, quality of service plays a crucial role: the storage service needs to sustain a stable throughput for each individual accesss, in addition to achieving a high aggregated throughput under concurrency. In this paper we propose a technique to address this problem using component monitoring, application-side feedback and behavior pattern analysis to automatically infer useful knowledge about the causes of poor quality of service and provide an easy way to reasonin about potential improvements. We apply our proposal to BlobSeer, a representative data storage service specifically designed to achieve high aggregated throughputs and show through extensive experimentation substantial improvements in the stability of individual data read accesses under MapReduce workloads.
Storage elasticity on IaaS clouds is a crucial feature in the age of data intensive computing. Ho... more Storage elasticity on IaaS clouds is a crucial feature in the age of data intensive computing. However, the traditional provisioning model of leveraging virtual disks of fixed capacity and performance characteristics has limited ability to match the increasingly dynamic nature of I/O application requirements. This mismatch is particularly problematic in the context of scientific applications that interleave periods of I/O inactivity with I/O intensive bursts. In this context, over provisioning for best performance during peaks leads to significant extra costs because of unnecessarily tied-up resources, while any other trade-off leads to performance loss. This paper provides a transparent solution that automatically boosts I/O bandwidth during peaks for underlying virtual disks, effectively avoiding over-provisioning without performance loss. Our proposal relies on the idea of leveraging short lived virtual disks of better performance characteristics (and thus more expensive) to act during peaks as a caching layer for the persistent virtual disks where the application data is stored. We show how this idea can be achieved efficiently at the block-device level, using a caching mechanism that leverages iterative behavior and learns from past experience. We demonstrate the benefits of our proposal both for microbenchmarks and for two real-life applications using large- scale experiments.
Storage elasticity on IaaS clouds is an important feature for data-intensive workloads: storage r... more Storage elasticity on IaaS clouds is an important feature for data-intensive workloads: storage requirements can vary greatly during application runtime, making worst-case over-provisioning a poor choice that leads to unnecessarily tied-up storage and extra costs for the user. While the ability to adapt dynamically to storage requirements is thus attractive, how to implement it is not well understood. Current approaches simply rely on users to attach and detach virtual disks to the virtual machine (VM) instances and then manage them manually, thus greatly increasing application complexity while reducing cost efficiency. Unlike such approaches, this paper aims to provide a transparent solution that presents a unified storage space to the VM in the form of a regular POSIX file system that hides the details of attaching and detaching virtual disks by handling those actions transparently based on dynamic application requirements. The main difficulty in this context is to understand the intent of the application and regulate the available storage in order to avoid running out of space while minimizing the performance overhead of doing so. To this end, we propose a storage space prediction scheme that analyzes multiple system parameters and dynamically adapts monitoring based on the intensity of the I/O in order to get as close as possible to the real usage. We show the value of our proposal over static worst-case over provisioning and simpler elastic schemes that rely on a reactive model to attach and detach virtual disks, using both synthetic benchmarks and real-life data-intensive applications. Our experiments demonstrate that we can reduce storage waste/cost by 30-40% with only 2-5% performance overhead.
As the explosion of data sizes continues to push the limits of our abilities to efficiently store... more As the explosion of data sizes continues to push the limits of our abilities to efficiently store and process big data, next generation big data systems face multiple challenges. One such important challenge relates to the limited scalability of I/O, a determining factor in the overall performance of big data applications. Although paradigms like MapReduce have long been used to take advantage of local disks and avoid data movements over the network as much as possible, with increasing core count per node, local storage comes under increasing I/O pressure itself and prompts the need to equip nodes with multiple disks. However, given the rising need to virtualize large datacenters in order to provide a more flexible allocation and consolidation of physical resources (transforming them into public or private/hybrid clouds), the following questions arise: is it possible to take advantage of multiple local disks at virtual machine (VM) level in order to speed up big data analytics? If so, what are the best practices to achieve a high virtualized aggregated I/O throughput? This paper aims to answer these questions in the context of I/O intensive MapReduce workloads: it analyzes and characterizes their behavior under different virtualization scenarios in order to propose best practices for current approaches and speculate on future areas of improvement.
With increasing scale and complexity of supercomputing and cloud computing architectures, faults ... more With increasing scale and complexity of supercomputing and cloud computing architectures, faults are becoming a frequent occurrence. For a large class of applications that run for a long time and are tightly coupled, Checkpoint-Restart (CR) is the only feasible method to survive failures. However, exploding checkpoint sizes that need to be dumped to storage pose a major scalability challenge, prompting the need to reduce the amount of checkpointing data. This paper contributes with a novel collective memory contents deduplication scheme that attempts to identify and eliminate duplicate memory pages before they are saved to storage. Unlike previous approaches that concentrate on the checkpoints of the same process, our approach identifies duplicate memory pages shared by different processes (regardless whether on the same or different node). We show both how to achieve such a global deduplication in a scalable fashion and how to leverage it effectively to optimize the data layout in such way that it minimizes I/O bottlenecks. Large scale experiments show significant reduction of storage space consumption and performance overhead compared to several state-of-art approaches, both in synthetic benchmarks and for a real life high performance computing application.
With increasing scale and complexity of supercomputing and cloud computing architectures, faults ... more With increasing scale and complexity of supercomputing and cloud computing architectures, faults are becoming a frequent occurrence, which makes reliability a difficult challenge. Although for some applications it is enough to restart failed tasks, there is a large class of applications where tasks run for a long time or are tightly coupled, thus making a restart from scratch unfeasible. Checkpoint-Restart (CR), the main method to survive failures for such applications faces additional challenges in this context: not only does it need to minimize the performance overhead on the application due to checkpointing, but it also needs to operate with scarce resources. Given the iterative nature of the targeted applications, we launch the assumption that first-time writes to memory during asynchronous checkpointing generate the same kind of interference as they did in past iterations. Based on this assumption, we propose novel asynchronous checkpointing approach that leverages both current and past access pattern trends in order to optimize the order in which memory pages are flushed to stable storage. Large scale experiments show up to 60% improvement when compared to state-of-art checkpointing approaches, all this achievable with an extra memory requirement of less than 5% of the total application memory.
A critical feature of IaaS cloud computing is the ability to deploy, boot and terminate large gro... more A critical feature of IaaS cloud computing is the ability to deploy, boot and terminate large groups of interdependent VMs very quickly, which enables users to efficiently exploit the on-demand nature and elasticity of clouds even for large-scale deployments. A common pattern in this context is multi-deployment, i.e., using the same VM image template to instantiate a large number of VMs in parallel. A difficult trade-off arises in this context: access the content of the template on demand but slowly due to I/O bottlenecks or pre -broadcast the full contents of the template on the local storage of the hosting nodes to avoid such bottlenecks. Unlike previous approaches that are biased towards either of the extremes, we propose a scheme that augments on-demand access through a collaborative scheme in which the VMs aim to leverage the similarity of access pattern in order to anticipate future accesses and exchange chunks between themselves in an attempt to reduce contention to the remote storage where the VM image template is stored. Large scale experiments show improvements in read throughput between 30%-40% compared to on demand access schemes that perform in isolation.
Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry an... more Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running HPC applications. Given the need to provide fault tolerance, support for suspend–resume and offline migration, an efficient Checkpoint-Restart mechanism becomes paramount in this context. We propose BlobCR, a dedicated checkpoint repository that is able to take live incremental snapshots of the whole disk attached to the virtual machine (VM) instances. BlobCR aims to minimize the performance overhead of checkpointing by persisting VM disk snapshots asynchronously in the background using a low overhead technique we call selective copy-on-write. It includes support for both application-level and process-level checkpointing, as well as support to roll back filesystem changes. Experiments at large scale demonstrate the benefits of our proposal both in synthetic settings and for a real-life HPC application.
As Map-Reduce emerges as a leading programming paradigm for data-intensive computing, today’s fra... more As Map-Reduce emerges as a leading programming paradigm for data-intensive computing, today’s frameworks which support it still have substantial shortcomings that limit its potential scalability. In this paper we discuss several directions where there is room for such progress: they concern storage efficiency under massive data access concurrency, scheduling, volatility and fault tolerance. We place our discussion in the perspective of the current evolution towards an increasing integration of large-scale distributed platforms (clouds, cloud federations, enterprise desktop grids, etc.). We propose an approach which aims to overcome the current limitations of existing Map Reduce frameworks, in order to achieve scalable, concurrency optimized, fault-tolerant Map-Reduce data processing on hybrid infrastructures. This approach will be evaluated with real-life bio-informatics applications on existing Nimbus-powered cloud testbeds interconnected with desktop grids.
As data volumes increase at a high speed in more and more application fields of science, engineer... more As data volumes increase at a high speed in more and more application fields of science, engineering, information services, etc., the challenges posed by data-intensive computing gain an increasing importance. The emergence of highly scalable infrastructures, e.g. for cloud computing and for petascale computing and beyond introduces additional issues for which scalable data management becomes an immediate need. This paper brings several contributions. First, it proposes a set of principles for designing highly scalable distributed storage systems that are optimized for heavy data access concurrency. In particular, we highlight the potentially large benefits of using versioning in this context. Second, based on these principles, we propose a set of versioning algorithms, both for data and metadata, that enable a high throughput under concurrency. Finally, we implement and evaluate these algorithms in the BlobSeer prototype, that we integrate as a storage backend in the Hadoop MapReduce framework. We perform extensive microbenchmarks as well as experiments with real MapReduce applications: they demonstrate that applying the principles defended in our approach brings substantial benefits to data intensive applications.
With data volumes increasing at a high rate and the emergence of highly scalable infrastructures ... more With data volumes increasing at a high rate and the emergence of highly scalable infrastructures (cloud computing, petascale computing), distributed management of data becomes a crucial issue that faces many challenges. This thesis brings several contributions in order to address such challenges. First, it proposes a set of principles for designing highly scalable distributed storage systems that are optimized for heavy data access concurrency. In particular, it highlights the potentially large benefits of using versioning in this context. Second, based on these principles, it introduces a series of distributed data and metadata management algorithms that enable a high throughput under concurrency. Third, it shows how to efficiently implement these algorithms in practice, dealing with key issues such as high-performance parallel transfers, efficient maintainance of distributed data structures, fault tolerance, etc. These results are used to build BlobSeer, an experimental prototype that is used to demonstrate both the theoretical benefits of the approach in synthetic benchmarks, as well as the practical benefits in real-life, applicative scenarios: as a storage backend for MapReduce applications, as a storage backend for deployment and snapshotting of virtual machine images in clouds, as a quality-of-service enabled data storage service for cloud applications. Extensive experimentations on the Grid'5000 testbed show that BlobSeer remains scalable and sustains a high throughput even under heavy access concurrency, outperforming by a large margin several state-of-art approaches.
With increasing interest among mainstream users to run HPC applications, Infrastructure-as-a-Serv... more With increasing interest among mainstream users to run HPC applications, Infrastructure-as-a-Service (IaaS) cloud computing platforms represent a viable alternative to the acquisition and maintenance of expensive hardware, often out of the financial capabilities of such users. Also, one of the critical needs of HPC applications is an efficient, scalable and persistent storage. Unfortunately, storage options proposed by cloud providers are not standardized and typically use a different access model. In this context, the local disks on the compute nodes can be used to save large data sets such as the data generated by Checkpoint-Restart (CR). This local storage offers high throughput and scalability but it needs to be combined with persistency techniques, such as block replication or erasure codes. One of the main challenges that such techniques face is to minimize the overhead of performance and I/O resource utilization (i.e., storage space and bandwidth), while at the same time guaranteeing high reliability of the saved data. This paper introduces a novel persistency technique that leverages Reed Solomon (RS) encoding to save data in a reliable fashion. Compared to traditional approaches that rely on block replication, we demonstrate about 50% higher throughput while reducing network bandwidth and storage utilization by a factor of 2 for the same targeted reliability level. This is achieved both by modeling and real life experimentation on hundreds of nodes.
Live migration of virtual machines (VMs) is key feature of virtualization that is extensively lev... more Live migration of virtual machines (VMs) is key feature of virtualization that is extensively leveraged in IaaS cloud environments: it is the basic building block of several important features, such as load balancing, pro-active fault tolerance, power management, online maintenance, etc. While most live migration efforts concentrate on how to transfer the memory from source to destination during the migration process, comparatively little attention has been devoted to the transfer of storage. This problem is gaining increasing importance: due to performance reasons, virtual machines that run large-scale, data-intensive applications tend to rely on local storage, which poses a difficult challenge on live migration: it needs to handle storage transfer in addition to memory transfer. This paper proposes a memory migration independent approach that addresses this challenge. It relies on a hybrid active push / prioritized prefetch strategy, which makes it highly resilient to rapid changes of disk state exhibited by I/O intensive workloads. At the same time, it is minimally intrusive in order to ensure a maximum of portability with a wide range of hypervisors. Large scale experiments that involve multiple simultaneous migrations of both synthetic benchmarks and a real scientific application show improvements of up to 10x faster migration time, 10x less bandwidth consumption and 8x less performance degradation over state-of art.
Research report, INRIA, Jan 1, 2010
Infrastructure-as-a-Service (IaaS) cloud computing has revolutionized the way we think of acquiri... more Infrastructure-as-a-Service (IaaS) cloud computing has revolutionized the way we think of acquiring resources by in- troducing a simple change: allowing users to lease computational resources from the cloud provider's datacenter for a short time by deploying virtual machines (VMs) on those resources. This new model raises new challenges in the de- sign and development of IaaS middleware. One of those challenges is the need to deploy a large number (hundreds or even thousands) of VM instances simultaneously. Once the VM instances are deployed, another challenge is to simultaneously take a snapshot of many images and transfer them to persistent storage to support management tasks, such as suspend-resume and migration. With datacenters growing at a fast rate and configurations becoming heterogeneous, it is important to enable efficient concurrent deployment and snapshotting that is at the same time hypervisor- independent and ensures a maximum of compatibility with different configurations. This paper addresses these challenges by proposing a virtual file system specifically optimized for virtual machine image storage. It is based on a lazy transfer scheme coupled with object-versioning that demonstrates excellent performance in terms of consumed resources: execution time, network traffic and storage space. Experiments on hundreds of nodes demonstrate performance improvements in concurrent VM deployments ranging from a factor of 2 up to 25 over state-of-art, with storage and bandwidth utilization reduction of as much as 90%, while at the same time keeping comparable snapshotting performance, which comes with the added benefit of high portability.
As Map-Reduce emerges as a leading programming paradigm for data-intensive computing, today's fra... more As Map-Reduce emerges as a leading programming paradigm for data-intensive computing, today's frameworks which support it still have substantial shortcomings that limit its potential scalability. In this paper we discuss several directions where there is room for such progress: they concern storage efficiency under massive data access concurrency, scheduling, volatility and fault-tolerance. We place our discussion in the perspective of the current evolution towards an increasing integration of large-scale distributed platforms (clouds, cloud federations, enterprise desktop grids, etc.). We propose an approach which aims to overcome the current limitations of existing Map-Reduce frameworks, in order to achieve scalable, concurrency-optimized, fault-tolerant Map-Reduce data processing on hybrid infrastructures. This approach will be evaluated with real-life bio-informatics applications on existing Nimbus-powered cloud testbeds interconnected with desktop grids.
Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry an... more Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running scientific applications. Given the dynamic nature of IaaS clouds and the long runtime and resource utilization of such applications, an efficient checkpoint-restart mechanism becomes paramount in this context. This paper proposes a solution to the aforementioned challenge that aims at minimizing the storage space and performance overhead of checkpoint-restart. We introduce an approach that leverages virtual machine (VM) disk-image multi-snapshotting and multi-deployment inside checkpoint-restart protocols running at guest level in order to efficiently capture and potentially roll back the complete state of the application, including file system modifications. Experiments on the G5K testbed show substantial improvement for MPI applications over existing approaches, both for the case when customized checkpointing is available at application level and the case when it needs to be handled at process level.
Parallel & Distributed …, Jan 1, 2010
Hadoop is a software framework supporting the Map-Reduce programming model. It relies on the Hado... more Hadoop is a software framework supporting the Map-Reduce programming model. It relies on the Hadoop Distributed File System (HDFS) as its primary storage system. The efficiency of HDFS is crucial for the performance of Map-Reduce applications. We substitute the original HDFS layer of Hadoop with a new, concurrency-optimized data storage layer based on the BlobSeer data management service. Thereby, the efficiency of Hadoop is significantly improved for data-intensive Map-Reduce applications, which naturally exhibit a high degree of data access concurrency. Moreover, BlobSeer's features (built-in versioning, its support for concurrent append operations) open the possibility for Hadoop to further extend its functionalities. We report on extensive experiments conducted on the Grid'5000 testbed. The results illustrate the benefits of our approach over the original HDFS-based implementation of Hadoop.
Infrastructure-as-a-Service (IaaS) cloud computing has revolutionized the way we think of acquiri... more Infrastructure-as-a-Service (IaaS) cloud computing has revolutionized the way we think of acquiring computational resources: it allows users to deploy virtual machines (VMs) at large scale and pay only for the resources that were actually used throughout the runtime of the VMs. This new model raises new challenges in the design and development of IaaS middleware: excessive storage costs associated with both user data and VM images might make the cloud less attractive, especially for users that need to manipulate huge data sets and a large number of VM images. Storage costs result not only from storage space utilization, but also from bandwidth consumption: in typical deployments, a large number of data transfers between the VMs and the persistent storage are performed, all under high performance requirements. This paper evaluates the trade-off resulting from transparently applying data compression to conserve storage space and bandwidth at the cost of slight computational overhead. We aim at reducing the storage space and bandwidth needs with minimal impact on data access performance. Our solution builds on BlobSeer, a distributed data management service specifically designed to sustain a high throughput for concurrent accesses to huge data sequences that are distributed at large scale. Extensive experiments demonstrate that our approach achieves large reductions (at least 40%) of bandwidth and storage space utilization, while still attaining high performance levels that even surpass the original (no compression) performance levels in several data-intensive scenarios.
With Infrastructure-as-a-Service (IaaS) cloud economics getting increasingly complex and dynamic,... more With Infrastructure-as-a-Service (IaaS) cloud economics getting
increasingly complex and dynamic, resource costs can vary greatly over short periods of time. Therefore, a critical issue is the ability to deploy, boot and terminate VMs very quickly, which enables cloud users to exploit elasticity to find the optimal trade-off between the computational needs (number of resources, usage time) and budget constraints. This paper proposes an adaptive prefetching mechanism aiming to reduce the time required to simultaneously boot a large number of VM instances on clouds from the same initial VM image (multi-deployment). Our proposal does not require any foreknowledge of the exact access pattern. It dynamically adapts to it at run time, enabling the slower instances to learn from the experience of the faster ones. Since all booting instances typically access only a small
part of the virtual image along almost the same pattern, the required data can be pre-fetched in the background. Large scale experiments under concurrency on hundreds of nodes show that introducing such a prefetching mechanism can achieve a speed-up of up to 35% when compared to simple on-demand fetching.
The recent explosion in data sizes manipulated by distributed scientific applications has prompte... more The recent explosion in data sizes manipulated by distributed
scientific applications has prompted the need to develop specialized storage systems capable to deal with specific access patterns in a scalable fashion. In this context, a large class of applications focuses on parallel array processing: small parts of huge multi-dimensional arrays are concurrently accessed
by a large number of clients, both for reading and writing.
A specialized storage system that deals with such an access
pattern faces several challenges at the level of data/metadata management. We introduce Pyramid, an active array-oriented storage system that addresses these challenges. Experimental evaluation demonstrates substantial scalability improvements brought by Pyramid with respect to state-of-art approaches both in weak and strong scaling scenarios, with gains of 100% to 150%.
2nd IEEE International …, Jan 1, 2010
The cloud computing model aims to make large-scale data-intensive computing affordable even for u... more The cloud computing model aims to make large-scale data-intensive computing affordable even for users with limited financial resources, that cannot invest into expensive infrastructures necesssary to run them. In this context, MapReduce is emerging as a highly scalable programming paradigm that enables high-throughput data-intensive processing as a cloud service. Its performance is highly dependent on the underlying storage service, responsible to efficiently support massively parallel data accesses by guaranteeing a high throughput under heavy access concurrency. In this context, quality of service plays a crucial role: the storage service needs to sustain a stable throughput for each individual accesss, in addition to achieving a high aggregated throughput under concurrency. In this paper we propose a technique to address this problem using component monitoring, application-side feedback and behavior pattern analysis to automatically infer useful knowledge about the causes of poor quality of service and provide an easy way to reasonin about potential improvements. We apply our proposal to BlobSeer, a representative data storage service specifically designed to achieve high aggregated throughputs and show through extensive experimentation substantial improvements in the stability of individual data read accesses under MapReduce workloads.