Distributed Storage System Research Papers (original) (raw)

Elastic distributed storage systems have been increasingly studied in recent years because power consumption has become a major problem in data centers. Much progress has been made in improving the agility of resizing small-and... more

Elastic distributed storage systems have been increasingly studied in recent years because power consumption has become a major problem in data centers. Much progress has been made in improving the agility of resizing small-and large-scale distributed storage systems. However, most of these studies focus on metadata based distributed storage systems. On the other hand, emerging consistent hashing based distributed storage systems are considered to allow better scalability and are highly attractive. We identify challenges in achieving elasticity in consistent hashing based distributed storage. These challenges cannot be easily solved by techniques used in current studies. In this paper, we propose an elastic consistent hashing based distributed storage to solve two problems. First, in order to allow a distributed storage to resize quickly, we modify the data placement algorithm using a primary server design and achieve an equal-work data layout. Second, we propose a selective data reintegration technique to reduce the performance impact when resizing a cluster. Our experimental and trace analysis results confirm that our proposed elastic consistent hashing works effectively and allows significantly better elasticity.

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web... more

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google ...

Ability to accommodate all types of distributed storage options and renewable energy sources is one of main characteristics of smart grid. Smart grid integrates advanced sensing technologies, control methodologies and communication... more

Ability to accommodate all types of distributed storage options and renewable energy sources is one of main characteristics of smart grid. Smart grid integrates advanced sensing technologies, control methodologies and communication technologies into current power distribution systems to provide electricity to customers in a better way. Infrastructure for implementation and utilization of renewable energy sources requires distributed storage systems with high power density and high energy density. Currently, some research investigates energy management and dynamic control of distributed storage system to offer not only high power density and high energy density storage but also high efficiency and long life systems. In this paper, an intelligent energy management system is proposed to provide short-term requirements of distributed storage system in smart grid. The energy management of a distributed storage system is formulated as a nonlinear mixed-integer optimization problem. A hybrid algorithm that is combined an evolutionary algorithm with a linear programming was developed to solve the problem. Outcomes of simulation studies show the potential of solving the problem by the proposed algorithm.

Highly available cloud storage is often implemented with complex, multi-tiered distributed systems built on top of clusters of commodity servers and disk drives. Sophisticated management, load balancing and recovery techniques are needed... more

Highly available cloud storage is often implemented with complex, multi-tiered distributed systems built on top of clusters of commodity servers and disk drives. Sophisticated management, load balancing and recovery techniques are needed to achieve high performance and availability amidst an abundance of failure sources that include software, hardware, network connectivity, and power issues. While there is a relative wealth of failure studies of individual components of storage systems, such as disk drives, ...

Urbanisation of a watershed with its associated impact on the quantity and quality of stormwater runoff has resulted in the implementation of a number of alternatives for storm-water management. The early stage of stormwater management... more

Urbanisation of a watershed with its associated impact on the quantity and quality of stormwater runoff has resulted in the implementation of a number of alternatives for storm-water management. The early stage of stormwater management initiatives in Malaysia is concentrated on the minimization of downstream flooding caused by urbanisation. The objective of the project is to implement the source control

—We consider the problem of private information retrieval (PIR) over a distributed storage system. The storage system consists of N non-colluding databases, each storing an MDS-coded version of M messages. In the PIR problem, the user... more

—We consider the problem of private information retrieval (PIR) over a distributed storage system. The storage system consists of N non-colluding databases, each storing an MDS-coded version of M messages. In the PIR problem, the user wishes to retrieve one of the available messages without revealing the message identity to any individual database. We derive the information-theoretic capacity of this problem, which is defined as the maximum number of bits of the desired message that can be privately retrieved per one bit of downloaded information. We show that the PIR capacity in this case is C = 1 + K N + K 2 N 2 + · · · + K M −1 N M −1 −1 = (1 + Rc + R 2 c + · · · + R M −1 c) −1 = 1−Rc 1−R M c , where Rc is the rate of the (N, K) code used. The capacity is a function of the code rate and the number of messages only regardless of the explicit structure of the storage code. The result implies a fundamental tradeoff between the optimal retrieval cost and the storage cost. The result generalizes the achievability and converse results for the classical PIR with replicating databases to the case of coded databases.

Abstract[1] In the face of global change, which is characterized by growing water demands and increasingly variable water supplies, the equitable sharing of water and the drought proofing of rural livelihoods will require an increasing... more

Abstract[1] In the face of global change, which is characterized by growing water demands and increasingly variable water supplies, the equitable sharing of water and the drought proofing of rural livelihoods will require an increasing physical capacity to store water. This is especially true for the semiarid and dry subhumid regions of sub-Saharan Africa and Asia. This paper addresses the following question: What criteria should policymakers apply in choosing between centralized storage capacity in the form of conventional large reservoirs and large interbasin water transfer schemes and decentralized and distributed storage systems in the farmers' fields and in microwatersheds and villages (tanks, microdams, and aquifers)? This exploratory paper uses an interdisciplinary framework encompassing the natural and social sciences to develop four indicators that are considered critical for understanding the biochemical, physical, economic, and sociopolitical dimensions of the scale issues underlying the research question. These are the residence time of water in a reservoir, the water provision capacity, the cost effectiveness of providing reliable access to water per beneficiary, and the equity dimension: maximizing the number of beneficiaries and compensating the losers. The procedural governance challenges associated with each indicator are dealt with separately. It is concluded that water storage and the institutional capacity to effectively administer it are recursively linked. This implies that if the scale of new storage projects gradually increases, a society can progressively learn and adapt to the increasing institutional complexity.

Performance monitoring in most distributed systems provides min-imal guidance for tuning, problem diagnosis, and decision making. Stardust is a monitoring infrastructure that replaces traditional per-formance counters with end-to-end... more

Performance monitoring in most distributed systems provides min-imal guidance for tuning, problem diagnosis, and decision making. Stardust is a monitoring infrastructure that replaces traditional per-formance counters with end-to-end traces of requests and allows ...

Nowadays, distributed storage is adopted to alleviate Delay-tolerant networking (DTN) congestion, but reliability transmission during the congestion period remains an issue. In this paper, we propose a multi-custodians distributed storage... more

Nowadays, distributed storage is adopted to alleviate Delay-tolerant networking (DTN) congestion, but reliability transmission during the congestion period remains an issue. In this paper, we propose a multi-custodians distributed storage (MCDS) framework that includes a set of algorithms to determine when should appropriate bundles be migrated (or retrieved) to (or from) the suitable custodians, so that we can solve DTN congestion and improve reliability transmission simultaneously. MCDS adopts multiple custodians to temporarily store the duplications of migrated bundle for releasing DTN congestion. Thus, MCDS has more opportunities to retrieve the migrated bundles when network congestion is mitigated. Two performance metrics are used to evaluate the simulation results: the goodput ratio (GR) that represents QoS of data transmission, and the retrieved loss ratio (RLR) that reflects the performance of reliability transmission. We also use another distributed storage mechanism based on single-custodian distributed storage (SCDS) to evaluate MCDS. Simulation results show that MCDS has better GR and RLR in almost all simulation cases. For various scenarios, the GR and RLR of MCDS are in the range of 10.6%-18.4% and 23.2%-36.8%, respectively, which are higher than those of SCDS.

With the explosion of data in applications all around us, erasure coded storage has emerged as an attractive alternative to replication because even with significantly lower storage overhead, they provide better reliability against data... more

With the explosion of data in applications all around us, erasure coded storage has emerged as an attractive alternative to replication because even with significantly lower storage overhead, they provide better reliability against data loss. Reed-Solomon code is the most widely used erasure code because it provides maximum reliability for a given storage overhead and is flexible in the choice of coding parameters that determine the achievable reliability. However, reconstruction time for unavailable data becomes prohibitively long mainly because of network bottlenecks. Some proposed solutions either use additional storage or limit the coding parameters that can be used. In this paper, we propose a novel distributed reconstruction technique, called Partial Parallel Repair (PPR), which divides the reconstruction operation to small partial operations and schedules them on multiple nodes already involved in the data reconstruction. Then a distributed protocol progressively combines these partial results to reconstruct the unavailable data blocks and this technique reduces the network pressure. Theoretically, our technique can complete the network transfer in ⌈(log2(k + 1))⌉ time, compared to k time needed for a (k, m) Reed-Solomon code. Our experiments show that PPR reduces repair time and degraded read time significantly. Moreover, our technique is compatible with existing erasure codes and does not require any additional storage overhead. We demonstrate this by overlaying PPR on top of two prior schemes, Local Reconstruction Code and Rotated Reed-Solomon code, to gain additional savings in reconstruction time.

We address the problem of pollution attacks in coding based distributed storage systems proposed for wireless sensor networks. In a pollution attack, the adversary maliciously alters some of the stored encoded packets, which results in... more

We address the problem of pollution attacks in coding based distributed storage systems proposed for wireless sensor networks. In a pollution attack, the adversary maliciously alters some of the stored encoded packets, which results in the incorrect decoding of a large part of the original data upon retrieval. We propose algorithms to detect and recover from such attacks. In contrast to existing approaches to solve this problem, our approach is not based on adding cryptographic checksums or signatures to the encoded packets. We believe that our proposed algorithms are suitable in practical systems. 1

In a distributed storage system, client caches managed on the basis of small granularity objects can provide better memory utilization then page-based caches. However, ob-ject servers, unlike page servers, must perform additional disk... more

In a distributed storage system, client caches managed on the basis of small granularity objects can provide better memory utilization then page-based caches. However, ob-ject servers, unlike page servers, must perform additional disk reads. These installation reads are required to ...

Urbanisation of a watershed with its associated impact on the quantity and quality of stormwater runoff has resulted in the implementation of a number of alternatives for storm-water management. The early stage of stormwater management... more

Urbanisation of a watershed with its associated impact on the quantity and quality of stormwater runoff has resulted in the implementation of a number of alternatives for storm-water management. The early stage of stormwater management initiatives in Malaysia is concentrated on the minimization of downstream flooding caused by urbanisation. The objective of the project is to implement the source control

The energy costs of running computer systems are a growing concern: for large data centers, recent estimates put these costs higher than the cost of hardware itself. As a consequence, energy efficiency has become a pervasive theme for... more

The energy costs of running computer systems are a growing concern: for large data centers, recent estimates put these costs higher than the cost of hardware itself. As a consequence, energy efficiency has become a pervasive theme for designing, deploying, and operating computer systems. This paper evaluates the energy trade-offs brought by data deduplication in distributed storage systems. Depending on the workload, deduplication can enable a lower storage footprint, reduce the I/O pressure on the storage system, and reduce network traffic, at the cost of increased computational overhead. From an energy perspective, data deduplication enables a trade-off between the energy consumed for additional computation and the energy saved by lower storage and network load. The main point our experiments and model bring home is the following: while for non energy-proportional machines performance- and energy-centric optimizations have break-even points that are relatively close, for the newer generation of energy proportional machines the break-even points are significantly different. An important consequence of this difference is that, with newer systems, there are higher energy inefficiencies when the system is optimized for performance.

The data generated by scientific simulations and experimental facilities is beginning to revolutionize the infrastructure support needed by these applications. The on-demand aspect and flexibility of cloud computing... more

The data generated by scientific simulations and
experimental facilities is beginning to revolutionize the
infrastructure support needed by these applications. The on-demand aspect and flexibility of cloud computing
environments makes it an attractive platform for data-intensive scientific applications. However, cloud
computing poses unique challenges for these applications.
For example, cloud computing environments are
heterogeneous, dynamic and non-persistent which can
make reproducibility a challenge. The volume, velocity,
variety, veracity and value of data combined with the
characteristics of cloud environment make it important to
track the state of execution data and application’s entire
lifetime information to understand and ensure
reproducibility. This paper proposes and implements a state
management system (FRIEDA-State) for high-throughput
and data-intensive scientific applications running in cloud
environments. Our design addresses the challenges of state
management in cloud environments and offers various
configurations. Our implementation is built on top of
FRIEDA (Flexible Robust Intelligent Elastic Data
Management), a data management and execution
framework for cloud environments. Our experiment results
on two cloud test beds (FutureGrid and Amazon) show that
the proposed solution has a minimal overhead
(1.2ms/operation at a scale of 64 virtual machines) and is
suitable for state management in cloud environments.

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web... more

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this article, we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.

As peer-to-peer and,widely distributed storage systems proliferate, the need to perform efficient erasure coding, instead of replication, is crucial to performance and ef- ficiency. Low-Density Parity-Check (LDPC) codes have arisen as... more

As peer-to-peer and,widely distributed storage systems proliferate, the need to perform efficient erasure coding, instead of replication, is crucial to performance and ef- ficiency. Low-Density Parity-Check (LDPC) codes have arisen as alternatives to standard erasure codes, such as Reed-Solomon codes, trading off vastly improved decod- ing performance,for inefficiencies in the amount,of data that must be acquired to perform,decoding. The

The paper deals with the optimal sizing and allocation of dispersed generation, distributed storage systems and capacitor banks. The optimization aims at minimizing the sum of the costs sustained by the distributor for the power losses,... more

The paper deals with the optimal sizing and allocation of dispersed generation, distributed storage systems and capacitor banks. The optimization aims at minimizing the sum of the costs sustained by the distributor for the power losses, for network upgrading, for carrying out the reactive power service and the costs of storage and capacitor installation, over a planning period of several years. A hybrid procedure based on a genetic algorithm and a sequential quadratic programming-based algorithm was used. A numerical application on a 18-busbar MV balanced 3-phase network was performed in order to show the feasibility of the proposed procedure. be used to reduce the variability of some DG sources, to counter the voltage rise effect or to improve the power quality in distribution networks. Moreover, an optimal control of DESSs allows the operators of electrical distribution systems to improve the reactive control and, as a consequence, to reduce the overall costs (3-6). Considering an...

Modern desktop grid environments and shared computing platforms have popularized the use of contributory resources, such as desktop computers, as computing substrates for a variety of applications. However, addressing the exponentially... more

Modern desktop grid environments and shared computing platforms have popularized the use of contributory resources, such as desktop computers, as computing substrates for a variety of applications. However, addressing the exponentially growing storage demands of applications, especially in a contributory environment, remains a challenging research problem. In this paper, we propose a transparent distributed storage system that harnesses the storage contributed by desktop grid participants arranged in a ...