A Distributed and Flexible Platform for Large-Scale Data Storage in HPC Systems (original) (raw)
Related papers
CHAIO: Enabling HPC Applications on Data-Intensive File Systems
The computing paradigm of "HPC in the Cloud" has gained a surging interest in recent years, due to its merits of cost-efficiency, flexibility, and scalability. Cloud is designed on top of distributed file systems such as Google file system (GFS). The capability of running HPC applications on top of data-intensive file systems is a critical catalyst in promoting Clouds for HPC. However, the semantic gap between data-intensive file systems and HPC imposes numerous challenges. For example, N-1 (N to 1) is a widely used data access pattern for HPC applications such as checkpointing, but cannot perform well on data-intensive file systems.
Object Storage: Scalable Bandwidth for HPC Clusters
2004
This paper describes the Object Storage Architecture solution for cost-effective, high bandwidth storage in High Performance Computing (HPC) environments. An HPC environment requires a storage system to scale to very large sizes and performance without sacrificing cost-effectiveness nor ease of sharing and managing data. Traditional storage solutions, including disk-per-node, Storage-Area Networks (SAN), and Network-Attached Storage (NAS) implementations, fail to find a balance between performance, ease of use, and cost as the storage system scales up. In contrast, building storage systems as specialized storage clusters using commodity-off-the-shelf (COTS) components promise excellent price-performance at scale provided that binding them into a single system image and linking them to HPC compute clusters can be done without introducing bottlenecks or management complexities. While a file interface (typified by NAS systems) at each storage cluster component is too high-level to provide scalable bandwidth and simple management across large numbers of components, and a block interface (typified by SAN systems) is too low-level to avoid synchronization bottlenecks in a shared storage cluster, an object interface (typified by the inode layer of traditional file system implementations) is at the intermediate level needed for independent, highly parallel operation at each storage cluster component under centralized, but infrequently applied, control. The Object Storage Device (OSD) interface achieves this independence by storing an unordered collection of named variable-length byte arrays, called objects, and embedding extendable attributes, fine-grain capability-based access control, and encapsulated data layout and allocation into each object. With this higher-level interface, object storage clusters are capable of highly parallel data transfers between storage and compute cluster node under the infrequently applied control of the out-of-band metadata managers. Object Storage Architectures support single-system-image file systems with the traditional sharing and management features of NAS systems and the resource consolidation and scalable performance of SAN systems.
GekkoFS - A Temporary Distributed File System for HPC Applications
2018 IEEE International Conference on Cluster Computing (CLUSTER), 2018
We present GekkoFS, a temporary, highly-scalable burst buffer file system which has been specifically optimized for new access patterns of data-intensive High-Performance Computing (HPC) applications. The file system provides relaxed POSIX semantics, only offering features which are actually required by most (not all) applications. It is able to provide scalable I/O performance and reaches millions of metadata operations already for a small number of nodes, significantly outperforming the capabilities of general-purpose parallel file systems.
Supporting Large Scale Data-Intensive Computing with the FusionFS Distributed File System
State-of-the-art yet decades-old architecture of HPC storage systems has segregated compute and storage resources, bringing unprecedented inefficiencies and bottlenecks at petascale levels and beyond. This paper presents FusionFS, a new distributed file system designed from the ground up for high scalability (16K nodes) while achieving significantly higher I/O performance (2.5TB/sec). FusionFS achieves these levels of scalability and performance through complete decentralization, and the co-location of storage and compute resources. It supports POSIX-like interfaces important for ease of adoption and backwards compatibility with legacy applications. It is made reliable through data replication, and it supports both strong and weak consistency semantics. Furthermore, it supports scalable data provenance capture and querying, a much needed feature in large scale scientific computing systems towards achieving reproducible and verifiable experiments.
Future Generation Computer Systems, 2018
HPC and Big Data stacks are completely separated today. The storage layer offers opportunities for convergence, as the challenges associated with HPC and Big Data storage are similar: trading versatility for performance. This motivates a global move towards dropping file-based, POSIX-IO compliance systems. However, on HPC platforms this is made difficult by the centralized storage architecture using file-based storage. In this paper we advocate that the growing trend of equipping HPC compute nodes with local storage redistributes the cards by enabling object storage to be deployed alongside the application on the compute nodes. Such integration of application and storage not only allows fine-grained configuration of the storage system, but also improves application portability across platforms. In addition, the single-user nature of such application-specific storage obviates the need for resource-consuming storage features like permissions or file hierarchies offered by traditional file systems. In this article we propose and evaluate Blobs (Binary Large Objects) as an alternative to distributed file systems. We factually demonstrate that it offers drop-in compatibility with a variety of existing applications while improving storage throughput by up to 28%.
DiSK: A distributed shared disk cache for HPC environments
2009
Data movement within high performance environments can be a large bottleneck to the overall performance of programs. With the addition of continuous storage and usage of older data, the back end storage is becoming a larger problem than the improving network and computational nodes. This has led us to develop a Distributed Shared Disk Cache, DiSK, to reduce the dependence on these back end storage systems. With DiSK requested files will be distributed across nodes in order to reduce the amount of requests directed to the archives. DiSK has two key components. One is a Distributed Metadata Management, DIMM, scheme that allows a centralized manager to access what data is available in the system. This is accomplished through the use of a counterbased bloomfilter with locality checks in order to reduce false positives and false negatives. The second component is a method of replication called Differentiable Replication, DiR. The novelty of DiR is that the requirements of the files and capabilities of underlying nodes are taken into consideration for replication. This allows for a varying degree of replication depending on the file. This customization of DiSK yields better performance than the conventional archive system.
Scalable Storage for Data-Intensive Computing
Handbook of Data Intensive Computing, 2011
Cloud computing applications require a scalable, elastic and fault tolerant storage system. We survey how storage systems have evolved from the traditional distributed filesystems, peer-to-peer storage systems and how these ideas have been synthesized in current cloud computing storage systems. Then, we describe how metadata management can be improved for a file system built to support large scale data-intensive applications. We implement Ring File System (RFS), that uses a single hop Distributed Hash Table, to manage file metadata and a traditional client-server model for managing the actual data. Our solution does not have a single point of failure, since the metadata is replicated. The number of files that can be stored and the throughput of metadata operations scales linearly with the number of servers. We compare our solution against two open source implementations of the Google File System (GFS): HDFS and KFS and show that it performs better in terms of fault tolerance, scalability and throughput. We envision that similar ideas from peer-to-peer systems can be applied to other large scale cloud computing storage systems.
An Effective Storage Mechanism for High Performance Computing (HPC)
International Journal of Advanced Computer Science and Applications, 2015
All over the process of treating data on HPC Systems, parallel file systems play a significant role. With more and more applications, the need for high performance Input-Output is rising. Different possibilities exist: General Parallel File System, cluster file systems and virtual parallel file system (PVFS) are the most important ones. However, these parallel file systems use pattern and model access less effective such as POSIX semantics (A family of technical standards emerged from a project to standardize programming interfaces software designed to operate on variant UNIX operating system.), which forces the MPI-IO implementations to use inefficient techniques based on locks. To avoid this synchronization in these techniques, we ensure that the use of a versioning-based file system is much more effective.
Scalable I/O forwarding framework for high-performance computing systems
2009
Current leadership-class machines suffer from a significant imbalance between their computational power and their I/O bandwidth. While Moore's law ensures that the computational power of high-performance computing systems increases with every generation, the same is not true for their I/O subsystems. The scalability challenges faced by existing parallel file systems with respect to the increasing number of clients, coupled with the minimalistic compute node kernels running on these machines, call for a new I/O paradigm to meet the requirements of data-intensive scientific applications. I/O forwarding is a technique that attempts to bridge the increasing performance and scalability gap between the compute and I/O components of leadership-class machines by shipping I/O calls from compute nodes to dedicated I/O nodes. The I/O nodes perform operations on behalf of the compute nodes and can reduce file system traffic by aggregating, rescheduling, and caching I/O requests. This paper presents an open, scalable I/O forwarding framework for high-performance computing systems. We describe an I/O protocol and API for shipping function calls from compute nodes to I/O nodes, and we present a quantitative analysis of the overhead associated with I/O forwarding.
GekkoFS — A Temporary Burst Buffer File System for HPC Applications
Journal of Computer Science and Technology, 2020
Many scientific fields increasingly use High-Performance Computing (HPC) to process and analyze massive amounts of experimental data while storage systems in today's HPC environments have to cope with new access patterns. These patterns include many metadata operations, small I/O requests, or randomized file I/O, while general-purpose parallel file systems have been optimized for sequential shared access to large files. Burst buffer file systems create a separate file system that applications can use to store temporary data. They aggregate node-local storage available within the compute nodes or use dedicated SSD clusters and offer a peak bandwidth higher than that of the backend parallel file system without interfering with it. However, burst buffer file systems typically offer many features that a scientific application, running in isolation for a limited amount of time, does not require. We present GekkoFS, a temporary, highly-scalable file system which has been specifically optimized for the aforementioned use cases. GekkoFS provides relaxed POSIX semantics which only offers features which are actually required by most (not all) applications. GekkoFS is, therefore, able to provide scalable I/O performance and reaches millions of metadata operations already for a small number of nodes, significantly outperforming the capabilities of common parallel file systems.