MTC Envelope: Defining the Capability of Large Scale Computers in the Context of Parallel Scripting Applications (original) (raw)

Design and analysis of data management in scalable parallel scripting

2012 International Conference for High Performance Computing, Networking, Storage and Analysis, 2012

We seek to enable efficient large-scale parallel execution of applications in which a shared filesystem abstraction is used to couple many tasks. Such parallel scripting (manytask computing, MTC) applications suffer poor performance and utilization on large parallel computers because of the volume of filesystem I/O and a lack of appropriate optimizations in the shared filesystem. Thus, we design and implement a scalable MTC data management system that uses aggregated compute node local storage for more efficient data movement strategies. We co-design the data management system with the data-aware scheduler to enable dataflow pattern identification and automatic optimization. The framework reduces the time to solution of parallel stages of an astronomy data analysis application, Montage, by 83.2% on 512 cores; decreases the time to solution of a seismology application, CyberShake, by 7.9% on 2,048 cores; and delivers BLAST performance better than mpiBLAST at various scales up to 32,768 cores, while preserving the flexibility of the original BLAST application.

File System Workload Analysis For Large Scale Scientific Computing Applications

Parallel scientific applications require high-performance I/O support from underlying file systems. A comprehensive understanding of the expected workload is therefore essential for the design of high-performance parallel file systems. We reexamine the workload characteristics in parallel computing environments in the light of recent technology advances and new applications. We analyze application traces from a cluster with hundreds of nodes. On average, each application has only one or two typical request sizes. Large requests from several hundred kilobytes to several megabytes are very common. Although in some applications, small requests account for more than 90% of all requests, almost all of the I/O data are transferred by large requests. All of these applications show bursty access patterns. More than 65% of write requests have inter-arrival times within one millisecond in most applications. By running the same benchmark on different file models, we also find that the write throughput of using an individual output file for each node exceeds that of using a shared file for all nodes by a factor of 5. This indicates that current file systems are not well optimized for file sharing.

In-Memory Runtime File Systems for Many-Task Computing

Lecture Notes in Computer Science, 2014

Many scientific computations can be expressed as Many-Task Computing (MTC) applications. In such scenarios, application processes communicate by means of intermediate files, in particular input, temporary data generated during job execution (stored in a runtime file system), and output. In data-intensive scenarios, the temporary data is generally much larger than input and output. In a 6x6 degree Montage mosaic [3], for example, the input, output and intermediate data sizes are 3.2GB, 10.9GB and 45.5GB, respectively [6]. Thus, speeding up I/O access to temporary data is key to achieving good overall performance. General-purpose, distributed or parallel file systems such as NFS, GPFS, or PVFS provide less than desirable performance for temporary data for two reasons. First, they are typically backed by physical disks or SSDs, limiting the achievable bandwidth and latency of the file system. Second, they provide POSIX semantics which are both too costly and unnecessarily strict for temporary data of MTC applications that are written once and read several times. Tailoring a runtime file system to this pattern can lead to significant performance improvements. Memory-based runtime file systems promise better performance. For MTC applications, such file systems are co-designed with task schedulers, aiming at data locality [6]. Here, tasks are placed onto nodes that contain the required input files, while write operations go to the node's own memory. Analyzing the communication patterns of workflows like Montage [3], however, shows that, initially, files are created by a single task. In subsequent steps, tasks combine several files, and final results are based on global data aggregation. Aiming at data locality hence leads to two significant drawbacks: (1.) Local-only write operations can lead to significant storage imbalance across nodes, while localonly read operations cause file replication onto all nodes that need them, which in worst case might exceed the memory capacity of nodes performing global data reductions. (2.) Because tasks typically read more than a single input file, locality-aware task placement is difficult to achieve in the first place. To overcome these drawbacks, we designed a distributed, in-memory runtime file system called MemFS that replaces data locality by uniformly spreading file stripes across all storage nodes. Due to its striping mechanism, MemFS leverages full network bisection bandwidth, maximizing I/O performance while avoiding storage imbalance problems.

Tuning the performance of I/O intensive parallel applications

1996

Getting good I/O performance from parallel programs is a critical problem for many application domains. In this paper, we report our experience tuning the I/O performance of four application programs from the areas of satellite-data processing and linear algebra. After tuning, three of the four applications achieve application-level I/O rates of over 100 MB/s on 16 processors. The total volume of I/O required by the programs ranged from about 75 MB to over 200 GB. We report the lessons learned in achieving high I/O performance from these applications, including the need for code restructuring, local disks on every node and knowledge of future I/O requests. We also report our experience on achieving high performance on peer-to-peer con gurations. Finally, wecomment on the necessity of complex I/O interfaces like collective I/O and strided requests to achieve high performance. 1

MTC envelope

Many scientific applications can be efficiently expressed with the parallel scripting (many-task computing, MTC) paradigm. These applications are typically composed of several stages of computation, with tasks in different stages coupled by a shared file system abstraction. However, we often see poor performance when running these applications on large scale computers due to the applications' frequency and volume of filesystem I/O and the absence of appropriate optimizations in the context of parallel scripting applications.

A User-Friendly Approach for Tuning Parallel File Operations

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis, 2014

The Lustre file system provides high aggregated I/O bandwidth and is in widespread use throughout the HPC community. Here we report on work (1) developing a model for understanding collective parallel MPI write operations on Lustre, and (2) producing a library that optimizes parallel write performance in a user-friendly way. We note that a system's default stripe count is rarely a good choice for parallel I/O, and that performance depends on a delicate balance between the number of stripes and the actual (not requested) number of collective writers. Unfortunate combinations of these parameters may degrade performance considerably. For the programmer, however, it's all about the stripe count: an informed choice of this single parameter allows MPI to assign writers in a way that achieves near-optimal performance. We offer recommendations for those who wish to tune performance manually and describe the easy-to-use T3PIO library that manages the tuning automatically.

Lessons from characterizing the input/output behavior of parallel scientific applications

Performance Evaluation, 1998

Because both processor and interprocessor communication hardware is evolving rapidly with only moderate improvements to le system performance in parallel systems, it is becoming increasingly di cult to provide su cient input/output (I/O) performance to parallel applications. I/O hardware and le system parallelism are the key to bridging this performance gap. Prerequisite to the development of e cient parallel le system is detailed characterization of the I/O demands of parallel applications.

HPC global file system performance analysis using a scientific-application derived benchmark

Parallel Computing, 2009

With the exponential growth of high-fidelity sensor and simulated data, the scientific community is increasingly reliant on ultrascale HPC resources to handle its data analysis requirements. However, to use such extreme computing power effectively, the I/O components must be designed in a balanced fashion, as any architectural bottleneck will quickly render the platform intolerably inefficient. To understand I/O performance of data-intensive applications in realistic computational settings, we develop a lightweight, portable benchmark called MADbench2, which is derived directly from a large-scale Cosmic Microwave Background (CMB) data analysis package. Our study represents one of the most comprehensive I/O analyses of modern parallel file systems, examining a broad range of system architectures and configurations, including Lustre on the Cray XT3, XT4, and Intel Itanium2 clusters; GPFS on IBM Power5 and AMD Opteron platforms; a BlueGene/P installation using GPFS and PVFS2 file systems; and CXFS on the SGI Altix3700. We present extensive synchronous I/O performance data comparing a number of key parameters including concurrency, POSIX-versus MPI-IO, and unique-versus shared-file accesses, using both the default environment as well as highly-tuned I/O parameters. Finally, we explore the potential of asynchronous I/O and show that only the two of the nine evaluated systems benefited from MPI-2's asynchronous MPI-IO. On those systems, experimental results indicate that the computational intensity required to hide I/O effectively is already close to the practical limit of BLAS3 calculations. Overall, our study quantifies vast differences in performance and functionality of parallel file systems across state-of-the-art platforms-showing I/O rates that vary up to 75x on the examined architectures-while providing system designers and computational scientists a lightweight tool for conducting further analysis.

Filesystem Aware Scalable I/O Framework for Data-Intensive Parallel Applications

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, 2013

The growing speed gap between CPU and memory makes I/O the main bottleneck of many industrial applications. Some applications need to perform I/O operations for very large volume of data frequently, which will harm the performance seriously. This work's motivation are geophysical applications used for oil and gas exploration. These applications process Terabyte size datasets in HPC facilities [6]. The datasets represent subsurface models and field recorded data. In general term, these applications read as inputs and write as intermediate/final results huge amount of data, where the underlying algorithms implement seismic imaging techniques. The traditional sequential I/O, even when couple with advance storage systems, cannot complete all I/O operations for so large volumes of data in an acceptable time range. Parallel I/O is the general strategy to solve such problems. However, because of the dynamic property of many of these applications, each parallel process does not know the data size it needs to write until its computation is done, and it also cannot identify the position in the file to write. In order to write correctly and efficiently, communication and synchronization are required among all processes to fully exploit the parallel I/O paradigm. To tackle these issues, we use a dynamic load balancing framework that is general enough for most of these applications. And to reduce the expensive synchronization and communication overhead, we introduced a I/O node that only handles I/O request and let compute nodes perform I/O operations in parallel. By using both POSIX I/O and memory-mapping interfaces, the experiment indicates that our approach is scalable. For instance, with 16 processes, the bandwidth of parallel reading can reach the theoretical peak performance (2.5 GB/s) of the storage infrastructure. Also, the parallel writing can be up to 4.68x (speedup, POSIX I/O) and 7.23x (speedup, memory-mapping) more efficient than the serial I/O implementation. Since, most geophysical applications are I/O bounded, these results positively impact the overall performance of the application, and confirm the chosen strategy as path to follow.