Subhash Saini - Academia.edu (original) (raw)

Papers by Subhash Saini

ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 1998

With the resurgence of distributed shared memory (DSM) systems based on cache-coherent Non Unifor... more With the resurgence of distributed shared memory (DSM) systems based on cache-coherent Non Uniform Memoo' Access (ccNUMA) architectures and increasing disparity between memo_ and processors speeds, data locality overheads are becoming the greatest bottlenecks in the way of realizing potential high performance of these systems. While parallelization tools and compilers facilitate the users in porting their sequential applications to a DSM system, a lot of time and effort is needed to tune the memoo, performance of these applications to achieve reasonable speedup, h_ this paper; we show that integrating cache performance modeling and tuning support within a parallelization environment can alleviate this problem. The Cache Performance Modeling and Prediction Tool (CPMP), employs trace-driven simulation techniques without the overhead of generating and managing detailed address traces. CPMP predicts the cache performance impact of source code level "what-if" modifications in a program to assist a user in the tuning process. CPMP is built on top of a customized version of the Computer Aided Parallelization Tools (CAPTools) environment. Finally, we demonstrate how CPMP can be applied to tune a real Computational Fluid Dynamics (CFD) application. time prohibita userto applymodeling based approaches for tuninganyrealcode.Nevertheless, tracedrivensimulation approaches areconsidered reliableandaccurate underrealisticconditions [ 14]. Table I. Various memory subsystem performance analysis techniques, their goals, and limitations with respect to application cache performance tuning.

Parallel and Distributed Computing Systems (ISCA), 1998

Scientists at NASA Ames Research Center have been developing computational aeroscience applicatio... more Scientists at NASA Ames Research Center have been developing computational aeroscience applications on highly parallel architectures over the past ten years. During that same time period, a steady transition of hardware and system software also occurred, forcing us to expend great efforts into migrating and re-coding our applications. As applications and machine architectures become increasingly complex, the cost and time required for this process will become prohibitive. In this paper, we present the first set of results in our evaluation of interactive parallelization tools. In particular, we evaluate CAPTool's ability to parallelize computational aeroscience applications. CAPTools was tested on serial versions of the NAS Parallel Benchmarks and ARC3D, a computational fluid dynamics application, on two platforms: the SGI Origin 2000 and the Cray T3E. This evaluation includes performance, amount of user interaction required, limitations and portability. Based on these results, a discussion on the feasibility of computer aided parallelization of aerospace applications is presented along with suggestions for future work. High performance computers have evolved rapidly over the past decade. Although new advances in architecture have increased overall performance, they have also created limitations on programs' portability. Hand tuning applications for each new machine achieves the best performance results, but at a high cost of time and effort. At NASA Ames Research Center, high performance computing hardware is constantly updated to keep pace with new technology. This translates to an average machine life span of 3 years. In the past, our scientists expended a very large effort for every new machine in an attempt to fully utilize their computing performance potential. Currently, NASA is also working on an Information Power Grid Initiative to produce a computational grid that will work in concert with computational grids being assembled at PACI [15, 16]. In anticipation of a widely distributed and heterogeneous computing envirc_nment, together with the increasing complexity of future applications, we may not be able to afford to continue our porting efforts every three years. To protect investments in code maintentance and development, the parallelization process needs to require less time and effort. 1.2 A Spectrum of Parallel Programming Alternatives Generally speaking, four major approaches have been used to mount applications on parallel architectures: 1. Parallelization by hand; 2. Using semi-custom building blocks (PETSc [4], NHSE software [14]); 3. Data parallel languages and parallelizing compilers (HPF [8], FORTRAN-D [1], Vienna FORTRAN [5],

IEEE International Conference on High Performance Computing, Data, and Analytics, 1998

Workload characterization is used for modeling and evaluating computing systems at different leve... more Workload characterization is used for modeling and evaluating computing systems at different levels of detail. We present workload characterization for a class of Computational Fluid Dynamics (CFD) applications that solve Partial Differential Equations (PDEs). This workload characterization focuses on three high performance computing platforms: SGI Origin2000, IBM SP-2, and a cluster of Intel Pentium Pro based PCs. We execute extensive measurement-based experiments on these platforms to gather statistics of system resource usage, which lead to a quantitative workload characterization. Our workload characterization approach yields a coarse-grain resource utilization behavior that is being applied for performance modeling and evaluation of distributed high performance metacomputing systems. In addition, this study enhances our understanding of interactions between PDE solver workloads and high performance computing platforms and is useful for tuning applications belonging to this class.

Program traces are used for analysis of program performance, memory utilization, and communicatio... more Program traces are used for analysis of program performance, memory utilization, and communications as well as for program debugging. The trace contains records of execution events generated by monitoring units inserted into the program. The trace size limits the resolution of execution events and restricts the user's ability to analyze the program

We present a cache performance modeling methodology that facilitates the tuning of uniprocessor c... more We present a cache performance modeling methodology that facilitates the tuning of uniprocessor cache performance for applications executing on shared memory multiprocessors by accurately predicting the effects of source code level modifications. Measurements on a single processor are initially used for identifying parts of code where cache utilization improvements may significantly impact the overall performance. Cache simulation based on trace-driven techniques can be carried out without gathering detailed address traces. Minimal runtime information for modeling cache performance of a selected code block includes: base virtual addresses of arrays, virtual addresses of variables, and loop bounds for that code block. Rest of the information is obtah_ed from the source code. We show that the cache performance predictions are as reliable as those obtained through trace-driven simulations. This technique is particularly helpful to the exploration of various "what-if" scenarios regarding the cache performance impact for alternative code structures. We explain and validate this methodology using a simple matrixmatrix multiplication program. We then apply this methodology to predict and tune the cache performance of two realistic scientific applications taken from the Computational Fluid Dynamics (CFD) domain.

We present an HPF implementation of BT, SP, LU, FT, CG and MG of NPB2.3-serial benchmark set. The... more We present an HPF implementation of BT, SP, LU, FT, CG and MG of NPB2.3-serial benchmark set. The implementation is based on HPF performance model of the benchmark specific primitive operations with distributed arrays. We present profiling and performance data on SGI Origin 2000 and compare the results with NPB2.3. We discuss an advantages and limitations of HPF and pghpf compiler.

We present a cache performance modeling methodology that facilitates the tuning of uniprocessor c... more We present a cache performance modeling methodology that facilitates the tuning of uniprocessor cache performance for applications executing on shared memory multiprocessors by accurately predicting the effects of source code level modifications. Measurements on a single processor are initially used for identifying parts of code where cache utilization improvements may significantly impact the overall performance. Cache simulation based on trace-driven techniques can be carried out without gathering detailed address traces. Minimal runtime information for modeling cache performance of a selected code block includes: base virtual addresses of arrays, virtual addresses of variables, and loop bounds for that code block. Rest of the information is obtained from the source code. We show that the cache performance predictions are as reliable as those obtained through trace-driven simulations. This technique is particularly helpful to the exploration of various "what-if' scenarios...

Proceedings of the IEEE/ACM SC98 Conference, 1998

Porting applications to new high performance parallel and distributed computing platforms is a ch... more Porting applications to new high performance parallel and distributed computing platforms is a challenging task. Since writing parallel code by hand is time consuming and costly, porting codes would ideally be automated by using some parallelization tools and compilers. In this paper, we compare the performance of three parallelization tools and compilers based on the NAS Parallel Benchmark and a CFD application, ARC3D, on the SGI Origin2000 multiprocessor. The tools and compilers compared include: 1) CAPTools: an interactive computer aided parallelization toolkit, 2) Portland Group's HPF compiler, and 3) the MIPSPro FORTRAN compiler available on the Origin2000, with support for shared memory multiprocessing directives and MP runtime library. The tools and compilers are evaluated in four areas: 1) required user interaction, 2) limitations, 3) portability and 4) performance. Based on these results, a discussion on the feasibility of computer-aided parallelization of aerospace applications is presented along with suggestions for future work.

… , San Jose, CA, 1999

● Strength ► built on top of a high-level language, easy to program ► portability from the HPF st... more ● Strength ► built on top of a high-level language, easy to program ► portability from the HPF standard ● Weakness ► questionable performance due to immaturity of compiler technology ► hidden performance model, hard to track ► lack of handling irregular computation

RECON, 1998

HPF Implementation of NPB 2. 3. Michael Frumkin, Haoqiang Jin, Jerry Yan RECON 20020073510, 1998.... more HPF Implementation of NPB 2. 3. Michael Frumkin, Haoqiang Jin, Jerry Yan RECON 20020073510, 1998. We present an HPF implementation of BT, SP, LU, FT, CG and MG of NP132.3-serial benchmark set. The implementation ...

1998., 1998

On the Information Content of Program Traces. Michael Frumkin, Robert Hood, Jerry Yan 1998., 1998... more On the Information Content of Program Traces. Michael Frumkin, Robert Hood, Jerry Yan 1998., 1998. Program traces are used for analysis of program performance, memory utilization, and communications as well as for program debugging. ...

RECON, 1998

A series of efforts have been devoted to investigating methods of porting and paralleliz-ing appl... more A series of efforts have been devoted to investigating methods of porting and paralleliz-ing applications quickly and efficiently for new architectures, such as the SGI Origin 2000 and Cray T3E. This report presents the parallelization of a CFD application, ARC3D, using the ...

2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'02)

Resource management is an important component of a grid computing infrastructure. The scalability... more Resource management is an important component of a grid computing infrastructure. The scalability and adaptability of such systems are two key challenges that must be addressed. In this work an agent-based resource management system, ARMS, is implemented for grid computing. ARMS utilises the performance prediction techniques of the PACE toolkit to provide quantitative data regarding the performance of complex applications running on a local grid resource. At the meta-level, a hierarchy of homogeneous agents are used to provide a scalable and adaptable abstraction of the system architecture. Each agent is able to cooperate with other agents and thereby provide service advertisement and discovery for the scheduling of applications that need to utilise grid resources. A case study with corresponding experimental results is included to demonstrate the efficiency of the resource management and scheduling system.

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2009

ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 1998

Parallel and Distributed Computing Systems (ISCA), 1998

IEEE International Conference on High Performance Computing, Data, and Analytics, 1998

We present a cache performance modeling methodology that facilitates the tuning of uniprocessor c... more We present a cache performance modeling methodology that facilitates the tuning of uniprocessor cache performance for applications executing on shared memory multiprocessors by accurately predicting the effects of source code level modifications. Measurements on a single processor are initially used for identifying parts of code where cache utilization improvements may significantly impact the overall performance. Cache simulation based on trace-driven techniques can be carried out without gathering detailed address traces. Minimal runtime information for modeling cache performance of a selected code block includes: base virtual addresses of arrays, virtual addresses of variables, and loop bounds for that code block. Rest of the information is obtained from the source code. We show that the cache performance predictions are as reliable as those obtained through trace-driven simulations. This technique is particularly helpful to the exploration of various "what-if' scenarios...

Proceedings of the IEEE/ACM SC98 Conference, 1998

… , San Jose, CA, 1999

RECON, 1998

1998., 1998

RECON, 1998

2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'02)

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2009