Vivek Kale - Academia.edu (original) (raw)
Papers by Vivek Kale
Parallel Computing Architectures and APIs, 2019
Parallel Computing Architectures and APIs, 2019
Lecture Notes in Computer Science, 2005
The small performance variation within each node of a cloud computing infrastructure (i.e. cloud)... more The small performance variation within each node of a cloud computing infrastructure (i.e. cloud) can be a fundamental impediment to scalability of a high-performance application. This performance variation (referred to as jitter) particularly impacts overall performance of scientific workloads running on a cloud. Studies show that the primary source of performance variations comes from disk I/O and the underlying communication network [1]. In this paper, we explore the opportunities to improve performance of high performance applications running on emerging cloud platforms. Our contributions are 1. the quantification and assessment of performance variation of data-intensive scientific workloads on a small set of homogeneous nodes running Hadoop and 2. the development of an improved Hadoop scheduler that can improve performance (and potentially scalability) of these application by leveraging the intrinsic performance variation of the system. In using our enhanced scheduler for data-intensive scientific workloads, we are able to obtain more than a 21% performance gain over the default Hadoop scheduler.
Lecture Notes in Computer Science, 2015
2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
Computational bioinformatics and biomedical applications frequently contain heterogeneously sized... more Computational bioinformatics and biomedical applications frequently contain heterogeneously sized units of work or tasks, for instance due to variability in the sizes of biological sequences and molecules. Variable-sized workloads lead to load imbalances in parallel implementations which detract from efficiency and performance. Many modern computing resources now have multiple graphics processing units(GPUs) per computer for acceleration. These multiple GPU resources need to be used efficiently through balancing of workloads across the GPUs. OpenMP is a portable directive-based parallel programming API used ubiquitously in bioscience applications to program CPUs; recently, the use of OpenMP directives for GPU acceleration has become possible. Here, motivated by experiences with imbalanced loads in GPU-accelerated bioinformatics applications, we address the load balancing problem using OpenMP task-to-GPU scheduling combined with OpenMP GPU offloading for multiply heterogeneous workloads – loads with both variable input sizes, and simultaneously, variable convergence rates for algorithms with a stochastic component – scheduled across multiple GPUs. We aim to develop strategies which are both easy to use and have lower overheads, and may be incorporated incrementally in existing programs which already make use of OpenMP for CPU-based threading in order to make use of multi-GPU computers. We test different combinations of input size variability and convergence rate variability, and characterize the effects of these different scenarios on the performance of scheduling strategies across multiple GPUs with OpenMP. We present several dynamic scheduling solutions for different parallel patterns, explore optimizations, and provide publicly available example computational kernels to make these strategies easy to use in programs. This work will enable application developers to efficiently and easily use multiple GPUs for imbalanced workloads found in bioinformatics and biomedical applications.
OpenMP: Portable Multi-Level Parallelism on Modern Systems, 2020
Many modern supercomputers such as ORNL's Summit, LLNL's Sierra, and LBL's upcoming Perlmutter of... more Many modern supercomputers such as ORNL's Summit, LLNL's Sierra, and LBL's upcoming Perlmutter offer or will offer multiple, e.g., 4 to 8, GPUs per node for running computational science and engineering applications. One should expect an application to achieve speedup using multiple GPUs on a node of a supercomputer over a single GPU of the node, in particular an application that is embarrassingly parallel and load imbalanced, such as AutoDock, QMCPACK and DMRG++. OpenMP is a popular model used to run applications on heterogeneous devices of a node and OpenMP 5.x provides rich features for tasking and GPU offloading. However, OpenMP doesn't provide significant support for running application code on multiple GPUs efficiently, in particular for the aforementioned applications. We provide different OpenMP task-to-GPU scheduling strategies that help distribute an application's work across GPUs on a node for efficient parallel GPU execution. Our solution involves using OpenMP's construct taskloop to generate OpenMP tasks containing target regions for OpenMP threads, and then having OpenMP threads assign those tasks to GPUs on a node through a schedule specified by the application programmer. We analyze the performance of our solution using a small benchmark code representative of the aforementioned applications. Our solution improves performance over a standard baseline assignment of tasks to GPUs by up to 57.2%. Further, based on our results, we suggest OpenMP extensions that could help an application programmer have his or her application run on multiple GPUs per node efficiently.
Load imbalance is the major source of performance degradation in computationally-intensive applic... more Load imbalance is the major source of performance degradation in computationally-intensive applications that frequently consist of parallel loops. Efficient scheduling of parallel loops can improve the performance of such programs. OpenMP is the de-facto standard for parallel programming on sharedmemory systems. The current OpenMP specification provides only three choices for loop scheduling which are insufficient in scenarios with irregular loops, system-induced interference, or both. Therefore, this work augments the LLVM implementation of the OpenMP runtime library with eleven state-of-the-art plus three new and ready-to-use scheduling techniques. We tested the existing and the added loop scheduling strategies on several applications from the NAS, SPEC OMP 2012, and CORAL-2 benchmark suites. The experimental results show that each newly implemented scheduling technique outperforms the other in certain application and system configurations. We measured performance gains of up to 6...
Parallel Computing, 2021
As recent enhancements to the OpenMP specification become available in its implementations, there... more As recent enhancements to the OpenMP specification become available in its implementations, there is a need to share the results of experimentation in order to better understand the OpenMP implementation's behavior in practice, to identify pitfalls, and to learn how the implementations can be effectively deployed in scientific codes. We report on experiences gained and practices adopted when using OpenMP to port a variety of ECP applications, mini-apps and libraries based on different computational motifs to accelerator-based leadershipclass high-performance supercomputer systems at the United States Department of Energy. Additionally, we identify important challenges and open problems related to the deployment of OpenMP. Through our report of experiences, we find that OpenMP implementations are successful on current supercomputing platforms and that OpenMP is a promising programming model to use for applications to be run on emerging and future platforms with accelerated nodes.
OpenMP: Conquering the Full Hardware Spectrum, 2019
Parallel loops are an important part of OpenMP programs. Efficient scheduling of parallel loops c... more Parallel loops are an important part of OpenMP programs. Efficient scheduling of parallel loops can improve performance of the programs. The current OpenMP specification only offers three options for loop scheduling, which are insufficient in certain instances. Given the large number of other possible scheduling strategies, standardizing each of them is infeasible. A more viable approach is to extend the OpenMP standard to allow a user to define loop scheduling strategies within her application. The approach will enable standard-compliant application-specific scheduling. This work analyzes the principal components required by user-defined scheduling and proposes two competing interfaces as candidates for the OpenMP standard. We conceptually compare the two proposed interfaces with respect to the three host languages of OpenMP, i.e., C, C++, and Fortran. These interfaces serve the OpenMP community as a basis for discussion and prototype implementation supporting user-defined scheduling in an OpenMP library.
OpenMP: Enabling Massive Node-Level Parallelism, 2021
This paper reports on experiences gained and practices adopted when using the latest features of ... more This paper reports on experiences gained and practices adopted when using the latest features of OpenMP to port a variety of HPC applications and mini-apps based on different computational motifs (Berke-leyGW, WDMApp/XGC, GAMESS, GESTS, and GridMini) to acceleratorbased, leadership-class, high-performance supercomputer systems at the Department of Energy. As recent enhancements to OpenMP become available in implementations, there is a need to share the results of experimentation with them in order to better understand their behavior in practice, to identify pitfalls, and to learn how they can be effectively deployed in scientific codes. Additionally, we identify best practices from these experiences that we can share with the rest of the OpenMP community.
Lecture Notes in Computer Science, 2012
Hybrid parallel programming with MPI for internode communication in conjunction with a shared-mem... more Hybrid parallel programming with MPI for internode communication in conjunction with a shared-memory programming model to manage intranode parallelism has become a dominant approach to scalable parallel programming. While this model provides a great deal of flexibility and performance potential, it saddles programmers with the complexity of utilizing two parallel programming systems in the same application. We introduce an MPI-integrated shared-memory programming model that is incorporated into MPI through a small extension to the one-sided communication interface. We discuss the integration of this interface with the upcoming MPI 3.0 one-sided semantics and describe solutions for providing portable and efficient data sharing, atomic operations, and memory consistency. We describe an implementation of the new interface in the MPICH2 and Open MPI implementations and demonstrate an average performance improvement of 40% to the communication component of a five-point stencil solver.
2013 IEEE 27th International Symposium on Parallel and Distributed Processing, 2013
The lattice Boltzmann method is increasingly important in facilitating large-scale fluid dynamics... more The lattice Boltzmann method is increasingly important in facilitating large-scale fluid dynamics simulations. To date, these simulations have been built on discretized velocity models of up to 27 neighbors. Recent work has shown that higher order approximations of the continuum Boltzmann equation enable not only recovery of the Navier-Stokes hydrodynamics, but also simulations for a wider range of Knudsen numbers, which is especially important in micro-and nanoscale flows. These higher-order models have significant impact on both the communication and computational complexity of the application. We present a performance study of the higherorder models as compared to the traditional ones, on both the IBM Blue Gene/P and Blue Gene/Q architectures. We study the tradeoffs of many optimizations methods such as the use of deep halo level ghost cells that, alongside hybrid programming models, reduce the impact of extended models and enable efficient modeling of extreme regimes of computational fluid dynamics.
Proceedings of the 2010 Workshop on Parallel Programming Patterns, 2010
The NAS parallel benchmarks, originally developed by NASA for evaluating performance of their hig... more The NAS parallel benchmarks, originally developed by NASA for evaluating performance of their high-performance computers, have been regarded as one of the most widely used benchmark suites for side-by-side comparisons of high-performance machines. However, even though the NAS parallel benchmarks have grown tremendously in the last two decades, documentation is lagging behind because of rapid changes and additions to the collection of benchmark codes primarily due to rapid innovation of parallel architectures. Consequently, the learning curve for beginning graduate students, researchers, or software systems engineers to pick up these benchmarks is typically huge. In this paper, we document and assess the NAS parallel benchmark suite by identifying parallel patterns within the NAS benchmark codes. We believe that such documentation of the benchmarks will allow researchers as well as those in industry to understand, use and modify these codes more effectively.
Proceedings of the 3rd International Workshop on Multicore Software Engineering, 2010
Measures for Performance Evaluation of Multicore • Benchmarks are used in high-performance scient... more Measures for Performance Evaluation of Multicore • Benchmarks are used in high-performance scientific computing research and industry for sideby-side-comparison of two or more machines.
2011 18th International Conference on High Performance Computing, 2011
Recent studies have shown that operating system (OS) interference, popularly called OS noise can ... more Recent studies have shown that operating system (OS) interference, popularly called OS noise can be a significant problem as we scale to a large number of processors. One solution for mitigating noise is to turn off certain OS services on the machine. However, this is typically infeasible because full-scale OS services may be required for some applications. Furthermore, it is not a choice that an end user can make. Thus, we need an application-level solution. Building upon previous work that demonstrated the utility of within-node lightweight load balancing, we discuss the technique of weighted micro-scheduling and provide insights based on experimentation for two different machines with very different noise signatures. Through careful enumeration of the search space of scheduler parameters, we allow our weighted micro-scheduler to be dynamic, adaptive and tunable for a specific application running on a specific architecture. By doing this, we show how we can enable running scientific applications efficiently on a very large number of processors, even in the presence of noise.
Computing, 2013
Hybrid parallel programming with the message passing interface (MPI) for internode communication ... more Hybrid parallel programming with the message passing interface (MPI) for internode communication in conjunction with a shared-memory programming model to manage intranode parallelism has become a dominant approach to scalable parallel programming. While this model provides a great deal of flexibility and performance potential, it saddles programmers with the complexity of utilizing two parallel
Lecture Notes in Computer Science, 2010
Application performance can degrade significantly due to node-local load imbalances during applic... more Application performance can degrade significantly due to node-local load imbalances during application execution on a large number of SMP nodes. These imbalances can arise from the machine, operating system, or the application itself. Although dynamic load balancing within a node can mitigate imbalances, such load balancing is challenging because of its impact to data movement and synchronization overhead. We developed a series of scheduling strategies that mitigate imbalances without incurring high overhead. Our strategies provide performance gains for various HPC codes, and perform better than widely known scheduling strategies such as OpenMP guided scheduling. Our scheme and methodology allows for scaling applications to next-generation clusters of SMPs with minimal application programmer intervention. We expect these techniques to be increasingly useful for future machines approaching exascale.
Parallel Computing Architectures and APIs, 2019
Parallel Computing Architectures and APIs, 2019
Lecture Notes in Computer Science, 2005
The small performance variation within each node of a cloud computing infrastructure (i.e. cloud)... more The small performance variation within each node of a cloud computing infrastructure (i.e. cloud) can be a fundamental impediment to scalability of a high-performance application. This performance variation (referred to as jitter) particularly impacts overall performance of scientific workloads running on a cloud. Studies show that the primary source of performance variations comes from disk I/O and the underlying communication network [1]. In this paper, we explore the opportunities to improve performance of high performance applications running on emerging cloud platforms. Our contributions are 1. the quantification and assessment of performance variation of data-intensive scientific workloads on a small set of homogeneous nodes running Hadoop and 2. the development of an improved Hadoop scheduler that can improve performance (and potentially scalability) of these application by leveraging the intrinsic performance variation of the system. In using our enhanced scheduler for data-intensive scientific workloads, we are able to obtain more than a 21% performance gain over the default Hadoop scheduler.
Lecture Notes in Computer Science, 2015
2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
Computational bioinformatics and biomedical applications frequently contain heterogeneously sized... more Computational bioinformatics and biomedical applications frequently contain heterogeneously sized units of work or tasks, for instance due to variability in the sizes of biological sequences and molecules. Variable-sized workloads lead to load imbalances in parallel implementations which detract from efficiency and performance. Many modern computing resources now have multiple graphics processing units(GPUs) per computer for acceleration. These multiple GPU resources need to be used efficiently through balancing of workloads across the GPUs. OpenMP is a portable directive-based parallel programming API used ubiquitously in bioscience applications to program CPUs; recently, the use of OpenMP directives for GPU acceleration has become possible. Here, motivated by experiences with imbalanced loads in GPU-accelerated bioinformatics applications, we address the load balancing problem using OpenMP task-to-GPU scheduling combined with OpenMP GPU offloading for multiply heterogeneous workloads – loads with both variable input sizes, and simultaneously, variable convergence rates for algorithms with a stochastic component – scheduled across multiple GPUs. We aim to develop strategies which are both easy to use and have lower overheads, and may be incorporated incrementally in existing programs which already make use of OpenMP for CPU-based threading in order to make use of multi-GPU computers. We test different combinations of input size variability and convergence rate variability, and characterize the effects of these different scenarios on the performance of scheduling strategies across multiple GPUs with OpenMP. We present several dynamic scheduling solutions for different parallel patterns, explore optimizations, and provide publicly available example computational kernels to make these strategies easy to use in programs. This work will enable application developers to efficiently and easily use multiple GPUs for imbalanced workloads found in bioinformatics and biomedical applications.
OpenMP: Portable Multi-Level Parallelism on Modern Systems, 2020
Many modern supercomputers such as ORNL's Summit, LLNL's Sierra, and LBL's upcoming Perlmutter of... more Many modern supercomputers such as ORNL's Summit, LLNL's Sierra, and LBL's upcoming Perlmutter offer or will offer multiple, e.g., 4 to 8, GPUs per node for running computational science and engineering applications. One should expect an application to achieve speedup using multiple GPUs on a node of a supercomputer over a single GPU of the node, in particular an application that is embarrassingly parallel and load imbalanced, such as AutoDock, QMCPACK and DMRG++. OpenMP is a popular model used to run applications on heterogeneous devices of a node and OpenMP 5.x provides rich features for tasking and GPU offloading. However, OpenMP doesn't provide significant support for running application code on multiple GPUs efficiently, in particular for the aforementioned applications. We provide different OpenMP task-to-GPU scheduling strategies that help distribute an application's work across GPUs on a node for efficient parallel GPU execution. Our solution involves using OpenMP's construct taskloop to generate OpenMP tasks containing target regions for OpenMP threads, and then having OpenMP threads assign those tasks to GPUs on a node through a schedule specified by the application programmer. We analyze the performance of our solution using a small benchmark code representative of the aforementioned applications. Our solution improves performance over a standard baseline assignment of tasks to GPUs by up to 57.2%. Further, based on our results, we suggest OpenMP extensions that could help an application programmer have his or her application run on multiple GPUs per node efficiently.
Load imbalance is the major source of performance degradation in computationally-intensive applic... more Load imbalance is the major source of performance degradation in computationally-intensive applications that frequently consist of parallel loops. Efficient scheduling of parallel loops can improve the performance of such programs. OpenMP is the de-facto standard for parallel programming on sharedmemory systems. The current OpenMP specification provides only three choices for loop scheduling which are insufficient in scenarios with irregular loops, system-induced interference, or both. Therefore, this work augments the LLVM implementation of the OpenMP runtime library with eleven state-of-the-art plus three new and ready-to-use scheduling techniques. We tested the existing and the added loop scheduling strategies on several applications from the NAS, SPEC OMP 2012, and CORAL-2 benchmark suites. The experimental results show that each newly implemented scheduling technique outperforms the other in certain application and system configurations. We measured performance gains of up to 6...
Parallel Computing, 2021
As recent enhancements to the OpenMP specification become available in its implementations, there... more As recent enhancements to the OpenMP specification become available in its implementations, there is a need to share the results of experimentation in order to better understand the OpenMP implementation's behavior in practice, to identify pitfalls, and to learn how the implementations can be effectively deployed in scientific codes. We report on experiences gained and practices adopted when using OpenMP to port a variety of ECP applications, mini-apps and libraries based on different computational motifs to accelerator-based leadershipclass high-performance supercomputer systems at the United States Department of Energy. Additionally, we identify important challenges and open problems related to the deployment of OpenMP. Through our report of experiences, we find that OpenMP implementations are successful on current supercomputing platforms and that OpenMP is a promising programming model to use for applications to be run on emerging and future platforms with accelerated nodes.
OpenMP: Conquering the Full Hardware Spectrum, 2019
Parallel loops are an important part of OpenMP programs. Efficient scheduling of parallel loops c... more Parallel loops are an important part of OpenMP programs. Efficient scheduling of parallel loops can improve performance of the programs. The current OpenMP specification only offers three options for loop scheduling, which are insufficient in certain instances. Given the large number of other possible scheduling strategies, standardizing each of them is infeasible. A more viable approach is to extend the OpenMP standard to allow a user to define loop scheduling strategies within her application. The approach will enable standard-compliant application-specific scheduling. This work analyzes the principal components required by user-defined scheduling and proposes two competing interfaces as candidates for the OpenMP standard. We conceptually compare the two proposed interfaces with respect to the three host languages of OpenMP, i.e., C, C++, and Fortran. These interfaces serve the OpenMP community as a basis for discussion and prototype implementation supporting user-defined scheduling in an OpenMP library.
OpenMP: Enabling Massive Node-Level Parallelism, 2021
This paper reports on experiences gained and practices adopted when using the latest features of ... more This paper reports on experiences gained and practices adopted when using the latest features of OpenMP to port a variety of HPC applications and mini-apps based on different computational motifs (Berke-leyGW, WDMApp/XGC, GAMESS, GESTS, and GridMini) to acceleratorbased, leadership-class, high-performance supercomputer systems at the Department of Energy. As recent enhancements to OpenMP become available in implementations, there is a need to share the results of experimentation with them in order to better understand their behavior in practice, to identify pitfalls, and to learn how they can be effectively deployed in scientific codes. Additionally, we identify best practices from these experiences that we can share with the rest of the OpenMP community.
Lecture Notes in Computer Science, 2012
Hybrid parallel programming with MPI for internode communication in conjunction with a shared-mem... more Hybrid parallel programming with MPI for internode communication in conjunction with a shared-memory programming model to manage intranode parallelism has become a dominant approach to scalable parallel programming. While this model provides a great deal of flexibility and performance potential, it saddles programmers with the complexity of utilizing two parallel programming systems in the same application. We introduce an MPI-integrated shared-memory programming model that is incorporated into MPI through a small extension to the one-sided communication interface. We discuss the integration of this interface with the upcoming MPI 3.0 one-sided semantics and describe solutions for providing portable and efficient data sharing, atomic operations, and memory consistency. We describe an implementation of the new interface in the MPICH2 and Open MPI implementations and demonstrate an average performance improvement of 40% to the communication component of a five-point stencil solver.
2013 IEEE 27th International Symposium on Parallel and Distributed Processing, 2013
The lattice Boltzmann method is increasingly important in facilitating large-scale fluid dynamics... more The lattice Boltzmann method is increasingly important in facilitating large-scale fluid dynamics simulations. To date, these simulations have been built on discretized velocity models of up to 27 neighbors. Recent work has shown that higher order approximations of the continuum Boltzmann equation enable not only recovery of the Navier-Stokes hydrodynamics, but also simulations for a wider range of Knudsen numbers, which is especially important in micro-and nanoscale flows. These higher-order models have significant impact on both the communication and computational complexity of the application. We present a performance study of the higherorder models as compared to the traditional ones, on both the IBM Blue Gene/P and Blue Gene/Q architectures. We study the tradeoffs of many optimizations methods such as the use of deep halo level ghost cells that, alongside hybrid programming models, reduce the impact of extended models and enable efficient modeling of extreme regimes of computational fluid dynamics.
Proceedings of the 2010 Workshop on Parallel Programming Patterns, 2010
The NAS parallel benchmarks, originally developed by NASA for evaluating performance of their hig... more The NAS parallel benchmarks, originally developed by NASA for evaluating performance of their high-performance computers, have been regarded as one of the most widely used benchmark suites for side-by-side comparisons of high-performance machines. However, even though the NAS parallel benchmarks have grown tremendously in the last two decades, documentation is lagging behind because of rapid changes and additions to the collection of benchmark codes primarily due to rapid innovation of parallel architectures. Consequently, the learning curve for beginning graduate students, researchers, or software systems engineers to pick up these benchmarks is typically huge. In this paper, we document and assess the NAS parallel benchmark suite by identifying parallel patterns within the NAS benchmark codes. We believe that such documentation of the benchmarks will allow researchers as well as those in industry to understand, use and modify these codes more effectively.
Proceedings of the 3rd International Workshop on Multicore Software Engineering, 2010
Measures for Performance Evaluation of Multicore • Benchmarks are used in high-performance scient... more Measures for Performance Evaluation of Multicore • Benchmarks are used in high-performance scientific computing research and industry for sideby-side-comparison of two or more machines.
2011 18th International Conference on High Performance Computing, 2011
Recent studies have shown that operating system (OS) interference, popularly called OS noise can ... more Recent studies have shown that operating system (OS) interference, popularly called OS noise can be a significant problem as we scale to a large number of processors. One solution for mitigating noise is to turn off certain OS services on the machine. However, this is typically infeasible because full-scale OS services may be required for some applications. Furthermore, it is not a choice that an end user can make. Thus, we need an application-level solution. Building upon previous work that demonstrated the utility of within-node lightweight load balancing, we discuss the technique of weighted micro-scheduling and provide insights based on experimentation for two different machines with very different noise signatures. Through careful enumeration of the search space of scheduler parameters, we allow our weighted micro-scheduler to be dynamic, adaptive and tunable for a specific application running on a specific architecture. By doing this, we show how we can enable running scientific applications efficiently on a very large number of processors, even in the presence of noise.
Computing, 2013
Hybrid parallel programming with the message passing interface (MPI) for internode communication ... more Hybrid parallel programming with the message passing interface (MPI) for internode communication in conjunction with a shared-memory programming model to manage intranode parallelism has become a dominant approach to scalable parallel programming. While this model provides a great deal of flexibility and performance potential, it saddles programmers with the complexity of utilizing two parallel
Lecture Notes in Computer Science, 2010
Application performance can degrade significantly due to node-local load imbalances during applic... more Application performance can degrade significantly due to node-local load imbalances during application execution on a large number of SMP nodes. These imbalances can arise from the machine, operating system, or the application itself. Although dynamic load balancing within a node can mitigate imbalances, such load balancing is challenging because of its impact to data movement and synchronization overhead. We developed a series of scheduling strategies that mitigate imbalances without incurring high overhead. Our strategies provide performance gains for various HPC codes, and perform better than widely known scheduling strategies such as OpenMP guided scheduling. Our scheme and methodology allows for scaling applications to next-generation clusters of SMPs with minimal application programmer intervention. We expect these techniques to be increasingly useful for future machines approaching exascale.