Kevin Barker - Academia.edu (original) (raw)
Papers by Kevin Barker
The Journal of Supercomputing, 2013
Abstract As the march to the exascale computing gains momentum, energy consumption of supercomput... more Abstract As the march to the exascale computing gains momentum, energy consumption of supercomputers has emerged to be the critical roadblock. While architectural innovations are imperative in achieving computing of this scale, it is largely dependent on the systems ...
Proceedings of the 10th Workshop on Workflows in Support of Large-Scale Science - WORKS '15, 2015
2014 Energy Efficient Supercomputing Workshop, 2014
computer.org
Organizers Dimitris Nikolopoulos, FORTH-ICS and University of Crete, Greece Cal Ribbens, Virginia... more Organizers Dimitris Nikolopoulos, FORTH-ICS and University of Crete, Greece Cal Ribbens, Virginia Tech, USA ... Program Committee Eduard Ayguade, Barcelona Supercomputing Center, Spain Chris Baker, Oak Ridge National Laboratory, USA Kevin Barker, Pacific Northwest National Laboratory, USA Filip Blagojevic, Hitachi Global Storage, USA Carter Edwards, Sandia National Laboratories, USA Narayan Ganesan, University of Delaware, USA Mike Heroux, Sandia National Laboratory, USA Ananth Kalyanaraman, Washington State University, USA Sven ...
Currently there is large architectural diversity in high perfonnance computing systems. They incl... more Currently there is large architectural diversity in high perfonnance computing systems. They include "commodity" cluster systems that optimize per-node perfonnance for small jobs, massively parallel processors (MPPs) that optimize aggregate perfonnance for large jobs, and accelerated systems that optimize both per-node and aggregate perfonnance but only for applications custom-designed to take advantage of such systems. Because of these dissimilarities, meaningful comparisons of achievable perfonnance are not straightforward. In this work we utilize a methodology that combines both empirical analysis and perfonnance modeling to compare clusters (represented by a 4,352-core IB cluster), MPPs (represented by a 147,456-core BG/P), and accelerated systems (represented by the 129,600-core RoadrUIUler) across a workload of four applications. Strengths of our approach include the ability to compare architectures-as opposed to specific implementations of an architecture--attribute each application's perfonnance bottlenecks to characteristics unique to each system, and to explore perfonnance scenarios in advance of their availability for measurement. Our analysis illustrates that application pertonnance is essentially unrelated to relative peak perfonnance but that application perfonnance can be both predicted and explained using modeling.
Parallel Processing Letters, 2009
In this paper, we present a methodology for profiling parallel applications executing on the fami... more In this paper, we present a methodology for profiling parallel applications executing on the family of architectures commonly referred as the "Cell" processor. Specifically, we examine Cell-centric MPI programs on hybrid clusters containing multiple Opteron and IBM PowerXCell 8i processors per node such as those used in the petascale Roadrunner system. We analyze the performance of our approach on a PlayStation3 console based on Cell Broadband Engine-the CBE-as well as an IBM BladeCenter QS22 based on PowerXCell 8i. Our implementation incurs less than 0.5% overhead and 0.3 µs per profiler call for a typical molecular dynamics code on the Cell BE while efficiently utilizing the limited local store of the Cell's SPE cores. Our worst-case overhead analysis on the PowerXCell 8i costs 3.2 µs per profiler call while using only two 5 KiB buffers. We demonstrate the use of our profiler on a cluster of hybrid nodes running a suite of scientific applications. Our analyses of inter-SPE communication (across the entire cluster) and function call patterns provide valuable information that can be used to optimize application performance. As a matter of fact, event data structure as described in Section 3.1 has enough data to provide finer details on message passing events. For example inter-SPE and/or SPE-to-PPE communications can be analyzed in finer detail. Function use, duration, type of message passing activity, size of the message, type of data being sent (and/or received), count of a certain data type, and source/destination, can be analyzed to provide more insight into the program flow. As we can extract point-to-point communication matrix from an application execution, it is also possible to automatically identify the communication pattern by measuring the degree of match between point-to-point communication matrix and predefined communication templates for regularly occurring communication patterns in scientific applications .
... ASc-Roadrunner Petasca e Hybri ys m Darren J. Kerbyso'n Scott Pakin, Mike Lang, Jose... more ... ASc-Roadrunner Petasca e Hybri ys m Darren J. Kerbyso'n Scott Pakin, Mike Lang, Jose Sancho, Kei Davis, Kevin Barker, and also Josh Peraza ... 1.026PF sustained on Linpack benchmark (Kistler, Gunnels, Benton, Brokenshire) First #1 Infiniband machine ...
Proceedings of the 20 Years of Beowulf Workshop on Honor of Thomas Sterling's 65th Birthday - Beowulf '14, 2015
This paper presents a parallel programming environment for mesh generation. Our approach is based... more This paper presents a parallel programming environment for mesh generation. Our approach is based on overdecomposition. The programming environment supports: low-latency one-sided communication, global address space in the context of data/object mobility, automatic message forwarding and dynamic load balancing. These are the minimum requirements for developing efficient adaptive parallel mesh generation codes on distributed memory machines and clusters of workstations and PCs. Performance data from a 3-dimensional advancing front mesh generation code designed for crack propagation simulations suggest that the flexibility and general nature of our parallel programming environment does not cause undue overhead.
Lecture Notes in Computer Science, 2007
Optical Circuit Switching (OCS) is a promising technology for future large-scale high performance... more Optical Circuit Switching (OCS) is a promising technology for future large-scale high performance computing networks. It currently widely used in telecommunication networks and offers all-optical data paths between nodes in a system. Traffic passing through these paths is subject only to the propagation delay through optical fibers and optical/electrical conversions on the sending and receiving ends. High communication bandwidths within
In this work we present an initial performance evaluation of Intel's latest, secondgeneratio... more In this work we present an initial performance evaluation of Intel's latest, secondgeneration quad-core processor, Nehalem, and provide a comparison to first-generation AMD and Intel quad-core processors Barcelona and Tigerton. Nehalem is the first Intel processor to implement a NUMA architecture incorporating QuickPath Interconnect for interconnecting processors within a node, and the first to incorporate an integrated memory controller. We evaluate the suitability of these processors in quad-socket compute nodes as building blocks for large-scale scientific computing clusters. Our analysis of intra-processor and intra-node scalability of microbenchmarks, and a range of large-scale scientific applications, indicates that quad-core processors can deliver an improvement in performance of up to 4x over a single core depending on the workload being processed. However, scalability can be less when considering a full node. We show that Nehalem outperforms Barcelona on memory-intensi...
Third International Symposium on Parallel and Distributed Computing/Third International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Networks, 2004
In the last few years, research advances in dynamic scheduling at application and runtime system ... more In the last few years, research advances in dynamic scheduling at application and runtime system levels have contributed to improving the performance of scientific applications in heterogeneous environments. This paper presents the design and implementation of a library as a result of an integrated approach to dynamic load balancing. This approach combines the advantages of optimizing data migration via novel dynamic loop scheduling strategies with the advances in object migration mechanisms of parallel runtime systems. The performance improvements obtained by the use of this library have been investigated by its use in two scientific applications: the N-body simulations, and the profiling of automatic quadrature routines. The experimental results obtained underscore the significance of using such an integrated approach, as well as the benefits of using the library especially in cluster applications characterized by irregular and unpredictable behavior.
Lecture Notes in Computer Science, 1998
2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2008
We demonstrate the outstanding performance and scalability of the VPIC kinetic plasma modeling co... more We demonstrate the outstanding performance and scalability of the VPIC kinetic plasma modeling code on the heterogeneous IBM Roadrunner supercomputer at Los Alamos National Laboratory. VPIC is a three-dimensional, relativistic, electromagnetic, particle-in-cell (PIC) code that self-consistently evolves a kinetic plasma. VPIC simulations of laser plasma interaction were conducted at unprecedented fidelity and scale-up to 1.0 times 1012 particles on as
Supercomputing Conference, 2005
The interconnect plays a key role in both the cost and performance of large-scale HPC systems. Th... more The interconnect plays a key role in both the cost and performance of large-scale HPC systems. The cost of future high-bandwidth electronic interconnects mushrooms due to expensive optical transceivers needed between electronic switches. We describe a potentially cheaper and more power-efficient approach to building high-performance interconnects. Through empirical analysis of HPC applications, we find that the bulk of inter-processor communication
Parallel and Distributed Computing Systems, 2005
In this work we present a detailed analytical performance model of the large-scale parallel appli... more In this work we present a detailed analytical performance model of the large-scale parallel application HYCOM, the Hybrid Coordinate Ocean Model. The performance model is developed from analyzing the activities contained within the code. It is parameterized in terms of both application characteristics (including problem sizes, time-steps, and application phases) and system characteristics (including single processor performance, computational node size,
Proceedings of the 2003 ACM/IEEE conference on Supercomputing - SC '03, 2003
We present an evaluation of a flexible framework and runtime software system for the dynamic load... more We present an evaluation of a flexible framework and runtime software system for the dynamic load balancing of asynchronous and highly adaptive and irregular applications. These applications, which include parallel unstructured and adaptive mesh refinement, serve as building blocks for a large class of scientific applications. Extensive study has lead to the development of solutions to the dynamic load balancing problem for loosely synchronous and computation intensive programs; however, these methods are not suitable for asynchronous and highly adaptive applications. We evaluate a new software framework which includes support for an Active Messages style communication mechanism, global name space, transparent object migration, and preemptive decision making. Our results from both a 3-dimensional parallel advancing front mesh generation program, as well as a synthetic microbenchmark, indicate that this new framework out-performs two existing general-purpose, well-known, and widely used software systems for the dynamic load balancing of adpative and irregular parallel applications.
The Journal of Supercomputing, 2013
Abstract As the march to the exascale computing gains momentum, energy consumption of supercomput... more Abstract As the march to the exascale computing gains momentum, energy consumption of supercomputers has emerged to be the critical roadblock. While architectural innovations are imperative in achieving computing of this scale, it is largely dependent on the systems ...
Proceedings of the 10th Workshop on Workflows in Support of Large-Scale Science - WORKS '15, 2015
2014 Energy Efficient Supercomputing Workshop, 2014
computer.org
Organizers Dimitris Nikolopoulos, FORTH-ICS and University of Crete, Greece Cal Ribbens, Virginia... more Organizers Dimitris Nikolopoulos, FORTH-ICS and University of Crete, Greece Cal Ribbens, Virginia Tech, USA ... Program Committee Eduard Ayguade, Barcelona Supercomputing Center, Spain Chris Baker, Oak Ridge National Laboratory, USA Kevin Barker, Pacific Northwest National Laboratory, USA Filip Blagojevic, Hitachi Global Storage, USA Carter Edwards, Sandia National Laboratories, USA Narayan Ganesan, University of Delaware, USA Mike Heroux, Sandia National Laboratory, USA Ananth Kalyanaraman, Washington State University, USA Sven ...
Currently there is large architectural diversity in high perfonnance computing systems. They incl... more Currently there is large architectural diversity in high perfonnance computing systems. They include "commodity" cluster systems that optimize per-node perfonnance for small jobs, massively parallel processors (MPPs) that optimize aggregate perfonnance for large jobs, and accelerated systems that optimize both per-node and aggregate perfonnance but only for applications custom-designed to take advantage of such systems. Because of these dissimilarities, meaningful comparisons of achievable perfonnance are not straightforward. In this work we utilize a methodology that combines both empirical analysis and perfonnance modeling to compare clusters (represented by a 4,352-core IB cluster), MPPs (represented by a 147,456-core BG/P), and accelerated systems (represented by the 129,600-core RoadrUIUler) across a workload of four applications. Strengths of our approach include the ability to compare architectures-as opposed to specific implementations of an architecture--attribute each application's perfonnance bottlenecks to characteristics unique to each system, and to explore perfonnance scenarios in advance of their availability for measurement. Our analysis illustrates that application pertonnance is essentially unrelated to relative peak perfonnance but that application perfonnance can be both predicted and explained using modeling.
Parallel Processing Letters, 2009
In this paper, we present a methodology for profiling parallel applications executing on the fami... more In this paper, we present a methodology for profiling parallel applications executing on the family of architectures commonly referred as the "Cell" processor. Specifically, we examine Cell-centric MPI programs on hybrid clusters containing multiple Opteron and IBM PowerXCell 8i processors per node such as those used in the petascale Roadrunner system. We analyze the performance of our approach on a PlayStation3 console based on Cell Broadband Engine-the CBE-as well as an IBM BladeCenter QS22 based on PowerXCell 8i. Our implementation incurs less than 0.5% overhead and 0.3 µs per profiler call for a typical molecular dynamics code on the Cell BE while efficiently utilizing the limited local store of the Cell's SPE cores. Our worst-case overhead analysis on the PowerXCell 8i costs 3.2 µs per profiler call while using only two 5 KiB buffers. We demonstrate the use of our profiler on a cluster of hybrid nodes running a suite of scientific applications. Our analyses of inter-SPE communication (across the entire cluster) and function call patterns provide valuable information that can be used to optimize application performance. As a matter of fact, event data structure as described in Section 3.1 has enough data to provide finer details on message passing events. For example inter-SPE and/or SPE-to-PPE communications can be analyzed in finer detail. Function use, duration, type of message passing activity, size of the message, type of data being sent (and/or received), count of a certain data type, and source/destination, can be analyzed to provide more insight into the program flow. As we can extract point-to-point communication matrix from an application execution, it is also possible to automatically identify the communication pattern by measuring the degree of match between point-to-point communication matrix and predefined communication templates for regularly occurring communication patterns in scientific applications .
... ASc-Roadrunner Petasca e Hybri ys m Darren J. Kerbyso'n Scott Pakin, Mike Lang, Jose... more ... ASc-Roadrunner Petasca e Hybri ys m Darren J. Kerbyso'n Scott Pakin, Mike Lang, Jose Sancho, Kei Davis, Kevin Barker, and also Josh Peraza ... 1.026PF sustained on Linpack benchmark (Kistler, Gunnels, Benton, Brokenshire) First #1 Infiniband machine ...
Proceedings of the 20 Years of Beowulf Workshop on Honor of Thomas Sterling's 65th Birthday - Beowulf '14, 2015
This paper presents a parallel programming environment for mesh generation. Our approach is based... more This paper presents a parallel programming environment for mesh generation. Our approach is based on overdecomposition. The programming environment supports: low-latency one-sided communication, global address space in the context of data/object mobility, automatic message forwarding and dynamic load balancing. These are the minimum requirements for developing efficient adaptive parallel mesh generation codes on distributed memory machines and clusters of workstations and PCs. Performance data from a 3-dimensional advancing front mesh generation code designed for crack propagation simulations suggest that the flexibility and general nature of our parallel programming environment does not cause undue overhead.
Lecture Notes in Computer Science, 2007
Optical Circuit Switching (OCS) is a promising technology for future large-scale high performance... more Optical Circuit Switching (OCS) is a promising technology for future large-scale high performance computing networks. It currently widely used in telecommunication networks and offers all-optical data paths between nodes in a system. Traffic passing through these paths is subject only to the propagation delay through optical fibers and optical/electrical conversions on the sending and receiving ends. High communication bandwidths within
In this work we present an initial performance evaluation of Intel's latest, secondgeneratio... more In this work we present an initial performance evaluation of Intel's latest, secondgeneration quad-core processor, Nehalem, and provide a comparison to first-generation AMD and Intel quad-core processors Barcelona and Tigerton. Nehalem is the first Intel processor to implement a NUMA architecture incorporating QuickPath Interconnect for interconnecting processors within a node, and the first to incorporate an integrated memory controller. We evaluate the suitability of these processors in quad-socket compute nodes as building blocks for large-scale scientific computing clusters. Our analysis of intra-processor and intra-node scalability of microbenchmarks, and a range of large-scale scientific applications, indicates that quad-core processors can deliver an improvement in performance of up to 4x over a single core depending on the workload being processed. However, scalability can be less when considering a full node. We show that Nehalem outperforms Barcelona on memory-intensi...
Third International Symposium on Parallel and Distributed Computing/Third International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Networks, 2004
In the last few years, research advances in dynamic scheduling at application and runtime system ... more In the last few years, research advances in dynamic scheduling at application and runtime system levels have contributed to improving the performance of scientific applications in heterogeneous environments. This paper presents the design and implementation of a library as a result of an integrated approach to dynamic load balancing. This approach combines the advantages of optimizing data migration via novel dynamic loop scheduling strategies with the advances in object migration mechanisms of parallel runtime systems. The performance improvements obtained by the use of this library have been investigated by its use in two scientific applications: the N-body simulations, and the profiling of automatic quadrature routines. The experimental results obtained underscore the significance of using such an integrated approach, as well as the benefits of using the library especially in cluster applications characterized by irregular and unpredictable behavior.
Lecture Notes in Computer Science, 1998
2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2008
We demonstrate the outstanding performance and scalability of the VPIC kinetic plasma modeling co... more We demonstrate the outstanding performance and scalability of the VPIC kinetic plasma modeling code on the heterogeneous IBM Roadrunner supercomputer at Los Alamos National Laboratory. VPIC is a three-dimensional, relativistic, electromagnetic, particle-in-cell (PIC) code that self-consistently evolves a kinetic plasma. VPIC simulations of laser plasma interaction were conducted at unprecedented fidelity and scale-up to 1.0 times 1012 particles on as
Supercomputing Conference, 2005
The interconnect plays a key role in both the cost and performance of large-scale HPC systems. Th... more The interconnect plays a key role in both the cost and performance of large-scale HPC systems. The cost of future high-bandwidth electronic interconnects mushrooms due to expensive optical transceivers needed between electronic switches. We describe a potentially cheaper and more power-efficient approach to building high-performance interconnects. Through empirical analysis of HPC applications, we find that the bulk of inter-processor communication
Parallel and Distributed Computing Systems, 2005
In this work we present a detailed analytical performance model of the large-scale parallel appli... more In this work we present a detailed analytical performance model of the large-scale parallel application HYCOM, the Hybrid Coordinate Ocean Model. The performance model is developed from analyzing the activities contained within the code. It is parameterized in terms of both application characteristics (including problem sizes, time-steps, and application phases) and system characteristics (including single processor performance, computational node size,
Proceedings of the 2003 ACM/IEEE conference on Supercomputing - SC '03, 2003
We present an evaluation of a flexible framework and runtime software system for the dynamic load... more We present an evaluation of a flexible framework and runtime software system for the dynamic load balancing of asynchronous and highly adaptive and irregular applications. These applications, which include parallel unstructured and adaptive mesh refinement, serve as building blocks for a large class of scientific applications. Extensive study has lead to the development of solutions to the dynamic load balancing problem for loosely synchronous and computation intensive programs; however, these methods are not suitable for asynchronous and highly adaptive applications. We evaluate a new software framework which includes support for an Active Messages style communication mechanism, global name space, transparent object migration, and preemptive decision making. Our results from both a 3-dimensional parallel advancing front mesh generation program, as well as a synthetic microbenchmark, indicate that this new framework out-performs two existing general-purpose, well-known, and widely used software systems for the dynamic load balancing of adpative and irregular parallel applications.