Dan Stanzione - Academia.edu (original) (raw)
Papers by Dan Stanzione
PLOS computational biology/PLoS computational biology, Feb 7, 2024
Texas Advanced Computing Center (TACC
The Stampede 1 supercomputer was a tremendous success as an XSEDE resource, providing more than e... more The Stampede 1 supercomputer was a tremendous success as an XSEDE resource, providing more than eight million successful computational simulations and data analysis jobs to more than ten thousand users. In addition, Stampede 1 introduced new technology that began to move users towards many core processors. As Stampede 1 reaches the end of its production life, it is being replaced in phases by a new supercomputer, Stampede 2, that will not only take up much of the original system's workload, but continue the bridge to technologies on the path to exascale computing. This paper provides a brief summary of the experiences of Stampede 1, and details the design and architecture of Stampede 2. Early results are presented from a subset of Intel Knights Landing nodes that are bridging between the two systems.
Plant and Animal Genome XX Conference (January 14-18, 2012), Jan 16, 2012
Debugging is difficult; debugging parallel programs at large scale is particularly so. Interactiv... more Debugging is difficult; debugging parallel programs at large scale is particularly so. Interactive debugging tools continue to improve in ways that mitigate the difficulties, and the best such systems will continue to be mission critical. Such tools have their limitations, however. They are often unable to operate across many thousands of cores. Even when they do function correctly, mining and
CRC Press eBooks, May 8, 2019
InfiniBand has emerged as a new high bandwidth, low latency standard for high performance computi... more InfiniBand has emerged as a new high bandwidth, low latency standard for high performance computing, but as a technology, is still focused on Layer 2 switching. Standards have not yet been defined for InfiniBand Layer 3 Routing, required for additional scalability, distance reach, security, and fault tolerance and isolation.The meeting will consist of:Product Leads from InfiniBand vendors discussing the unique
Power consumption of ICT facilities and data centers has grown, and this has led to a need to imp... more Power consumption of ICT facilities and data centers has grown, and this has led to a need to improve energy efficiency of these facilities. DC power distribution systems employing 380VDC as the supply voltage is one promising approach to address this problem for countries around the world developing and deploying commercial services. We demonstrated a 380VDC power distribution system interconnected with a solar power generation system in Texas, USA. The purpose of this demonstration was to show that a 380VDC power supply system saves more energy than an AC power supply system, and to show how much carbon dioxide emissions can be reduced by integrating a solar power generation system. This demonstration resulted in an approximate 17% energy reduction compared with an AC power supply system having the same level of reliability. Also, an evaluation using Data center Performance Per Energy (DPPE) as a performance index of the efficiency of data centers was carried out. The results showed that Power Usage Effectiveness (PUE), one of the sub-metrics of DPPE, improved with the 380VDC power supply system compared with the AC power supply system.
Proceedings of SPIE, Oct 8, 1998
ABSTRACT The PCIT method is an important technique for detecting interactions between networks. T... more ABSTRACT The PCIT method is an important technique for detecting interactions between networks. The PCIT algorithm has been used in the biological context to infer complex regulatory mechanisms and interactions in genetic networks, in genome wide association studies, and in other similar problems. In this work, the PCIT algorithm is re-implemented with exemplary parallel, vector, I/O, memory and instruction optimizations for today's multi- and many-core architectures. The evolution and performance of the new code targets the processor architectures of the Stampede supercomputer, but will also benefit other architectures. The Stampede system consists of an Intel Xeon E5 processor base system with an innovative component comprised of Intel Xeon Phi Coprocessors. Optimized results and an analysis are presented for both the Xeon and the Xeon Phi.
At many university, government, and corporate facilities, it is increasingly common for multiple ... more At many university, government, and corporate facilities, it is increasingly common for multiple compute clusters to exist in a relatively small geographic area. These clusters represent a significant investment, but effectively leveraging this investment across clusters is a challenge. Dynamic Virtual Clustering has been shown to be an effective way to increase utilization, decrease job turnaround time, and increase workload throughput in a multi-cluster environment on a small geographic scale. Dynamic Virtual Clustering is a system for flexibly and seamlessly deploying virtual machines in a single or multi-cluster environment. The amount of time required to deploy virtual machines may be prohibitively large, especially when the jobs designated to run inside the virtual machines are short-lived. In this paper we examine the overhead of deploying virtual machine images, and present an implementation of image caching as a way to reduce this overhead. I. INTRODUCTION At many university, government, and corporate facilities, it is increasingly common for multiple compute clusters to exist in a relatively small geographic area. These clusters represent a significant investment, but effectively leveraging this investment across clusters is a challenge. Dynamic Virtual Clustering (DVC) has been shown to be an effective way to increase utilization, decrease job turnaround time, and increase workload throughput in a multi-cluster environment on a small geographic scale. DVC is a system for flexibly and seamlessly deploying virtual machines (VMs) across a single or multi-cluster environment. DVC tightly integrates VM technology with the cluster's resource management and scheduling software to allows jobs to run on any cluster in any software environment while effectively sandboxing users and applications from the host system. DVC uses VMs in a cluster environment by staging images to compute nodes and booting the VMs on those nodes. However, the amount of time required to stage and boot VMs may be prohibitively large, especially when the jobs designated to run inside the VMs are short-lived. In this paper we examine the overhead of staging and booting, consider issues associated with caching VM images, and present an implementation of image caching as a way to reduce this overhead. The basic implementation and analysis of caching presented here will later be used to create intelligent scheduling algorithms and heuristics that use cache information to reduce overhead due to VM use. Section II examines DVC, virtualization, and resource management in cluster environments with respect to virtual machines. Section III examines the initial implementation of VM creation, details image caching as a way to reduce the overhead of staging and booting VM images, and enumerates situations where caching can cause unexpected and incorrect behavior.
Springer eBooks, 2006
Abstract. As larger and larger commodity clusters for high perfor-mance computing proliferate at ... more Abstract. As larger and larger commodity clusters for high perfor-mance computing proliferate at research institutions around the world, challenges in maintaining effective use of these systems also continue to increase. Among the many challenges are maintaining the appropriate ...
This paper provides and overview of the {it GDBase} framework for offline parallel debuggers. The... more This paper provides and overview of the {it GDBase} framework for offline parallel debuggers. The framework was designed to become the basis of debugging tools which scale successfully on systems with tens to hundreds of thousands of cores. With several systems coming online at more than 50,000 cores in the past year, debuggers which can run at these scales are
The growth in the capacity and capability of NAND Flash based storage systems have changed the fa... more The growth in the capacity and capability of NAND Flash based storage systems have changed the face of data oriented computational systems. These systems have become both more capable and flexible in how they are used. With these changes comes both increased potential and user complexity. While many systems attempt to hide this complexity through the addition of more layers of storage caches, the design of the Wrangler system went a different route, choosing instead to build a simple yet flexible web based interface to allow users to easily configure this complex data computing system based on their service and software needs. This allows users to work in the environments best suited to their workflows while optimally utilizing the systems high performance and high capacity storage systems. This interface also allows users to schedule long term periods of reserved capacity, "data campaigns", for projects. Finally, the system has been designed to support the data storage and sharing capacities of the system to enable these key aspects of data research. We discuss the capabilities with respect to three already existing workflows on the system to highlight the diversity and flexibility provided by this environment to data researchers.
In the past, reconfigurable computing has not been an option for accelerating scientific algorith... more In the past, reconfigurable computing has not been an option for accelerating scientific algorithms (which require complex floating-point operations) and other similar applications due to limited FPGA density. However, the rapid increase of FPGA densities over the past several years has altered this situation. The central goal of the Reconfigurable Computing Application Development Environment (RCADE) is to capitalize on these
IEEE Computer, Nov 1, 2011
PLOS computational biology/PLoS computational biology, Feb 7, 2024
Texas Advanced Computing Center (TACC
The Stampede 1 supercomputer was a tremendous success as an XSEDE resource, providing more than e... more The Stampede 1 supercomputer was a tremendous success as an XSEDE resource, providing more than eight million successful computational simulations and data analysis jobs to more than ten thousand users. In addition, Stampede 1 introduced new technology that began to move users towards many core processors. As Stampede 1 reaches the end of its production life, it is being replaced in phases by a new supercomputer, Stampede 2, that will not only take up much of the original system's workload, but continue the bridge to technologies on the path to exascale computing. This paper provides a brief summary of the experiences of Stampede 1, and details the design and architecture of Stampede 2. Early results are presented from a subset of Intel Knights Landing nodes that are bridging between the two systems.
Plant and Animal Genome XX Conference (January 14-18, 2012), Jan 16, 2012
Debugging is difficult; debugging parallel programs at large scale is particularly so. Interactiv... more Debugging is difficult; debugging parallel programs at large scale is particularly so. Interactive debugging tools continue to improve in ways that mitigate the difficulties, and the best such systems will continue to be mission critical. Such tools have their limitations, however. They are often unable to operate across many thousands of cores. Even when they do function correctly, mining and
CRC Press eBooks, May 8, 2019
InfiniBand has emerged as a new high bandwidth, low latency standard for high performance computi... more InfiniBand has emerged as a new high bandwidth, low latency standard for high performance computing, but as a technology, is still focused on Layer 2 switching. Standards have not yet been defined for InfiniBand Layer 3 Routing, required for additional scalability, distance reach, security, and fault tolerance and isolation.The meeting will consist of:Product Leads from InfiniBand vendors discussing the unique
Power consumption of ICT facilities and data centers has grown, and this has led to a need to imp... more Power consumption of ICT facilities and data centers has grown, and this has led to a need to improve energy efficiency of these facilities. DC power distribution systems employing 380VDC as the supply voltage is one promising approach to address this problem for countries around the world developing and deploying commercial services. We demonstrated a 380VDC power distribution system interconnected with a solar power generation system in Texas, USA. The purpose of this demonstration was to show that a 380VDC power supply system saves more energy than an AC power supply system, and to show how much carbon dioxide emissions can be reduced by integrating a solar power generation system. This demonstration resulted in an approximate 17% energy reduction compared with an AC power supply system having the same level of reliability. Also, an evaluation using Data center Performance Per Energy (DPPE) as a performance index of the efficiency of data centers was carried out. The results showed that Power Usage Effectiveness (PUE), one of the sub-metrics of DPPE, improved with the 380VDC power supply system compared with the AC power supply system.
Proceedings of SPIE, Oct 8, 1998
ABSTRACT The PCIT method is an important technique for detecting interactions between networks. T... more ABSTRACT The PCIT method is an important technique for detecting interactions between networks. The PCIT algorithm has been used in the biological context to infer complex regulatory mechanisms and interactions in genetic networks, in genome wide association studies, and in other similar problems. In this work, the PCIT algorithm is re-implemented with exemplary parallel, vector, I/O, memory and instruction optimizations for today's multi- and many-core architectures. The evolution and performance of the new code targets the processor architectures of the Stampede supercomputer, but will also benefit other architectures. The Stampede system consists of an Intel Xeon E5 processor base system with an innovative component comprised of Intel Xeon Phi Coprocessors. Optimized results and an analysis are presented for both the Xeon and the Xeon Phi.
At many university, government, and corporate facilities, it is increasingly common for multiple ... more At many university, government, and corporate facilities, it is increasingly common for multiple compute clusters to exist in a relatively small geographic area. These clusters represent a significant investment, but effectively leveraging this investment across clusters is a challenge. Dynamic Virtual Clustering has been shown to be an effective way to increase utilization, decrease job turnaround time, and increase workload throughput in a multi-cluster environment on a small geographic scale. Dynamic Virtual Clustering is a system for flexibly and seamlessly deploying virtual machines in a single or multi-cluster environment. The amount of time required to deploy virtual machines may be prohibitively large, especially when the jobs designated to run inside the virtual machines are short-lived. In this paper we examine the overhead of deploying virtual machine images, and present an implementation of image caching as a way to reduce this overhead. I. INTRODUCTION At many university, government, and corporate facilities, it is increasingly common for multiple compute clusters to exist in a relatively small geographic area. These clusters represent a significant investment, but effectively leveraging this investment across clusters is a challenge. Dynamic Virtual Clustering (DVC) has been shown to be an effective way to increase utilization, decrease job turnaround time, and increase workload throughput in a multi-cluster environment on a small geographic scale. DVC is a system for flexibly and seamlessly deploying virtual machines (VMs) across a single or multi-cluster environment. DVC tightly integrates VM technology with the cluster's resource management and scheduling software to allows jobs to run on any cluster in any software environment while effectively sandboxing users and applications from the host system. DVC uses VMs in a cluster environment by staging images to compute nodes and booting the VMs on those nodes. However, the amount of time required to stage and boot VMs may be prohibitively large, especially when the jobs designated to run inside the VMs are short-lived. In this paper we examine the overhead of staging and booting, consider issues associated with caching VM images, and present an implementation of image caching as a way to reduce this overhead. The basic implementation and analysis of caching presented here will later be used to create intelligent scheduling algorithms and heuristics that use cache information to reduce overhead due to VM use. Section II examines DVC, virtualization, and resource management in cluster environments with respect to virtual machines. Section III examines the initial implementation of VM creation, details image caching as a way to reduce the overhead of staging and booting VM images, and enumerates situations where caching can cause unexpected and incorrect behavior.
Springer eBooks, 2006
Abstract. As larger and larger commodity clusters for high perfor-mance computing proliferate at ... more Abstract. As larger and larger commodity clusters for high perfor-mance computing proliferate at research institutions around the world, challenges in maintaining effective use of these systems also continue to increase. Among the many challenges are maintaining the appropriate ...
This paper provides and overview of the {it GDBase} framework for offline parallel debuggers. The... more This paper provides and overview of the {it GDBase} framework for offline parallel debuggers. The framework was designed to become the basis of debugging tools which scale successfully on systems with tens to hundreds of thousands of cores. With several systems coming online at more than 50,000 cores in the past year, debuggers which can run at these scales are
The growth in the capacity and capability of NAND Flash based storage systems have changed the fa... more The growth in the capacity and capability of NAND Flash based storage systems have changed the face of data oriented computational systems. These systems have become both more capable and flexible in how they are used. With these changes comes both increased potential and user complexity. While many systems attempt to hide this complexity through the addition of more layers of storage caches, the design of the Wrangler system went a different route, choosing instead to build a simple yet flexible web based interface to allow users to easily configure this complex data computing system based on their service and software needs. This allows users to work in the environments best suited to their workflows while optimally utilizing the systems high performance and high capacity storage systems. This interface also allows users to schedule long term periods of reserved capacity, "data campaigns", for projects. Finally, the system has been designed to support the data storage and sharing capacities of the system to enable these key aspects of data research. We discuss the capabilities with respect to three already existing workflows on the system to highlight the diversity and flexibility provided by this environment to data researchers.
In the past, reconfigurable computing has not been an option for accelerating scientific algorith... more In the past, reconfigurable computing has not been an option for accelerating scientific algorithms (which require complex floating-point operations) and other similar applications due to limited FPGA density. However, the rapid increase of FPGA densities over the past several years has altered this situation. The central goal of the Reconfigurable Computing Application Development Environment (RCADE) is to capitalize on these
IEEE Computer, Nov 1, 2011