Jeffrey Vetter - Profile on Academia.edu (original) (raw)

Papers by Jeffrey Vetter

Driven by the trends of increasing core-count and bandwidth-wall problem, the size of last level... more Driven by the trends of increasing core-count and bandwidth-wall problem, the size of last level caches (LLCs) has greatly increased and hence, the researchers have explored non-volatile memories (NVMs) which provide high density and consume low-leakage power. Since NVMs have low write-endurance and the existing cache management policies are write variation-unaware, effective wear-leveling techniques are required for achieving reasonable cache lifetimes using NVMs. We present EqualWrites, a technique for mitigating intra-set write variation. Our technique works by recording the number of writes on a block and changing the cache-block location of a hot data-item to redirect the future writes to a cold block to achieve wear-leveling. Simulation experiments have been performed using an x86-64 simulator and benchmarks from SPEC06 and HPC (high-performance computing) field. The results show that for single, dual and quad-core system configurations, EqualWrites improves cache lifetime by 6.31X, 8.74X and 10.54X, respectively. Also, its implementation overhead is very small and it provides larger improvement in lifetime than three other intra-set wear-leveling techniques and a cache replacement policy.

ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), Jun 21, 2014

With increasing system core-count, the size of last level cache (LLC) has increased and since SRA... more With increasing system core-count, the size of last level cache (LLC) has increased and since SRAM consumes high leakage power, power consumption of LLCs is becoming a significant fraction of processor power consumption. To address this, researchers have used embedded DRAM (eDRAM) LLCs which consume low-leakage power. However, eDRAM caches consume a significant amount of energy in the form of refresh energy. In this paper, we propose ESTEEM, an energy saving technique for embedded DRAM caches. ESTEEM uses dynamic cache reconfiguration to turn-off a portion of the cache to save both leakage and refresh energy. It logically divides the cache sets into multiple modules and turns-off possibly different number of ways in each module. Microarchitectural simulations confirm that ESTEEM is effective in improving performance and energy efficiency and provides better results compared to a recently-proposed eDRAM cache energy saving technique, namely Refrint. For single and dual-core simulations, the average saving in memory subsystem (LLC+main memory) on using ESTEEM is 25.8% and 32.6%, respectively and average weighted speedup are 1.09X and 1.22X, respectively. Additional experiments confirm that ESTEEM works well for a wide-range of system parameters.

31st IEEE International Conference on Computer Design (ICCD), Oct 7, 2013

Recent trends of CMOS scaling and use of large last level caches (LLCs) have led to significant i... more Recent trends of CMOS scaling and use of large last level caches (LLCs) have led to significant increase in the leakage energy consumption of LLCs and hence, managing their energy consumption has become extremely important in modern processor design. The conventional cache energy saving techniques require offline profiling or provide only coarse granularity of cache allocation. We present FlexiWay, a cache energy saving technique which uses dynamic cache reconfiguration. FlexiWay logically divides the cache sets into multiple (e.g. 16) modules and dynamically turns off suitable and possibly different number of cache ways in each module. FlexiWay has very small implementation overhead and it provides fine-grain cache allocation even with caches of typical associativity, e.g. an 8-way cache. Microarchitectural simulations have been performed using an x86-64 simulator and workloads from SPEC2006 suite. Also, FlexiWay has been compared with two conventional energy saving techniques. The results show that FlexiWay provides largest energy saving and incurs only small loss in performance. For single, dual and quad core systems, the average energy saving using FlexiWay are 26.2%, 25.7% and 22.4%, respectively.

Non-volatile memory (NVM) devices, such as Flash, phase change RAM, spin transfer torque RAM, and... more Non-volatile memory (NVM) devices, such as Flash, phase change RAM, spin transfer torque RAM, and resistive RAM, offer several advantages and challenges when compared to conventional memory technologies, such as DRAM and magnetic hard disk drives (HDDs). In this paper, we present a survey of software techniques that have been proposed to exploit the advantages and mitigate the disadvantages of NVMs when used for designing memory systems, and, in particular, secondary storage (e.g., solid state drive) and main memory. We classify these software techniques along several dimensions to highlight their similarities and differences. Given that NVMs are growing in popularity, we believe that this survey will motivate further research in the field of software technology for NVMs

To address the limitations of SRAM such as high-leakage and low-density, researchers have explore... more To address the limitations of SRAM such as high-leakage and low-density, researchers have explored use of non-volatile memory (NVM) devices, such as ReRAM (resistive RAM) and STT-RAM (spin transfer torque RAM) for designing on-chip caches. A crucial limitation of NVMs, however, is that their write endurance is low and the large intra-set write variation introduced by existing cache management policies may further exacerbate this problem, thereby reducing the cache lifetime significantly. We present EqualChance, a technique to increase cache lifetime by reducing intra-set write variation. EqualChance works by periodically changing the physical cache-block location of a write-intensive data item within a set to achieve wear-leveling. Simulations using workloads from SPEC CPU2006 suite and HPC (high-performance computing) field show that EqualChance improves the cache lifetime by 4.29X. Also, its implementation overhead is small, and it incurs very small performance and energy loss.

As both CPU and GPU become employed in a wide range of applications, it has been acknowledged tha... more As both CPU and GPU become employed in a wide range of applications, it has been acknowledged that both of these processing units (PUs) have their unique features and strengths and hence, CPU-GPU collaboration is inevitable to achieve high-performance computing. This has motivated significant amount of research on heterogeneous computing techniques, along with the design of CPU-GPU fused chips and petascale heterogeneous supercomputers. In this paper, we survey heterogeneous computing techniques (HCTs) such as workload-partitioning which enable utilizing both CPU and GPU to improve performance and/or energy efficiency. We review heterogeneous computing approaches at runtime, algorithm, programming, compiler and application level. Further, we review both discrete and fused CPU-GPU systems; and discuss benchmark suites designed for evaluating heterogeneous computing systems (HCSs). We believe that this paper will provide insights into working and scope of applications of HCTs to researchers and motivate them to further harness the computational powers of CPUs and GPUs to achieve the goal of exascale performance.

As the number of cores on a chip increase and key applications become even more data-intensive, m... more As the number of cores on a chip increase and key applications become even more data-intensive, memory systems in modern processors have to deal with increasingly large amount of data. In face of such challenges, data compression presents as a promising approach to increase effective memory system capacity and also provide performance and energy advantages. This paper presents a survey of techniques for using compression in cache and main memory systems. It also classifies the techniques based on key parameters to highlight their similarities and differences. It discusses compression in CPUs and GPUs, conventional and non-volatile memory (NVM) systems, and 2D and 3D memory systems. We hope that this survey will help the researchers in gaining insight into the potential role of compression approach in memory components of future extreme-scale systems.

ORNL Technical Report number ORNL/TM-2014/636, 2014

[Download source-code from: https://code.ornl.gov/3d\_cache\_modeling\_tool/destiny\] To enable the d... more [Download source-code from: https://code.ornl.gov/3d_cache_modeling_tool/destiny] To enable the design of large sized caches, novel memory technologies (such as non-volatile memory) and novel fabrication approaches (e.g. 3D stacking) have been explored. The existing modeling tools, however, cover only few memory technologies, CMOS technology nodes and fabrication approaches. We present DESTINY, a tool for modeling 3D (and 2D) cache designs using SRAM, embedded DRAM (eDRAM), spin transfer torque RAM (STT-RAM), resistive RAM (ReRAM) and phase change RAM (PCM). DESTINY is very useful for performing design-space exploration across several dimensions, such as optimizing for a target (e.g. latency, area or energy-delay product) for a given memory technology, choosing the suitable memory technology or fabrication method (i.e. 2D v/s 3D) for a given optimization target etc. DESTINY has been validated against several cache prototypes. We believe that DESTINY will boost studies of next-generation memory architectures used in systems ranging from mobile devices to extreme-scale supercomputers.

Recent trends of CMOS scaling and increasing number of on-chip cores have led to a large increase... more Recent trends of CMOS scaling and increasing number of on-chip cores have led to a large increase in the size of on-chip caches. Since SRAM has low density and consumes large amount of leakage power, its use in designing
on-chip caches has become more challenging. To address this issue, researchers are exploring the use of several emerging memory technologies, such as embedded DRAM, spin transfer torque RAM, resistive RAM, phase change RAM and domain wall memory. In this paper, we survey the architectural approaches proposed for designing memory systems and, specifically, caches with these emerging memory technologies. To highlight their similarities and differences, we present a classification of these technologies and architectural approaches based on their key characteristics. We also briefly summarize the challenges in using these technologies for architecting caches. We believe that this survey will help the readers gain insights into the emerging memory device technologies, and their potential use in designing future computing systems.

Supercomputing Conference, 2000

As evidenced by the popularity of MPI (Message Passing Interface), message passing is an effectiv... more As evidenced by the popularity of MPI (Message Passing Interface), message passing is an effective programming technique for managing coarse-grained concurrency on distributed computers. Unfortunately, debugging message-passing applications can be difficult. Software complexity, data races, and scheduling dependencies can make programming errors challenging to locate with manual, interactive debugging techniques. This article describes Umpire, a new tool for detecting programming

An annotated bibliography of interactive program steering

Sigplan Notices, 1994

This annotated bibliography reviews current research in dynamic and interactive programsteering. ... more This annotated bibliography reviews current research in dynamic and interactive programsteering. In particular, we review systems-related research addressing dynamic programsteering, raising issues in operating and language systems, mechanisms and algorithmsfor dynamic program adaptation, program monitoring and the associated data storagetechniques, and the design of dynamically steerable or adaptable programs. We defineprogram steering as the capacity to control the execution of...

Scalable Software Transactional Memory for Global Address Space Architectures

This paper presents the challenges encountered in and po- tential solutions to designing scalable... more This paper presents the challenges encountered in and po- tential solutions to designing scalable Software Transactional Memory (STM) for large-scale distributed memory systems with thousands of nodes. We introduce Global Transactional Memory (GTM), a generalized and scalable STM design supporting a dynamic programming model based on thread- level parallelism, Single Process Multiple Data (SPMD) par- allelism, and remote procedure call

The tradeoffs of accuracy and performance are as yet an unsolved problem when dealing with Graphi... more The tradeoffs of accuracy and performance are as yet an unsolved problem when dealing with Graphics Processing Units (GPUs) as a general-purpose computation device. Their high performance and low cost makes them a desirable target for scientific computation, and new language efforts help address the programming challenges of data parallel algorithms and memory management. But the original task of GPUs -real-time renderinghas traditionally kept accuracy as a secondary goal, and sacrifices have sometimes been made as a result. In fact, the widely deployed hardware is generally capable of only single precision arithmetic, and even this accuracy is not necessarily equivalent to that of a commodity CPU. In this paper, we investigate the accuracy and performance characteristics of GPUs, including results from a preproduction double precision-capable GPU. We then accelerate the full Quantum Monte Carlo simulation code DCA++, similarly investigating its tolerance to the precision of arithmetic delivered by GPUs. The results show that while DCA++ has some sensitivity to the arithmetic precision, the single-precision GPU results were comparable to single-precision CPU results. Acceleration of the code on a fully GPU-enabled cluster showed that any remaining inaccuracy in GPU precision was negligible; sufficient accuracy was retained for scientifically meaningful results while still showing significant speedups.

Concurrency and Computation: Practice and Experience, 2005

Comparisons of high-performance computers based on their peak floating point performance are comm... more Comparisons of high-performance computers based on their peak floating point performance are common but seldom useful when comparing performance on real workloads. Factors that influence sustained performance extend beyond a system's floating-point units, and real applications exercise machines in complex and diverse ways. Even when it is possible to compare systems based on their performance, other considerations affect which machine is best for a given organization. These include the cost, the facilities requirements (power, floorspace, etc.), the programming model, the existing code base, and so on. This paper describes some of the important measures for evaluating high-performance computers. We present data for many of these metrics based on our experience at Lawrence Livermore National Laboratory (LLNL), and we compare them with published information on the Earth Simulator. We argue that evaluating systems involves far more than comparing benchmarks and acquisition costs. We show that evaluating systems often involves complex choices among a variety of factors that influence the value of a supercomputer to an organization, and that the high-end computing community should view cost/performance comparisons of different architectures with skepticism.

Ieee Concurrency, 1998

F inding solutions to complex problems often requires scientists and engineers to work together. ... more F inding solutions to complex problems often requires scientists and engineers to work together. When they can do this work online from different locations, their opportunities for collaboration increase. For example, an atmospheric scientist working with a global chemical-transport model might enlist a chemist at a remote research university to explain an observed phenomenon in a model of chemical-species disbursement through the atmosphere. The atmospheric scientist working locally would start the model, bringing up a visualization-steering interface. She would then contact the chemist who would bring up a visualization that displays the model output from a chemical perspective. They could then take turns "steering" the model-changing parameter values, viewing the effects of such changes, rolling back the model to earlier simulation points, and trying different parameters until a resolution is found.

Performance Technolgies for Peta-Scale Systems: A White Paper Prepared by the Performance Evaluation Research Center

Future-looking high end computing initiatives will deploy powerful, large-scale computing platfor... more Future-looking high end computing initiatives will deploy powerful, large-scale computing platforms that leverage novel component technologies for superior node performance in advanced system architectures with tens or even hundreds of thousands of nodes. Recent advances in performance tools and modeling methodologies suggest that it is feasible to acquire such systems intelligently and achieve excellent performance, while also significantly reducing the

Exascale Hardware Architectures Working Group

... Group Lead: Scott Hemmert Participants and Contributors: Jim Ang, Brian Carnes, Patrick Chian... more ... Group Lead: Scott Hemmert Participants and Contributors: Jim Ang, Brian Carnes, Patrick Chiang, Doug Doerfler, Sudip Dosanjh, Parks Fields, Ken Koch, Jim Laros, Matt Leininger, John Noe, Terri Quinn, Josep Torrellas, Jeff Vetter, Cheryl Wampler, Andy White ...

Achieving good performance on high-end computing systems is growing ever more challenging due to ... more Achieving good performance on high-end computing systems is growing ever more challenging due to enormous scale, increasing architectural complexity, and increasing application complexity. To address these challenges in DOE's SciDAC-2 program , the Performance Engineering Research Institute (PERI) has embarked on an ambitious research plan encompassing performance modeling and prediction, automatic performance optimization and performance engineering of high profile applications. The principal new component is a research activity in automatic tuning software, which is spurred by the strong user preference for automatic tools.

Parallel Computing, 2006

We study the performance of high-speed interconnects using a set of communication micro-benchmark... more We study the performance of high-speed interconnects using a set of communication micro-benchmarks. The goal is to identify certain limiting factors and bottlenecks with these interconnects. Our micro-benchmarks are based on dense communication patterns with different communicating partners and varying degrees of these partners. We tested our microbenchmarks on five platforms: an IBM system of 68-node 16-way Power3, interconnected by a SP switch2; another IBM system of 264-node 4-way Power PC 604e, interconnected by an SP switch; a Compaq cluster of 128-node 4-way ES40/EV67 processor, interconnected by an Quadrics interconnect; an Intel cluster of 16-node dual-CPU Xeon, interconnected by an Quadrics interconnect; and a cluster of 22-node Sun Ultra Sparc, interconnected by an Ethernet network. Our results show many limitations of these networks including the memory contention within a node as the number of communicating processors increased and the limitations of the network interface for communication between multiple processors of different nodes.

Characterization of Scientific Workloads on Systems with Multi-Core Processors

2006 IEEE International Symposium on Workload Characterization, 2006

1 The submitted manuscript has been authored by a contractor of the US Government under Contract ... more 1 The submitted manuscript has been authored by a contractor of the US Government under Contract No. DE-AC05-00OR22725. Accordingly, the US Government retains a non-exclusive, royalty-free license to publish or reproduce the published form of this ...

ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), Jun 21, 2014

31st IEEE International Conference on Computer Design (ICCD), Oct 7, 2013

ORNL Technical Report number ORNL/TM-2014/636, 2014

Supercomputing Conference, 2000

An annotated bibliography of interactive program steering

Sigplan Notices, 1994

Scalable Software Transactional Memory for Global Address Space Architectures

Concurrency and Computation: Practice and Experience, 2005

Ieee Concurrency, 1998

Performance Technolgies for Peta-Scale Systems: A White Paper Prepared by the Performance Evaluation Research Center

Exascale Hardware Architectures Working Group

Parallel Computing, 2006

Characterization of Scientific Workloads on Systems with Multi-Core Processors

2006 IEEE International Symposium on Workload Characterization, 2006

Driven by the trends of increasing core-count and bandwidth-wall problem, the size of last level ... more Driven by the trends of increasing core-count and bandwidth-wall problem, the size of last level caches (LLCs) has greatly increased. Since SRAM consumes high leakage power, researchers have explored use of non-volatile memories (NVMs) for designing caches as they provide high density and consume low leakage power. However, since NVMs have low write-endurance and the existing cache management policies are write variation-unaware, effective wear-leveling techniques are required for achieving reasonable cache lifetimes using NVMs. We present WriteSmoothing, a technique for mitigating intra-set write variation in NVM caches. WriteSmoothing logically divides the cache-sets into multiple modules. For each module, WriteSmoothing collectively records number of writes in each way for any of the sets. It then periodically makes most frequently written ways in a module unavailable to shift the write-pressure to other ways in the sets of the module. Extensive simulation results have shown that on average, for single and dual-core system configurations, WriteSmoothing improves cache lifetime by 2.17X and 2.75X, respectively. Also, its implementation overhead is small and it works well for a wide range of algorithm and system parameters.

Use of NVM (Non-volatile memory) devices such as ReRAM (resistive RAM) and STT-RAM (spin transfer... more Use of NVM (Non-volatile memory) devices such as ReRAM (resistive RAM) and STT-RAM (spin transfer torque RAM) for designing on-chip caches holds the promise of providing a high-density, low-leakage alternative to SRAM. However, low write endurance of NVMs, along with the write-variation introduced by existing cache management schemes may significantly limit the lifetime of NVM caches. We present LastingNVCache, a technique for improving lifetime of NVM caches by mitigating the intra-set write variation. LastingNVCache works on the key idea that by periodically flushing a frequently-written data-item, the next time the block can be made to load into a cold block in the set. Through this, the future writes to that data-item can be redirected from a hot block to a cold block, which leads to improvement in the cache lifetime. Microarchitectural simulations have shown that LastingNVCache provides 6.36X, 9.79X, and 10.94X improvement in lifetime for single, dual and quad-core systems. Also, its implementation overhead is small and it outperforms a recently proposed technique for improving lifetime of NVM caches.