Joel Emer - Profile on Academia.edu (original) (raw)

Papers by Joel Emer

Sparse tensor algebra is a challenging class of workloads to accelerate due to low arithmetic int... more Sparse tensor algebra is a challenging class of workloads to accelerate due to low arithmetic intensity and varying sparsity patterns. Prior sparse tensor algebra accelerators have explored tiling sparse data to increase exploitable data reuse and improve throughput, but typically allocate tile size in a given buffer for the worst-case data occupancy. This severely limits the utilization of available memory resources and reduces data reuse. Other accelerators employ complex tiling during preprocessing or at runtime to determine the exact tile size based on its occupancy. This paper proposes a speculative tensor tiling approach, called overbooking, to improve buffer utilization by taking advantage of the distribution of nonzero elements in sparse tensors to construct larger tiles with greater data reuse. To ensure correctness, we propose a low-overhead hardware mechanism, Tailors, that can tolerate data overflow by design while ensuring reasonable data reuse. We demonstrate that Tailors can be easily integrated into the memory hierarchy of an existing sparse tensor algebra accelerator. To ensure high buffer utilization with minimal tiling overhead, we introduce a statistical approach, Swiftiles, to pick a tile size so that tiles usually fit within the buffer's capacity, but can potentially overflow, i.e., it overbooks the buffers. Across a suite of 22 sparse tensor algebra workloads, we show that our proposed overbooking strategy introduces an average speedup of 52.7× and 2.3× and an average energy reduction of 22.5× and 2.5× over ExTensor without and with optimized tiling, respectively.

Microarchitectural side-channels enable an attacker to exfiltrate information via the observable ... more Microarchitectural side-channels enable an attacker to exfiltrate information via the observable side-effects of a victim's execution. Obfuscating mitigation schemes have recently gained in popularity for their appealing performance characteristics. These schemes, including randomized caches and DRAM traffic shapers, limit, but do not completely eliminate, side-channel leakage. An important (yet under-explored) research challenge is the quantitative study of the security effectiveness of these schemes, identifying whether these obfuscating schemes help increase the security level of a system, and if so, by how much. In this paper, we address this research challenge by presenting Metior, a comprehensive model to quantitatively evaluate the effectiveness of obfuscating side-channel mitigations. Metior offers a way to reason about the flow of information through obfuscating schemes. Metior builds upon existing information theoretic approaches, allowing for the comprehensive side-channel leakage evaluation of active attackers, real victim applications, and state-ofthe-art microarchitectural obfuscation schemes. We demonstrate the use of Metior in the concrete leakage evaluation of three microarchitectural obfuscation schemes (fully-associative random replacement caches, CEASER-S, and Camouflage), identifying unintuitive leakage behaviours across all three schemes. • Security and privacy → Side-channel analysis and countermeasures.

arXiv (Cornell University), May 11, 2022

In recent years, many accelerators have been proposed to efficiently process sparse tensor algebr... more In recent years, many accelerators have been proposed to efficiently process sparse tensor algebra applications (e.g., sparse neural networks). However, these proposals are single points in a large and diverse design space. The lack of systematic description and modeling support for these sparse tensor accelerators impedes hardware designers from efficient and effective design space exploration. This paper first presents a unified taxonomy to systematically describe the diverse sparse tensor accelerator design space. Based on the proposed taxonomy, it then introduces Sparseloop, the first fast, accurate, and flexible analytical modeling framework to enable early-stage evaluation and exploration of sparse tensor accelerators. Sparseloop comprehends a large set of architecture specifications, including various dataflows and sparse acceleration features (e.g., elimination of zero-based compute). Using these specifications, Sparseloop evaluates a design's processing speed and energy efficiency while accounting for data movement and compute incurred by the employed dataflow, including the savings and overhead introduced by the sparse acceleration features using stochastic density models. Across representative accelerator designs and workloads, Sparseloop achieves over 2000× faster modeling speed than cycle-level simulations, maintains relative performance trends, and achieves 0.1% to 8% average error. The paper also presents example use cases of Sparseloop in different accelerator design flows to reveal important design insights.

This paper reports the results of a study of VAX-11/780 processor performance using a novel hardw... more This paper reports the results of a study of VAX-11/780 processor performance using a novel hardware monitoring technique. A micro-PC histogram monitor was built for these measurements. It keeps a count of the number of microcode cycles executed at each microcode location. Measurement experiments were performed on live timesharing workloads as well as on synthetic workloads of several types. The histogram counts allow the calculation of the frequency of various architectural events, such as the frequency of different types of opcodes and operand specifiers, as well as the frequency of some implementation-specific events, such as translation buffer misses. The measurement technique also yields the amount of processing time spent in various activities, such as ordinary microcode computation, memory management, and processor stalls of different kinds. This, paper reports in detail the amount of time the 'average'fVAX instruction spends in these activities.

IEEE Journal of Solid-state Circuits, 2017

arXiv (Cornell University), Mar 16, 2017

Computer vision enables a wide range of applications in robotics/drones, self-driving cars, smart... more Computer vision enables a wide range of applications in robotics/drones, self-driving cars, smart Internet of Things, and portable/wearable electronics. For many of these applications, local embedded processing is preferred due to privacy and/or latency concerns. Accordingly, energy-efficient embedded vision hardware delivering real-time and robust performance is crucial. While deep learning is gaining popularity in several computer vision algorithms, a significant energy consumption difference exists compared to traditional hand-crafted approaches. In this paper, we provide an in-depth analysis of the computation, energy and accuracy trade-offs between learned features such as deep Convolutional Neural Networks (CNN) and hand-crafted features such as Histogram of Oriented Gradients (HOG). This analysis is supported by measurements from two chips that implement these algorithms. Our goal is to understand the source of the energy discrepancy between the two approaches and to provide insight about the potential areas where CNNs can be improved and eventually approach the energy-efficiency of HOG while maintaining its outstanding performance accuracy.

arXiv (Cornell University), Jun 24, 2020

As deep neural network (DNN) models grow ever-larger, they can achieve higher accuracy and solve ... more As deep neural network (DNN) models grow ever-larger, they can achieve higher accuracy and solve more complex problems. This trend has been enabled by an increase in available compute power; however, efforts to continue to scale electronic processors are impeded by the costs of communication, thermal management, power delivery and clocking. To improve scalability, we propose a digital optical neural network (DONN) with intralayer optical interconnects and reconfigurable input values. The near path-length-independence of optical energy consumption enables information locality between a transmitter and arbitrarily arranged receivers, which allows greater flexibility in architecture design to circumvent scaling limitations. In a proof-of-concept experiment, we demonstrate optical multicast in the classification of 500 MNIST images with a 3-layer, fully-connected network. We also analyze the energy consumption of the DONN and find that optical data transfer is beneficial over electronics when the spacing of computational units is on the order of >10 µm.

arXiv (Cornell University), Jul 10, 2018

A recent trend in deep neural network (DNN) development is to extend the reach of deep learning a... more A recent trend in deep neural network (DNN) development is to extend the reach of deep learning applications to platforms that are more resource and energy constrained, e.g., mobile devices. These endeavors aim to reduce the DNN model size and improve the hardware processing efficiency, and have resulted in DNNs that are much more compact in their structures and/or have high data sparsity. These compact or sparse models are different from the traditional large ones in that there is much more variation in their layer shapes and sizes, and often require specialized hardware to exploit sparsity for performance improvement. Therefore, many DNN accelerators designed for large DNNs do not perform well on these models. In this work, we present Eyeriss v2, a DNN accelerator architecture designed for running compact and sparse DNNs. To deal with the widely varying layer shapes and sizes, it introduces a highly flexible on-chip network, called hierarchical mesh, that can adapt to the different amounts of data reuse and bandwidth requirements of different data types, which improves the utilization of the computation resources. Furthermore, Eyeriss v2 can process sparse data directly in the compressed domain for both weights and activations, and therefore is able to improve both processing speed and energy efficiency with sparse models. Overall, with sparse MobileNet, Eyeriss v2 in a 65nm CMOS process achieves a throughput of 1470.6 inferences/sec and 2560.3 inferences/J at a batch size of 1, which is 12.6× faster and 2.5× more energy efficient than the original Eyeriss running MobileNet.

Tensor algebra involving multiple sparse operands is severely memory bound, making it a challengi... more Tensor algebra involving multiple sparse operands is severely memory bound, making it a challenging target for acceleration. Furthermore, irregular sparsity complicates traditional techniques-such as tiling-for ameliorating memory bottlenecks. Prior sparse tiling schemes are sparsity unaware: they carve tensors into uniform coordinate-space shapes, which leads to low-occupancy tiles and thus lower exploitable reuse. To address these challenges, this paper proposes dynamic reflexive tiling (DRT), a novel tiling method that improves data reuse over prior art for sparse tensor kernels, unlocking significant performance improvement opportunities. DRT's key idea is dynamic sparsity-aware tiling. DRT continuously retiles sparse tensors at runtime based on the current sparsity of the active regions of all input tensors, to maximize accelerator buffer utilization while retaining the ability to co-iterate through tiles of distinct tensors. Through an extensive evaluation over a set of SuiteSparse matrices, we show how DRT can be applied to multiple prior accelerators with different dataflows (ExTensor, OuterSPACE, MatRaptor), improving their performance (by 3.3×, 5.1× and 1.6×, respectively) while adding negligible area overhead. We apply DRT to higherorder tensor kernels to reduce DRAM traffic by 3.9× and 16.9× over

Hardware-design languages typically impose a rigid communication hierarchy that follows module in... more Hardware-design languages typically impose a rigid communication hierarchy that follows module instantiation. This leads to an undesirable side-effect where changes to a child's interface result in changes to the parents. Soft connections address this problem by allowing the user to specify connection endpoints that are automatically connected at compilation time, rather than by the user.

The Queueing Paradigm

This is a book about statistical prediction. It tells the story of how the behavior of complex el... more This is a book about statistical prediction. It tells the story of how the behavior of complex electronic systems can be predicted using pencil, paper, the poetry of mathematics, and the number crunching ability of computers.

Machine learning plays a critical role in extracting meaningful information out of the zetabytes ... more Machine learning plays a critical role in extracting meaningful information out of the zetabytes of sensor data collected every day. For some applications, the goal is to analyze and understand the data to identify trends (e.g., surveillance, portable/wearable electronics); in other applications, the goal is to take immediate action based the data (e.g., robotics/drones, self-driving cars, smart Internet of Things). For many of these applications, local embedded processing near the sensor is preferred over the cloud due to privacy or latency concerns, or limitations in the communication bandwidth. However, at the sensor there are often stringent constraints on energy consumption and cost in addition to throughput and accuracy requirements. Furthermore, flexibility is often required such that the processing can be adapted for different applications or environments (e.g., update the weights and model in the classifier). In many applications, machine learning often involves transforming the input data into a higher dimensional space, which, along with programmable weights, increases data movement and consequently energy consumption. In this paper, we will discuss how these challenges can be addressed at various levels of hardware design ranging from architecture, hardware-friendly algorithms, mixed-signal circuits, and advanced technologies (including memories and sensors).

Computer architecture news, Oct 1, 2002

Interconnection networks usually consist of a fabric of interconnected routers, which receive pac... more Interconnection networks usually consist of a fabric of interconnected routers, which receive packets arriving at their input ports and forward them to appropriate output ports. Unfortunately, network packets moving through these routers are often delayed due to conflicting demand for resources, such as output ports or buffer space. Hence, routers typically employ arbiters that resolve conflicting resource demands to maximize the number of matches between packets waiting at input ports and free output ports. Efficient design and implementation of the algorithm running on these arbiters is critical to maximize network performance. This paper proposes a new arbitration algorithm called SPAA (Simple Pipelined Arbitration Algorithm), which is implemented in the Alpha 21364 processor's on-chip router pipeline. Simulation results show that SPAA significantly outperforms two earlier well-known arbitration algorithms: PIM (Parallel Iterative Matching) and WFA (Wave-Front Arbiter) implemented in the SGI Spider switch. SPAA outperforms PIM and WFA because SPAA exhibits matching capabilities similar to PIM and WFA under realistic conditions when many output ports are busy, incurs fewer clock cycles to perform the arbitration, and can be pipelined effectively. Additionally, we propose a new prioritization policy called the Rotary Rule, which prevents the network's adverse performance degradation from saturation at high network loads by prioritizing packets already in the network over new packets generated by caches or memory.

Optimizing Compression Schemes for Parallel Sparse Tensor Algebra

2023 Data Compression Conference (DCC)

Advanced Technologies

Synthesis Lectures on Computer Architecture, 2020

Hierarchical circuit integrated cache memory

An Architectural Perspective on Soft Errors From Cosmic Radiation

Multiplying Alpha Performance

2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)

We propose the Sparse Abstract Machine (SAM), an abstract machine model for targeting sparse tens... more We propose the Sparse Abstract Machine (SAM), an abstract machine model for targeting sparse tensor algebra to reconfigurable and fixed-function spatial dataflow accelerators. SAM defines a streaming dataflow abstraction with sparse primitives that encompass a large space of scheduled tensor algebra expressions. SAM dataflow graphs naturally separate tensor formats from algorithms and are expressive enough to incorporate arbitrary iteration orderings and many hardware-specific optimizations. We also present Custard, a compiler from a high-level language to SAM that demonstrates SAM's usefulness as an intermediate representation. We automatically bind from SAM to a streaming dataflow simulator. We evaluate the generality and extensibility of SAM, explore the performance space of sparse tensor algebra optimizations using SAM, and show SAM's ability to represent dataflow hardware. CCS CONCEPTS • Computer systems organization → Data flow architectures; • Software and its engineering → Compilers.