Jesse Beu - Academia.edu (original) (raw)

Papers by Jesse Beu

arXiv (Cornell University), Jan 24, 2020

Kronecker Products (KP) have been used to compress IoT RNN Applications by 15-38x compression fac... more Kronecker Products (KP) have been used to compress IoT RNN Applications by 15-38x compression factors, achieving better results than traditional compression methods. However when KP is applied to large Natural Language Processing tasks, it leads to significant accuracy loss (approx 26%). This paper proposes a way to recover accuracy otherwise lost when applying KP to large NLP tasks, by allowing additional degrees of freedom in the KP matrix. More formally, we propose doping, a process of adding an extremely sparse overlay matrix on top of the pre-defined KP structure. We call this compression method doped kronecker product compression. To train these models, we present a new solution to the phenomenon of co-matrix adaption (CMA), which uses a new regularization scheme called co-matrix dropout regularization (CMR). We present experimental results that demonstrate compression of a large language model with LSTM layers of size 25 MB by 25× with 1.4% loss in perplexity score. At 25× compression, an equivalent pruned network leads to 7.9% loss in perplexity score, while HMD and LMF lead to 15% and 27% loss in perplexity score respectively.

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020

MobileNets family of computer vision neural networks have fueled tremendous progress in the desig... more MobileNets family of computer vision neural networks have fueled tremendous progress in the design and organization of resource-efficient architectures in recent years. New applications with stringent real-time requirements on highly constrained devices require further compression of MobileNets-like compute-efficient networks. Model quantization is a widely used technique to compress and accelerate neural network inference and prior works have quantized MobileNets to 4 − 6 bits, albeit with a modest to significant drop in accuracy. While quantization to sub-byte values (i.e. precision ≤ 8 bits) has been valuable, even further quantization of MobileNets to binary or ternary values is necessary to realize significant energy savings and possibly runtime speedups on specialized hardware, such as ASICs and FPGAs. Under the key observation that convolutional filters at each layer of a deep neural network may respond differently to ternary quantization, we propose a novel quantization method that generates per-layer hybrid filter banks consisting of full-precision and ternary weight filters for MobileNets. Using this proposed quantization method, we quantize a substantial portion of weight filters of MobileNets to ternary values resulting in a 27.98% savings in energy, and a 51.07% reduction in the model size, while achieving comparable accuracy and no degradation in throughput on specialized hardware in comparison to the baseline full-precision MobileNets. Finally, we demonstrate the generalizability and effectiveness of hybrid filter banks to other neural network architectures.

Proceedings of the 2nd International Workshop on Challenges in Artificial Intelligence and Machine Learning for Internet of Things, 2020

Recent trends have shown that deep learning models have become larger and more accurate at an inc... more Recent trends have shown that deep learning models have become larger and more accurate at an increased computational cost, making them difficult to deploy for latency-constrained applications. Conditional execution methods address this increase in cost while preserving the accuracy of the model by only performing the most important computation on important features. In this paper, we analyze a recent method, Feature Boosting and Suppression (FBS), which dynamically assesses which channels contain the most important input-dependent features and prune the others based on a runtime threshold gating mechanism. FBS is able to dynamically prune convolution filters with little loss in accuracy at conservative pruning rates. However, at aggressive pruning rates FBS suffers from heavy accuracy loss in a similar way to aggressive static pruning, due to a low number of active filters and correspondingly narrower effective network per inference. Conditionally parameterized convolutions (CondConv) is another work in the conditional execution domain that increases performance at small computation cost through the generation of input-dependent expert filters. We discover that substituting standard convolutional filters with input-specific filters, as described in CondConv, we enable FBS to address this accuracy loss because CondConv expert filters are able to bolster the narrower network's reduced capacity and capture distinct features from various classes in a single composite filter as needed, making aggressive pruning appealing again. We test our FBS-pruned CondConv model on CIFAR-10 with a custom 9-layer CNN and demonstrate that we can achieve up to 47.2% savings in computational costs at iso-accuracy and 1.01% improvement in accuracy at iso-computational costs over the state-of-art FBS technique.

2019 2nd Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), 2019

Recurrent neural networks can be large and compute-intensive, yet many applications that benefit ... more Recurrent neural networks can be large and compute-intensive, yet many applications that benefit from RNNs run on small devices with very limited compute and storage capabilities while still having run-time constraints. As a result, there is a need for compression techniques that can achieve significant compression without negatively impacting inference run-time and task accuracy. This paper explores a new compressed RNN cell implementation called Hybrid Matrix Decomposition (HMD) that achieves this dual objective. This scheme divides the weight matrix into two parts-an unconstrained upper half and a lower half composed of rank-1 blocks. This results in output features where the upper sub-vector has "richer" features while the lower-sub vector has "constrained" features". HMD can compress RNNs by a factor of 2-4× while having a faster run-time than pruning (Zhu & Gupta, 2017) and retaining more model accuracy than matrix factorization (Grachev et al., 2017). We evaluate this technique on 5 benchmarks spanning 3 different applications, illustrating its generality in the domain of edge computing.

ArXiv, 2021

Structured matrices, such as those derived from Kronecker products (KP), are effective at compres... more Structured matrices, such as those derived from Kronecker products (KP), are effective at compressing neural networks, but can lead to unacceptable accuracy loss when applied to large models. In this paper, we propose the notion of doping addition of an extremely sparse matrix to a structured matrix. Doping facilitates additional degrees of freedom for a small number of parameters, allowing them to independently diverge from the fixed structure. To train LSTMs with doped structured matrices, we introduce the additional parameter matrix while slowly annealing its sparsity level. However, we find that performance degrades as we slowly sparsify the doping matrix, due to co-matrix adaptation (CMA) between the structured and the sparse matrices. We address this over dependence on the sparse matrix using a co-matrix dropout regularization (CMR) scheme. We provide empirical evidence to show that doping, CMA and CMR are concepts generally applicable to multiple structured matrices (Kronecke...

ArXiv, 2019

Recurrent neural networks (RNNs) have shown state of the art results for speech recognition, natu... more Recurrent neural networks (RNNs) have shown state of the art results for speech recognition, natural language processing, image captioning and video summarizing applications. Many of these applications run on low-power platforms, so their energy efficiency is extremely important. We observed that cache-oblivious RNN scheduling during inference typically results in 30-50x more data transferred on and off the CPU than the application's working set size. This can potentially impact its energy efficiency. This paper presents a new metric called Data Reuse Efficiency to gauge the RNN scheduling efficiency of a platform and shows the factors that influence the DRE value. Additionally, this paper discusses an optimization to improve reuse in RNNs and highlights the positive impact of this optimization on the total amount of memory read from or written to the memory controller (and, hence, the DRE value) during the execution of an RNN application for a mobile SoC.

ArXiv, 2019

Recurrent Neural Networks (RNN) can be large and compute-intensive, making them hard to deploy on... more Recurrent Neural Networks (RNN) can be large and compute-intensive, making them hard to deploy on resource constrained devices. As a result, there is a need for compression technique that can significantly compress recurrent neural networks, without negatively impacting task accuracy. This paper introduces a method to compress RNNs for resource constrained environments using Kronecker products. We call the RNNs compressed using Kronecker products as Kronecker product Recurrent Neural Networks (KPRNNs). KPRNNs can compress the LSTM[22], GRU [9] and parameter optimized FastRNN [30] layers by 15 - 38x with minor loss in accuracy and can act as in-place replacement of most RNN cells in existing applications. By quantizing the Kronecker compressed networks to 8 bits, we further push the compression factor to 50x. We compare the accuracy and runtime of KPRNNs with other state-of-the-art compression techniques across 5 benchmarks spanning 3 different applications, showing its generality. A...

ArXiv, 2020

Matrix multiplications between asymmetric bit-width operands, especially between 8- and 4-bit ope... more Matrix multiplications between asymmetric bit-width operands, especially between 8- and 4-bit operands are likely to become a fundamental kernel of many important workloads including neural networks and machine learning. While existing SIMD matrix multiplication instructions for symmetric bit-width operands can support operands of mixed precision by zero- or sign-extending the narrow operand to match the size of the other operands, they cannot exploit the benefit of narrow bit-width of one of the operands. We propose a new SIMD matrix multiplication instruction that uses mixed precision on its inputs (8- and 4-bit operands) and accumulates product values into narrower 16-bit output accumulators, in turn allowing the SIMD operation at 128-bit vector width to process a greater number of data elements per instruction to improve processing throughput and memory bandwidth utilization without increasing the register read- and write-port bandwidth in CPUs. The proposed asymmetric-operand-s...

ability to complete this degree, and their understanding and compassion during my extended social... more ability to complete this degree, and their understanding and compassion during my extended social absences around publication deadlines over the years. I especially want to thank my mother for instilling such a strong sense of integrity in me, which leaves its mark on every piece of research I do, and my father and stepmother for believing in me with so much enthusiasm that my victories always felt like theirs as well. I also must thank Gerald Merckel for having such a profound influence on this undecided undergraduate student so many years ago, and for being my point-of-contact to my PhD advisor and lifelong friend, Tom Conte. I could easily populate pages attempting to describe the complex dynamic that only a student and advisor know, but I'll keep it simple: Tom, thank you for helping me reach my potential. I also want to thank Eric Rotenberg, whose countless after-lecture discussions and impassioned arguments during my time and NCSU shaped much of my understanding of architecture. And to the Tinker group, my support web that kept me from sinking countless times, I owe much to you. Saurabh Sharma, whose 'AMAZING' mentorship got my feet wet on my first publication. Balaji Iyer, the most gullible lab-mate and impressive human-encyclopedia of reference I've known. Paul Bryan, my go-to statistics guru and video/board gaming bud. Jason Poovey, my fellow 'laugh riot comedian' who has an uncanny knack for knocking me off balance with his unexpected perspective. And especially Chad Rosier, my roommate and partner in crime through Swampy, SlaDir, v DSDP, CaffeineSim, MCP, and countless other brave adventures in architecture; thank you for helping me through the hardest time of my life. I also want to thank the 'new guys', the Georgia Tech Tinkers. Rishiraj Bheda, Brian Railing, Phillip Vassenkov and Eric Hein: when things started to become overwhelming, you helped me pull it all back together. This wouldn't have happened without your feedback in all those meetings, all those trips to the 'Zen Garden', the Monday lunches or other million small things that add up to so much. And Brian, we will publish together! Last, and certainly not least, I want to thank Regina Maniquis, my inspiration to keep going when I thought I had no more to give, and the reason I've been able to push through the countless hours of late night thinking, planning, coding and writing over these past several months. This is dedicated to you. vi

ACM Journal on Emerging Technologies in Computing Systems, 2021

Micro-controllers (MCUs) make up most of the processors in the world with widespread applicabilit... more Micro-controllers (MCUs) make up most of the processors in the world with widespread applicability from automobile to medical devices. The Internet of Things promises to enable these resource-constrained MCUs with machine learning algorithms to provide always-on intelligence. Many Internet of Things applications consume time-series data that are naturally suitable for recurrent neural networks (RNNs) like LSTMs and GRUs. However, RNNs can be large and difficult to deploy on these devices, as they have few kilobytes of memory. As a result, there is a need for compression techniques that can significantly compress RNNs without negatively impacting task accuracy. This article introduces a method to compress RNNs for resource-constrained environments using the Kronecker product (KP). KPs can compress RNN layers by 16× to 38× with minimal accuracy loss. By quantizing the resulting models to 8 bits, we further push the compression factor to 50×. We compare KP with other state-of-the-art c...

Proceedings of the 1st Workshop on Machine Learning on Edge in Sensor Systems, 2019

Recurrent Neural Networks (RNNs) break a time-series input (or a sentence) into multiple time-ste... more Recurrent Neural Networks (RNNs) break a time-series input (or a sentence) into multiple time-steps (or words) and process it one time-step (word) at a time. However, not all of these time-steps (words) need to be processed to determine the final output accurately. Prior work has exploited this intuition by incorporating an additional predictor in front of the RNN model to prune time-steps that are not relevant. However, they jointly train the predictor and the RNN model, allowing one to learn from the mistakes of the other. In this work we present a method to skip RNN time-steps without retraining or fine tuning the original RNN model. Using an ideal predictor, we show that even without retraining the original model, we can train a predictor to skip 45% of steps for the SST dataset and 80% of steps for the IMDB dataset without impacting the model accuracy. We show that the decision to skip is not easy by comparing against 5 different baselines based on solutions derived from domain knowledge. Finally, we present a case study about the cost and accuracy benefits of realizing such a predictor. This realistic predictor on the SST dataset is able to reduce the computation by more than 25% with at most 0.3% loss in accuracy while being 40× smaller than the original RNN model.

2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS), 2019

Recurrent Neural Networks (RNN) can be difficult to deploy on resource constrained devices due to... more Recurrent Neural Networks (RNN) can be difficult to deploy on resource constrained devices due to their size. As a result, there is a need for compression techniques that can significantly compress RNNs without negatively impacting task accuracy. This paper introduces a method to compress RNNs for resource constrained environments using Kronecker product (KP). KPs can compress RNN layers by 16 − 38× with minimal accuracy loss. We show that KP can beat the task accuracy achieved by other state-of-the-art compression techniques across 4 benchmarks spanning 3 different applications, while simultaneously improving inference run-time.

2019 2nd Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), 2019

The Winograd or Cook-Toom class of algorithms help to reduce the overall compute complexity of ma... more The Winograd or Cook-Toom class of algorithms help to reduce the overall compute complexity of many modern deep convolutional neural networks (CNNs). Although there has been a lot of research done on model and algorithmic optimization of CNN, little attention has been paid to the efficient implementation of these algorithms on embedded CPUs, which usually have very limited memory and low power budget. This paper aims to fill this gap and focuses on the efficient implementation of Winograd or Cook-Toom based convolution on modern Arm Cortex-A CPUs, widely used in mobile devices today. Specifically, we demonstrate a reduction in inference latency by using a set of optimization strategies that improve the utilization of computational resources, and by effectively leveraging the ARMv8-A NEON SIMD instruction set. We evaluated our proposed region-wise multi-channel implementations on Arm Cortex-A73 platform using several representative CNNs. The results show significant performance improvements in full network, up to 60%, over existing im2row/im2col based optimization techniques.

Proceedings of the 2nd International Workshop on Challenges in Artificial Intelligence and Machine Learning for Internet of Things, 2020

There has been a recent surge in interest in dynamic inference technologies which can reduce the ... more There has been a recent surge in interest in dynamic inference technologies which can reduce the cost of inference, without sacrificing the accuracy of the model. These models are based on the assumption that not all parts of the output feature map (OFM) are equally important for all inputs. The parts of the output feature maps that are deemed unimportant for a certain input can be skipped entirely or computed at a lower precision, leading to reduced number of computation. In this paper we focus on one such technology that targets unimportant features in the spatial domain of OFM, called Precision Gating (PG). PG computes most features in low precision, to identify regions in the OFM where an object of interest is present, and computes high precision OFM for that region only. We show that PG leads to loss in accuracy when we push the MAC reduction achieved by a PG network. We identify orthogonal dynamic optimization opportunities not exploited by PG and show that the combined technologies can achieve far better results than their individual baseline. This Hybrid Model can achieve 1.92x computation savings on a CIFAR-10 model at an accuracy of 91.35%. At a similar computation savings, the PG model achieves an accuracy of 89.9%. Additionally, we show that PG leads to GEMM computations that are not hardware aware and propose a fix that makes PG technique CPU friendly without losing accuracy.

Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, 2020

Sequence model based NLP applications can be large. Yet, many applications that benefit from them... more Sequence model based NLP applications can be large. Yet, many applications that benefit from them run on small devices with very limited compute and storage capabilities, while still having run-time constraints. As a result, there is a need for a compression technique that can achieve significant compression without negatively impacting inference run-time and task accuracy. This paper proposes a new compression technique called Hybrid Matrix Factorization that achieves this dual objective. HLF improves low-rank matrix factorization (LMF) techniques by doubling the rank of the matrix using an intelligent hybrid-structure leading to better accuracy than LMF. Further, by preserving dense matrices, it leads to faster inference run-time than pruning or structure matrix based compression technique. We evaluate the impact of this technique on 5 NLP benchmarks across multiple tasks (Translation, Intent Detection, Language Modeling) and show that for similar accuracy values and compression factors, HLF can achieve more than 2.32× faster inference run-time than pruning and 16.77% better accuracy than LMF.

2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), 2013

As more heterogeneous architecture solutions continue to emerge, coherence solutions tailored for... more As more heterogeneous architecture solutions continue to emerge, coherence solutions tailored for these architectures will become mandatory. Coherence hierarchies will likely continue to be prevalent in future large-scale shared memory architectures. However, past experience has shown that hierarchical coherence protocol design is a non-trivial problem, especially when considering the verification effort required to guarantee correctness. While some strategies do exist for verification of homogenous coherence hierarchies, support for reasonable verification of heterogeneous coherence hierarchies is currently unavailable. Ideally, hierarchical coherence protocols composed of 'building block' protocols should be able to take advantage of incremental verification to side step the state-space explosion problem which hampers any large-scale verification effort. In this work, we prove this can be accomplished through the use of the Manager-Client Pairing (MCP) framework, which provides encapsulation and permission checking support that enables a form of statespace symmetry. When combined with an inductive proof, this ensures the validation properties of proper permission distribution and livelock/deadlock freedom are enforced by any hierarchical composition of MCP compliant protocols. Demonstration of this methodology through the MurPhi formal verifier shows several orders of magnitude improvement in verification cost compared to full hierarchy verification.

2012 SC Companion: High Performance Computing, Networking Storage and Analysis, 2012

A simulation system for modern multicore architectures is composed of various component models. F... more A simulation system for modern multicore architectures is composed of various component models. For such a system to be useful for research purposes, modifiability is a key quality attribute. Users, when building a simulation model, need to have the capability to adjust various aspects of a component, or even replace a component with another of the same type. Software design considerations can determine whether or not a simulation system is successful in providing such capabilities. This paper presents a few design tactics that we adopt in creating configurable, modifiable, and reusable components for Manifold, our parallel simulation framework for multicore systems. The main example component is MCP-cache, a coherence cache model. The ideas behind the tactics are general enough and should be useful to designers of similar systems.

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, 2011

As technology continues to scale, the need for more sophisticated coherence management is becomin... more As technology continues to scale, the need for more sophisticated coherence management is becoming a necessity. The likely solution to this problem is the use of coherence hierarchies, analogous to how cache hierarchies have helped address the memory-wall problem in the past. Previous work in the construction of large-scale coherence protocols, however, demonstrates the complexity inherent to this design space. The difficulty with hierarchical coherence protocol design is the complexity increases exponentially with the increase in coherence states, due in turn to interactions between hierarchy tiers. Additionally, because of the large development investment, choices regarding coherence hierarchy are often made statically with little knowledge of how changes to the organization would affect the system. In this work, we present Manager-Client Pairing (MCP) as a unifying methodology for designing multi-tier coherence protocols by formally defining and limiting the interactions between levels within a coherence hierarchy to enable composition. Using MCP, we then implement a variety of hierarchical coherence protocol configurations for a 256-core system comprised of 4 64-core manycores, and provide insights into what impact different hierarchy depth and width choices can have on system performance.

2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, 2012

In the last decade, the microprocessor industry has undergone a dramatic change, ushering in the ... more In the last decade, the microprocessor industry has undergone a dramatic change, ushering in the new era of multi-/manycore processors. As new designs incorporate increasing core counts, simulation technology has not matched pace, resulting in simulation times that increasingly dominate the design cycle. Complexities associated with the execution of code and communication between simulated cores has presented new obstacles for the simulation of manycore designs. Hence, many techniques developed to accelerate uniprocessor simulation cannot be easily adapted to accelerate manycore simulation. In this work, a novel time-parallel barrier-interval simulation methodology is presented to rapidly accelerate the simulation of certain classes of multi-threaded workloads. A program delineated into intervals by barriers may be accurately simulated in parallel. This approach avoids challenges originating from unknown thread progressions, since the program location of each executing thread is known. For the workloads tested, wall-clock speedups range from 1.22x to 596x, with an average of 13.94x. Furthermore, this approach allows the estimation of stable performance metrics such as cycle counts with minimal losses in accuracy (2%, on average, for all tested workloads). The proposed technique provides a fast and accurate mechanism to rapidly accelerate particular classes of manycore simulations.