Joseph Gonzalez - Academia.edu (original) (raw)

Papers by Joseph Gonzalez

arXiv (Cornell University), Jul 15, 2020

Existing approaches to federated learning suffer from a communication bottleneck as well as conve... more Existing approaches to federated learning suffer from a communication bottleneck as well as convergence issues due to sparse client participation. In this paper we introduce a novel algorithm, called FetchSGD, to overcome these challenges. FetchSGD compresses model updates using a Count Sketch, and then takes advantage of the mergeability of sketches to combine model updates from many workers. A key insight in the design of FetchSGD is that, because the Count Sketch is linear, momentum and error accumulation can both be carried out within the sketch. This allows the algorithm to move momentum and error accumulation from clients to the central aggregator, overcoming the challenges of sparse client participation while still achieving high compression rates and good convergence. We prove that FetchSGD has favorable convergence guarantees, and we demonstrate its empirical effectiveness by training two residual networks and a transformer model.

arXiv (Cornell University), Apr 21, 2021

The computation demand for machine learning (ML) has grown rapidly recently, which comes with a n... more The computation demand for machine learning (ML) has grown rapidly recently, which comes with a number of costs. Estimating the energy cost helps measure its environmental impact and finding greener strategies, yet it is challenging without detailed information. We calculate the energy use and carbon footprint of several recent large models-T5 , Meena , GShard , Switch Transformer , and GPT-3-and refine earlier estimates for the neural architecture search that found Evolved Transformer. We highlight the following opportunities to improve energy efficiency and CO 2 equivalent emissions (CO 2 e): • Large but sparsely activated DNNs can consume <1/10th the energy of large, dense DNNs without sacrificing accuracy despite using as many or even more parameters. • Geographic location matters for ML workload scheduling since the fraction of carbon-free energy and resulting CO 2 e vary ~5X-10X, even within the same country and the same organization. We are now optimizing where and when large models are trained. • Specific datacenter infrastructure matters, as Cloud datacenters can be ~1.4-2X more energy efficient than typical datacenters, and the ML-oriented accelerators inside them can be ~2-5X more effective than off-the-shelf systems. Remarkably, the choice of DNN, datacenter, and processor can reduce the carbon footprint up to ~100-1000X. These large factors also make retroactive estimates of energy cost difficult. To avoid miscalculations, we believe ML papers requiring large computational resources should make energy consumption and CO 2 e explicit when practical. We are working to be more transparent about energy use and CO 2 e in our future research. To help reduce the carbon footprint of ML, we believe energy usage and CO 2 e should be a key metric in evaluating models, and we are collaborating with MLPerf developers to include energy usage during training and inference in this industry standard benchmark.

arXiv (Cornell University), Feb 20, 2017

Distributed optimization algorithms are widely used in many industrial machine learning applicati... more Distributed optimization algorithms are widely used in many industrial machine learning applications. However choosing the appropriate algorithm and cluster size is often difficult for users as the performance and convergence rate of optimization algorithms vary with the size of the cluster. In this paper we make the case for an ML-optimizer that can select the appropriate algorithm and cluster size to use for a given problem. To do this we propose building two models: one that captures the system level characteristics of how computation, communication change as we increase cluster sizes and another that captures how convergence rates change with cluster sizes. We present preliminary results from our prototype implementation called Hemingway and discuss some of the challenges involved in developing such a system.

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

Recent work learns contextual representations of source code by reconstructing tokens from their ... more Recent work learns contextual representations of source code by reconstructing tokens from their context. For downstream semantic understanding tasks like code clone detection, these representations should ideally capture program functionality. However, we show that the popular reconstruction-based RoBERTa model is sensitive to source code edits, even when the edits preserve semantics. We propose Con-traCode: a contrastive pre-training task that learns code functionality, not form. Con-traCode pre-trains a neural network to identify functionally similar variants of a program among many non-equivalent distractors. We scalably generate these variants using an automated source-to-source compiler as a form of data augmentation. Contrastive pretraining outperforms RoBERTa on an adversarial code clone detection benchmark by 39% AUROC. Surprisingly, improved adversarial robustness translates to better accuracy over natural code; ContraCode improves summarization and TypeScript type inference accuracy by 2 to 13 percentage points over competitive baselines. All source is available at https://github.com/parasj/contracode.

In pursuit of graph processing performance, the systems community has largely abandoned general-p... more In pursuit of graph processing performance, the systems community has largely abandoned general-purpose distributed dataflow frameworks in favor of specialized graph processing systems that provide tailored programming abstractions and accelerate the execution of iterative graph algorithms. In this paper we argue that many of the advantages of specialized graph processing systems can be recovered in a modern general-purpose distributed dataflow system. We introduce GraphX, an embedded graph processing framework built on top of Apache Spark, a widely used distributed dataflow system. GraphX presents a familiar composable graph abstraction that is sufficient to express existing graph APIs, yet can be implemented using only a few basic dataflow operators (e.g., join, map, group-by). To achieve performance parity with specialized graph systems, GraphX recasts graph-specific optimizations as distributed join optimizations and materialized view maintenance. By leveraging advances in distr...

Queue, 2018

This installment of Research for Practice features a curated selection from Dan Crankshaw and Joe... more This installment of Research for Practice features a curated selection from Dan Crankshaw and Joey Gonzalez, who provide an overview of machine learning serving systems. What happens when we wish to actually deploy a machine learning model to production, and how do we serve predictions with high accuracy and high computational efficiency? Dan and Joey’s selection provides a thoughtful selection of cutting-edge techniques spanning database-level integration, video processing, and prediction middleware. Given the explosion of interest in machine learning and its increasing impact on seemingly every application vertical, it’s possible that systems such as these will become as commonplace as relational databases are today

Reinforcement learning (RL) algorithms involve the deep nesting of highly irregular computation p... more Reinforcement learning (RL) algorithms involve the deep nesting of highly irregular computation patterns, each of which typically exhibits opportunities for distributed computation. We argue for distributing RL components in a composable way by adapting algorithms for top-down hierarchical control, thereby encapsulating parallelism and resource requirements within short-running compute tasks. We demonstrate the benefits of this principle through RLlib: a library that provides scalable software primitives for RL. These primitives enable a broad range of algorithms to be implemented with high performance, scalability, and substantial code reuse. RLlib is available at this https URL.

ArXiv, 2019

Serverless cloud computing handles virtually all the system administration operations needed to m... more Serverless cloud computing handles virtually all the system administration operations needed to make it easier for programmers to use the cloud. It provides an interface that greatly simplifies cloud programming, and represents an evolution that parallels the transition from assembly language to high-level programming languages. This paper gives a quick history of cloud computing, including an accounting of the predictions of the 2009 Berkeley View of Cloud Computing paper, explains the motivation for serverless computing, describes applications that stretch the current limits of serverless, and then lists obstacles and research opportunities required for serverless computing to fulfill its full potential. Just as the 2009 paper identified challenges for the cloud and predicted they would be addressed and that cloud use would accelerate, we predict these issues are solvable and that serverless computing will grow to dominate the future of cloud computing.

Proceedings of the 11th ACM Symposium on Cloud Computing, 2020

Serving ML prediction pipelines spanning multiple models and hardware accelerators is a key chall... more Serving ML prediction pipelines spanning multiple models and hardware accelerators is a key challenge in production machine learning. Optimally configuring these pipelines to meet tight end-to-end latency goals is complicated by the interaction between model batch size, the choice of hardware accelerator, and variation in the query arrival process. In this paper we introduce InferLine, a system which provisions and manages the individual stages of prediction pipelines to meet end-to-end tail latency constraints while minimizing cost. InferLine consists of a low-frequency combinatorial planner and a high-frequency auto-scaling tuner. The low-frequency planner leverages stage-wise profiling, discrete event simulation, and constrained combinatorial search to automatically select hardware type, replication, and batching parameters for each stage in the pipeline. The high-frequency tuner uses network calculus to auto-scale each stage to meet tail latency goals in response to changes in the query arrival process. We demonstrate that InferLine outperforms existing approaches by up to 7.6x in cost while achieving up to 34.5x lower latency SLO miss rate on realistic workloads and generalizes across state-of-the-art model serving frameworks.

2021 IEEE 17th International Conference on Automation Science and Engineering (CASE), 2021

As many robot automation applications increasingly rely on multi-core processing or deep-learning... more As many robot automation applications increasingly rely on multi-core processing or deep-learning models, cloud computing is becoming an attractive and economically viable resource for systems that do not contain high computing power onboard. Despite its immense computing capacity, it is often underused by the robotics and automation community due to lack of expertise in cloud computing and cloud-based infrastructure. Fog Robotics balances computing and data between cloud edge devices. We propose a software framework, FogROS, as an extension of the Robot Operating System (ROS), the defacto standard for creating robot automation applications and components. It allows researchers to deploy components of their software to the cloud with minimal effort, and correspondingly gain access to additional computing cores, GPUs, FPGAs, and TPUs, as well as predeployed software made available by other researchers. FogROS allows a researcher to specify which components of their software will be deployed to the cloud and to what type of computing hardware. We evaluate FogROS on 3 examples: (1) simultaneous localization and mapping (ORB-SLAM2), (2) Dexterity Network (Dex-Net) GPU-based grasp planning, and (3) multi-core motion planning using a 96core cloud-based server. In all three examples, a component is deployed to the cloud and accelerated with a small change in system launch configuration, while incurring additional latency of 1.2 s, 0.6 s, and 0.5 s due to network communication, the computation speed is improved by 2.6×, 6.0× and 34.2×, respectively. Code, videos, and supplementary material can be found at https://github.com/BerkeleyAutomation/ FogROS.

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018

Neural networks rely on convolutions to aggregate spatial information. However, spatial convoluti... more Neural networks rely on convolutions to aggregate spatial information. However, spatial convolutions are expensive in terms of model size and computation, both of which grow quadratically with respect to kernel size. In this paper, we present a parameter-free, FLOP-free "shift" operation as an alternative to spatial convolutions. We fuse shifts and point-wise convolutions to construct end-to-end trainable shift-based modules, with a hyperparameter characterizing the tradeoff between accuracy and efficiency. To demonstrate the operation's efficacy, we replace ResNet's 3x3 convolutions with shift-based modules for improved CI-FAR10 and CIFAR100 accuracy using 60% fewer parameters; we additionally demonstrate the operation's resilience to parameter reduction on ImageNet, outperforming ResNet family members. We finally show the shift operation's applicability across domains, achieving strong performance with fewer parameters on classification, face verification and style transfer.

2020 IEEE/ACM Symposium on Edge Computing (SEC), 2020

Cameras are deployed at scale with the purpose of searching and tracking objects of interest (e.g... more Cameras are deployed at scale with the purpose of searching and tracking objects of interest (e.g., a suspected person) through the camera network on live videos. Such crosscamera analytics is data and compute intensive, whose costs grow with the number of cameras and time. We present Spatula, a cost-efficient system that enables scaling cross-camera analytics on edge compute boxes to large camera networks by leveraging the spatial and temporal cross-camera correlations. While such correlations have been used in computer vision community, Spatula uses them to drastically reduce the communication and computation costs by pruning search space of a query identity (e.g., ignoring frames not correlated with the query identity's current position). Spatula provides the first system substrate on which cross-camera analytics applications can be built to efficiently harness the cross-camera correlations that are abundant in large camera deployments. Spatula reduces compute load by 8.3× on an 8-camera dataset, and by 23 × −86× on two datasets with hundreds of cameras (simulated from real vehicle/pedestrian traces). We have also implemented Spatula on a testbed of 5 AWS DeepLens cameras.

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infections can cause Coronavirus Dis... more Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infections can cause Coronavirus Disease 2019 (COVID-19), which manifests with a range of severities from mild illness to life threatening pneumonia and multi-organ failure. Severe COVID-19 is characterized by an inflammatory signature including high levels of inflammatory cytokines, alveolar inflammatory infiltrates and vascular microthrombi. Here we show that severe COVID-19 patients produced a unique serologic signature, including increased IgG1 with afucosylated Fc glycans. This Fc modification on SARS-CoV-2 IgGs enhanced interactions with the activating FcγR, FcγRIIIa; when incorporated into immune complexes, Fc afucosylation enhanced production of inflammatory cytokines by monocytes, including IL-6 and TNF. These results show that disease severity in COVID-19 correlates with the presence of afucosylated IgG1, a pro-inflammatory IgG Fc modification.

Journal of Clinical Oncology, 2012

TPS1143 Background: Indibulin (Zybulin, ZIO‑301) is a new, synthetic agent that inhibits tumor ce... more TPS1143 Background: Indibulin (Zybulin, ZIO‑301) is a new, synthetic agent that inhibits tumor cell growth at the G2/M phase through destabilization of microtubule dynamics. It binds tubulin at a different site than taxanes and vinca alkaloids. Indibulin does not interact with acetylated (neuronal) tubulins and has not exhibited the neurotoxicity associated with other tubulin binders. Indibulin has potent antitumor activity in human cancer cell lines, including multidrug-, taxane-, and vinblastine-resistant lines. Norton-Simon modeling based on cell line data suggested that dd administration could optimize efficacy while limiting toxicity. Methods: Eligible are patients (pts) with metastatic or unresectable locally advanced breast cancer, measurable or non-measurable disease, and any number of prior therapies. The objective of the Ph I portion is to determine the maximum tolerated dose (MTD) of indibulin when given in a dd fashion (5 days treatment, 9 days rest (14 total) using stan...

Communications of the ACM, 2016

This open source computing framework unifies streaming, batch, and interactive big data workloads... more This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications.

Proceedings of the 2015 SIAM International Conference on Data Mining, 2015

Slow running or straggler tasks in distributed processing frameworks [1, 2] can be 6 to 8 times s... more Slow running or straggler tasks in distributed processing frameworks [1, 2] can be 6 to 8 times slower than the median task in a job on a production cluster [3], despite existing mitigation techniques. This leads to extended job completion times, inefficient use of resources, and increased costs. Recently, proactive straggler avoidance techniques [4] have explored the use of predictive models to improve task scheduling. However, to capture node and workload variability, separate models are built for every node and workload, requiring the time consuming collection of training data and limiting the applicability to new nodes and workloads. In this work, we observe that predictors for similar nodes or workloads are likely to be similar and can share information, suggesting a multi-task learning (MTL) based approach. We generalize the MTL formulation of [5] to capture commonalities in arbitrary groups. Using our formulation to predict stragglers allows us to reduce job completion times by up to 59% over Wrangler [4]. This large reduction arises from a 7 point increase in prediction accuracy. Further, we can get equal or better accuracy than [4] using a sixth of the training data, thus bringing the training time down from 4 hours to about 40 minutes. In addition, our formulation reduces the number of parameters by grouping our parameters into nodeand workload-dependent factors. This helps us generalize to tasks with insufficient data and achieve significant gains over a naive MTL formulation [5]. 1 Clusters are used for different purposes, and statistics such as the kinds of jobs submitted, their resource requirements and the frequency at which they are submitted vary depending upon the usage. We call one such distribution of jobs a workload.

Scalable probabilistic reasoning is the key to unlocking the full potential of the age of big dat... more Scalable probabilistic reasoning is the key to unlocking the full potential of the age of big data. From untangling the biological processes that govern cancer to effectively targeting products and advertisements, probabilistic reasoning is how we make sense of noisy data and turn information into understanding and action. Unfortunately, the algorithms and tools for sophisticated structured probabilistic reasoning were developed for the sequential Von Neumann architecture and have therefore been unable to scale with big data. In this thesis we propose a simple set of design principles to guide the development of new parallel and distributed algorithms and systems for scalable probabilistic reasoning. We then apply these design principles to develop a series of new algorithms for inference in probabilistic graphical models and derive theoretical tools to characterize the parallel properties of statistical inference. We implement and assess the efficiency and scalability of the new inference algorithms in the multicore and distributed settings demonstrating the substantial gains from applying the thesis methodology to real-world probabilistic reasoning. Based on the lessons learned in statistical inference we introduce the GraphLab parallel abstraction which generalizes the thesis methodology and enable the rapid development of new efficient and scalable parallel and distributed algorithms for probabilistic reasoning. We demonstrate how the GraphLab abstraction can be used to rapidly develop new scalable algorithms for probabilistic reasoning and assess their performance on real-world problems in both the multicore and distributed settings. Finally, we identify a unique challenge associated with the underlying graphical structure in a wide range of probabilistic reasoning tasks. To address this challenge we introduce PowerGraph which refines the GraphLab abstraction and achieves orders of magnitude improvements in performance relative to existing systems. Research is a team effort and I was fortunate enough to be a part of an amazing team. I would like to thank my advisor Carlos Guestrin, who helped me focus on the important problems, guided me through the challenges of research, and taught me how to more effectively teach and communicate ideas both in writing and in presentations. In addition, Carlos gave me the opportunity to work with, learn from, and lead an exceptional team. Much of the work in this thesis was done with Yucheng Low, who taught me a lot about systems, software engineering, and how to persevere through challenging bugs and complicated and even impossible proofs. Our many long discussions shaped both the key principles in this thesis as well as their execution. In addition, Yucheng was instrumental in developing many of the systems and theoretical techniques used to evaluate the ideas in this thesis. Finally, Yucheng's exceptional skills as a world class barista made possible many late nights of successful research. Early in my graduate work at CMU I had the opportunity to work with Andreas Krause on Gaussian process models for signal quality estimation in wireless sensor networks. Andreas showed me how to apply the scientific method to design effective experiments, isolate bugs, and understand complex processes. Around the same time I also started to work with David O'Hallaron. As I transitioned my focus to the work in this thesis, David provided early guidance on scalable algorithm and system design and research focus. In addition, David introduced me to standard techniques in scientific computing and helped me build collaborations with Intel research.