Mor Harchol-Balter's Homepage (original) (raw)

Introduction for students: Link to my upcoming IC talk slides 2021

I am interested in the performance analysis and design of computer systems, particularly distributed systems. I use analytical models to capture the important characteristics of a computer system, and then I prove theorems about these models that allow me to redesign the system to improve its performance. Here "performance" might denote response time, energy use, throughput, capacity, etc. Most of my research involves inventing new analytical techniques in the area of performance analysis, as well as new algorithms for resource allocation.

Unlike many theoretical computer scientists, my analysis is based on stochastic (probabilistic) models of computer systems. There is no "adversary" sending us worst-case inputs. By contrast, there is a stream of requests, whose arrival times and service demands come from empirically fitted distributions. These distributions might be correlated, and they often exhibit heavy tailed service demands and high variability.

I believe that many conventional wisdoms on which we base computer system designs are not well understood and sometimes false, leading to inferior designs. My research revisits very classic questions in system design. Here are a few examples of commonly-held beliefs that my research challenges:

Thousands of "load balancing" heuristics do exactly that -- they aim to balance the load among the existing hosts. But is load balancing necessarily a good thing?
Ever notice that the optimal scheduling policy, Shortest-Remaining-Processing-Time-First (SRPT), is rarely used in practice? There's a fear that the big jobs will "starve", or be treated unfairly as compared with Processor-Sharing (PS). But is this a reality?
To minimize mean waiting time in a server farm, research suggests that each job should be sent to the server at which it will experience the least wait. That seems good from the job's perspective, but is the greedy strategy best for the system as a whole?
Given a choice between a single machine with speed s, or n identical machines each with speed s/n, which would you choose? Think again ...
Migrating active jobs is generally considered too expensive. Killing jobs midway through execution and restarting them from scratch later is even worse! Says who?
Cycle stealing (using another machine's idle cycles to do your work) is a central theme in distributed systems and has been the topic of thousands of papers. But can we quantify when cycle stealing pays off, as a function of switching costs and threshold parameters?
Redundancy is the idea of making multiple copies of the same job, where each copy is dispatched to a different queue, and where one only needs to wait for the first copy to complete. When does redundancy help?

Research Specifics:

I work on the design and performance analysis of computer systems including both theory and implementation:

**Algorithmic Work:**Designing and analyzing algorithms for: resource allocation, task/job scheduling, load sharing, routing, cycle stealing, replication, power-management, multi-class/multi-server scheduling, multi-core parallel scheduling, fairness. Correlated arrival processes, and analysis under high-variability workloads. ``All-Can-Win Theorem'', demonstrating that scheduling policies which are biased towards favoring small jobs can also be preferable to large jobs.

Stochastic Analysis & Queueing Techniques: Developing new methods for stochastic analysis. Examples include: (i) Recursive Dimensionality Reduction (RDR), a technique to reduce a multi-dimensional Markov chain to a 1-dimensional chain, by using busy period transitions; (ii) Recursive Renewal Reward (RRR) and Clearing Analysis on Phases (CAP), techniques that allow one to obtain closed-form solutions for many one-dimensional infinite repeating Markov chains, including the M/M/k with setup chain; (iii) Exact analysis of Replication Systems, the first exact solution (in product-form) for replication systems, involving any number of servers, and number of classes, and any degree of job replication; (iv) SOAP Analysis of Scheduling Policies, the first exact response time analysis of a huge class of scheduling policies with no prior analysis, including the Gittins Index; (v) Optimal Scheduling for Multi-server systems, the first algorithms and analysis for optimal scheduling in the M/G/k, both with known and unknown job sizes.

Computer systems implementation: Resource management/scheduling in data centers and supercomputing centers. Implementation of QoS tail latency guarantees for flows in networked storage systems (PriorityMeister and SNC-Meister). Online scheduling of jobs with heterogeneous preferences (TetriSched). Dynamic power management in multi-tier data centers (AutoScale), adopted by Facebook. Scheduling of Memory Controllers (ATLAS). Pricing and queueing optimizations for cloud services. Kernel-level SRPT connection scheduling in Web servers (SyNC). Scheduling to cope with transient overload in Web servers.

Modeling and Workload characterization: Known for the discovery of Pareto heavy-tailed distribution of UNIX process CPU lifetimes. Statistical characterization of workloads including UNIX processes, web, OLTP, supercomputing, memory workloads, datacenter workloads, and parallel multi-core jobs.

My TENURE RESEARCH STATEMENT from May 2007:

SOME TALKS

(2024) ValueTools: Can Increasing the Hit Ratio Hurt Cache Throughput?" talk slides .
(2017) CanQueue Keynote & MIT-LIDS Keynote: Queueing With Redundant Requests: talk slides and abstract .
(2015) ISCA Tutorial & ICDCS Keynote: What Queueing Theory Teaches Us about Computer Systems Design: talk slides and abstract .
(2014) GreenMetrics Keynote: Power Management in Data Centers: Theory & Practice: talk slides and abstract .
(2009) Surprising Results on Task Assignment for High-Variability Workloads: talk slides and abstract .
(2007) ORSIS '07 Keynote, Lunteren '07 Keynote, MASCOTS '08 Keynote, SIPEW '08 Keynote: Scheduling in Server Farms: talk slides and abstract .
(2006) Analysis of Cycle Stealing and Priority Queueing in Multi-Server Systems via Dimensionality Reduction Technique: talk slides and abstract .
(2005) Scheduling Your Network Connections: the SYNC project talk slides and abstract .
(2006) Open Workload Generators versus Closed: A cautionary tale Adam's talk slides and abstract .
(2004) What Analytical Performance Modeling Teaches Us About Computer System Design talk slides and abstract .

Graduated PhD STUDENTS

Ben Berg . Assistant Professor at UNC, Chapel-Hill, Computer Science Department. Link to Ben's thesis .
Sherwin Doroudi . Assistant Professor at U. Minnesota, ISyE. Link to Sherwin's thesis . Winner 2016 William W. Cooper Doctoral Dissertation Award in Management or Management Science.
Anshul Gandhi . Associate Professor at SUNY Stony Brook, Computer Science Department. Link to Anshul's thesis . Winner SPEC Dissertation Award. Winner SIGMETRICS Rising Star Award.
Kristy Gardner . Assistant Professor at Amherst College, Computer Science Department. Link to Kristy's thesis .
Isaac Grosof . Assistant Professor at Northwestern, ISyE. Co-winner INFORMS 2022 George Nicholson Prize. Winner of 2023 SIGMETRICS Doctoral Dissertation Award. Link to Isaac's thesis .
Varun Gupta Associate Professor at University of Chicago, Booth School of Business. Link to Varun's thesis .
Takayuki Osogami Researcher at IBM-TRL. Link to Taka's thesis .
David McWherter Researcher at Google, Seattle. Link to David's thesis .
Bianca Schroeder Associate Professor at University of Toronto, Computer Science. Link to Bianca's thesis .
Ziv Scully Assistant Professor at Cornell, Operations Research and Industrial Engineering Department. Link to Ziv's thesis . Co-winner INFORMS 2022 George Nicholson Prize. Also winner 2022 SIGMETRICS Doctoral Dissertation Award.
Adam Wierman Full Professor at Caltech, Computer Science. Department Head of Computer Science. Link to Adam's thesis . Winner of 2007 SCS Distinguished Dissertation Award. Winner SIGMETRICS Rising Star Award.
Timothy Zhu Assistant Professor at Penn State University, Computer Science Department. Link to Timmy's thesis .

Current PhD STUDENTS

Arthur (Zhouzi) Li
Keerthana Gurushankar

KEYNOTE PRESENTATIONS

CONFEST 2024, QTECQS 2023, ITC 2022, MAM 2022, UEMCON 2021, DDQC 2021, QTNA 2019, YEQT 2018, 40th Dutch Queueing Colloquium 2018, MIT LIDS Student Conference 2017, CanQueue 2016, ACM SIGMETRICS 2016, ICDCS 2015, GreenMetrics 2014, MASCOTS 2008, SIPEW 2008, Lunteren Conference 2007, ORSIS 2007.

ANNUAL TALK on Applying to Ph.D. Programs in Computer Science (pdf) .

TEACHING

Fall 2025! Offered odd-numbered Fall semesters, since 1999: 15-857 Analytical Performance Modeling (graduate)
Taught every Spring since 2004: 15-259 Probability and Computing (undergraduate)
New for Fall 2022! 15-829 Performance Modeling Tools for Systems Researchers (graduate)
(2014S) 21-127 Concepts of Mathematics (undergraduate)
(Spring 2007): 15-858A Advanced Stochastic Processes (graduate)
(Spring 2000, Spring 2001, Fall 2002): 15-441 Computer Networks (undergraduate)
(2000 through present): SQUALL (Scheduling and QUeueing Around Lunchtime)

---