Lei Zhang's homepage (original) (raw)
About me
My research interests are broadly in distributed systems. I'm currently a Research Scientist at ByteDance's AI Networking team, working on Systems for ML, with a major focus on LLM reliability. My experiences are widely on distributed systems, cloud systems, and AI infrastructure.
- Reliability and observability: distributed tracing, PL for systems towards root cause analysis
- Distributed caching: heterogeneous memory management, CDN caching, performance quantification
- Systems for ML: reliability, observability, performance analysis, RDMA-based AI infra for LLM, Collective Communication Library
My research has been awarded with an ACM SIGMETRICS Kenneth C. Sevcik Outstanding Student Paper Award. I was a postdoc researcher at Princeton University, working with Prof. Ravi Netravali. Before that, I received my Ph.D. from Emory University in 2021, working with Prof. Ymir Vigfusson, Master from Georgia Tech in 2017(was in Ph.D. program, worked with Prof. Karsten Schwan memorial page ), and Bachelor from Tsinghua University in 2015. I transferred to Emory in 2018 as a post-qualified Ph.D. student.
Experience
2023/9–current
Research Scientist, ByteDance Inc.
2022/1–2023/8
Postdoc, Princeton University
2018/5-2018/8
Ph.D. Intern, acebook Inc.
Services
Program Committee
ACM EuroSys'26
Program Committee
USENIX ATC'25
External Program Committee
ACM SIGMETRICS'23
Program Committee
ACM SoCC'22, '23, '24
Awards
Best Student Paper Award
ACM SIGMETRICS’20
Bronze medal
24th, 25th China Mathematical Olympiad
Talks
Enhancing Runtime Reliability in LLM Training via Fine-Grained Observability
Publications
Measurement and Analysis Methods of Performance Problems in Distributed Systems
Emory University, 2021
Thesis Toward Bandwidth-adaptive Fully-Immersive Volumetric Video Conferencing
Rajrup Ghosh, Christina Suyong Shin, Lei Zhang, Muyang Ye, Tao Jin, Harsha V. Madhyastha, Ravi Netravali, Antonio Ortega, Sanjay Rao, Anthony Rowe, Ramesh Govindan
In ACM CoNEXT 2025
Mycroft: Tracing Dependencies in Collective Communication Towards Reliable LLM Training
Yangtao Deng*, Lei Zhang*, Qinlong Wang, Xiaoyun Zhi, Xinlei Zhang, Zhuo Jiang, Haohan Xu, Lei Wang, Zuquan Song, Gaohong Liu, Yang Bai, Shuguang Wang, W. Xiao, Jianxi Ye, Minlan Yu, Hong Xu
In ACM SOSP 2025
Paper Slides
Lumos: Lightweight Provenance-Guided Online Debugging
Jingyuan Chen, Lei Zhang, Gongqi Huang, Ravi Netravali, Amit Levy
In USENIX OSDI 2025 (Poster)
Minder: Faulty Machine Detection for Large-scale Distributed Model Training
Yangtao Deng, Xiang Shi, Zhuo Jiang, Xingjian Zhang, Lei Zhang, Zhang Zhang, Bo Li, Zuquan Song, Hang Zhu, Gaohong Liu, Fuliang Li, Shuguang Wang, Haibin Lin, Jianxi Ye, Minlan Yu
In USENIX NSDI 2025
Paper Slides Video
LatenSeer: Causal Modeling of End-to-End Latency Distributions by Harnessing Distributed Tracing
Yazhuo Zhang, Rebecca Isaacs, Yao Yue, Juncheng Yang, Lei Zhang, Ymir Vigfusson
In ACM SoCC 2023
Paper Code
The Benefit of Hindsight: Tracing Edge-Cases in Distributed Systems
Lei Zhang, Zhiqiang Xie, Vaastav Anand, Ymir Vigfusson, Jonathan Mace
In USENIX NSDI 2023
Paper Slides Code Benchmark Code Video
When is the Cache Warm? Manufacturing a Rule of Thumb
Lei Zhang, Juncheng Yang, Anna Blasiak, Mike McCall, Ymir Vigfusson
In USENIX HotCloud 2020
Paper Slides Video
Optimal Data Placement for Heterogeneous Cache, Memory, and Storage Systems
Lei Zhang, Reza Karimi, Irfan Ahmad, Ymir Vigfusson
In ACM SIGMETRICS 2020
Paper Video
Kenneth C. Sevcik Outstanding Student Paper Award
Deceptive Secret Sharing
Lei Zhang, Douglas Blough
In IEEE International Conference on Dependable Systems and Networks (DSN) 2018
Paper
Systematic Data Placement Optimization in Multi-Cloud Storage for Complex Requirements
Maomeng Su, Lei Zhang, Yongwei Wu, Kang Chen, Keqin Li
In IEEE Transactions on Computers 2016, 65(6): 1964-1977.
Paper
Under Review:
A Lightweight Telemetry System with Service Tracing for Locating Network Slowdowns
Automatic Instrumentation for Fine-grained Observability in Distributed Systems