Efficient Direct-Connect Topologies for Collective Communications (original) (raw)
Abstract:We consider the problem of distilling efficient network topologies for collective communications. We provide an algorithmic framework for constructing direct-connect topologies optimized for the latency vs. bandwidth trade-off associated with the workload. Our approach synthesizes many different topologies and schedules for a given cluster size and degree and then identifies the appropriate topology and schedule for a given workload. Our algorithms start from small, optimal base topologies and associated communication schedules and use techniques that can be iteratively applied to derive much larger topologies and schedules. Additionally, we incorporate well-studied large-scale graph topologies into our algorithmic framework by producing efficient collective schedules for them using a novel polynomial-time algorithm. Our evaluation uses multiple testbeds and large-scale simulations to demonstrate significant performance benefits from our derived topologies and schedules.
Submission history
From: Liangyu Zhao [view email]
[v1] Mon, 7 Feb 2022 16:59:05 UTC (1,652 KB)
[v2] Tue, 30 May 2023 01:07:56 UTC (2,761 KB)
[v3] Sat, 23 Sep 2023 23:23:47 UTC (753 KB)
[v4] Tue, 26 Sep 2023 19:40:47 UTC (753 KB)
[v5] Mon, 13 May 2024 00:07:22 UTC (775 KB)