feat(amd): EP intra-node normal and low-latency kernels with mori shmem by jhchouuu · Pull Request #164 · ByteDance-Seed/Triton-distributed (original) (raw)

Implement EP intra-node dispatch/combine kernels using mori shmem P2P (putmem_signal_warp) on AMD MI325X
Add Low Latency EP v1 (raw all-to-all) and v2 (online FP8 quant + combine with topk weighted reduce)
Fix shfl_up/shfl_down_sync implementation and golden reference calculation in test_language_extra.py
Fix mixed-bitwidth ld/st implementation and add kernel test coverage
Update mori submodule to main with JIT bitcode compilation, replacing manual hipcc/llvm-link build
Simplify build_mori_shmem.sh to use mori JIT (mori.ir.bitcode.find_bitcode())
Add AlgoBW and BusBW metrics to EP A2A benchmark output
Add CI tests for EP A2A (correctness + perf), LL v2 (correctness + perf M=64/128)

Co-authored-by: Wu, Yutong yutong.wu@amd.com

AI review requested due to automatic review settings

March 27, 2026 05:53

Conflicts:

3rdparty/mori

Copilot AI review requested due to automatic review settings

April 13, 2026 02:48

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})