feat(amd): EP intra-node normal and low-latency kernels with mori shmem by jhchouuu · Pull Request #164 · ByteDance-Seed/Triton-distributed (original) (raw)
- Implement EP intra-node dispatch/combine kernels using mori shmem P2P (putmem_signal_warp) on AMD MI325X
- Add Low Latency EP v1 (raw all-to-all) and v2 (online FP8 quant + combine with topk weighted reduce)
- Fix shfl_up/shfl_down_sync implementation and golden reference calculation in test_language_extra.py
- Fix mixed-bitwidth ld/st implementation and add kernel test coverage
- Update mori submodule to main with JIT bitcode compilation, replacing manual hipcc/llvm-link build
- Simplify
build_mori_shmem.shto use mori JIT (mori.ir.bitcode.find_bitcode()) - Add AlgoBW and BusBW metrics to EP A2A benchmark output
- Add CI tests for EP A2A (correctness + perf), LL v2 (correctness + perf M=64/128)
Co-authored-by: Wu, Yutong yutong.wu@amd.com
AI review requested due to automatic review settings
Conflicts:
3rdparty/mori
Copilot AI review requested due to automatic review settings
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})