GitHub - ROCm/mori: Modular RDMA Interface (original) (raw)

News

Introduction

MORI (Modular RDMA Interface) is a bottom-up, modular, and composable framework for building high-performance communication applications with a strong focus on RDMA + GPU integration. Inspired by the role of MLIR in compiler infrastructure, MORI provides reusable and extensible building blocks that make it easier for developers to adopt advanced techniques such as IBGDA (Infiniband GPUDirect Async) and GDS (GPUDirect Storage).

To help developers get started quickly, MORI also includes a suite of optimized librariesβ€”MORI-EP (MoE dispatch & combine kernels), MORI-IO (p2p communication for KVCache transfer), and MORI-CCL (collective communication)β€”that deliver out-of-the-box performance, with support for AMD Pensando DSC, Broadcom Thor2, and NVIDIA Mellanox ConnectX-7 NICs.

Features summary

Documentation

Topic Description Guide
MORI-EP Dispatch/combine API, kernel types, configuration, usage examples EP Guide
MORI-SHMEM Symmetric memory APIs, initialization, memory management Shmem Guide
MORI-IR Device bitcode integration for Triton and other GPU kernel frameworks IR Guide
MORI-IO P2P communication concepts, engine/backend/session design IO Guide
MORI-VIZ Warp-level kernel profiler with Perfetto integration Profiler

Benchmarks

MORI-EP

Benchmark on DeepSeek V3 model configurations:

Bandwidth (4096 tokens, 7168 hidden, top-8 experts, FP8 dispatch + BF16 combine)

Hardware Kernels Dispatch XGMI Dispatch RDMA Combine XGMI Combine RDMA
MI300X + CX7 EP8 307 GB/s x 330 GB/s x
EP16-V1 171 GB/s 52 GB/s 219 GB/s 67 GB/s
EP32-V1 103 GB/s* 57 GB/s* 91 GB/s* 50 GB/s*
MI355X + AINIC EP8 345 GB/s x 420 GB/s x
EP16-V1 179 GB/s 54 GB/s 234 GB/s 71 GB/s
EP32-V1 85 GB/s 46 GB/s 110 GB/s 61 GB/s

Latency (128 tokens, 7168 hidden, top-8 experts, FP8 dispatch + BF16 combine)

Hardware Kernels Dispatch Latency Dispatch BW Combine Latency Combine BW
MI300X + CX7 EP8 35 us 134 GB/s 47 us 204 GB/s
EP16-V1-LL 76 us 96 GB/s 122 us 121 GB/s
EP32-V1-LL 157 us* 48 GB/s* 280 us* 55 GB/s*
MI355X + AINIC EP8 31 us 142 GB/s 36 us 276 GB/s
EP16-V1-LL 84 us 87 GB/s 108 us 139 GB/s
EP32-V1-LL 152 us 45 GB/s 187 us 76 GB/s

* Stale data from previous kernel version; updated numbers pending re-benchmarking.

MORI-IO

NOTE: This is the preview version of MORI-IO benchmark performance.

GPU Direct RDMA READ, pairwise, 128 consecutive transfers, 1 GPU, MI300X + Thor2:

+--------------------------------------------------------------------------------------------------------+
|                                            Initiator Rank 0                                            |
+-------------+-----------+----------------+---------------+---------------+--------------+--------------+
| MsgSize (B) | BatchSize | TotalSize (MB) | Max BW (GB/s) | Avg Bw (GB/s) | Min Lat (us) | Avg Lat (us) |
+-------------+-----------+----------------+---------------+---------------+--------------+--------------+
|      8      |    128    |      0.00      |      0.03     |      0.03     |    33.38     |    36.33     |
|      16     |    128    |      0.00      |      0.06     |      0.06     |    34.09     |    36.35     |
|      32     |    128    |      0.00      |      0.12     |      0.11     |    34.57     |    36.33     |
|      64     |    128    |      0.01      |      0.24     |      0.23     |    33.62     |    36.33     |
|     128     |    128    |      0.02      |      0.49     |      0.45     |    33.62     |    36.49     |
|     256     |    128    |      0.03      |      0.94     |      0.89     |    34.81     |    36.99     |
|     512     |    128    |      0.07      |      1.86     |      1.77     |    35.29     |    37.01     |
|     1024    |    128    |      0.13      |      3.84     |      3.53     |    34.09     |    37.09     |
|     2048    |    128    |      0.26      |      7.33     |      6.96     |    35.76     |    37.65     |
|     4096    |    128    |      0.52      |     12.94     |     12.46     |    40.53     |    42.09     |
|     8192    |    128    |      1.05      |     20.75     |     20.12     |    50.54     |    52.11     |
|    16384    |    128    |      2.10      |     29.03     |     28.33     |    72.24     |    74.02     |
|    32768    |    128    |      4.19      |     36.50     |     35.91     |    114.92    |    116.81    |
|    65536    |    128    |      8.39      |     41.74     |     41.39     |    200.99    |    202.70    |
|    131072   |    128    |     16.78      |     45.14     |     44.85     |    371.69    |    374.10    |
|    262144   |    128    |     33.55      |     46.93     |     46.76     |    715.02    |    717.56    |
|    524288   |    128    |     67.11      |     47.94     |     47.81     |   1399.99    |   1403.64    |
|   1048576   |    128    |     134.22     |     48.44     |     48.32     |   2770.90    |   2777.76    |
+-------------+-----------+----------------+---------------+---------------+--------------+--------------+

Hardware Support Matrix

GPU

MORI-EP MORI-IO MORI-SHMEM
MI308X βœ… βœ… βœ…
MI300X βœ… βœ… βœ…
MI325X βœ… βœ… βœ…
MI355X βœ… βœ… βœ…
MI450X 🚧 🚧 🚧

NIC

MORI-EP MORI-IO MORI-SHMEM
Pollara βœ… βœ… βœ…
CX7 βœ… βœ… βœ…
Thor2 βœ… βœ… βœ…
Volcano 🚧 🚧 🚧

βœ… Supported 🚧 Under Development

Installation

Prerequisites

Or build docker image with:

cd mori && docker build -t rocm/mori:dev -f docker/Dockerfile.dev .

IBGDA NIC support (optional, for GPU-direct RDMA β€” auto-detected, no manual configuration needed):

NIC User library
AMD Pollara (AINIC) libionic.so
Mellanox ConnectX libmlx5.so (typically pre-installed)
Broadcom Thor2 libbnxt_re.so

Note: IBGDA requires vendor-specific DV (Direct Verbs) libraries. Mellanox libmlx5 is typically pre-installed with the kernel OFED stack. For Thor2 and Pollara, install the corresponding userspace library from your NIC vendor.

Install

MoRI can be installed in three ways: from PyPI (stable), nightly pre-built wheels (latest dev), or from source.

From PyPI (stable release)

Nightly (pre-built, tested daily)

From PyPI

pip install --pre amd-mori-nightly

Or from GitHub Pages

pip install --no-index --force-reinstall --find-links https://rocm.github.io/mori/nightly/latest/ amd_mori

Browse all nightly builds: https://rocm.github.io/mori/nightly/

Note: amd-mori and amd-mori-nightly both provide the mori Python module. Do not install both at the same time β€” uninstall one before installing the other.

From source

NOTE: for venv build, add --no-build-isolation at the end

cd mori && pip install .

No hipcc needed at install time β€” host code compiles with a standard C++ compiler. GPU kernels are JIT-compiled on first use and cached to~/.mori/jit/. If a GPU is detected during install, kernel precompilation starts automatically in the background.

To manually precompile all kernels (e.g. in a Docker image build):

MORI_PRECOMPILE=1 python -c "import mori"

Verify installation

python -c "import mori; print(mori.version)"

Testing

Test MORI-EP (dispatch / combine)

cd /path/to/mori export PYTHONPATH=/path/to/mori:$PYTHONPATH python -c "import mori; print(mori.file)"

Test correctness (8 GPUs)

pytest tests/python/ops/test_dispatch_combine_intranode.py -q pytest tests/python/ops/test_dispatch_combine_async_ll.py -q pytest tests/python/ops/test_dispatch_combine_internode_v1.py -q

Benchmark performance

python tests/python/ops/bench_dispatch_combine.py

Test MORI-IO

cd /path/to/mori export PYTHONPATH=/path/to/mori:$PYTHONPATH

Correctness tests

pytest tests/python/io/

Benchmark performance (two nodes)

export GLOO_SOCKET_IFNAME=ens14np0 torchrun --nnodes=2 --node_rank=0 --nproc_per_node=1 --master_addr="10.194.129.65" --master_port=1234
tests/python/io/benchmark.py --host="10.194.129.65" --enable-batch-transfer --enable-sess --buffer-size 32768 --transfer-batch-size 128

Test MORI-IR (Triton + shmem integration, guide)

Basic shmem put (2 GPUs)

torchrun --nproc_per_node=2 examples/shmem/ir/test_triton_shmem.py

Allreduce (8 GPUs)

torchrun --nproc_per_node=8 examples/shmem/ir/test_triton_allreduce.py

Contribution Guide

Welcome to MORI! We appreciate your interest in contributing. Whether you're fixing bugs, adding features, improving documentation, or sharing feedback, your contributions help make MORI better for everyone.

Code Quality

MORI uses pre-commit hooks to maintain code quality. After cloning the repository:

pip install pre-commit cd /path/to/mori pre-commit install

Run on all files (first time)

pre-commit run --all-files

Pre-commit automatically checks code formatting, linting, license headers, and other quality checks on commit. To skip checks when necessary: git commit --no-verify