Distributed Data Parallel (DDP) Training on PyTorch with AMD GPUs (ROCm) and RCCL test hangs · Issue #1129 · ROCm/rccl (original) (raw)
Problem Description
I have a Ubuntu 22.04 machine with two AMD MI100 GPUs installed. When trying to run a PyTorch training script, using DDP and backend="nccl"
(which under the hood should use rccl
), the script hangs with the GPU use at 100%, without the expected GPU temperature buildup.
At first I thought it was related to my PyTorch installation, but when I tried the all_reduce_perf
test of rccl-tests I observed the same behaviour: the script hangs with the GPU use at 100%, without the expected GPU temperature buildup.
Output of all_reduce_perf
:
$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:3f7f785
# Using devices
# Rank 0 Pid 21284 on deep-visionscaper3 device 0 [0000:2f:00.0] AMD Instinct MI100
# Rank 1 Pid 21284 on deep-visionscaper3 device 1 [0000:03:00.0] AMD Instinct MI100
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
Nothing happens after this initial output.
Output of rock-smi
:
$ rocm-smi --alldevices -f -P -t -u -g
============================ ROCm System Management Interface ============================
====================================== Temperature =======================================
GPU[0] : Temperature (Sensor edge) (C): 63.0
GPU[0] : Temperature (Sensor junction) (C): 76.0
GPU[0] : Temperature (Sensor memory) (C): 61.0
GPU[1] : Temperature (Sensor edge) (C): 58.0
GPU[1] : Temperature (Sensor junction) (C): 70.0
GPU[1] : Temperature (Sensor memory) (C): 56.0
==========================================================================================
=============================== Current clock frequencies ================================
GPU[0] : sclk clock level: 15 (1502Mhz)
GPU[1] : sclk clock level: 15 (1502Mhz)
==========================================================================================
=================================== Current Fan Metric ===================================
GPU[0] : Not supported
GPU[1] : Not supported
==========================================================================================
=================================== Power Consumption ====================================
GPU[0] : Average Graphics Package Power (W): 114.0
GPU[1] : Average Graphics Package Power (W): 107.0
==========================================================================================
=================================== % time GPU is busy ===================================
GPU[0] : GPU use (%): 100
GPU[1] : GPU use (%): 100
==========================================================================================
================================== End of ROCm SMI Log ===================================
The version of rccl I have installed:
$ apt list --installed | grep rccl
rccl-dev/jammy,now 2.18.3.60002-115~22.04 amd64 [installed,automatic]
rccl/jammy,now 2.18.3.60002-115~22.04 amd64 [installed,automatic]
Am I missing anything in my installation?
Operating System
Ubuntu 22.04.4 LTS (Jammy Jellyfish)
CPU
AMD Ryzen Threadripper PRO 5975WX 32-Cores
GPU
AMD Instinct MI100
ROCm Version
ROCm 6.0.0
ROCm Component
rccl
Steps to Reproduce
Run ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
of the rccl-tests
repo, it should perform the all_reduce
test without blocking.
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
$ /opt/rocm/bin/rocminfo --support
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD Ryzen Threadripper PRO 5975WX 32-Cores
Uuid: CPU-XX
Marketing Name: AMD Ryzen Threadripper PRO 5975WX 32-Cores
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3600
BDFID: 0
Internal Node ID: 0
Compute Unit: 64
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 527960576(0x1f780a00) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 527960576(0x1f780a00) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 527960576(0x1f780a00) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 2
*******
Name: gfx908
Uuid: GPU-6fdb95cab945d4fe
Marketing Name: AMD Instinct MI100
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 8192(0x2000) KB
Chip ID: 29580(0x738c)
ASIC Revision: 2(0x2)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1502
BDFID: 12032
Internal Node ID: 1
Compute Unit: 120
SIMDs per CU: 4
Shader Engines: 8
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 65
SDMA engine uCode:: 18
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 33538048(0x1ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 33538048(0x1ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*******
Agent 3
*******
Name: gfx908
Uuid: GPU-c088ca837a22409d
Marketing Name: AMD Instinct MI100
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 2
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 8192(0x2000) KB
Chip ID: 29580(0x738c)
ASIC Revision: 2(0x2)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1502
BDFID: 768
Internal Node ID: 2
Compute Unit: 120
SIMDs per CU: 4
Shader Engines: 8
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 65
SDMA engine uCode:: 18
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 33538048(0x1ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 33538048(0x1ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
Additional Information
$ echo "OS:" && cat /etc/os-release | grep -E "^(NAME=|VERSION=)";
echo "CPU: " && cat /proc/cpuinfo | grep "model name" | sort --unique;
echo "GPU:" && /opt/rocm/bin/rocminfo | grep -E "^\s*(Name|Marketing Name)";
OS:
NAME="Ubuntu"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
CPU:
model name : AMD Ryzen Threadripper PRO 5975WX 32-Cores
GPU:
Name: AMD Ryzen Threadripper PRO 5975WX 32-Cores
Marketing Name: AMD Ryzen Threadripper PRO 5975WX 32-Cores
Name: gfx908
Marketing Name: AMD Instinct MI100
Name: amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
Name: gfx908
Marketing Name: AMD Instinct MI100
Name: amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
$ apt show rocm-libs -a
Package: rocm-libs
Version: 6.0.2.60002-115~22.04
Priority: optional
Section: devel
Maintainer: ROCm Dev Support <rocm-dev.support@amd.com>
Installed-Size: 13.3 kB
Depends: hipblas (= 2.0.0.60002-115~22.04), hipblaslt (= 0.6.0.60002-115~22.04), hipfft (= 1.0.13.60002-115~22.04), hipsolver (= 2.0.0.60002-115~22.04), hipsparse (= 3.0.0.60002-115~22.04), hiptensor (= 1.1.0.60002-115~22.04), miopen-hip (= 3.00.0.60002-115~22.04), half (= 1.12.0.60002-115~22.04), rccl (= 2.18.3.60002-115~22.04), rocalution (= 3.0.3.60002-115~22.04), rocblas (= 4.0.0.60002-115~22.04), rocfft (= 1.0.25.60002-115~22.04), rocrand (= 3.0.0.60002-115~22.04), hiprand (= 2.10.16.60002-115~22.04), rocsolver (= 3.24.0.60002-115~22.04), rocsparse (= 3.0.2.60002-115~22.04), rocm-core (= 6.0.2.60002-115~22.04), composablekernel-dev (= 1.1.0.60002-115~22.04), hipblas-dev (= 2.0.0.60002-115~22.04), hipblaslt-dev (= 0.6.0.60002-115~22.04), hipcub-dev (= 3.0.0.60002-115~22.04), hipfft-dev (= 1.0.13.60002-115~22.04), hipsolver-dev (= 2.0.0.60002-115~22.04), hipsparse-dev (= 3.0.0.60002-115~22.04), hiptensor-dev (= 1.1.0.60002-115~22.04), miopen-hip-dev (= 3.00.0.60002-115~22.04), rccl-dev (= 2.18.3.60002-115~22.04), rocalution-dev (= 3.0.3.60002-115~22.04), rocblas-dev (= 4.0.0.60002-115~22.04), rocfft-dev (= 1.0.25.60002-115~22.04), rocprim-dev (= 3.0.0.60002-115~22.04), rocrand-dev (= 3.0.0.60002-115~22.04), hiprand-dev (= 2.10.16.60002-115~22.04), rocsolver-dev (= 3.24.0.60002-115~22.04), rocsparse-dev (= 3.0.2.60002-115~22.04), rocthrust-dev (= 3.0.0.60002-115~22.04), rocwmma-dev (= 1.3.0.60002-115~22.04)
Homepage: https://github.com/RadeonOpenCompute/ROCm
Download-Size: 1050 B
APT-Sources: https://repo.radeon.com/rocm/apt/6.0.2 jammy/main amd64 Packages
Description: Radeon Open Compute (ROCm) Runtime software stack
$ apt show rccl -a
Package: rccl
Version: 2.18.3.60002-115~22.04
Priority: optional
Section: devel
Maintainer: RCCL Maintainer <rccl-maintainer@amd.com>
Installed-Size: 520 MB
Depends: hip-rocclr (>= 3.5.0), rocm-smi-lib (>= 4.0.0), rocm-core, libc6 (>= 2.34), libgcc-s1 (>= 3.0), libstdc++6 (>= 11)
Recommends: rccl-dev (>=2.18.3.60002)
Download-Size: 13.7 MB
APT-Manual-Installed: no
APT-Sources: https://repo.radeon.com/rocm/apt/6.0.2 jammy/main amd64 Packages
Description: ROCm Communication Collectives Library
ROCm Communication Collectives Library
Optimized primitives for collective multi-GPU communication