Distributed Data Parallel (DDP) Training on PyTorch with AMD GPUs (ROCm) and RCCL test hangs · Issue #1129 · ROCm/rccl (original) (raw)

Problem Description

I have a Ubuntu 22.04 machine with two AMD MI100 GPUs installed. When trying to run a PyTorch training script, using DDP and backend="nccl" (which under the hood should use rccl), the script hangs with the GPU use at 100%, without the expected GPU temperature buildup.

At first I thought it was related to my PyTorch installation, but when I tried the all_reduce_perf test of rccl-tests I observed the same behaviour: the script hangs with the GPU use at 100%, without the expected GPU temperature buildup.

Output of all_reduce_perf:

$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop:3f7f785
# Using devices
#   Rank  0 Pid  21284 on deep-visionscaper3 device  0 [0000:2f:00.0] AMD Instinct MI100
#   Rank  1 Pid  21284 on deep-visionscaper3 device  1 [0000:03:00.0] AMD Instinct MI100
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)     

Nothing happens after this initial output.

Output of rock-smi:

$ rocm-smi --alldevices -f -P -t -u -g


============================ ROCm System Management Interface ============================
====================================== Temperature =======================================
GPU[0]		: Temperature (Sensor edge) (C): 63.0
GPU[0]		: Temperature (Sensor junction) (C): 76.0
GPU[0]		: Temperature (Sensor memory) (C): 61.0
GPU[1]		: Temperature (Sensor edge) (C): 58.0
GPU[1]		: Temperature (Sensor junction) (C): 70.0
GPU[1]		: Temperature (Sensor memory) (C): 56.0
==========================================================================================
=============================== Current clock frequencies ================================
GPU[0]		: sclk clock level: 15 (1502Mhz)
GPU[1]		: sclk clock level: 15 (1502Mhz)
==========================================================================================
=================================== Current Fan Metric ===================================
GPU[0]		: Not supported
GPU[1]		: Not supported
==========================================================================================
=================================== Power Consumption ====================================
GPU[0]		: Average Graphics Package Power (W): 114.0
GPU[1]		: Average Graphics Package Power (W): 107.0
==========================================================================================
=================================== % time GPU is busy ===================================
GPU[0]		: GPU use (%): 100
GPU[1]		: GPU use (%): 100
==========================================================================================
================================== End of ROCm SMI Log ===================================

The version of rccl I have installed:

$ apt list --installed | grep rccl

rccl-dev/jammy,now 2.18.3.60002-115~22.04 amd64 [installed,automatic]
rccl/jammy,now 2.18.3.60002-115~22.04 amd64 [installed,automatic]

Am I missing anything in my installation?

Operating System

Ubuntu 22.04.4 LTS (Jammy Jellyfish)

CPU

AMD Ryzen Threadripper PRO 5975WX 32-Cores

GPU

AMD Instinct MI100

ROCm Version

ROCm 6.0.0

ROCm Component

rccl

Steps to Reproduce

Run ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2 of the rccl-tests repo, it should perform the all_reduce test without blocking.

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

$ /opt/rocm/bin/rocminfo --support
ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen Threadripper PRO 5975WX 32-Cores
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen Threadripper PRO 5975WX 32-Cores
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3600                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            64                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    527960576(0x1f780a00) KB           
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    527960576(0x1f780a00) KB           
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    527960576(0x1f780a00) KB           
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx908                             
  Uuid:                    GPU-6fdb95cab945d4fe               
  Marketing Name:          AMD Instinct MI100                 
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      8192(0x2000) KB                    
  Chip ID:                 29580(0x738c)                      
  ASIC Revision:           2(0x2)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1502                               
  BDFID:                   12032                              
  Internal Node ID:        1                                  
  Compute Unit:            120                                
  SIMDs per CU:            4                                  
  Shader Engines:          8                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 65                                 
  SDMA engine uCode::      18                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    33538048(0x1ffc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    33538048(0x1ffc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 3                  
*******                  
  Name:                    gfx908                             
  Uuid:                    GPU-c088ca837a22409d               
  Marketing Name:          AMD Instinct MI100                 
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      8192(0x2000) KB                    
  Chip ID:                 29580(0x738c)                      
  ASIC Revision:           2(0x2)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1502                               
  BDFID:                   768                                
  Internal Node ID:        2                                  
  Compute Unit:            120                                
  SIMDs per CU:            4                                  
  Shader Engines:          8                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 65                                 
  SDMA engine uCode::      18                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    33538048(0x1ffc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    33538048(0x1ffc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***             

Additional Information

$ echo "OS:" && cat /etc/os-release | grep -E "^(NAME=|VERSION=)";
  echo "CPU: " && cat /proc/cpuinfo | grep "model name" | sort --unique;
  echo "GPU:" && /opt/rocm/bin/rocminfo | grep -E "^\s*(Name|Marketing Name)";
OS:
NAME="Ubuntu"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
CPU: 
model name	: AMD Ryzen Threadripper PRO 5975WX 32-Cores
GPU:
  Name:                    AMD Ryzen Threadripper PRO 5975WX 32-Cores
  Marketing Name:          AMD Ryzen Threadripper PRO 5975WX 32-Cores
  Name:                    gfx908                             
  Marketing Name:          AMD Instinct MI100                 
      Name:                    amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
  Name:                    gfx908                             
  Marketing Name:          AMD Instinct MI100                 
      Name:                    amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
$ apt show rocm-libs -a
Package: rocm-libs
Version: 6.0.2.60002-115~22.04
Priority: optional
Section: devel
Maintainer: ROCm Dev Support <rocm-dev.support@amd.com>
Installed-Size: 13.3 kB
Depends: hipblas (= 2.0.0.60002-115~22.04), hipblaslt (= 0.6.0.60002-115~22.04), hipfft (= 1.0.13.60002-115~22.04), hipsolver (= 2.0.0.60002-115~22.04), hipsparse (= 3.0.0.60002-115~22.04), hiptensor (= 1.1.0.60002-115~22.04), miopen-hip (= 3.00.0.60002-115~22.04), half (= 1.12.0.60002-115~22.04), rccl (= 2.18.3.60002-115~22.04), rocalution (= 3.0.3.60002-115~22.04), rocblas (= 4.0.0.60002-115~22.04), rocfft (= 1.0.25.60002-115~22.04), rocrand (= 3.0.0.60002-115~22.04), hiprand (= 2.10.16.60002-115~22.04), rocsolver (= 3.24.0.60002-115~22.04), rocsparse (= 3.0.2.60002-115~22.04), rocm-core (= 6.0.2.60002-115~22.04), composablekernel-dev (= 1.1.0.60002-115~22.04), hipblas-dev (= 2.0.0.60002-115~22.04), hipblaslt-dev (= 0.6.0.60002-115~22.04), hipcub-dev (= 3.0.0.60002-115~22.04), hipfft-dev (= 1.0.13.60002-115~22.04), hipsolver-dev (= 2.0.0.60002-115~22.04), hipsparse-dev (= 3.0.0.60002-115~22.04), hiptensor-dev (= 1.1.0.60002-115~22.04), miopen-hip-dev (= 3.00.0.60002-115~22.04), rccl-dev (= 2.18.3.60002-115~22.04), rocalution-dev (= 3.0.3.60002-115~22.04), rocblas-dev (= 4.0.0.60002-115~22.04), rocfft-dev (= 1.0.25.60002-115~22.04), rocprim-dev (= 3.0.0.60002-115~22.04), rocrand-dev (= 3.0.0.60002-115~22.04), hiprand-dev (= 2.10.16.60002-115~22.04), rocsolver-dev (= 3.24.0.60002-115~22.04), rocsparse-dev (= 3.0.2.60002-115~22.04), rocthrust-dev (= 3.0.0.60002-115~22.04), rocwmma-dev (= 1.3.0.60002-115~22.04)
Homepage: https://github.com/RadeonOpenCompute/ROCm
Download-Size: 1050 B
APT-Sources: https://repo.radeon.com/rocm/apt/6.0.2 jammy/main amd64 Packages
Description: Radeon Open Compute (ROCm) Runtime software stack
$ apt show rccl -a
Package: rccl
Version: 2.18.3.60002-115~22.04
Priority: optional
Section: devel
Maintainer: RCCL Maintainer <rccl-maintainer@amd.com>
Installed-Size: 520 MB
Depends: hip-rocclr (>= 3.5.0), rocm-smi-lib (>= 4.0.0), rocm-core, libc6 (>= 2.34), libgcc-s1 (>= 3.0), libstdc++6 (>= 11)
Recommends: rccl-dev (>=2.18.3.60002)
Download-Size: 13.7 MB
APT-Manual-Installed: no
APT-Sources: https://repo.radeon.com/rocm/apt/6.0.2 jammy/main amd64 Packages
Description: ROCm Communication Collectives Library
 ROCm Communication Collectives Library
 Optimized primitives for collective multi-GPU communication