nvptx (GNU libgomp) (original) (raw)
12.2 nvptx ¶
On the hardware side, there is the hierarchy (fine to coarse):
- thread
- warp
- thread block
- streaming multiprocessor
All OpenMP and OpenACC levels are used, i.e.
- OpenMP’s simd and OpenACC’s vector map to threads
- OpenMP’s threads (“parallel”) and OpenACC’s workers map to warps
- OpenMP’s teams and OpenACC’s gang use a threadpool with the size of the number of teams or gangs, respectively.
The used sizes are
- The
warp_size
is always 32 - CUDA kernel launched:
dim={#teams,1,1}, blocks={#threads,warp_size,1}
. - The number of teams is limited by the number of blocks the device can host simultaneously.
Additional information can be obtained by setting the environment variable toGOMP_DEBUG=1
(very verbose; grep for kernel.*launch
for launch parameters).
GCC generates generic PTX ISA code, which is just-in-time compiled by CUDA, which caches the JIT in the user’s directory (see CUDA documentation; can be tuned by the environment variables CUDA_CACHE_{DISABLE,MAXSIZE,PATH}
.
Note: While PTX ISA is generic, the -mptx=
and -march=
commandline options still affect the used PTX ISA code and, thus, the requirements on CUDA version and hardware.
The implementation remark:
- I/O within OpenMP target regions and OpenACC compute regions is supported using the C library
printf
functions. Additionally, the Fortranprint
/write
statements are supported within OpenMP target regions, but not yet within OpenACC compute regions. - Compilation OpenMP code that contains
requires reverse_offload
requires at least-march=sm_35
, compiling for-march=sm_30
is not supported. - For code containing reverse offload (i.e.
target
regions withdevice(ancestor:1)
), there is a slight performance penalty for all target regions, consisting mostly of shutdown delay Per device, reverse offload regions are processed serially such that the next reverse offload region is only executed after the previous one returned. - OpenMP code that has a
requires
directive withself_maps
orunified_shared_memory
runs on nvptx devices if and only if all of those support thepageableMemoryAccess
property;5 otherwise, all nvptx device are removed from the list of available devices (“host fallback”). - The default per-warp stack size is 128 kiB; see also
-msoft-stack
in the GCC manual. - Low-latency memory (
omp_low_lat_mem_space
) is supported when the theaccess
trait is set tocgroup
, and libgomp has been built for PTX ISA version 4.1 or higher (such as in GCC’s default configuration). The default pool size is 8 kiB per team, but may be adjusted at runtime by setting environment variableGOMP_NVPTX_LOWLAT_POOL=bytes
. The maximum value is limited by the available hardware, and care should be taken that the selected pool size does not unduly limit the number of teams that can run simultaneously. omp_low_lat_mem_alloc
cannot be used with true low-latency memory because the definition implies theomp_atv_all
trait; main graphics memory is used instead.omp_cgroup_mem_alloc
,omp_pteam_mem_alloc
, andomp_thread_mem_alloc
, all use low-latency memory as first preference, and fall back to main graphics memory when the low-latency pool is exhausted.- The OpenMP routines
omp_target_memcpy_rect
andomp_target_memcpy_rect_async
and thetarget update
directive for non-contiguous list items use the 2D and 3D memory-copy functions of the CUDA library. Higher dimensions call those functions in a loop and are therefore supported. - The unique identifier (UID), used with OpenMP’s API UID routines, consists of the ‘GPU-’ prefix followed by the 16-bytes UUID as returned by the CUDA runtime library. This UUID is output in grouped lower-case hex digits; the grouping of those 32 digits is: 8 digits, hyphen, 4 digits, hyphen, 4 digits, hyphen, 16 digits. This leads to a string like
GPU-a8081c9e-f03e-18eb-1827-bf5ba95afa5d
. The output matches the format used bynvidia-smi
. - OpenMP interop – Foreign-Runtime Support for Nvidia GPUs