Introduction to CUDA Programming (original) (raw)

Last Updated : 2 Mar, 2026

CUDA (Compute Unified Device Architecture) is a parallel computing and programming model developed by NVIDIA, which extends C++ to enable general-purpose computing on GPUs. It allows multiple threads to execute simultaneously, significantly accelerating data-parallel computations compared to sequential CPU execution.

CUDA programs are compiled using the NVIDIA compiler (NVCC), which is designed specifically for NVIDIA GPUs.

Architecture of a CUDA GPU

The power of CUDA lies in its physical hardware organization. While a CPU is composed of a few sophisticated cores, a CUDA-capable GPU is built from an array of Streaming Multiprocessors (SMs).

cudagpu

CUDA GPU Hardware Architecture

Above diagram shows how the host (CPU) sends tasks to the GPU, where Thread Execution Manager distributes them across multiple Streaming Multiprocessors (SMs) for parallel execution, with each SM accessing Global Memory as needed.

Key Components of the Hardware Architecture

CUDA Work Distribution

CUDA organizes threads into a logical hierarchy that maps directly onto the hardware components mentioned above.

**Note: CUDA programs can be written and executed in Google Colab, which provide built-in NVIDIA GPUs and CUDA support.

Basic Program

This basic "Hello World" example demonstrates the interaction between the CPU and the GPU by launching a single thread to print a message.

C++ `

%%cuda #include <stdio.h>

global void simpleKernel() { printf("Hello world\n"); }

int main() { simpleKernel<<<1, 1>>>();

cudaDeviceSynchronize();

return 0;

}

`

**Output

Hello world

**Explanation:

Limitations