Writing First CUDA Program (original) (raw)

Last Updated : 14 Feb, 2026

A CUDA program is "heterogeneous," meaning it consists of code that runs on two different systems at once: the Host (CPU) and the Device (NVIDIA GPU). CUDA programming model is an extension of the C++ language, adding specialized syntax to manage parallel execution. To coordinate these systems, a standard .cu file follows a specific structural template.

Structure of CUDA Program

For headers, we can add standard C++ headers like <stdio.h> for basic input/output or <math.h> for complex mathematical calculations. For more granular control of the GPU hardware, we use:

#include <cuda.h>

**Explanation: cuda.h header provides access to the CUDA Driver API for low-level device management.

2. Kernel Definition (GPU Code)

The Kernel is a special function designed to run on the GPU. It contains the logic that will be executed in parallel across many threads.

C++ `

global void myKernel() { // This code executes on the GPU printf("Hello from the GPU!\n"); }

**Explanation:

**__global__ mandatory declaration specifier. It tells the compiler that this function is called from the CPU but must be executed on the GPU hardware.
**Execution Mapping: Unlike a standard function, a kernel is designed to run multiple instances of itself at the same time.
**Return Type: Kernels must always have a void return type. They cannot return values via a return statement instead, they write results directly to GPU memory.

3. Main Function (CPU Code)

The main() function is the entry point of the program that runs on the CPU. It handles the logical flow, manages memory and tells the GPU when to start working.

C++ `

int main() { // 1. Launch the kernel on the GPU myKernel<<<1, 1>>>();

// 2. Synchronize to wait for the GPU to finish
cudaDeviceSynchronize();

return 0;

}

**Explanation:

**Triple Chevron Syntax (<<< >>>) unique CUDA syntax is used to launch a kernel. It specifies how many "blocks" and "threads per block" the GPU should use. For example, <<<1, 1>>> launches exactly one thread.
**cudaDeviceSynchronize() function is vital because the CPU and GPU work independently. It forces the CPU to wait until the GPU has completed its task before the program exits.

This basic "Hello World" example demonstrates the interaction between the CPU and the GPU by launching a single thread to print a message.

C++ `

%%cuda #include <stdio.h>

global void simpleKernel() { printf("Hello world\n"); }

int main() { simpleKernel<<<1, 1>>>();

cudaDeviceSynchronize();

return 0;

}

**Output

Hello world

**Explanation:

%%cuda magic command automatically handles nvcc compilation and execution in one step.
__global__ keyword indicates a function that runs on the GPU (the "device") but is called from the CPU (the "host").
<<<1, 1>>> defines the execution configuration (1 block, 1 thread).
cudaDeviceSynchronize() forces the CPU to wait until the GPU has finished executing the kernel and flushed its printf buffer before exiting the program.