skip to content
luminary.blog
by Oz Akan
car sketch

CUDA Programming: An Introduction

Getting started with CUDA programming: Hello Threads

/ 12 min read

Table of Contents

Back In The Day

Back in the day, I’d spend weeks planning my next computer build. Motherboard, memory, disk drive – most parts were straightforward choices. The real decisions came down to CPU and GPU. Intel was usually the safe bet for processors, unless AMD had something compelling that generation. But switching meant a new motherboard too, so it had to be worth it.

Then came the GPU decision. Always GeForce, but which model? Should I skip lunch for a few months to get the latest card, or the previous generation? I was mostly playing FIFA and some FPS games on the network with friends, dual-booting between Slackware Linux and Windows like any linux desktop user would do.

At the time, a graphics card was just the thing that made games look good. Then around 2007, NVIDIA released something called CUDA. I remember reading about it but honestly can’t recall if I did anything with it back then. Today CUDA, Compute Unified Device Architecture, made these powerful cards useful in other areas than just graphics. All those parallel were also perfect for the kind of math-heavy tasks that make AI and scientific computing. Those gaming GPUs are now powering everything from training LLMs to folding proteins and predicting weather patterns.

Out of curiousity I wanted to check CUDA out again.

I found that in the Lindholm et al., 2008 white paper1, NVIDIA called its architecture “TESLA”. The paper says;

TO ENABLE FLEXIBLE, PROGRAMMABLE GRAPHICS AND HIGH-PERFORMANCE COMPUTING, NVIDIA HAS DEVELOPED THE TESLA SCALABLE UNIFIED GRAPHICS AND PARALLEL COMPUTING ARCHITECTURE. ITS SCALABLE PARALLEL ARRAY OF PROCESSORS IS MASSIVELY MULTITHREADED AND PROGRAMMABLE IN C OR VIA GRAPHICS APIS

They dropped the name around 2018 due to the confision with Tesla cars.

Where can I run CUDA?

A search result returned the CUDA C++ Programming Guide 2. I knew I had to run C++ and needed an NVIDIA card that can run CUDA. On my arm based macbook, I had no luck. I understand NVIDIA supported Mac when it was running Intel CPUs. Now it doesn’t. So I had to find an online solution. One option was to run a virtual machine on the cloud with a GPU. That would be extra overhead for my little experiment. Then I found LeetGPU 3 which can run C++ code on a GPU and Google Colab 4 which runs Jupyter notebooks and supports GPUs.

For the hell world program, it turned out I had to understand some of the basics first.

CUDA Architecture

CUDA (Compute Unified Device Architecture) is NVIDIA’s parallel computing platform and programming model that enables developers to use NVIDIA GPUs for general-purpose computing.

Key Concepts

  • Parallel Computing: Execute thousands of threads simultaneously
  • GPU vs CPU: GPUs excel at data-parallel tasks, CPUs at sequential tasks
  • Massive Parallelism: Thousands of cores vs CPU’s ~100
  • Cost Effective: Better performance per dollar for parallel workloads

Streaming Multiprocessors (SMs)

  • Each SM5 contains multiple CUDA cores
  • Executes groups of 32 threads (warps) simultaneously
  • Has shared memory and register file

Programming Model

Host vs Device

  • Host: CPU and its memory (RAM)
  • Device: GPU and its memory (VRAM)
  • Kernel: Function that runs on the device

Thread Hierarchy

Grid
├── Block 0
│ ├── Thread 0
│ ├── Thread 1
│ └── ...
├── Block 1
│ ├── Thread 0
│ ├── Thread 1
│ └── ...
└── ...

Key Terms

  • Grid: Collection of thread blocks
  • Block: Collection of threads that can cooperate
  • Thread: Individual execution unit
  • Warp: Group of 32 threads executed together

Memory Hierarchy

Memory Types (Fastest to Slowest):

  1. Registers: Private to each thread
  2. Shared Memory: Shared within a thread block
  3. Global Memory: Accessible by all threads
  4. Host Memory: CPU RAM

Memory Scope and Lifetime:

Memory Type | Scope | Lifetime
---------------|--------------|-------------
Register | Thread | Thread
Local | Thread | Thread
Shared | Block | Block
Global | Grid | Application
Constant | Grid | Application
Texture | Grid | Application

Hello World, Finally

On LeetGPU 3

#include <stdio.h>
#include <cuda_runtime.h>
// Kernel function (runs on GPU)
__global__ void helloFromGPU() {
printf("Hello World from GPU thread %d!\n", threadIdx.x);
}
int main() {
// Launch kernel with 10 threads
helloFromGPU<<<1, 10>>>();
// Wait for GPU to finish
cudaDeviceSynchronize();
return 0;
}

I got the response below:

Terminal window
Running NVIDIA GTX TITAN X in CYCLE ACCURATE mode...
Compiling...
Executing...
Hello World from GPU thread 0!
Hello World from GPU thread 1!
Hello World from GPU thread 2!
Hello World from GPU thread 3!
Hello World from GPU thread 4!
Hello World from GPU thread 5!
Hello World from GPU thread 6!
Hello World from GPU thread 7!
Hello World from GPU thread 8!
Hello World from GPU thread 9!
GPU Execution Time: 3.80 microseconds
Exit status: 0

I had to pick the “CYCLE ACCURATE” mode to emulate the underlying GPU architecture at a detailed, hardware-representative level. The “Functional” mode didn’t print the thread IDs properly.

Remember grid, block, thread concepts from the above? Let’s create a CUDA grid with a 2D structure: 3 blocks along the x-axis and 3 blocks along the y-axis (total 9 blocks) and each CUDA block with 2 threads along the x-axis and 2 threads along the y-axis (total 4 threads per block)

#include <stdio.h>
__global__ void printThreadInfo2D() {
// Calculate unique thread ID from 2D block and thread indices
int uniqueThreadId = blockIdx.y * gridDim.x * blockDim.x * blockDim.y +
blockIdx.x * blockDim.x * blockDim.y +
threadIdx.y * blockDim.x + threadIdx.x;
printf("Thread ID: %d, blockIdx: (%d,%d), threadIdx: (%d,%d)\n",
uniqueThreadId, blockIdx.x, blockIdx.y, threadIdx.x, threadIdx.y);
}
int main() {
dim3 gridDim(3, 3);
dim3 blockDim(2, 2);
printThreadInfo2D<<<gridDim, blockDim>>>();
cudaDeviceSynchronize();
return 0;
}

Output:

Terminal window
Running NVIDIA GTX TITAN X in CYCLE ACCURATE mode...
Compiling...
Executing...
Thread ID: 0, blockIdx: (0,0), threadIdx: (0,0)
Thread ID: 1, blockIdx: (0,0), threadIdx: (1,0)
Thread ID: 2, blockIdx: (0,0), threadIdx: (0,1)
Thread ID: 3, blockIdx: (0,0), threadIdx: (1,1)
Thread ID: 4, blockIdx: (1,0), threadIdx: (0,0)
Thread ID: 5, blockIdx: (1,0), threadIdx: (1,0)
Thread ID: 6, blockIdx: (1,0), threadIdx: (0,1)
...
...
...
Thread ID: 27, blockIdx: (0,2), threadIdx: (1,1)
Thread ID: 28, blockIdx: (1,2), threadIdx: (0,0)
Thread ID: 29, blockIdx: (1,2), threadIdx: (1,0)
Thread ID: 30, blockIdx: (1,2), threadIdx: (0,1)
Thread ID: 31, blockIdx: (1,2), threadIdx: (1,1)
Thread ID: 32, blockIdx: (2,2), threadIdx: (0,0)
Thread ID: 33, blockIdx: (2,2), threadIdx: (1,0)
Thread ID: 34, blockIdx: (2,2), threadIdx: (0,1)
Thread ID: 35, blockIdx: (2,2), threadIdx: (1,1)
GPU Execution Time: 3.83 microseconds
Exit status: 0

PyCuda

Most of ML work is done with python. I also wanted to use Jupyter notebooks. This made me search for the python bindings. I found a few python libraries and one of them was PyCuda. It is simple to use and basically lets you run C++ code on CUDA architecture. (NVIDIA supports cuda-python)

Install the library.

Terminal window
!pip install pycuda

Add the CUDA kernel code as a python string (line 7 - 13).

import pycuda.autoinit
import pycuda.driver as drv
import numpy as np
from pycuda.compiler import SourceModule
# CUDA kernel code as string
mod = SourceModule("""
__global__ void add_kernel(float *a, float *b, float *c, int n) {
int idx = threadIdx.x + blockDim.x * blockIdx.x;
if (idx < n)
c[idx] = a[idx] + b[idx];
}
""")
# Sample input arrays
n = 16
a = np.random.randn(n).astype(np.float32)
b = np.random.randn(n).astype(np.float32)
c = np.zeros_like(a)
# Get kernel from module
add_func = mod.get_function("add_kernel")
# Launch kernel
add_func(
drv.In(a), drv.In(b), drv.Out(c), np.int32(n),
block=(n,1,1), grid=(1,1)
)
print("Result:", c)

Result:

Terminal window
Result: [ 0.08215982 -1.1922485 -0.41126803 -0.53234756 -2.0320995 0.3066362
-0.05669016 -0.7823407 -1.2497976 1.3654544 -0.20398068 -1.0760175
-0.50371385 0.18608434 -1.2050071 0.9978307 ]

From numpy to cupy

NumPy is a Python library used for handling and processing large, multidimensional arrays and matrices of numerical data. CuPy does the same but its operations are executed on NVIDIA GPUs using CUDA, allowing for much faster computations for large arrays and mathematical tasks.

I decided to test the performance by running a relatively simple ML problem. A linear regression one. Yes, home price estimator :).

First numpy version:

import numpy as np
import time
import matplotlib.pyplot as plt
def generate_house_data(n_samples=100000):
"""Generate synthetic house price dataset"""
np.random.seed(42)
# Features: [sqft, bedrooms, age, neighborhood_score]
sqft = np.random.normal(2000, 500, n_samples)
bedrooms = np.random.poisson(3, n_samples) + 1 # 1-6 bedrooms
age = np.random.uniform(0, 50, n_samples) # 0-50 years old
neighborhood_score = np.random.uniform(1, 10, n_samples) # 1-10 rating
# True coefficients (unknown in real world)
true_coeffs = [50000, 120, 15000, -800, 5000] # [bias, sqft, bed, age, neighborhood]
# Generate prices with some noise
noise = np.random.normal(0, 10000, n_samples)
prices = (true_coeffs[0] +
true_coeffs[1] * sqft +
true_coeffs[2] * bedrooms +
true_coeffs[3] * age +
true_coeffs[4] * neighborhood_score +
noise)
# Create feature matrix with bias column
X = np.column_stack([np.ones(n_samples), sqft, bedrooms, age, neighborhood_score])
return X, prices, true_coeffs
def linear_regression_numpy(X, y):
"""
Solve linear regression using normal equation with NumPy (CPU)
Mathematical steps:
1. Compute X^T (transpose)
2. Compute X^T @ X (matrix multiplication)
3. Compute (X^T @ X)^(-1) (matrix inversion)
4. Compute X^T @ y (matrix-vector multiplication)
5. Compute final coefficients: β = (X^T @ X)^(-1) @ X^T @ y
"""
print("NumPy Linear Regression (CPU)")
start_time = time.time()
# Step 1: Compute X transpose
XT = X.T
print(f"X shape: {X.shape}, X^T shape: {XT.shape}")
# Step 2: Compute X^T @ X (covariance matrix)
XTX = XT @ X
print(f"X^T @ X shape: {XTX.shape}")
# Step 3: Compute X^T @ y (feature-target correlations)
XTy = XT @ y
print(f"X^T @ y shape: {XTy.shape}")
# Step 4: Solve (X^T @ X) @ β = X^T @ y
# Using np.linalg.solve is more stable than computing inverse directly
coefficients = np.linalg.solve(XTX, XTy)
end_time = time.time()
print(f"NumPy computation time: {end_time - start_time:.4f} seconds")
return coefficients
def predict_numpy(X, coefficients):
"""Make predictions using learned coefficients"""
return X @ coefficients
def compute_metrics(y_true, y_pred):
"""Compute R² and RMSE metrics"""
# R-squared
ss_res = np.sum((y_true - y_pred) ** 2)
ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)
r2 = 1 - (ss_res / ss_tot)
# RMSE
rmse = np.sqrt(np.mean((y_true - y_pred) ** 2))
return r2, rmse
# Generate dataset
print("Generating house price dataset...")
X, y, true_coeffs = generate_house_data(n_samples=100000)
print(f"Dataset: {X.shape[0]} houses, {X.shape[1]-1} features")
print(f"True coefficients: {true_coeffs}")
# Train model with NumPy
coeffs_numpy = linear_regression_numpy(X, y)
y_pred_numpy = predict_numpy(X, coeffs_numpy)
r2_numpy, rmse_numpy = compute_metrics(y, y_pred_numpy)
print(f"\nNumPy Results:")
print(f"Learned coefficients: {coeffs_numpy}")
print(f"R² Score: {r2_numpy:.4f}")
print(f"RMSE: ${rmse_numpy:,.2f}")

Then the cupy version:

import cupy as cp
import numpy as np
import time
def linear_regression_cupy(X, y):
"""
Solve linear regression using normal equation with CuPy (GPU/CUDA)
Same mathematical steps as NumPy but executed on GPU:
1. Transfer data to GPU memory
2. Perform matrix operations on GPU
3. Transfer results back to CPU
"""
print("\nCuPy Linear Regression (GPU/CUDA)")
# Transfer data to GPU
start_time = time.time()
X_gpu = cp.asarray(X)
y_gpu = cp.asarray(y)
transfer_time = time.time()
print(f"Data transfer to GPU: {transfer_time - start_time:.4f} seconds")
# GPU computations
compute_start = time.time()
# Step 1: Compute X transpose
XT_gpu = X_gpu.T
# Step 2: Compute X^T @ X (matrix multiplication on GPU)
XTX_gpu = XT_gpu @ X_gpu
# Step 3: Compute X^T @ y
XTy_gpu = XT_gpu @ y_gpu
# Step 4: Solve linear system on GPU
coefficients_gpu = cp.linalg.solve(XTX_gpu, XTy_gpu)
compute_end = time.time()
# Transfer results back to CPU
coefficients = cp.asnumpy(coefficients_gpu)
end_time = time.time()
print(f"GPU computation time: {compute_end - compute_start:.4f} seconds")
print(f"Total time (including transfers): {end_time - start_time:.4f} seconds")
return coefficients
def predict_cupy(X, coefficients):
"""Make predictions using GPU"""
X_gpu = cp.asarray(X)
coeffs_gpu = cp.asarray(coefficients)
predictions_gpu = X_gpu @ coeffs_gpu
return cp.asnumpy(predictions_gpu)
# Train model with CuPy (GPU)
coeffs_cupy = linear_regression_cupy(X, y)
y_pred_cupy = predict_cupy(X, coeffs_cupy)
r2_cupy, rmse_cupy = compute_metrics(y, y_pred_cupy)
print(f"\nCuPy Results:")
print(f"Learned coefficients: {coeffs_cupy}")
print(f"R² Score: {r2_cupy:.4f}")
print(f"RMSE: ${rmse_cupy:,.2f}")
# Verify results are identical
print(f"\nCoefficient difference (should be ~0): {np.max(np.abs(coeffs_numpy - coeffs_cupy))}")

Finally, let’s create a benchmark:

def benchmark_performance():
"""Compare NumPy vs CuPy performance across different dataset sizes"""
sizes = [10000, 1000000, 100000000]
numpy_times = []
cupy_times = []
print("\nPerformance Benchmark:")
print("Dataset Size | NumPy Time | CuPy Time | Speedup")
print("-" * 50)
for size in sizes:
# Generate data
X_bench, y_bench, _ = generate_house_data(size)
# NumPy timing
start = time.time()
_ = linear_regression_numpy(X_bench, y_bench)
numpy_time = time.time() - start
numpy_times.append(numpy_time)
# CuPy timing
start = time.time()
_ = linear_regression_cupy(X_bench, y_bench)
cupy_time = time.time() - start
cupy_times.append(cupy_time)
speedup = numpy_time / cupy_time
print(f"{size:>11} | {numpy_time:>9.4f}s | {cupy_time:>8.4f}s | {speedup:>6.2f}x")
return sizes, numpy_times, cupy_times
# Run benchmark
sizes, numpy_times, cupy_times = benchmark_performance()

I ran the test with 10K, 1M, and 100M records.

Terminal window
Performance Benchmark:
Dataset Size | NumPy Time | CuPy Time | Speedup
--------------------------------------------------
NumPy Linear Regression (CPU)
NumPy computation time: 0.0003 seconds
CuPy Linear Regression (GPU/CUDA)
Data transfer to GPU: 0.0004 seconds
GPU computation time: 0.0006 seconds
Total time (including transfers): 0.0019 seconds
10000 | 0.0003s | 0.0019s | 0.18x
NumPy Linear Regression (CPU)
NumPy computation time: 0.0157 seconds
CuPy Linear Regression (GPU/CUDA)
Data transfer to GPU: 0.0535 seconds
GPU computation time: 0.0007 seconds
Total time (including transfers): 0.1430 seconds
1000000 | 0.0157s | 0.1431s | 0.11x
NumPy Linear Regression (CPU)
NumPy computation time: 2.1988 seconds
CuPy Linear Regression (GPU/CUDA)
Data transfer to GPU: 3.1775 seconds
GPU computation time: 0.0010 seconds
Total time (including transfers): 6.5008 seconds
100000000 | 2.1990s | 6.5009s | 0.34x

Overall time isn’t better with cupy but that is because so much time is spent in transfering the data from CPU to GPU.

If we only look at the processing time, it was 2.2 seconds on CPU to 0.001 seconds on GPU.

That is remarkable.

Final Notes

You may not necessarily need to delve deeply into the CUDA interface. Many data scientists and machine learning engineers typically use libraries such as PyTorch, which leverage CUDA behind the scenes. Similarly, game developers often utilize OpenGL or DirectX, which also rely on CUDA. It appears that CUDA is frequently abstracted away from the developer’s direct experience. While understanding the basics of CUDA is relatively straightforward, the real challenge is optimizing GPU performance. To achieve this, it’s essential to master key concepts such as threads, blocks, and grids, along with their relationship to GPU hardware.

If you want to learn more here is the training site.

Footnotes

  1. https://www.cs.cmu.edu/afs/cs/academic/class/15869-f11/www/readings/lindholm08_tesla.pdf

  2. https://docs.nvidia.com/cuda/cuda-c-programming-guide/#

  3. https://www.leetgpu.com/playground 2

  4. https://colab.research.google.com/

  5. https://modal.com/gpu-glossary/device-hardware/cuda-device-architecture