CUDA Programming: An Introduction • luminary.blog

Back In The Day

Back in the day, I’d spend weeks planning my next computer build. Motherboard, memory, disk drive – most parts were straightforward choices. The real decisions came down to CPU and GPU. Intel was usually the safe bet for processors, unless AMD had something compelling that generation. But switching meant a new motherboard too, so it had to be worth it.

Then came the GPU decision. Always GeForce, but which model? Should I skip lunch for a few months to get the latest card, or the previous generation? I was mostly playing FIFA and some FPS games on the network with friends, dual-booting between Slackware Linux and Windows like any linux desktop user would do.

At the time, a graphics card was just the thing that made games look good. Then around 2007, NVIDIA released something called CUDA. I remember reading about it but honestly can’t recall if I did anything with it back then. Today CUDA, Compute Unified Device Architecture, made these powerful cards useful in other areas than just graphics. All those parallel were also perfect for the kind of math-heavy tasks that make AI and scientific computing. Those gaming GPUs are now powering everything from training LLMs to folding proteins and predicting weather patterns.

Out of curiousity I wanted to check CUDA out again.

I found that in the Lindholm et al., 2008 white paper¹, NVIDIA called its architecture “TESLA”. The paper says;

TO ENABLE FLEXIBLE, PROGRAMMABLE GRAPHICS AND HIGH-PERFORMANCE COMPUTING, NVIDIA HAS DEVELOPED THE TESLA SCALABLE UNIFIED GRAPHICS AND PARALLEL COMPUTING ARCHITECTURE. ITS SCALABLE PARALLEL ARRAY OF PROCESSORS IS MASSIVELY MULTITHREADED AND PROGRAMMABLE IN C OR VIA GRAPHICS APIS

They dropped the name around 2018 due to the confision with Tesla cars.

Where can I run CUDA?

A search result returned the CUDA C++ Programming Guide ². I knew I had to run C++ and needed an NVIDIA card that can run CUDA. On my arm based macbook, I had no luck. I understand NVIDIA supported Mac when it was running Intel CPUs. Now it doesn’t. So I had to find an online solution. One option was to run a virtual machine on the cloud with a GPU. That would be extra overhead for my little experiment. Then I found LeetGPU ³ which can run C++ code on a GPU and Google Colab ⁴ which runs Jupyter notebooks and supports GPUs.

For the hell world program, it turned out I had to understand some of the basics first.

CUDA Architecture

CUDA (Compute Unified Device Architecture) is NVIDIA’s parallel computing platform and programming model that enables developers to use NVIDIA GPUs for general-purpose computing.

Key Concepts

Parallel Computing: Execute thousands of threads simultaneously
GPU vs CPU: GPUs excel at data-parallel tasks, CPUs at sequential tasks
Massive Parallelism: Thousands of cores vs CPU’s ~100
Cost Effective: Better performance per dollar for parallel workloads

Streaming Multiprocessors (SMs)

Each SM⁵ contains multiple CUDA cores
Executes groups of 32 threads (warps) simultaneously
Has shared memory and register file

Programming Model

Host vs Device

Host: CPU and its memory (RAM)
Device: GPU and its memory (VRAM)
Kernel: Function that runs on the device

Thread Hierarchy

1
Grid
2
├── Block 0
3
│   ├── Thread 0
4
│   ├── Thread 1
5
│   └── ...
6
├── Block 1
7
│   ├── Thread 0
8
│   ├── Thread 1
9
│   └── ...
10
└── ...

Key Terms

Grid: Collection of thread blocks
Block: Collection of threads that can cooperate
Thread: Individual execution unit
Warp: Group of 32 threads executed together

Memory Hierarchy

Memory Types (Fastest to Slowest):

Registers: Private to each thread
Shared Memory: Shared within a thread block
Global Memory: Accessible by all threads
Host Memory: CPU RAM

Memory Scope and Lifetime:

1
Memory Type    | Scope        | Lifetime
2
---------------|--------------|-------------
3
Register       | Thread       | Thread
4
Local          | Thread       | Thread
5
Shared         | Block        | Block
6
Global         | Grid         | Application
7
Constant       | Grid         | Application
8
Texture        | Grid         | Application

Hello World, Finally

On LeetGPU ³

1
#include <stdio.h>
2
#include <cuda_runtime.h>
3

4
// Kernel function (runs on GPU)
5
__global__ void helloFromGPU() {
6
    printf("Hello World from GPU thread %d!\n", threadIdx.x);
7
}
8

9
int main() {
10
    // Launch kernel with 10 threads
11
    helloFromGPU<<<1, 10>>>();
12

13
    // Wait for GPU to finish
14
    cudaDeviceSynchronize();
15

16
    return 0;
17
}

I got the response below:

1
Running NVIDIA GTX TITAN X in CYCLE ACCURATE mode...
2
Compiling...
3
Executing...
4
Hello World from GPU thread 0!
5
Hello World from GPU thread 1!
6
Hello World from GPU thread 2!
7
Hello World from GPU thread 3!
8
Hello World from GPU thread 4!
9
Hello World from GPU thread 5!
10
Hello World from GPU thread 6!
11
Hello World from GPU thread 7!
12
Hello World from GPU thread 8!
13
Hello World from GPU thread 9!
14
GPU Execution Time: 3.80 microseconds
15
Exit status: 0

I had to pick the “CYCLE ACCURATE” mode to emulate the underlying GPU architecture at a detailed, hardware-representative level. The “Functional” mode didn’t print the thread IDs properly.

Remember grid, block, thread concepts from the above? Let’s create a CUDA grid with a 2D structure: 3 blocks along the x-axis and 3 blocks along the y-axis (total 9 blocks) and each CUDA block with 2 threads along the x-axis and 2 threads along the y-axis (total 4 threads per block)

1
#include <stdio.h>
2

3
__global__ void printThreadInfo2D() {
4
    // Calculate unique thread ID from 2D block and thread indices
5
    int uniqueThreadId = blockIdx.y * gridDim.x * blockDim.x * blockDim.y +
6
                        blockIdx.x * blockDim.x * blockDim.y +
7
                        threadIdx.y * blockDim.x + threadIdx.x;
8

9
    printf("Thread ID: %d, blockIdx: (%d,%d), threadIdx: (%d,%d)\n",
10
           uniqueThreadId, blockIdx.x, blockIdx.y, threadIdx.x, threadIdx.y);
11
}
12

13
int main() {
14
    dim3 gridDim(3, 3);
15
    dim3 blockDim(2, 2);
16

17
    printThreadInfo2D<<<gridDim, blockDim>>>();
18
    cudaDeviceSynchronize();
19
    return 0;
20
}

Output:

1
Running NVIDIA GTX TITAN X in CYCLE ACCURATE mode...
2
Compiling...
3
Executing...
4
Thread ID: 0, blockIdx: (0,0), threadIdx: (0,0)
5
Thread ID: 1, blockIdx: (0,0), threadIdx: (1,0)
6
Thread ID: 2, blockIdx: (0,0), threadIdx: (0,1)
7
Thread ID: 3, blockIdx: (0,0), threadIdx: (1,1)
8
Thread ID: 4, blockIdx: (1,0), threadIdx: (0,0)
9
Thread ID: 5, blockIdx: (1,0), threadIdx: (1,0)
10
Thread ID: 6, blockIdx: (1,0), threadIdx: (0,1)
11
...
12
...
13
...
14
Thread ID: 27, blockIdx: (0,2), threadIdx: (1,1)
15
Thread ID: 28, blockIdx: (1,2), threadIdx: (0,0)
16
Thread ID: 29, blockIdx: (1,2), threadIdx: (1,0)
17
Thread ID: 30, blockIdx: (1,2), threadIdx: (0,1)
18
Thread ID: 31, blockIdx: (1,2), threadIdx: (1,1)
19
Thread ID: 32, blockIdx: (2,2), threadIdx: (0,0)
20
Thread ID: 33, blockIdx: (2,2), threadIdx: (1,0)
21
Thread ID: 34, blockIdx: (2,2), threadIdx: (0,1)
22
Thread ID: 35, blockIdx: (2,2), threadIdx: (1,1)
23
GPU Execution Time: 3.83 microseconds
24
Exit status: 0

PyCuda

Most of ML work is done with python. I also wanted to use Jupyter notebooks. This made me search for the python bindings. I found a few python libraries and one of them was PyCuda. It is simple to use and basically lets you run C++ code on CUDA architecture. (NVIDIA supports cuda-python)

Install the library.

1
!pip install pycuda

Add the CUDA kernel code as a python string (line 7 - 13).

1
import pycuda.autoinit
2
import pycuda.driver as drv
3
import numpy as np
4
from pycuda.compiler import SourceModule
5

6
# CUDA kernel code as string
7
mod = SourceModule("""
8
__global__ void add_kernel(float *a, float *b, float *c, int n) {
9
    int idx = threadIdx.x + blockDim.x * blockIdx.x;
10
    if (idx < n)
11
        c[idx] = a[idx] + b[idx];
12
}
13
""")
14

15
# Sample input arrays
16
n = 16
17
a = np.random.randn(n).astype(np.float32)
18
b = np.random.randn(n).astype(np.float32)
19
c = np.zeros_like(a)
20

21
# Get kernel from module
22
add_func = mod.get_function("add_kernel")
23

24
# Launch kernel
25
add_func(
26
    drv.In(a), drv.In(b), drv.Out(c), np.int32(n),
27
    block=(n,1,1), grid=(1,1)
28
)
29
print("Result:", c)

Result:

1
Result: [ 0.08215982 -1.1922485  -0.41126803 -0.53234756 -2.0320995   0.3066362
2
 -0.05669016 -0.7823407  -1.2497976   1.3654544  -0.20398068 -1.0760175
3
 -0.50371385  0.18608434 -1.2050071   0.9978307 ]

From numpy to cupy

NumPy is a Python library used for handling and processing large, multidimensional arrays and matrices of numerical data. CuPy does the same but its operations are executed on NVIDIA GPUs using CUDA, allowing for much faster computations for large arrays and mathematical tasks.

I decided to test the performance by running a relatively simple ML problem. A linear regression one. Yes, home price estimator :).

First numpy version:

1
import numpy as np
2
import time
3
import matplotlib.pyplot as plt
4

5
def generate_house_data(n_samples=100000):
6
    """Generate synthetic house price dataset"""
7
    np.random.seed(42)
8

9
    # Features: [sqft, bedrooms, age, neighborhood_score]
10
    sqft = np.random.normal(2000, 500, n_samples)
11
    bedrooms = np.random.poisson(3, n_samples) + 1  # 1-6 bedrooms
12
    age = np.random.uniform(0, 50, n_samples)  # 0-50 years old
13
    neighborhood_score = np.random.uniform(1, 10, n_samples)  # 1-10 rating
14

15
    # True coefficients (unknown in real world)
16
    true_coeffs = [50000, 120, 15000, -800, 5000]  # [bias, sqft, bed, age, neighborhood]
17

18
    # Generate prices with some noise
19
    noise = np.random.normal(0, 10000, n_samples)
20
    prices = (true_coeffs[0] +
21
             true_coeffs[1] * sqft +
22
             true_coeffs[2] * bedrooms +
23
             true_coeffs[3] * age +
24
             true_coeffs[4] * neighborhood_score +
25
             noise)
26

27
    # Create feature matrix with bias column
28
    X = np.column_stack([np.ones(n_samples), sqft, bedrooms, age, neighborhood_score])
29

30
    return X, prices, true_coeffs
31

32
def linear_regression_numpy(X, y):
33
    """
34
    Solve linear regression using normal equation with NumPy (CPU)
35

36
    Mathematical steps:
37
    1. Compute X^T (transpose)
38
    2. Compute X^T @ X (matrix multiplication)
39
    3. Compute (X^T @ X)^(-1) (matrix inversion)
40
    4. Compute X^T @ y (matrix-vector multiplication)
41
    5. Compute final coefficients: β = (X^T @ X)^(-1) @ X^T @ y
42
    """
43
    print("NumPy Linear Regression (CPU)")
44
    start_time = time.time()
45

46
    # Step 1: Compute X transpose
47
    XT = X.T
48
    print(f"X shape: {X.shape}, X^T shape: {XT.shape}")
49

50
    # Step 2: Compute X^T @ X (covariance matrix)
51
    XTX = XT @ X
52
    print(f"X^T @ X shape: {XTX.shape}")
53

54
    # Step 3: Compute X^T @ y (feature-target correlations)
55
    XTy = XT @ y
56
    print(f"X^T @ y shape: {XTy.shape}")
57

58
    # Step 4: Solve (X^T @ X) @ β = X^T @ y
59
    # Using np.linalg.solve is more stable than computing inverse directly
60
    coefficients = np.linalg.solve(XTX, XTy)
61

62
    end_time = time.time()
63
    print(f"NumPy computation time: {end_time - start_time:.4f} seconds")
64

65
    return coefficients
66

67
def predict_numpy(X, coefficients):
68
    """Make predictions using learned coefficients"""
69
    return X @ coefficients
70

71
def compute_metrics(y_true, y_pred):
72
    """Compute R² and RMSE metrics"""
73
    # R-squared
74
    ss_res = np.sum((y_true - y_pred) ** 2)
75
    ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)
76
    r2 = 1 - (ss_res / ss_tot)
77

78
    # RMSE
79
    rmse = np.sqrt(np.mean((y_true - y_pred) ** 2))
80

81
    return r2, rmse
82

83
# Generate dataset
84
print("Generating house price dataset...")
85
X, y, true_coeffs = generate_house_data(n_samples=100000)
86
print(f"Dataset: {X.shape[0]} houses, {X.shape[1]-1} features")
87
print(f"True coefficients: {true_coeffs}")
88

89
# Train model with NumPy
90
coeffs_numpy = linear_regression_numpy(X, y)
91
y_pred_numpy = predict_numpy(X, coeffs_numpy)
92
r2_numpy, rmse_numpy = compute_metrics(y, y_pred_numpy)
93

94
print(f"\nNumPy Results:")
95
print(f"Learned coefficients: {coeffs_numpy}")
96
print(f"R² Score: {r2_numpy:.4f}")
97
print(f"RMSE: ${rmse_numpy:,.2f}")

Then the cupy version:

1
import cupy as cp
2
import numpy as np
3
import time
4

5
def linear_regression_cupy(X, y):
6
    """
7
    Solve linear regression using normal equation with CuPy (GPU/CUDA)
8

9
    Same mathematical steps as NumPy but executed on GPU:
10
    1. Transfer data to GPU memory
11
    2. Perform matrix operations on GPU
12
    3. Transfer results back to CPU
13
    """
14
    print("\nCuPy Linear Regression (GPU/CUDA)")
15

16
    # Transfer data to GPU
17
    start_time = time.time()
18
    X_gpu = cp.asarray(X)
19
    y_gpu = cp.asarray(y)
20
    transfer_time = time.time()
21
    print(f"Data transfer to GPU: {transfer_time - start_time:.4f} seconds")
22

23
    # GPU computations
24
    compute_start = time.time()
25

26
    # Step 1: Compute X transpose
27
    XT_gpu = X_gpu.T
28

29
    # Step 2: Compute X^T @ X (matrix multiplication on GPU)
30
    XTX_gpu = XT_gpu @ X_gpu
31

32
    # Step 3: Compute X^T @ y
33
    XTy_gpu = XT_gpu @ y_gpu
34

35
    # Step 4: Solve linear system on GPU
36
    coefficients_gpu = cp.linalg.solve(XTX_gpu, XTy_gpu)
37

38
    compute_end = time.time()
39

40
    # Transfer results back to CPU
41
    coefficients = cp.asnumpy(coefficients_gpu)
42
    end_time = time.time()
43

44
    print(f"GPU computation time: {compute_end - compute_start:.4f} seconds")
45
    print(f"Total time (including transfers): {end_time - start_time:.4f} seconds")
46

47
    return coefficients
48

49
def predict_cupy(X, coefficients):
50
    """Make predictions using GPU"""
51
    X_gpu = cp.asarray(X)
52
    coeffs_gpu = cp.asarray(coefficients)
53
    predictions_gpu = X_gpu @ coeffs_gpu
54
    return cp.asnumpy(predictions_gpu)
55

56
# Train model with CuPy (GPU)
57
coeffs_cupy = linear_regression_cupy(X, y)
58
y_pred_cupy = predict_cupy(X, coeffs_cupy)
59
r2_cupy, rmse_cupy = compute_metrics(y, y_pred_cupy)
60

61
print(f"\nCuPy Results:")
62
print(f"Learned coefficients: {coeffs_cupy}")
63
print(f"R² Score: {r2_cupy:.4f}")
64
print(f"RMSE: ${rmse_cupy:,.2f}")
65

66
# Verify results are identical
67
print(f"\nCoefficient difference (should be ~0): {np.max(np.abs(coeffs_numpy - coeffs_cupy))}")

Finally, let’s create a benchmark:

1
def benchmark_performance():
2
    """Compare NumPy vs CuPy performance across different dataset sizes"""
3
    sizes = [10000, 1000000, 100000000]
4
    numpy_times = []
5
    cupy_times = []
6

7
    print("\nPerformance Benchmark:")
8
    print("Dataset Size | NumPy Time | CuPy Time | Speedup")
9
    print("-" * 50)
10

11
    for size in sizes:
12
        # Generate data
13
        X_bench, y_bench, _ = generate_house_data(size)
14

15
        # NumPy timing
16
        start = time.time()
17
        _ = linear_regression_numpy(X_bench, y_bench)
18
        numpy_time = time.time() - start
19
        numpy_times.append(numpy_time)
20

21
        # CuPy timing
22
        start = time.time()
23
        _ = linear_regression_cupy(X_bench, y_bench)
24
        cupy_time = time.time() - start
25
        cupy_times.append(cupy_time)
26

27
        speedup = numpy_time / cupy_time
28
        print(f"{size:>11} | {numpy_time:>9.4f}s | {cupy_time:>8.4f}s | {speedup:>6.2f}x")
29

30
    return sizes, numpy_times, cupy_times
31

32
# Run benchmark
33
sizes, numpy_times, cupy_times = benchmark_performance()

I ran the test with 10K, 1M, and 100M records.

1
Performance Benchmark:
2
Dataset Size | NumPy Time | CuPy Time | Speedup
3
--------------------------------------------------
4
NumPy Linear Regression (CPU)
5
NumPy computation time: 0.0003 seconds
6
CuPy Linear Regression (GPU/CUDA)
7
Data transfer to GPU: 0.0004 seconds
8
GPU computation time: 0.0006 seconds
9
Total time (including transfers): 0.0019 seconds
10
      10000 |    0.0003s |   0.0019s |   0.18x
11

12
NumPy Linear Regression (CPU)
13
NumPy computation time: 0.0157 seconds
14
CuPy Linear Regression (GPU/CUDA)
15
Data transfer to GPU: 0.0535 seconds
16
GPU computation time: 0.0007 seconds
17
Total time (including transfers): 0.1430 seconds
18
    1000000 |    0.0157s |   0.1431s |   0.11x
19

20
NumPy Linear Regression (CPU)
21
NumPy computation time: 2.1988 seconds
22
CuPy Linear Regression (GPU/CUDA)
23
Data transfer to GPU: 3.1775 seconds
24
GPU computation time: 0.0010 seconds
25
Total time (including transfers): 6.5008 seconds
26
  100000000 |    2.1990s |   6.5009s |   0.34x

Overall time isn’t better with cupy but that is because so much time is spent in transfering the data from CPU to GPU.

If we only look at the processing time, it was 2.2 seconds on CPU to 0.001 seconds on GPU.

That is remarkable.

Final Notes

You may not necessarily need to delve deeply into the CUDA interface. Many data scientists and machine learning engineers typically use libraries such as PyTorch, which leverage CUDA behind the scenes. Similarly, game developers often utilize OpenGL or DirectX, which also rely on CUDA. It appears that CUDA is frequently abstracted away from the developer’s direct experience.

While understanding the basics of CUDA is relatively straightforward, the real challenge is optimizing GPU performance. To achieve this, it’s essential to understand the key concepts such as threads, blocks, and grids, along with their relationship to GPU hardware.

If you want to learn more here is the training site.

Footnotes

← A Tale of Software Rot

What is LLM Inference? →