Introduction
There is a particular kind of frustration reserved for the engineer who opens a job posting for "HPC" and discovers it is asking for PyTorch experience. There is an equal and opposite frustration reserved for the ML engineer who is handed a cluster and told to "just use MPI" without further guidance.
Both of these engineers are now, for better or worse, the same person.
This book exists because the HPC and ML communities spent the better part of a decade insisting they had nothing to do with each other, and then spent 2022–2025 building the same data center, buying the same accelerators, and filing the same support tickets. The reconciliation is still ongoing. This guide attempts to give working engineers the vocabulary and the mental models to operate in both worlds — and to stop being confused by people who act like they are different worlds.
The State of HPC in 2026
The hardware landscape has consolidated in some ways and fragmented in others.
GPU dominance is complete — at least for workloads that can be parallelized. The era of debating whether GPUs were suitable for "real" HPC is over. H100s and their successors run everything from molecular dynamics to transformer training. The question is no longer whether to use a GPU but which GPU, which memory tier, and which programming model.
The GPU monoculture has cracked. For most of the 2010s, "GPU computing" meant CUDA, and CUDA meant NVIDIA. That is no longer a safe assumption. AMD's ROCm ecosystem reached a level of practical maturity around 2024–2025 that made it genuinely deployable — not a heroic research project, but something you could put into production without a dedicated champion willing to suffer for the cause. And Apple Silicon — with its unified memory architecture and Metal compute shaders — introduced a third credible path that matters most at the workstation level but has begun influencing how people think about memory bandwidth more broadly.
Classical HPC primitives are not going away. MPI turns 30 in 2024 and shows no signs of retirement. OpenMP is still how you thread a loop when you do not want to write a GPU kernel. Vectorization remains one of the most reliable sources of free performance that engineers consistently leave on the table. The ML community's habit of treating the CPU as a staging area for GPU work ignores a lot of performance sitting in plain sight.
The infrastructure has normalized. Kubernetes-based HPC, once a punchline, is now common. Job schedulers still dominate the traditional HPC center, but the boundaries are porous. Engineers increasingly need to understand both paradigms.
What This Book Is and Is Not
This is a practical technical guide. It contains real code. It makes real recommendations. It will sometimes tell you that a tool or platform is worse than the marketing suggests, because that is useful information.
It is not a comprehensive API reference. The official documentation exists and you should use it. It is not an academic survey with 200 citations per chapter. It is also not a beginner's guide to programming; it assumes you write code for a living and have a passing familiarity with systems programming concepts.
The target reader is an engineer who:
- Has shipped software that runs on real hardware under real constraints
- Understands memory allocation, cache locality, and why pointer aliasing matters
- Wants to understand GPU computing at the model level, not just the API level
- Is comfortable reading C, C++, or Python and can tolerate the others
A Note on the Platform Landscape
Throughout this book, code examples will be provided in CUDA, HIP (ROCm's CUDA-compatible API), Metal Shading Language, C with OpenMP, and C with MPI. Where frameworks diverge, we'll say so explicitly rather than pretending a single abstraction covers everything.
Benchmarks are hard to make meaningful and easy to make misleading. Where we quote performance numbers, we quote them with hardware context and an acknowledgment that your results will differ. Take any benchmark in any HPC book — including this one — as a directional signal, not a contract.
How to Read This Book
The chapters are roughly ordered from hardware-adjacent to workload-level. The GPU chapters (Foundations) can be read in any order if you already have context on one or two of the platforms. The parallel programming models section (MPI, OpenMP, vectorization) is self-contained. The final section on ML/HPC convergence draws on everything that came before.
If you are new to GPU computing entirely, read the chapters in order.
If you have CUDA experience and want to understand ROCm or Metal, skip to the relevant chapter and refer back as needed.
If you are a traditional HPC engineer trying to understand why everyone wants to put a transformer on your cluster, start with the final two chapters and work backward.
Let us get into it.
GPU Computing Fundamentals
Before writing a single line of CUDA, HIP, or Metal, it is worth understanding what a GPU actually is and why it behaves the way it does. The programming model makes more sense once you know what the hardware is doing, and most of the performance mistakes engineers make trace back to a mismatch between their mental model of the GPU and reality.
The Wrong Mental Model
The common misconception is that a GPU is a fast CPU with many cores. It is not. A GPU is an architecture optimized for a very different workload profile, and the differences are more fundamental than clock speed or core count.
A modern high-end CPU — say, an AMD EPYC Genoa — might have 96 cores running at 3–4 GHz, each with large private caches, out-of-order execution engines, branch predictors, and all the machinery that makes single-threaded latency low. That CPU is designed to make any given thread of execution run as fast as possible, even if it means burning transistors on speculation, prefetching, and reordering instructions.
A GPU — say, an NVIDIA H100 SXM — has 16,896 CUDA cores organized into 132 Streaming Multiprocessors. Each individual core is simple. It does not speculate. It does not reorder instructions. It executes one instruction at a time in lockstep with 31 other cores in its warp. What the GPU does instead of latency reduction is latency hiding: it maintains thousands of in-flight warps simultaneously, and when one warp stalls waiting for memory, it switches instantly to another. Zero context-switch overhead. The latency is still there; the GPU just ignores it by having enough other work to do.
This is the fundamental tradeoff: throughput vs. latency.
GPU Architecture Components
Streaming Multiprocessors (SMs) / Compute Units (CUs) / GPU Cores
The top-level compute unit in NVIDIA GPUs is the Streaming Multiprocessor (SM). In AMD terminology it is a Compute Unit (CU). In Apple Silicon it is a GPU Core. The names differ; the concept is similar.
Each SM on an H100 contains:
- 128 CUDA cores (FP32 ALUs)
- 64 FP64 units (double-precision)
- 4 Tensor Cores (for matrix operations)
- 256 KB of register file
- 228 KB of shared memory / L1 cache
- Load/store units and special function units (SFUs)
The SM is the scheduling unit. It manages warps. It is the level at which occupancy — the fraction of maximum possible warps that are resident at once — is calculated.
Warps and SIMT
NVIDIA executes threads in groups of 32 called warps. AMD's equivalent is a wavefront of 64 threads (though RDNA3 and later support both 32 and 64). This is the Single Instruction, Multiple Thread (SIMT) execution model.
Within a warp, all 32 threads execute the same instruction at the same time. When threads diverge — taking different branches based on per-thread data — the GPU executes both branches and masks off the inactive threads in each path. This is called warp divergence, and it is the most common source of GPU performance loss that programmers have direct control over.
If you have an if/else where half the threads take each branch, you have effectively halved your throughput for that section of code. If you can reorganize your data so that threads in the same warp make the same branch decision, you can recover that performance. This is one of the reasons GPU kernels often have awkward-looking data layouts.
Memory Hierarchy
The memory hierarchy is where most GPU performance is won or lost.
Global memory (HBM on datacenter GPUs): This is your main GPU memory. H100 SXM5 has 80 GB of HBM3 with ~3.35 TB/s bandwidth. Fast by any CPU memory standard. Slow from the perspective of a compute unit sitting next to it wanting 100+ TB/s of throughput. Every kernel that runs on the GPU is, at some level, fighting to keep the math units fed faster than global memory can supply them.
L2 cache: Shared across all SMs. On H100, 50 MB. This is the critical staging area for data that multiple SMs access. If your working set fits in L2, your effective bandwidth approaches what you'd get on-chip.
Shared memory / L1 cache: This is per-SM, explicitly managed (for the shared memory portion), and fast — on the order of 19 TB/s for the H100. When you see __shared__ in CUDA code, this is where data lives. The programmer controls what goes in shared memory and when, which is both the power and the burden of the model.
Registers: The fastest storage. Per-thread. On H100, each SM has 256 KB of register file shared across all resident warps. Register pressure — using too many registers per thread — reduces occupancy because fewer warps can be resident simultaneously.
Registers ~19 TB/s (effectively infinite bandwidth, per-thread)
Shared Mem ~19 TB/s (per-SM, programmer-managed)
L1 Cache Merged with shared memory in recent NVIDIA designs
L2 Cache ~12 TB/s (across all SMs)
HBM (VRAM) ~3.35 TB/s (H100 SXM5)
PCIe/NVLink ~900 GB/s (NVLink 4.0, bidirectional)
CPU DRAM ~460 GB/s (DDR5-4800, 8-channel)
Latencies follow the inverse pattern. Registers are essentially free. Global memory accesses that miss all caches take 400–800 cycles. This is why the GPU needs thousands of in-flight warps just to keep its arithmetic units busy.
Memory Coalescing
When threads in a warp access global memory, the hardware combines individual thread accesses into as few memory transactions as possible. If 32 threads each access a contiguous 4-byte element at addresses that span a 128-byte aligned region, that is one memory transaction. If they access random addresses scattered across memory, that is 32 separate transactions — 32x the bandwidth consumption, 32x the latency exposure.
Coalesced access is not optional. It is one of the first things to check when a kernel underperforms.
// Coalesced: thread i accesses element i
float val = data[threadIdx.x + blockIdx.x * blockDim.x];
// Not coalesced: thread i accesses element i * stride (with large stride)
float val = data[(threadIdx.x + blockIdx.x * blockDim.x) * stride];
The second pattern can easily run at 1/32 of the bandwidth of the first.
The Compute Throughput Problem
Modern GPUs are computationally powerful in a way that creates a new problem: it has become very easy to be memory-bound.
The H100 delivers ~67 TFLOPS of FP64 and ~989 TFLOPS of FP16. The memory bandwidth is ~3.35 TB/s. For an operation that reads A bytes and performs N floating-point operations, the arithmetic intensity is N/A FLOP/byte. Below the hardware's roofline — the ratio of peak compute to peak bandwidth — you are memory-bound. Above it, compute-bound.
For FP64: 67e12 / 3.35e12 ≈ 20 FLOP/byte is the crossover. A dense matrix-matrix multiply (GEMM) at large sizes achieves much higher arithmetic intensity (O(N) flops per O(N²) memory reads as tile sizes grow), which is why GEMM is one of the few operations that can actually saturate a GPU's compute units. A simple element-wise operation on a large array (read, compute, write) has an arithmetic intensity of roughly 1–2 FLOP/byte — deep in memory-bound territory regardless of how fast the ALUs are.
This is not unique to GPUs; the same analysis applies to CPUs. It is just more dramatic on GPUs where the compute-to-bandwidth ratio is higher.
The CPU-GPU Interface
Every GPU kernel invocation involves the CPU. The CPU launches the kernel, manages memory transfers, and handles synchronization. The latency of this interface matters.
PCIe bandwidth between CPU and GPU is the bottleneck for workloads that move data frequently across the interface. PCIe 5.0 x16 provides ~64 GB/s bidirectional, compared to 3350 GB/s of HBM bandwidth on-device. The rule is simple: data that moves across PCIe frequently should not be on the GPU.
NVLink (NVIDIA) and XGMI (AMD) provide high-bandwidth GPU-to-GPU interconnects — up to 900 GB/s for NVLink 4.0. These enable multi-GPU workloads to share memory across GPUs with substantially lower bandwidth penalty than going through host memory. On a DGX H100, 8 GPUs are connected in an all-reduce-friendly topology via NVLink, enabling collective operations (AllReduce, AllGather) that are fast enough to make distributed training practical at scale.
CUDA Unified Memory (and its equivalents in ROCm and Metal) provides a programming abstraction that hides explicit data transfers. The hardware migrates pages between CPU and GPU on demand. This is convenient for getting code working. It is not an optimization strategy; it is a tool for avoiding having to think about data movement, which carries the usual costs of not thinking about data movement.
Occupancy and Its Limits
Occupancy is the fraction of maximum warps that are simultaneously resident on an SM. High occupancy enables better latency hiding. But high occupancy is not the same thing as high performance.
The limits on occupancy are:
- Register pressure: More registers per thread → fewer threads per SM → lower occupancy
- Shared memory usage: More shared memory per block → fewer blocks per SM → lower occupancy
- Block size: Blocks must be sized such that they can fill an SM without remainder
Tools like nvcc --ptxas-options=-v (CUDA) and rocprof (ROCm) report register usage. The CUDA occupancy calculator (also available as a spreadsheet from NVIDIA) will tell you the theoretical maximum occupancy given your kernel's resource usage.
The caveat: a kernel with 100% theoretical occupancy and poor memory access patterns will underperform a kernel with 50% occupancy and well-optimized memory access. Occupancy is a prerequisite for latency hiding, not a substitute for correct behavior.
A Framework for Thinking About GPU Performance
When a GPU kernel is slow, the cause is almost always one of the following:
-
Memory bandwidth saturation — the kernel is reading/writing too much data relative to computation (low arithmetic intensity). Profile with
ncuorrocprofand look at memory throughput vs. peak. -
Warp divergence — threads in a warp are taking different code paths. Profile for "active warps" vs "eligible warps."
-
Uncoalesced memory access — threads in a warp are not accessing contiguous memory. L2 cache hit rate will be low; memory transactions per warp will be high.
-
Occupancy too low — not enough warps to hide latency. Register or shared memory usage is too high. Reduce resource usage per thread or accept that you need a different algorithmic approach.
-
Kernel launch overhead — too many small kernel launches. Launch overhead is on the order of microseconds; for work measured in tens of microseconds, this matters. Fuse kernels or use CUDA graphs.
-
PCIe transfer overhead — moving data between CPU and GPU too frequently. Restructure to do more work on-device between transfers.
These categories apply equally to CUDA, ROCm, and Metal — the profiling tools differ but the underlying physics does not.
The Next Chapters
The following chapters go deep on each of the three major GPU programming ecosystems: CUDA (Chapter 3), ROCm/HIP (Chapter 4), and Metal (Chapter 5). Each covers the programming model, memory management, kernel writing, and performance analysis tools specific to that platform.
If you already understand one platform well, you can read the others with the mental model from this chapter as context and note what is the same and what differs. The underlying physics is the same. The tradeoffs are the same. The syntax and toolchain are where they diverge.
The CUDA Programming Model
CUDA is 17 years old, controls roughly 90% of the GPU compute market, and remains the standard against which everything else is measured. Understanding CUDA is not optional for HPC in 2026 — even if you end up using ROCm or Metal, you will be reading CUDA documentation, porting CUDA code, or debugging problems that are explained in CUDA terms.
This chapter covers the CUDA programming model in depth: the execution hierarchy, memory management, synchronization, and the performance tools you need to stop guessing and start measuring.
The Execution Hierarchy
CUDA organizes computation into a three-level hierarchy: grids, blocks, and threads.
A kernel is a function that runs on the GPU. When you launch a kernel, you specify a grid of thread blocks. Each block contains threads. The threads within a block can share memory and synchronize with each other. Threads in different blocks cannot directly communicate during a kernel.
Grid (one per kernel launch)
└── Block (0, 0, 0), Block (1, 0, 0), Block (2, 0, 0), ...
└── Thread (0,0,0), Thread (1,0,0), ..., Thread (blockDim.x-1, 0, 0)
The grid and block dimensions can be up to 3D. In practice, most compute kernels use 1D or 2D layouts matching the problem structure (arrays, matrices, images).
// Kernel definition: runs on the GPU
__global__ void vector_add(const float* a, const float* b, float* c, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
c[idx] = a[idx] + b[idx];
}
}
// Kernel launch from the host (CPU)
int block_size = 256;
int grid_size = (n + block_size - 1) / block_size; // ceiling division
vector_add<<<grid_size, block_size>>>(d_a, d_b, d_c, n);
The <<<grid_size, block_size>>> syntax is the kernel launch configuration. It tells CUDA how many blocks to create and how many threads per block. Block sizes of 128, 256, or 512 are conventional starting points; optimal values depend on register pressure, shared memory usage, and the specific hardware.
Thread Indexing
blockIdx and threadIdx are built-in variables that give each thread its position in the grid. A thread's globally unique index in a 1D launch is:
int global_idx = blockIdx.x * blockDim.x + threadIdx.x;
For 2D:
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
The bounds check (if (idx < n)) matters. Grid sizes are always rounded up to fill blocks, so the last block may contain threads with no valid work. Without the bounds check, those threads will access out-of-bounds memory.
Memory Management
CUDA operates with a host (CPU) address space and a device (GPU) address space. In the general case, these are separate. Data must be explicitly moved between them.
Basic Allocation and Transfer
float *h_a = (float*)malloc(n * sizeof(float)); // host allocation
float *d_a;
cudaMalloc(&d_a, n * sizeof(float)); // device allocation
// Host → Device
cudaMemcpy(d_a, h_a, n * sizeof(float), cudaMemcpyHostToDevice);
// Device → Host
cudaMemcpy(h_a, d_a, n * sizeof(float), cudaMemcpyDeviceToHost);
// Cleanup
cudaFree(d_a);
free(h_a);
cudaMemcpy is synchronous by default — it blocks until the transfer completes. For overlapping computation and data transfer, use cudaMemcpyAsync with streams (covered below).
Pinned Memory
Regular host memory allocated with malloc is pageable. The CUDA driver must stage transfers through a locked buffer, costing an extra copy. Pinned memory (also called page-locked memory) removes this staging copy:
float *h_a;
cudaMallocHost(&h_a, n * sizeof(float)); // pinned allocation
// ... use h_a ...
cudaFreeHost(h_a);
Pinned transfers are typically 2× faster than pageable transfers and are a prerequisite for async transfers. The downside: pinned memory is scarce. Over-allocating it degrades overall system performance because the OS can't swap it out.
Shared Memory
Shared memory is the most important tool for GPU kernel optimization. It is fast (on-chip), explicitly managed by the programmer, and scoped to a thread block.
The canonical use case is tiled matrix multiplication: instead of each thread reading from global memory independently, the block cooperatively loads a tile into shared memory, computes from the tile, then loads the next tile.
__global__ void matmul_tiled(const float* A, const float* B, float* C,
int M, int N, int K) {
const int TILE = 16;
__shared__ float As[TILE][TILE];
__shared__ float Bs[TILE][TILE];
int row = blockIdx.y * TILE + threadIdx.y;
int col = blockIdx.x * TILE + threadIdx.x;
float acc = 0.0f;
for (int t = 0; t < (K + TILE - 1) / TILE; ++t) {
// Cooperatively load tile into shared memory
if (row < M && t * TILE + threadIdx.x < K)
As[threadIdx.y][threadIdx.x] = A[row * K + t * TILE + threadIdx.x];
else
As[threadIdx.y][threadIdx.x] = 0.0f;
if (col < N && t * TILE + threadIdx.y < K)
Bs[threadIdx.y][threadIdx.x] = B[(t * TILE + threadIdx.y) * N + col];
else
Bs[threadIdx.y][threadIdx.x] = 0.0f;
__syncthreads(); // Wait for all threads to finish loading
for (int k = 0; k < TILE; ++k)
acc += As[threadIdx.y][k] * Bs[k][threadIdx.x];
__syncthreads(); // Wait before overwriting shared memory
}
if (row < M && col < N)
C[row * N + col] = acc;
}
__syncthreads() is a block-level barrier. Every thread in the block must reach this point before any thread proceeds. Calling it inside a divergent branch (inside an if statement where some threads don't execute it) is undefined behavior.
Unified Memory
cudaMallocManaged allocates memory accessible from both CPU and GPU, with the driver handling page migration automatically.
float *data;
cudaMallocManaged(&data, n * sizeof(float));
// data is accessible on CPU and GPU
// driver migrates pages as needed
Use this for prototyping and correctness testing. For production code, explicit transfers with careful placement typically outperform the driver's heuristic page migration — especially for workloads with predictable access patterns.
Streams and Concurrency
A CUDA stream is a sequence of operations that execute in order on the device. Operations in different streams can overlap.
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
// These can overlap if the GPU has capacity
cudaMemcpyAsync(d_a, h_a, size, cudaMemcpyHostToDevice, stream1);
cudaMemcpyAsync(d_b, h_b, size, cudaMemcpyHostToDevice, stream2);
kernel_a<<<grid, block, 0, stream1>>>(d_a, d_result_a);
kernel_b<<<grid, block, 0, stream2>>>(d_b, d_result_b);
cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);
The classic use of streams is the copy-compute overlap pattern: while the GPU is processing batch N, stream in batch N+1. This requires pinned host memory and enough device memory to hold two batches simultaneously. The speedup is real — typically 20–40% for data-bound workloads — and the code complexity is manageable.
CUDA Graphs
For workloads consisting of many small kernels launched in a fixed pattern, kernel launch overhead accumulates. CUDA Graphs capture a set of kernel launches and their dependencies as a graph, then replay the entire graph with a single API call.
// Capture phase
cudaGraph_t graph;
cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
kernel_a<<<grid, block, 0, stream>>>(args_a);
kernel_b<<<grid, block, 0, stream>>>(args_b);
kernel_c<<<grid, block, 0, stream>>>(args_c);
cudaStreamEndCapture(stream, &graph);
// Instantiate (compile the graph)
cudaGraphExec_t graph_exec;
cudaGraphInstantiate(&graph_exec, graph, nullptr, nullptr, 0);
// Execute (much lower overhead than launching kernels individually)
cudaGraphLaunch(graph_exec, stream);
cudaStreamSynchronize(stream);
CUDA Graphs are essential for inference workloads where the same computation pattern repeats across many inputs. They are less useful when the computation graph changes shape (variable sequence lengths, dynamic shapes).
Synchronization Primitives
Device-wide Synchronization
cudaDeviceSynchronize() blocks the host until all GPU operations complete. Use it for correctness testing and timing. Do not use it in performance-critical paths.
Events
CUDA events are the right tool for timing GPU operations:
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, stream);
kernel<<<grid, block, 0, stream>>>(args);
cudaEventRecord(stop, stream);
cudaEventSynchronize(stop); // Wait for stop event to complete
float elapsed_ms;
cudaEventElapsedTime(&elapsed_ms, start, stop);
printf("Kernel time: %.3f ms\n", elapsed_ms);
cudaEventDestroy(start);
cudaEventDestroy(stop);
Events are placed in the stream's command queue. cudaEventRecord is asynchronous from the host's perspective — it does not block. cudaEventSynchronize blocks until the GPU has processed that event.
Warp-Level Primitives
Modern CUDA (compute capability 7.0+, Volta and later) exposes warp-level operations that do not require shared memory for intra-warp communication.
Warp Shuffle
Threads in a warp can directly exchange registers without going through shared memory:
// Warp reduction using shuffle
__device__ float warp_reduce_sum(float val) {
for (int offset = warpSize / 2; offset > 0; offset /= 2)
val += __shfl_down_sync(0xffffffff, val, offset);
return val; // Lane 0 holds the sum of all 32 values
}
__shfl_down_sync exchanges values within a warp. The first argument is the mask of participating threads (0xffffffff = all 32). The second is the value being shared. The third is the lane offset. This is faster than a shared-memory reduction and simpler to write.
Cooperative Groups
The Cooperative Groups API provides a structured way to work with groups of threads at any granularity:
#include <cooperative_groups.h>
namespace cg = cooperative_groups;
__global__ void reduce(float* data, float* result, int n) {
cg::thread_block block = cg::this_thread_block();
cg::thread_block_tile<32> warp = cg::tiled_partition<32>(block);
int idx = blockIdx.x * blockDim.x + threadIdx.x;
float val = (idx < n) ? data[idx] : 0.0f;
// Warp-level reduce
val = cg::reduce(warp, val, cg::plus<float>());
// One thread per warp writes to shared memory
__shared__ float warp_sums[32];
if (warp.thread_rank() == 0)
warp_sums[warp.meta_group_rank()] = val;
block.sync();
// Final reduction across warp sums (in first warp)
if (warp.meta_group_rank() == 0) {
val = (warp.thread_rank() < block.num_threads() / 32)
? warp_sums[warp.thread_rank()]
: 0.0f;
val = cg::reduce(warp, val, cg::plus<float>());
if (warp.thread_rank() == 0)
atomicAdd(result, val);
}
}
Cooperative Groups is the modern way to write reductions, scans, and other patterns that require coordination across subsets of threads. It is more composable than the older __syncthreads + shared memory approach and easier to reason about correctness.
Tensor Cores
Volta (2017) introduced Tensor Cores: specialized matrix multiplication units that operate on 4×4 matrix tiles in mixed precision. On H100, Tensor Cores deliver up to 3958 TFLOPS of FP8 throughput — roughly 50× the FP64 throughput on the same chip.
Tensor Cores are exposed through:
- WMMA API: Warp-level matrix operations, explicit control
- cuBLAS: High-level GEMM, handles Tensor Core usage automatically
- CUTLASS: Template library for efficient GEMM and related operations
For most users, cuBLAS is the right answer. CUTLASS when you need custom epilogue fusions. WMMA when you are writing a research implementation and need explicit control. Hand-written Tensor Core code that beats cuBLAS is rare and difficult.
// Let cuBLAS handle Tensor Cores
cublasHandle_t handle;
cublasCreate(&handle);
cublasSetMathMode(handle, CUBLAS_TENSOR_OP_MATH); // Enable Tensor Cores
float alpha = 1.0f, beta = 0.0f;
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N,
N, M, K,
&alpha,
d_B, N,
d_A, K,
&beta,
d_C, N);
Error Handling
CUDA functions return cudaError_t. Ignoring return values is how bugs become hour-long debugging sessions.
#define CUDA_CHECK(call) \
do { \
cudaError_t err = (call); \
if (err != cudaSuccess) { \
fprintf(stderr, "CUDA error at %s:%d: %s\n", \
__FILE__, __LINE__, cudaGetErrorString(err)); \
exit(EXIT_FAILURE); \
} \
} while (0)
// Usage
CUDA_CHECK(cudaMalloc(&d_data, size));
CUDA_CHECK(cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice));
After kernel launches, check for errors with:
kernel<<<grid, block>>>(args);
CUDA_CHECK(cudaGetLastError()); // Check for launch configuration errors
CUDA_CHECK(cudaDeviceSynchronize()); // Check for runtime errors during execution
cudaGetLastError catches errors set during kernel launch (bad grid/block dimensions, etc.). cudaDeviceSynchronize is required to surface errors that occur during kernel execution, since kernel launches are asynchronous.
Profiling with Nsight Compute
ncu (Nsight Compute) is the authoritative profiling tool for CUDA kernels. It collects hardware performance counters per kernel.
# Profile all kernels in a binary
ncu --set full ./my_program
# Profile a specific kernel
ncu --kernel-name my_kernel --set full ./my_program
# Key metrics to examine
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed,\
l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum,\
sm__warps_active.avg.pct_of_peak_sustained_active \
./my_program
The three questions ncu answers:
- Is the kernel compute-bound or memory-bound? Look at
sm__throughputvsl1tex__t_bytes. - Is the memory pattern efficient? Look at L1/L2 hit rates and coalescing metrics.
- Is occupancy limiting latency hiding? Look at active warps vs theoretical maximum.
Nsight Systems (nsys) is the higher-level profiler for understanding the interaction between CPU and GPU, stream behavior, and data transfer overlap:
nsys profile --trace=cuda,nvtx ./my_program
nsys-ui report.nsys-rep # Open GUI
Both tools are free and ship with the CUDA toolkit.
The CUDA Ecosystem
CUDA's real moat is not the programming model — it is the ecosystem built on top of it.
- cuBLAS: Dense linear algebra. The gold standard for GEMM.
- cuSPARSE: Sparse matrix operations.
- cuFFT: Fast Fourier transforms.
- cuDNN: Deep learning primitives (convolutions, attention, normalization).
- NCCL: GPU-to-GPU collective communications (AllReduce, Broadcast).
- Thrust: STL-like algorithms for GPUs.
- CUB: Block- and warp-level primitives for high-performance kernel building.
This ecosystem is the primary reason teams stay on CUDA even when they have access to AMD or other hardware. Porting a workflow is not just a matter of porting CUDA kernels to HIP; it is also porting or replacing all the library dependencies that sit between your code and the hardware.
What CUDA Does Not Do Well
No honest treatment of CUDA omits the frustrations.
The toolchain is NVIDIA-specific. nvcc compiles CUDA C++. It is a compiler extension to standard C++, and the interaction between CUDA extensions and modern C++ features is occasionally surprising. Clang has CUDA support that is generally good but not perfect.
Debugging is painful. cuda-gdb exists. It works. It is not pleasant. Memory errors produce CUDA error: an illegal memory access was encountered with no indication of which access or why. Tools like compute-sanitizer (--tool memcheck) help, but GPU debugging in 2026 is still more painful than CPU debugging.
The API surface is enormous. CUDA 12 has hundreds of API functions. Knowing which abstraction to use when is not obvious from documentation. The answer is usually: use a library (cuBLAS, cuDNN, NCCL) before writing your own kernel; write CUB primitives before writing raw CUDA; write raw CUDA only when the library doesn't expose what you need.
Vendor lock-in is real. CUDA code runs on NVIDIA hardware only. The portability problem is solved by HIP (Chapter 4), which provides a CUDA-compatible API that compiles for both NVIDIA and AMD, at the cost of some NVIDIA-specific features.
Despite these rough edges, CUDA remains the most complete, best-documented, and most-supported GPU computing platform in existence. The frustrations are well-understood and generally manageable. If you are starting a new GPU compute project in 2026 and do not have specific reasons to choose otherwise, CUDA is still the reasonable default.
ROCm: AMD's Open Platform
ROCm (Radeon Open Compute) has had a difficult adolescence. For most of its life — releases 1.x through 4.x — it was technically promising and practically painful. The installation process was fragile, hardware support was narrow, the ecosystem was sparse, and asking anyone to deploy it in production required either genuine idealism about the open-source GPU future or a procurement decision that precluded NVIDIA.
ROCm 5.x (2022–2023) was where things changed. ROCm 6.x continued the maturation. By 2025, it reached a state where the question shifted from "can we make this work?" to "is this the right tool for this job?" — which is the only question worth asking about any mature tool. This chapter answers that question with specifics.
HIP: The Portability Layer
Before the ROCm platform itself, it is useful to understand HIP, because HIP is how most CUDA code migrates to AMD hardware.
HIP (Heterogeneous-compute Interface for Portability) is a C++ runtime API and kernel language that is syntactically near-identical to CUDA. A valid HIP program can compile and run on NVIDIA GPUs (via the CUDA backend) or AMD GPUs (via the ROCm backend) with minimal or no source changes. Most of the CUDA concepts map directly:
| CUDA | HIP |
|---|---|
cudaMalloc | hipMalloc |
cudaMemcpy | hipMemcpy |
__global__ | __global__ |
threadIdx.x | threadIdx.x |
cudaDeviceSynchronize | hipDeviceSynchronize |
nvcc | hipcc |
The mechanical translation from CUDA to HIP can be done with the hipify-perl or hipify-clang tools:
hipify-clang my_cuda_kernel.cu --cuda-gpu-arch=sm_80 -- -I/usr/local/cuda/include
This produces HIP source that compiles with hipcc. The tool handles the syntactic translation correctly in most cases. What it does not handle: CUDA-specific library calls (cuBLAS → rocBLAS, cuDNN → MIOpen), NVIDIA-specific intrinsics with no AMD equivalent, and architecture-specific tuning that worked well on NVIDIA and may not on AMD.
HIP Kernel Example
#include <hip/hip_runtime.h>
__global__ void vector_add(const float* a, const float* b, float* c, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
c[idx] = a[idx] + b[idx];
}
}
int main() {
const int n = 1 << 20;
const size_t size = n * sizeof(float);
float *d_a, *d_b, *d_c;
hipMalloc(&d_a, size);
hipMalloc(&d_b, size);
hipMalloc(&d_c, size);
// ... host allocation, initialization, memcpy ...
dim3 block(256);
dim3 grid((n + 255) / 256);
hipLaunchKernelGGL(vector_add, grid, block, 0, 0, d_a, d_b, d_c, n);
hipDeviceSynchronize();
hipFree(d_a);
hipFree(d_b);
hipFree(d_c);
}
hipLaunchKernelGGL is the HIP kernel launch syntax. The <<<...>>> syntax also works with hipcc but the explicit form is preferred for portability with strict C++ compilers.
AMD GPU Architecture
Understanding the differences between AMD and NVIDIA microarchitecture helps explain where ROCm code needs tuning beyond a mechanical CUDA port.
Compute Units and Wavefronts
AMD Compute Units (CUs) are conceptually similar to NVIDIA SMs but differ in important details. On RDNA3 (RX 7000 series) and CDNA3 (Instinct MI300 series), the wavefront size is 64 threads — double NVIDIA's warp size of 32. RDNA3 also supports wave32 mode.
The larger wavefront size means:
- More register file per wavefront (64 threads worth vs 32)
- Branch divergence costs more when it occurs (64 masked threads vs 32)
- Warp-shuffle-equivalent operations (
__shfl_syncequivalent in HIP) operate on 64 lanes
Code tuned for 32-wide warps may not be optimal on 64-wide wavefronts without adjustment to block sizes and reduction patterns.
MI300X: The Datacenter Part
The AMD Instinct MI300X is the primary AMD competitor to the H100 in 2025–2026. Key specs:
- 192 GB of HBM3 unified memory (vs. H100's 80 GB)
- ~5.3 TB/s memory bandwidth (vs. H100's 3.35 TB/s)
- 304 compute units, ~164 TFLOPS FP64
- 5632 AI accelerators (matrix cores), up to ~1.3 PFLOPS FP8
The MI300X's headline advantage is memory capacity. For inference workloads where model weight size is the binding constraint — large language models, large multimodal models — fitting more of the model on a single device with no PCIe data movement is genuinely valuable. A single MI300X can hold a 70B parameter model in FP16 with room for KV cache, where an H100 requires multiple devices or quantization.
The memory bandwidth advantage (5.3 TB/s vs 3.35 TB/s) is also relevant for memory-bound operations. Inference is typically memory-bound for batch sizes where the arithmetic intensity is low; higher bandwidth directly translates to higher token throughput.
The ROCm Stack
ROCm is not just HIP. It is a full software stack:
Your Code
↓
HIP / OpenCL / OpenMP offload
↓
ROCr (ROCm Runtime)
↓
HSA (Heterogeneous System Architecture)
↓
AMDKFD (kernel driver) / AMDGPU (DRM driver)
↓
AMD GPU Hardware
The key components:
rocBLAS: AMD's BLAS implementation. Generally competitive with cuBLAS for GEMM on MI-series hardware. Some operations have parity, some do not; benchmark for your specific operation.
MIOpen: AMD's equivalent to cuDNN. Covers convolutions, attention, pooling, batch normalization. Performance varies by operation and hardware generation. The MI300X saw significant MIOpen improvements in ROCm 6.x.
rocFFT: FFT operations. Comparable to cuFFT in most scenarios.
RCCL: ROCm Collective Communications Library — the AMD equivalent of NCCL. Implements AllReduce, Broadcast, etc. over AMD's XGMI interconnect. On systems with multiple MI300X connected via Infinity Fabric, this enables multi-GPU training.
rocSPARSE: Sparse linear algebra.
Composable Kernel (CK): AMD's equivalent to CUTLASS — template-based high-performance kernels for GEMM and related operations, designed to express operations that exploit AMD-specific hardware (matrix cores, wavefront operations).
Installation and Environment
ROCm's installation story has improved substantially from its early years but remains more involved than CUDA.
# Ubuntu 22.04 / 24.04
sudo apt-get update
sudo apt-get install -y rocm
# Verify installation
rocminfo
rocm-smi
# Check GPU visibility
/opt/rocm/bin/rocminfo | grep -A5 "Agent"
Environment variables that matter:
export ROCM_PATH=/opt/rocm
export HIP_PLATFORM=amd # or nvidia when testing HIP on CUDA
export GPU_TARGETS=gfx942 # MI300X target architecture
export HIP_VISIBLE_DEVICES=0,1 # GPU selection (analogous to CUDA_VISIBLE_DEVICES)
Docker images are the practical approach for deployments:
docker pull rocm/rocm-terminal:6.3-ubuntu22.04
docker run -it --device /dev/kfd --device /dev/dri \
--group-add video --group-add render \
rocm/rocm-terminal:6.3-ubuntu22.04
The --device /dev/kfd and --device /dev/dri flags are AMD-specific — different from NVIDIA's --gpus all. This trips up people porting Docker-based workflows from CUDA environments.
Profiling with ROCm
rocprof
rocprof is the primary profiling tool. It collects hardware performance counters and generates trace files.
# Basic counter collection
rocprof --stats ./my_program
# Specific counters
rocprof --pmc SQ_INSTS_VALU SQ_INSTS_VMEM TCC_HIT TCC_MISS ./my_program
# Output CSV with per-dispatch data
rocprof -o results.csv ./my_program
The counter names on AMD hardware are different from NVIDIA's. The concepts map but the naming does not. SQ_INSTS_VALU counts vector ALU instructions; SQ_INSTS_VMEM counts vector memory instructions. The ratio between them is the hardware-level arithmetic intensity indicator.
Omniperf
Omniperf is AMD's newer profiling framework, designed to provide the kind of detailed per-kernel analysis that Nsight Compute provides on NVIDIA. It runs rocprof under the hood but adds normalization, roofline analysis, and a web-based visualization:
# Collect profile
omniperf profile -n my_workload -- ./my_program
# Analyze
omniperf analyze -p workloads/my_workload/MI300X_A1/
Omniperf's roofline view is particularly useful for the same reason as on NVIDIA: it immediately tells you whether a kernel is compute-bound or memory-bound, and how far from the hardware ceiling it is operating.
Where ROCm Wins and Where It Doesn't
ROCm wins on:
Memory capacity (MI300X): No single NVIDIA GPU in the H100/H200 class matches 192 GB of on-package HBM. For inference workloads where model size is the constraint, this is a significant operational advantage.
Memory bandwidth (MI300X): 5.3 TB/s is genuinely faster than 3.35 TB/s. For memory-bound operations, this directly translates to throughput.
Price-to-performance: AMD hardware is generally available at better pricing than equivalent NVIDIA hardware, particularly in the cloud market where providers like Google, Microsoft, and AWS have deployed MI300X at scale.
Openness: ROCm is Apache-licensed. You can read the implementation, file bugs, contribute patches. This matters for research groups and organizations where the ability to modify the stack is important.
ROCm lags on:
Ecosystem depth: CUDA has 17 years of libraries, tutorials, Stack Overflow answers, and code examples. ROCm has fewer high-quality third-party libraries, less community knowledge, and more "works in theory, needs a patch in practice" scenarios.
PyTorch/JAX integration: Both frameworks have first-class CUDA support. ROCm support is second-class — functional, but frequently trailing by a release or two, and occasionally requiring workarounds. The gap is closing in 2025–2026 but has not closed.
Triton: OpenAI's Triton (the Python-to-GPU-kernel compiler) added ROCm support in 2023, but kernel generation quality on AMD hardware still lags the NVIDIA backend for many operations.
Small workloads and tooling polish: For production inference serving, debugging tools (roc-gdb, memory sanitizers), and integration with observability stacks, NVIDIA's tooling is more mature.
The Portability Question
The strongest case for using HIP/ROCm is hardware portability. If your code runs on both AMD and NVIDIA hardware, you have negotiating leverage with cloud providers, can take advantage of spot pricing across hardware types, and are not locked into a single vendor's GPU supply chain.
For new code, HIP costs roughly the same effort as CUDA — the APIs are nearly identical. For existing CUDA code, the hipify tools handle 80–90% of the mechanical translation; the remaining 10–20% is library porting and architecture-specific tuning.
Whether that portability is worth the ongoing cost of maintaining compatibility with two hardware targets depends on your organization's operational situation and procurement strategy. There is no universal answer. There is only the honest accounting of what portability requires versus what it provides.
Metal: Apple Silicon in the HPC Arena
Apple Silicon was not supposed to matter for HPC. It was a laptop chip. A phone chip. Something you used for consumer workloads — video editing, maybe some light ML inference. The HPC community was politely uninterested.
Then people started running benchmarks.
The M-series chips, particularly the M2 Ultra and M3 Max/Ultra, deliver memory bandwidth that competes with discrete GPUs, FP16 throughput in the same neighborhood as mid-range datacenter parts, and a unified memory architecture that eliminates the PCIe bottleneck that makes CPU-GPU data movement painful. They do this in a laptop, drawing under 100W.
The relevance for HPC is specific. Apple Silicon is not going to replace H100 clusters. It is, however, a credible workstation compute platform for development, for workloads with large working sets that benefit from unified memory, and for inference where the power envelope matters. Understanding Metal — Apple's GPU programming API — is increasingly a useful skill.
The Unified Memory Architecture
The most distinctive feature of Apple Silicon is unified memory: the CPU cores, GPU cores, and Neural Engine all share the same physical DRAM with direct cache-coherent access. There is no discrete GPU with its own VRAM pool. There is no PCIe bus between CPU and GPU.
This changes the analysis of data movement cost fundamentally.
On a conventional discrete GPU setup:
- CPU allocates data in DRAM
- Transfer to GPU VRAM over PCIe (~64 GB/s peak, often much less)
- GPU processes
- Transfer results back over PCIe
- CPU reads results from DRAM
On Apple Silicon:
- CPU allocates data in shared DRAM
- GPU accesses the same allocation directly
- No transfer cost
For workloads where the data movement is the bottleneck — which is many workloads — this eliminates a fundamental constraint. The cost you pay is bandwidth: 400 GB/s (M3 Max) or 800 GB/s (M2 Ultra) instead of 3+ TB/s on a discrete datacenter GPU. For workloads that are memory-bandwidth-bound, unified memory does not help and may hurt if the bandwidth is the constraint.
The practical implication: workloads with high data-reuse are well-served by Apple Silicon. Workloads that need raw throughput on very large matrices with high arithmetic intensity are not.
Metal Shading Language
Metal compute shaders are written in Metal Shading Language (MSL), a C++14-derived language with GPU-specific extensions. If you have written HLSL or GLSL compute shaders, MSL will feel familiar. If your background is CUDA, it is syntactically different but conceptually similar.
// A simple vector addition kernel in MSL
#include <metal_stdlib>
using namespace metal;
kernel void vector_add(
device const float* a [[buffer(0)]],
device const float* b [[buffer(1)]],
device float* c [[buffer(2)]],
constant uint& n [[buffer(3)]],
uint index [[thread_position_in_grid]])
{
if (index < n) {
c[index] = a[index] + b[index];
}
}
The [[attribute]] syntax is how MSL decorates parameters with their binding points and semantics:
device: Main GPU-accessible memory (shared with CPU in unified memory)constant: Read-only memory, typically small (uniforms, parameters)threadgroup: Shared memory within a threadgroup (analogous to CUDA shared memory)[[buffer(N)]]: Binding slot for buffer arguments[[thread_position_in_grid]]: Built-in — the thread's global index[[threadgroup_position_in_grid]]: Block index (analogous toblockIdx)[[thread_position_in_threadgroup]]: Thread index within block (analogous tothreadIdx)
Threadgroup Memory (Shared Memory)
kernel void tiled_matmul(
device const float* A [[buffer(0)]],
device const float* B [[buffer(1)]],
device float* C [[buffer(2)]],
constant uint3& dims [[buffer(3)]],
threadgroup float* As [[threadgroup(0)]],
threadgroup float* Bs [[threadgroup(1)]],
uint2 thread_pos [[thread_position_in_grid]],
uint2 thread_in_group [[thread_position_in_threadgroup]],
uint2 group_pos [[threadgroup_position_in_grid]])
{
const uint TILE = 16;
uint M = dims.x, N = dims.y, K = dims.z;
uint row = thread_pos.y;
uint col = thread_pos.x;
float acc = 0.0;
for (uint t = 0; t < (K + TILE - 1) / TILE; ++t) {
// Load tile cooperatively
uint a_col = t * TILE + thread_in_group.x;
uint b_row = t * TILE + thread_in_group.y;
As[thread_in_group.y * TILE + thread_in_group.x] =
(row < M && a_col < K) ? A[row * K + a_col] : 0.0;
Bs[thread_in_group.y * TILE + thread_in_group.x] =
(b_row < K && col < N) ? B[b_row * N + col] : 0.0;
threadgroup_barrier(mem_flags::mem_threadgroup);
for (uint k = 0; k < TILE; ++k)
acc += As[thread_in_group.y * TILE + k] * Bs[k * TILE + thread_in_group.x];
threadgroup_barrier(mem_flags::mem_threadgroup);
}
if (row < M && col < N)
C[row * N + col] = acc;
}
threadgroup_barrier(mem_flags::mem_threadgroup) is the MSL equivalent of __syncthreads(). The mem_flags argument specifies which memory domains to synchronize.
The Metal API (Host Side)
The host-side Metal API is Objective-C or Swift. This is where the CUDA parallel breaks down for C/C++ shops. There is no C API for Metal; interfacing from C++ requires either Objective-C++ (.mm files) or wrapper libraries.
// Objective-C++ Metal setup
#import <Metal/Metal.h>
#import <Foundation/Foundation.h>
id<MTLDevice> device = MTLCreateSystemDefaultDevice();
id<MTLCommandQueue> queue = [device newCommandQueue];
// Load shader library
NSError* error = nil;
id<MTLLibrary> library = [device newDefaultLibrary];
id<MTLFunction> function = [library newFunctionWithName:@"vector_add"];
id<MTLComputePipelineState> pipeline =
[device newComputePipelineStateWithFunction:function error:&error];
// Allocate shared buffers (accessible by both CPU and GPU)
id<MTLBuffer> buf_a = [device newBufferWithLength:size
options:MTLResourceStorageModeShared];
id<MTLBuffer> buf_b = [device newBufferWithLength:size
options:MTLResourceStorageModeShared];
id<MTLBuffer> buf_c = [device newBufferWithLength:size
options:MTLResourceStorageModeShared];
// Write data into buf_a and buf_b via their contents pointer
float* data_a = (float*)[buf_a contents];
// ... fill data_a ...
// Encode and submit
id<MTLCommandBuffer> cmd = [queue commandBuffer];
id<MTLComputeCommandEncoder> encoder = [cmd computeCommandEncoder];
[encoder setComputePipelineState:pipeline];
[encoder setBuffer:buf_a offset:0 atIndex:0];
[encoder setBuffer:buf_b offset:0 atIndex:1];
[encoder setBuffer:buf_c offset:0 atIndex:2];
uint n_val = n;
[encoder setBytes:&n_val length:sizeof(uint) atIndex:3];
MTLSize threadgroup_size = {256, 1, 1};
MTLSize grid_size = {(n + 255) / 256 * 256, 1, 1};
[encoder dispatchThreads:grid_size threadsPerThreadgroup:threadgroup_size];
[encoder endEncoding];
[cmd commit];
[cmd waitUntilCompleted];
For C++ codebases, the practical options are:
- Metal-cpp: Apple's official C++ wrapper for the Metal API. Header-only, fairly mechanical translation from Objective-C. Use this for new C++ code.
- Objective-C++: Mix
.mmfiles into your build. Messy but functional for wrapping Metal in a C++ interface. - MLX: Apple's ML framework uses Metal internally and exposes a NumPy-like C++/Python API. Good for ML workloads; not a general GPU compute library.
Storage Modes and Memory Management
Metal's MTLResourceStorageMode controls where memory lives and how it is accessed:
| Mode | Description |
|---|---|
MTLResourceStorageModeShared | Single allocation, accessible by CPU and GPU |
MTLResourceStorageModePrivate | GPU-only; fastest GPU access, no CPU access |
MTLResourceStorageModeManaged | Explicit synchronization required (macOS only) |
MTLResourceStorageModeMemoryless | On-chip only; exists only during a render/compute pass |
On Apple Silicon (unified memory), Shared mode does not involve a copy — the CPU and GPU literally access the same memory. The performance difference between Shared and Private is typically small for compute workloads; Private mode can be faster for GPU-only buffers that the CPU never reads.
On Macs with discrete AMD GPUs (older Intel Macs), Shared mode requires managed synchronization, which is why MTLResourceStorageModeManaged exists. This mode is effectively obsolete for Apple Silicon development.
Matrix Multiplication: MPS
Metal Performance Shaders (MPS) is Apple's library of optimized GPU compute operations. For matrix multiplication, MPSMatrixMultiplication is the relevant class:
MPSMatrixMultiplication* gemm = [[MPSMatrixMultiplication alloc]
initWithDevice:device
resultRows:M
resultColumns:N
interiorColumns:K];
// MPSMatrix descriptors
MPSMatrixDescriptor* descA = [MPSMatrixDescriptor matrixDescriptorWithRows:M
columns:K rowBytes:K*sizeof(float) dataType:MPSDataTypeFloat32];
MPSMatrixDescriptor* descB = [MPSMatrixDescriptor matrixDescriptorWithRows:K
columns:N rowBytes:N*sizeof(float) dataType:MPSDataTypeFloat32];
MPSMatrixDescriptor* descC = [MPSMatrixDescriptor matrixDescriptorWithRows:M
columns:N rowBytes:N*sizeof(float) dataType:MPSDataTypeFloat32];
MPSMatrix* matA = [[MPSMatrix alloc] initWithBuffer:buf_a descriptor:descA];
MPSMatrix* matB = [[MPSMatrix alloc] initWithBuffer:buf_b descriptor:descB];
MPSMatrix* matC = [[MPSMatrix alloc] initWithBuffer:buf_c descriptor:descC];
[gemm encodeToCommandBuffer:cmd primaryMatrix:matA
secondaryMatrix:matB resultMatrix:matC];
MPS handles float32, float16, and (on recent hardware) bfloat16. For ML workloads, this is generally the right level of abstraction — hand-written Metal kernels rarely beat Apple's tuned implementations.
Performance Characteristics
Concrete numbers for Apple M-series GPUs (as of late 2025):
| Chip | GPU Cores | FP32 TFLOPS | FP16 TFLOPS | Memory BW |
|---|---|---|---|---|
| M3 | 10 | ~3.6 | ~7.2 | 100 GB/s |
| M3 Max | 40 | ~14.2 | ~28.4 | 400 GB/s |
| M2 Ultra | 76 | ~27.2 | ~54.4 | 800 GB/s |
For comparison: an H100 SXM5 delivers ~67 TFLOPS FP64, ~989 TFLOPS FP16, at ~3.35 TB/s. The H100 wins on raw throughput by roughly 20–30×. The M2 Ultra wins on memory bandwidth per watt by a substantial margin and eliminates PCIe overhead.
Where Apple Silicon is competitive:
- Inference with large models that fit in 192 GB of unified memory
- Workloads with irregular memory access patterns that benefit from cache coherence
- Development and prototyping where the developer's laptop is the compute platform
- Power-constrained deployments (edge, on-device)
Where it is not competitive:
- Training at scale (no NVLink equivalent, limited tensor core equivalent)
- Large batch dense linear algebra where raw TFLOPS dominate
- Workloads that need multi-node MPI distribution
The Developer Experience
Metal's development story is tight with Xcode. The Metal Debugger and GPU Frame Capture tools in Xcode are genuinely excellent for inspecting shader execution, visualizing buffer contents, and identifying performance bottlenecks. For GPU debugging, it is arguably the best DX of any platform.
The profiling story is also good: Instruments with the GPU counter template provides timeline views, occupancy data, and per-shader metrics. On Apple Silicon, you can profile compute kernels the same way you profile render work.
The limitation is the Apple-only toolchain. There is no Metal on Linux or Windows. Shaders written in MSL do not run anywhere else. If you need portability across GPU vendors, look at WebGPU (which has a Metal backend and runs in browsers), Vulkan compute (which has broad hardware support but not Apple), or HIP (NVIDIA/AMD only).
MLX: The Modern Python Interface
For ML-adjacent HPC work on Apple Silicon, MLX (Apple's open-source array framework) provides the most convenient interface:
import mlx.core as mx
a = mx.array([1.0, 2.0, 3.0])
b = mx.array([4.0, 5.0, 6.0])
c = a + b # Runs on GPU via Metal, lazy evaluation
mx.eval(c) # Triggers execution
# Matrix multiplication
A = mx.random.normal((1024, 512))
B = mx.random.normal((512, 1024))
C = A @ B # Uses MPS internally
mx.eval(C)
MLX uses lazy evaluation: operations build a computation graph, and mx.eval() triggers execution on the GPU. It handles memory management, precision, and device scheduling. For numerical computing that happens to need GPU acceleration on a Mac, this is the right starting point — not hand-written Metal shaders.
Verdict
Metal is worth knowing if you do serious GPU compute development on Apple hardware, ship products that target Apple Silicon, or are evaluating Apple devices for inference workloads. The programming model is coherent, the tools are polished, and the unified memory architecture creates genuine advantages for certain workload shapes.
It is not a path to CUDA-ecosystem compatibility or multi-vendor portability. It is a platform for Apple hardware specifically, and it is a good one for that narrower scope.
GPU Framework Shootout
The three previous chapters made the case for each platform on its own terms. This chapter makes them compete. The goal is a practical decision framework — not a winner-declaration, because the winner depends on your constraints — but a structured analysis of what you gain and what you give up with each choice.
The Decision Matrix
Before benchmarks, acknowledge the meta-question: are you choosing a programming model or a hardware platform? They are coupled but not identical.
- CUDA = NVIDIA hardware + CUDA programming model + full ecosystem
- HIP/ROCm = NVIDIA or AMD hardware + HIP programming model + ROCm ecosystem (AMD) or CUDA ecosystem (NVIDIA)
- Metal = Apple Silicon hardware + Metal/MSL programming model + Apple ecosystem
You can write HIP code that runs on NVIDIA hardware via the CUDA backend. You cannot run Metal on non-Apple hardware. You can write CUDA code that runs on AMD via porting tools, but it is a one-time port, not ongoing portability.
Side-by-Side: Language and Syntax
A simple reduction kernel in all three languages reveals where the models align and where they diverge.
CUDA
#include <cub/cub.cuh>
// Using CUB for a proper reduction
__global__ void reduce_sum(const float* input, float* output, int n) {
using BlockReduce = cub::BlockReduce<float, 256>;
__shared__ typename BlockReduce::TempStorage temp;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
float val = (idx < n) ? input[idx] : 0.0f;
float block_sum = BlockReduce(temp).Sum(val);
if (threadIdx.x == 0)
atomicAdd(output, block_sum);
}
// Launch:
int threads = 256;
int blocks = (n + threads - 1) / threads;
float* d_output;
cudaMalloc(&d_output, sizeof(float));
cudaMemset(d_output, 0, sizeof(float));
reduce_sum<<<blocks, threads>>>(d_input, d_output, n);
cudaDeviceSynchronize();
HIP (ROCm)
#include <hip/hip_runtime.h>
#include <hipcub/hipcub.hpp>
__global__ void reduce_sum(const float* input, float* output, int n) {
using BlockReduce = hipcub::BlockReduce<float, 256>;
__shared__ typename BlockReduce::TempStorage temp;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
float val = (idx < n) ? input[idx] : 0.0f;
float block_sum = BlockReduce(temp).Sum(val);
if (threadIdx.x == 0)
atomicAdd(output, block_sum);
}
// Launch — identical to CUDA
int threads = 256;
int blocks = (n + threads - 1) / threads;
float* d_output;
hipMalloc(&d_output, sizeof(float));
hipMemset(d_output, 0, sizeof(float));
hipLaunchKernelGGL(reduce_sum, dim3(blocks), dim3(threads), 0, 0,
d_input, d_output, n);
hipDeviceSynchronize();
The mechanical similarity is intentional. hipcub is a port of CUB to HIP; the interface is the same. This is the best-case scenario for HIP portability: the code is nearly identical.
Metal (MSL)
// MSL kernel — reduction using threadgroup memory
#include <metal_stdlib>
using namespace metal;
kernel void reduce_sum(
device const float* input [[buffer(0)]],
device atomic_float* output [[buffer(1)]],
constant uint& n [[buffer(2)]],
threadgroup float* shared [[threadgroup(0)]],
uint local_idx [[thread_position_in_threadgroup]],
uint global_idx [[thread_position_in_grid]],
uint group_size [[threads_per_threadgroup]])
{
shared[local_idx] = (global_idx < n) ? input[global_idx] : 0.0f;
threadgroup_barrier(mem_flags::mem_threadgroup);
for (uint stride = group_size / 2; stride > 0; stride >>= 1) {
if (local_idx < stride)
shared[local_idx] += shared[local_idx + stride];
threadgroup_barrier(mem_flags::mem_threadgroup);
}
if (local_idx == 0)
atomic_fetch_add_explicit(output, shared[0], memory_order_relaxed);
}
The Metal version is more verbose at the kernel signature level (every parameter requires attribute decoration) and uses different synchronization primitives. The reduction algorithm is the same; the idioms differ.
Performance Comparison
Benchmarking GPU kernels is an exercise in controlled disappointment: results vary significantly by hardware generation, problem size, precision, memory access pattern, and which version of which library you compiled against. The numbers below are indicative, not contractual.
GEMM (Matrix Multiplication)
Dense GEMM at FP32, 4096×4096 matrices, measured in TFLOPS effective throughput:
| Platform | Hardware | Library | Effective TFLOPS |
|---|---|---|---|
| CUDA | H100 SXM5 | cuBLAS 12 | ~60 TFLOPS |
| HIP | MI300X | rocBLAS 6.x | ~55 TFLOPS |
| HIP | H100 (CUDA backend) | cuBLAS | ~59 TFLOPS |
| Metal | M2 Ultra | MPS | ~12 TFLOPS |
| Metal | M3 Max | MPS | ~8 TFLOPS |
The H100/MI300X gap in GEMM has narrowed considerably in ROCm 6.x. MI300X's memory bandwidth advantage does not help large GEMMs much (they are compute-bound at this size), but its 192 GB capacity allows working at much larger problem sizes without spilling to host memory.
Memory Bandwidth (Stream Benchmark)
Sustained memory bandwidth, measured in GB/s:
| Platform | Hardware | Achieved BW | % of Peak |
|---|---|---|---|
| CUDA | H100 SXM5 | ~3.1 TB/s | ~92% |
| HIP | MI300X | ~4.9 TB/s | ~92% |
| Metal | M2 Ultra | ~720 GB/s | ~90% |
| Metal | M3 Max | ~360 GB/s | ~90% |
MI300X's bandwidth advantage is real and consistent. For memory-bound workloads (element-wise ops, reductions, gather/scatter), MI300X outperforms H100 by roughly 45%.
Inference Throughput (LLM, Large Batch)
For transformer inference at FP16, a representative large language model (70B parameters):
| Setup | Tokens/sec (batch=32) |
|---|---|
| 1× H100 SXM5 (requires quantization or offload) | ~3,200 |
| 1× MI300X (full FP16, fits in 192 GB) | ~4,100 |
| 2× H100 NVLink (full FP16) | ~5,800 |
| M2 Ultra (full FP16, fits in 192 GB) | ~320 |
The MI300X's single-card advantage for large model inference is significant. Not needing to split across GPUs eliminates the NVLink/XGMI communication overhead and simplifies deployment. The M2 Ultra — despite also fitting the model in unified memory — is outpaced on raw throughput; it is relevant for inference where latency and power matter more than throughput.
Ecosystem Comparison
| Capability | CUDA | ROCm/HIP | Metal |
|---|---|---|---|
| PyTorch | First-class | Supported, slight lag | Via MPS backend (limited) |
| JAX | First-class | Supported | Experimental |
| TensorFlow | First-class | Supported | Via Metal plugin |
| Triton kernels | First-class | ROCm backend (growing) | No |
| BLAS library | cuBLAS | rocBLAS | MPS |
| Profiler quality | Excellent (ncu, nsys) | Good (Omniperf) | Excellent (Instruments) |
| Community knowledge | Vast | Growing | Limited (HPC) |
| Debugger | cuda-gdb, compute-sanitizer | rocgdb | Xcode Metal Debugger |
| Docker support | --gpus all | --device /dev/kfd /dev/dri | macOS only |
| CI/CD integration | Mature | Workable | macOS runners only |
The Portability Spectrum
GPU portability is a spectrum, not a binary:
Write-once, run-anywhere: Does not exist at the kernel level in 2026. WebGPU is the closest thing (Metal/Vulkan/D3D12 backends, runs in browsers), and it has restrictions that make it unsuitable for serious HPC.
Portable high-level code: Possible with frameworks that abstract hardware (PyTorch, JAX, MLX). You write in Python/Python-adjacent, the framework dispatches to the right backend. This works well for ML workloads and reasonably well for array-style numerical computing. It does not work if you need to write custom kernels.
Portable custom kernels via HIP: HIP code compiles for both NVIDIA and AMD. This is the practical portability option for kernel authors. It requires maintaining one codebase but accepting that architecture-specific optimizations (wavefront-64 vs warp-32) may need conditional compilation.
Platform-specific kernels with shared logic: The most common real-world pattern for performance-critical code. Write CUDA and HIP separately, share the algorithmic logic in header files or via an abstraction layer. More code, better per-platform performance.
Decision Guide
Choose CUDA if:
- You need maximum ecosystem compatibility (third-party libraries, tutorials, hiring)
- You are running on NVIDIA hardware and do not need AMD compatibility
- You are working on ML training at scale (NCCL, cuDNN, cuBLAS are unmatched)
- Your team already knows CUDA and the portability cost is not justified
Choose HIP/ROCm if:
- You need code that runs on both NVIDIA and AMD hardware
- Your deployment hardware is or will be AMD (MI300X, future Instinct parts)
- The MI300X memory capacity or bandwidth is the decisive factor for your workload
- You have a principled preference for open-source GPU software stacks
- You are working with cloud providers who have made AMD competitive on price
Choose Metal if:
- Your deployment target is Apple Silicon (macOS application, Apple-specific service)
- You are developing and debugging on a MacBook and want GPU acceleration there
- The power-efficiency of Apple Silicon matters for your use case
- You are working with MLX and need to write custom kernels
None of the above if:
- You can express your workload in terms of existing library operations (cuBLAS, MPS, rocBLAS). The best GPU kernel is often the one you did not write.
On the Fragmentation Tax
Every organization that uses more than one GPU platform pays a fragmentation tax. It manifests as:
- Duplicate kernel implementations that must be maintained in sync
- CI pipelines that test on multiple hardware targets
- Engineers who know platform A but not platform B, creating knowledge silos
- Library version skew where feature parity is not guaranteed across platforms
This tax is real, measurable in engineering time, and tends to grow as the codebase grows. Frameworks like PyTorch partially absorb the tax by presenting a unified API over multiple backends. But the moment you drop below the framework layer — which HPC work regularly requires — the tax reappears.
The honest answer to "which GPU platform should we use?" is often: pick one and commit, unless you have a specific, quantified reason to support multiple. The grass on the other side of the PCIe bus is sometimes genuinely greener, but moving there still costs moving expenses.
MPI: Distributed Memory Parallelism
MPI (Message Passing Interface) is the load-bearing wall of the HPC world. It is how you make a single computation span multiple nodes in a cluster. It does this through a simple, brutal model: processes do not share memory; they communicate explicitly by sending and receiving messages.
This explicitness is the source of both MPI's power and its reputation for verbosity. When a program is slow or wrong, the communication pattern is visible in the code. There is no hidden coordination happening behind an abstraction. You can look at a section of MPI code and reason about exactly which processes are talking to which other processes and what data is moving. This property is extremely useful when you are debugging a hang on 2,048 nodes at 3 AM.
MPI is also 30 years old and designed by committee. These facts are not unrelated.
The Programming Model
MPI programs use the Single Program, Multiple Data (SPMD) model: every process runs the same program, but each process has a unique rank within a communicator. The default communicator MPI_COMM_WORLD contains all processes launched by the job scheduler. The rank identifies each process uniquely within that communicator.
#include <mpi.h>
#include <stdio.h>
int main(int argc, char** argv) {
MPI_Init(&argc, &argv);
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank); // This process's rank
MPI_Comm_size(MPI_COMM_WORLD, &size); // Total number of processes
printf("Hello from rank %d of %d\n", rank, size);
MPI_Finalize();
return 0;
}
Compile and run:
mpicc -o hello hello.c
mpirun -n 4 ./hello
Output (order not guaranteed):
Hello from rank 0 of 4
Hello from rank 2 of 4
Hello from rank 1 of 4
Hello from rank 3 of 4
The critical insight: every process executes MPI_Comm_rank independently and gets a different answer. This is how SPMD programs differentiate behavior per process — conditionals on rank.
Point-to-Point Communication
The two fundamental operations are MPI_Send and MPI_Recv.
if (rank == 0) {
// Process 0 sends data to process 1
int data = 42;
MPI_Send(&data, // buffer
1, // count
MPI_INT, // datatype
1, // destination rank
0, // tag (user-defined message label)
MPI_COMM_WORLD); // communicator
} else if (rank == 1) {
// Process 1 receives from process 0
int data;
MPI_Status status;
MPI_Recv(&data, // buffer
1, // count
MPI_INT, // datatype
0, // source rank (MPI_ANY_SOURCE to accept from any)
0, // tag (MPI_ANY_TAG to accept any tag)
MPI_COMM_WORLD,
&status);
printf("Received: %d\n", data);
}
Send Modes and Deadlocks
MPI_Send is a blocking send that may or may not buffer the data, depending on the message size and MPI implementation. This creates a classic deadlock scenario:
// DEADLOCK: both processes try to send before receiving
MPI_Send(&data, n, MPI_DOUBLE, partner, 0, MPI_COMM_WORLD);
MPI_Recv(&recv, n, MPI_DOUBLE, partner, 0, MPI_COMM_WORLD, &status);
When n is large enough that MPI_Send cannot buffer the data, it blocks waiting for the receiver to post a receive. But the receiver is also blocked in MPI_Send. Deadlock.
The standard fix: use MPI_Sendrecv for simultaneous send and receive, or use non-blocking operations.
// MPI_Sendrecv: atomic send+receive, no deadlock
MPI_Sendrecv(
&send_data, n, MPI_DOUBLE, right_neighbor, 0, // send
&recv_data, n, MPI_DOUBLE, left_neighbor, 0, // receive
MPI_COMM_WORLD, &status);
Non-Blocking Communication
Non-blocking operations return immediately and allow overlapping communication with computation:
MPI_Request send_req, recv_req;
// Post non-blocking receive first (good practice)
MPI_Irecv(&recv_buf, n, MPI_DOUBLE, src, 0, MPI_COMM_WORLD, &recv_req);
// Post non-blocking send
MPI_Isend(&send_buf, n, MPI_DOUBLE, dst, 0, MPI_COMM_WORLD, &send_req);
// Do compute work while communication is in flight
do_local_computation();
// Wait for both to complete
MPI_Wait(&send_req, MPI_STATUS_IGNORE);
MPI_Wait(&recv_req, MPI_STATUS_IGNORE);
MPI_Waitall blocks until all requests in an array complete:
MPI_Request requests[2] = {send_req, recv_req};
MPI_Waitall(2, requests, MPI_STATUSES_IGNORE);
The key rule: never read or write the buffer of a non-blocking operation until it has completed. The buffer is "owned" by MPI during the operation.
Collective Operations
Collectives involve all processes in a communicator simultaneously. They are the expressive, efficient way to do common communication patterns.
MPI_Bcast: One-to-All
int data;
if (rank == 0) data = 99; // Only rank 0 initializes
MPI_Bcast(&data, // buffer (source on root, destination on others)
1, // count
MPI_INT, // datatype
0, // root rank
MPI_COMM_WORLD);
// After Bcast, all ranks have data == 99
MPI_Reduce: All-to-One
double local_sum = compute_local_sum();
double global_sum;
MPI_Reduce(&local_sum, // send buffer
&global_sum, // receive buffer (meaningful only on root)
1, // count
MPI_DOUBLE,
MPI_SUM, // operation
0, // root rank
MPI_COMM_WORLD);
if (rank == 0)
printf("Global sum: %f\n", global_sum);
Operations include MPI_SUM, MPI_MAX, MPI_MIN, MPI_PROD, MPI_LAND (logical and), and user-defined operations via MPI_Op_create.
MPI_Allreduce: All-to-All Reduce
Like MPI_Reduce but every process gets the result. This is the operation underlying distributed gradient aggregation in ML training:
double local_grad = compute_local_gradient();
double global_grad;
MPI_Allreduce(&local_grad, &global_grad, 1, MPI_DOUBLE,
MPI_SUM, MPI_COMM_WORLD);
// All ranks now have the sum of gradients, can update weights
update_weights(global_grad / size);
MPI_Scatter and MPI_Gather
Scatter distributes data from a root to all processes; Gather collects from all processes to a root:
// Root has data[size * chunk_size], others receive chunk_size elements
double *all_data = NULL;
if (rank == 0) all_data = malloc(size * chunk_size * sizeof(double));
double *local_data = malloc(chunk_size * sizeof(double));
MPI_Scatter(all_data, chunk_size, MPI_DOUBLE, // send
local_data, chunk_size, MPI_DOUBLE, // receive
0, MPI_COMM_WORLD);
// Each process works on local_data
process(local_data, chunk_size);
// Collect results back
MPI_Gather(local_data, chunk_size, MPI_DOUBLE, // send
all_data, chunk_size, MPI_DOUBLE, // receive
0, MPI_COMM_WORLD);
MPI_Allgather and MPI_Alltoall are the variants where every process receives the collected data.
Derived Datatypes
Sending non-contiguous data without copying to a temporary buffer uses MPI derived datatypes.
// Send every other element of an array (stride 2)
MPI_Datatype strided_type;
MPI_Type_vector(
count, // number of blocks
1, // elements per block
2, // stride (in elements)
MPI_DOUBLE,
&strided_type);
MPI_Type_commit(&strided_type);
MPI_Send(data, 1, strided_type, dst, 0, MPI_COMM_WORLD);
MPI_Type_free(&strided_type);
For structured data (C structs), MPI_Type_create_struct creates a datatype matching an arbitrary struct layout. This is often more trouble than packing fields manually; in practice, many codebases avoid derived datatypes in favor of explicit packing routines, trading MPI complexity for code they can test and debug independently.
Communicators and Topologies
MPI_COMM_WORLD covers all processes, but real applications often need subsets.
// Split processes into groups based on rank parity
int color = rank % 2; // 0 for even, 1 for odd
MPI_Comm subcomm;
MPI_Comm_split(MPI_COMM_WORLD, color, rank, &subcomm);
// Each process is now in a 2-process or larger sub-communicator
int sub_rank, sub_size;
MPI_Comm_rank(subcomm, &sub_rank);
MPI_Comm_size(subcomm, &sub_size);
// Collective on the sub-communicator
MPI_Barrier(subcomm);
MPI_Comm_free(&subcomm);
Cartesian topologies are useful for structured grid computations where each process has neighbors in 2D or 3D space:
// Create a 2D Cartesian grid of processes
int dims[2] = {0, 0}; // Let MPI choose dimensions
MPI_Dims_create(size, 2, dims);
int periods[2] = {1, 1}; // Periodic boundaries (torus)
MPI_Comm cart_comm;
MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, 1, &cart_comm);
// Find neighbors
int north, south, east, west;
MPI_Cart_shift(cart_comm, 0, 1, &north, &south); // dimension 0
MPI_Cart_shift(cart_comm, 1, 1, &west, &east); // dimension 1
// Halo exchange with neighbors
MPI_Sendrecv(south_edge, n, MPI_DOUBLE, south, 0,
north_halo, n, MPI_DOUBLE, north, 0,
cart_comm, MPI_STATUS_IGNORE);
GPU-Aware MPI
Modern MPI implementations (OpenMPI 4+, MVAPICH2-GDR, Cray MPICH) support GPU-aware communication: passing device pointers directly to MPI routines without staging through host memory.
// GPU-aware MPI: d_send_buf is a device (GPU) pointer
cudaMalloc(&d_send_buf, n * sizeof(float));
cudaMalloc(&d_recv_buf, n * sizeof(float));
// MPI sees the device pointer and uses GPUDirect RDMA
// No cudaMemcpy to host required
MPI_Sendrecv(d_send_buf, n, MPI_FLOAT, dst, 0,
d_recv_buf, n, MPI_FLOAT, src, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
This requires:
- NVIDIA GPUDirect RDMA (for NVIDIA hardware)
- An MPI implementation compiled with GPU support
- InfiniBand or other RDMA-capable network
OMPI_MCA_osc=rdmaor equivalent configuration
When it works, GPU-aware MPI eliminates the device→host→network→host→device copy chain. The bandwidth and latency improvement is significant — often 3–5× on large messages.
When it does not work (driver version mismatch, unsupported hardware, misconfiguration), you get silent corruption or crashes. The debugging experience for GPU-aware MPI failures is not pleasant.
# Verify GPU-aware MPI is available
ompi_info --parseable | grep -i "MPI extensions"
# Or check for CUDA support
ompi_info | grep -i "cuda"
MPI + OpenMP Hybrid Programming
Single-node parallelism with OpenMP combined with multi-node MPI is the standard hybrid model for CPU-heavy HPC codes.
// Launch with: mpirun -n 8 --bind-to socket ./hybrid
// Each MPI process uses all cores on its socket via OpenMP
#include <mpi.h>
#include <omp.h>
int main(int argc, char** argv) {
// MPI_THREAD_FUNNELED: only master thread calls MPI
int provided;
MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided);
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
double local_result = 0.0;
#pragma omp parallel reduction(+:local_result)
{
int tid = omp_get_thread_num();
int nthreads = omp_get_num_threads();
// Each thread handles a slice of local work
local_result += compute_slice(rank, size, tid, nthreads);
}
double global_result;
MPI_Reduce(&local_result, &global_result, 1, MPI_DOUBLE,
MPI_SUM, 0, MPI_COMM_WORLD);
MPI_Finalize();
}
Thread safety levels in MPI (MPI_THREAD_SINGLE, MPI_THREAD_FUNNELED, MPI_THREAD_SERIALIZED, MPI_THREAD_MULTIPLE) must match how your code uses MPI from threads. MPI_THREAD_FUNNELED is the safest hybrid approach: only the main thread calls MPI functions; OpenMP threads handle computation. MPI_THREAD_MULTIPLE (any thread can call MPI) has overhead and is rarely necessary.
Performance Considerations
Latency vs. bandwidth: Small messages are latency-dominated; large messages are bandwidth-dominated. On InfiniBand HDR (200 Gb/s), latency is ~1–2 μs for small messages; bandwidth peaks around 20 GB/s. Structure your communication to minimize the number of small messages, even if it means aggregating messages manually.
Collective algorithm selection: MPI implementations select different algorithms for collectives based on message size and process count. An AllReduce of 4 bytes on 1024 processes uses a different algorithm than AllReduce of 1 GB. Understanding which algorithm is selected (usually via OMPI_MCA_coll_base_verbose=10 or equivalent) matters when tuning.
Non-blocking collectives (MPI-3): MPI_Iallreduce, MPI_Ibcast, etc. enable overlapping computation and collective communication, similar to non-blocking point-to-point:
MPI_Request req;
MPI_Iallreduce(&local_grad, &global_grad, n, MPI_DOUBLE,
MPI_SUM, MPI_COMM_WORLD, &req);
// Overlap with other computation
next_batch_forward_pass();
MPI_Wait(&req, MPI_STATUS_IGNORE);
// Now global_grad is ready
Process placement: Binding MPI processes to specific CPUs (--bind-to core, --bind-to socket) prevents process migration and ensures NUMA-local memory access. On NUMA systems, a process that migrates across sockets pays a 2–3× memory latency penalty. Always specify process binding in production MPI jobs.
The Modern Reality
MPI is not glamorous. The syntax is verbose, the error messages are cryptic, and debugging a deadlock on 512 nodes is an experience that makes you reconsider your career choices. It is also the only mature, widely-deployed solution for distributed-memory parallelism at scale.
The ML community has largely reinvented MPI's AllReduce under names like "distributed data parallel" and "gradient synchronization," with libraries like NCCL and RCCL doing the actual communication. NCCL is excellent for GPU collectives on NVIDIA hardware. It is not a replacement for MPI when you need general-purpose distributed-memory programming, process management across heterogeneous nodes, or communication on CPU buffers.
For new code that needs multi-node GPU distribution and fits into the ML training pattern (data parallel, gradient averaging), NCCL + PyTorch DDP or JAX pmap is likely the right abstraction. For anything that does not fit that pattern — general scientific computing, irregular communication, CPU-GPU mixed pipelines — MPI remains the right tool.
OpenMP: Shared Memory Without the Ceremony
OpenMP is how most HPC code adds parallelism to a loop without rewriting it. A pragma, a compile flag, and suddenly code that ran on one core runs on all of them. This is either elegant engineering or dangerous magic, depending on whether you understood what the loop was doing before you added the pragma.
OpenMP dates from 1997 and has accreted features steadily since. The core — parallel loops and task parallelism — is simple and useful. The periphery — SIMD directives, GPU offloading (target), memory model clauses — is increasingly complex and occasionally surprising. This chapter covers the core and the most useful features without pretending the complexity does not exist.
The Execution Model
OpenMP uses a fork-join model. The program begins as a single master thread. When it encounters a parallel region, it forks a team of threads. At the end of the parallel region, threads join back to one. Multiple parallel regions can be nested; each creates its own team.
Single thread → PARALLEL REGION → n threads → join → single thread → ...
The number of threads is controlled by:
- The
OMP_NUM_THREADSenvironment variable - The
num_threads(n)clause on a parallel directive omp_set_num_threads(n)called before the parallel region
By default, OpenMP uses all available logical cores. On a 16-core/32-thread system, OMP_NUM_THREADS defaults to 32. Whether 32 threads is sensible depends on whether your workload benefits from hyperthreading — it often does not for floating-point-heavy HPC code.
The Parallel Loop
The single most common OpenMP pattern:
#include <omp.h>
// Without OpenMP: serial
for (int i = 0; i < n; i++) {
result[i] = a[i] * b[i] + c[i];
}
// With OpenMP: parallel across all threads
#pragma omp parallel for
for (int i = 0; i < n; i++) {
result[i] = a[i] * b[i] + c[i];
}
#pragma omp parallel for combines parallel (fork a team) and for (distribute loop iterations). The loop iterations are divided among threads; each thread executes its share.
The implicit assumption: iterations are independent. If iteration i depends on the result of iteration i-1, you cannot parallelize it this way. The compiler will not always catch this.
Scheduling
The schedule clause controls how iterations are assigned to threads:
// Static: divide iterations into equal chunks, assigned at compile time
#pragma omp parallel for schedule(static)
for (int i = 0; i < n; i++) { /* ... */ }
// Dynamic: assign chunks of 16 iterations to threads as they finish
// Good for irregular workloads where some iterations take longer
#pragma omp parallel for schedule(dynamic, 16)
for (int i = 0; i < n; i++) { /* ... */ }
// Guided: decreasing chunk sizes; good for loops with decreasing work
#pragma omp parallel for schedule(guided, 8)
for (int i = 0; i < n; i++) { /* ... */ }
static scheduling has zero synchronization overhead and is cache-friendly (threads access contiguous memory). Use it for uniform workloads. dynamic adds synchronization overhead but balances load for non-uniform work; useful for tree traversal, sparse operations, and anything where iteration time varies.
Data Sharing
In a parallel region, variables are either shared (all threads see the same object) or private (each thread has its own copy).
The defaults:
- Variables declared outside the parallel region: shared
- Variables declared inside the parallel region: private
- Loop iteration variable: private (automatically)
int n = 1000;
double sum = 0.0; // shared by default — DANGER if written concurrently
#pragma omp parallel for
for (int i = 0; i < n; i++) {
sum += compute(i); // DATA RACE: multiple threads writing sum
}
The reduction clause fixes this:
double sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < n; i++) {
sum += compute(i); // Each thread has a private copy; combined at end
}
// sum now contains the total
reduction(op:var) creates a private copy of var for each thread (initialized to the identity element for op), lets each thread accumulate independently, and combines at the end. Operations: +, *, -, min, max, &&, ||, bitwise &, |, ^.
The private, firstprivate, and lastprivate Clauses
double x = 10.0; // initialized before the parallel region
#pragma omp parallel private(x)
{
// x is private but UNINITIALIZED — x is not copied in
x = omp_get_thread_num() * 2.0; // fine, each thread sets its own x
}
#pragma omp parallel firstprivate(x)
{
// x is private AND initialized to the value from before the region (10.0)
x += omp_get_thread_num(); // starts from 10.0 + thread_id
}
double last;
#pragma omp parallel for lastprivate(last)
for (int i = 0; i < n; i++) {
last = i; // The value from the last iteration (i = n-1) is copied out
}
Synchronization
Barrier
An implicit barrier exists at the end of every parallel for construct — threads wait until all have finished before the master continues. Explicit barriers:
#pragma omp parallel
{
do_first_phase();
#pragma omp barrier // All threads reach here before any proceed
do_second_phase();
}
Critical Section
Only one thread executes at a time:
#pragma omp parallel
{
double local_result = compute();
#pragma omp critical
{
global_sum += local_result; // Serial; only one thread at a time
}
}
Critical sections serialize execution — use them sparingly. For the common case of accumulation, prefer reduction.
Atomic Operations
For simple updates to a shared variable, atomic is faster than critical:
#pragma omp atomic
count++; // Atomic increment; no full lock needed
#pragma omp atomic update
histogram[bin]++; // Also atomic update
atomic works for simple read-modify-write operations on scalar variables. The hardware provides atomic instructions for these; it is substantially cheaper than acquiring a lock.
Tasks
OpenMP tasks (introduced in 3.0) enable parallelism that does not map neatly to loop iterations — recursive algorithms, irregular graphs, producer-consumer patterns.
#pragma omp parallel
{
#pragma omp single // Only one thread creates tasks
{
for (int i = 0; i < n; i++) {
#pragma omp task firstprivate(i)
{
process_item(i); // Executes in parallel as threads are available
}
}
}
// Implicit taskwait at end of single region
}
A classic recursive use: parallel merge sort.
void merge_sort(int* data, int n) {
if (n < THRESHOLD) {
insertion_sort(data, n);
return;
}
int mid = n / 2;
#pragma omp task shared(data)
merge_sort(data, mid);
#pragma omp task shared(data)
merge_sort(data + mid, n - mid);
#pragma omp taskwait // Wait for both halves before merging
merge(data, mid, n - mid);
}
// Call site:
#pragma omp parallel
#pragma omp single
merge_sort(data, n);
#pragma omp taskwait blocks the current task until all child tasks complete. #pragma omp taskgroup provides finer control when you want to wait for a subset.
Thread Affinity and NUMA
On NUMA systems (multiple sockets, each with local memory), thread placement affects memory access latency significantly. A thread on socket 0 accessing memory allocated by a thread on socket 1 pays a 2–3× latency penalty.
# Control affinity with environment variables (OpenMP 4.0+)
export OMP_PROC_BIND=close # Bind threads close together (same socket)
export OMP_PROC_BIND=spread # Spread threads across sockets
export OMP_PLACES=cores # Bind to core granularity
export OMP_PLACES=sockets # Bind to socket granularity
# Or with numactl for finer control
numactl --cpunodebind=0 --membind=0 ./my_program
First-touch policy: on Linux, pages are allocated in the memory bank of the thread that first accesses them. In NUMA-aware code, initialize arrays in parallel so that each thread touches its own portion of the data, placing that portion in local memory:
// NUMA-aware initialization: each thread initializes its own data
#pragma omp parallel for schedule(static)
for (int i = 0; i < n; i++) {
data[i] = 0.0; // Thread i/chunk touches this page → allocated locally
}
// Now the parallel computation has good locality
#pragma omp parallel for schedule(static)
for (int i = 0; i < n; i++) {
data[i] = compute(i); // Same thread as init → local memory access
}
The schedule must match between initialization and computation. If static scheduling assigns different iterations to different threads in the two loops, you get remote memory access.
SIMD Directives
OpenMP 4.0+ includes SIMD directives for vectorizing loops at the instruction level:
#pragma omp simd
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
// Combined parallel+simd: threads each execute a SIMD loop
#pragma omp parallel for simd
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
In practice, modern compilers auto-vectorize simple loops without the simd directive. The directive is useful for loops the compiler would not vectorize by default — typically because it cannot prove the absence of aliasing.
void add(float* restrict a, float* restrict b, float* restrict c, int n) {
// 'restrict' tells the compiler pointers don't alias — enables auto-vectorization
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
}
The restrict keyword (or -fno-strict-aliasing on the compiler) and careful loop structure often get you further than OpenMP SIMD directives. See the Vectorization chapter for more depth on this.
OpenMP Target Offload (GPU)
OpenMP 4.5+ introduced target directives for GPU offloading. The intent is to express GPU parallelism using the same pragma model as CPU parallelism.
#pragma omp target map(to:a[0:n], b[0:n]) map(from:c[0:n])
#pragma omp teams distribute parallel for simd
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
The target pragma executes the following block on the GPU. map(to:...) copies data to the device before execution; map(from:...) copies results back after. teams distribute creates the grid of thread blocks; parallel for simd handles the parallel loop and vectorization within a block.
The honest assessment: OpenMP target offload exists, works, and is supported by GCC (via OpenACC backend), Clang, and LLVM. Performance on complex kernels is typically 20–50% behind hand-tuned CUDA. The value proposition is code portability — the same code targets CPU and GPU — at the cost of performance headroom.
For GPU-specific kernels where performance is critical, write CUDA or HIP. For codes where you want GPU acceleration on some platforms and CPU execution on others, and the performance gap is acceptable, OpenMP target is a reasonable choice.
Profiling OpenMP Code
Before optimizing, measure. Tools for OpenMP profiling:
# Intel VTune (Linux, free download)
vtune -collect threading ./my_program
vtune -report summary
# LLVM XRay
clang -fxray-instrument -fxray-instruction-threshold=1 ./my_program.c -o my_program
XRAY_OPTIONS="patch_premain=true" ./my_program
llvm-xray account xray-log.* --sort=sum --top=10
# Score-P with Scalasca (cluster profiling)
scalasca -analyze mpirun -n 8 scorep-instrumented-binary
# Simple: time with OMP_NUM_THREADS variation
for t in 1 2 4 8 16; do
OMP_NUM_THREADS=$t time ./my_program
done
Amdahl's Law is your constant companion when reading these profiles. If the parallel section takes 90% of serial runtime, the theoretical maximum speedup with infinite threads is 10×. With 16 threads, you might get 7–8×. If you are seeing 3×, the issue is not thread count — it is synchronization, load imbalance, or memory bandwidth saturation.
A Note on Thread Safety
Adding #pragma omp parallel for to a loop that calls external functions is dangerous if those functions are not thread-safe. Standard library functions that use global state (strtok, rand, some stdio functions) will produce incorrect results or crashes. Thread-safe alternatives exist (strtok_r, rand_r), but the burden is on the programmer to know which functions are safe.
Third-party libraries may or may not be thread-safe. BLAS implementations (OpenBLAS, MKL) are thread-safe when called from multiple threads but may internally use their own thread pools — nested parallelism. When BLAS is called from inside an OpenMP region, you often want to disable BLAS's own threading:
export OMP_NUM_THREADS=16
export OPENBLAS_NUM_THREADS=1 # Disable OpenBLAS threading
export MKL_NUM_THREADS=1 # Disable MKL threading
Otherwise you get thread oversubscription: 16 OpenMP threads each spawning 16 BLAS threads = 256 threads on 16 cores.
Summary
OpenMP's strength is low-friction parallelism for existing code. The annotation model means you can add parallelism incrementally, verify correctness at each step, and fall back to serial execution by disabling the flags.
Its limitation is the shared-memory constraint: OpenMP works within a single node. For multi-node parallelism, combine with MPI (as covered in the previous chapter). For GPU parallelism with better performance guarantees, use CUDA or HIP directly and treat OpenMP as the CPU-side complement.
The programmer's responsibility is to understand the code before adding pragmas. OpenMP does not protect you from data races — it just makes them faster.
Vectorization and SIMD
SIMD (Single Instruction, Multiple Data) is the most consistently underexploited source of performance in production HPC code. Unlike GPU parallelism — which requires rearchitecting algorithms — or OpenMP — which requires restructuring loops — vectorization often requires very little: write the code correctly, and the compiler does the work.
When it works, vectorization delivers 4–16× speedup on floating-point loops at zero programmer effort. When it does not work, engineers spend days hand-writing intrinsics and end up with code that is brittle, architecture-specific, and barely faster than the scalar version they were complaining about.
This chapter covers both paths.
What SIMD Is
Modern CPUs contain SIMD units — wide arithmetic units that operate on multiple data elements simultaneously. Instead of adding two floats, they add eight. Instead of multiplying one double-precision pair, they multiply four.
The relevant ISA extensions by platform:
| Extension | Width | Elements (float) | Platform |
|---|---|---|---|
| SSE2 | 128-bit | 4 × float32 | x86 (2000+) |
| AVX | 256-bit | 8 × float32 | x86 (2011+) |
| AVX2 | 256-bit | 8 × float32, integer | x86 (2013+) |
| AVX-512 | 512-bit | 16 × float32 | x86 (2017+, servers) |
| NEON | 128-bit | 4 × float32 | ARM (v7+) |
| SVE | Scalable | 4–64 × float32 | ARM (v8.2+) |
| VSX | 128-bit | 4 × float32 | POWER |
A Skylake-X Xeon supporting AVX-512 with two FMA units can execute 32 float32 FMA operations per cycle. At 3.5 GHz, that is 112 GFLOPS per core — before considering multiple cores. Leaving AVX-512 unused means leaving most of that peak performance on the table.
Auto-Vectorization
The compiler is the first line of vectorization. GCC, Clang, and MSVC all auto-vectorize loops that meet certain criteria. The criteria are simple to state and surprisingly easy to violate accidentally:
- Loop count must be known or estimable at compile time (or at loop entry for runtime vectorization)
- No loop-carried dependencies (iteration N depends on result of iteration N-1)
- No aliasing between input and output buffers
- Memory access must be contiguous (stride-1)
// Auto-vectorizes cleanly
void add(float* c, const float* a, const float* b, int n) {
for (int i = 0; i < n; i++)
c[i] = a[i] + b[i];
}
Check whether the compiler vectorized with:
# GCC
gcc -O3 -march=native -fopt-info-vec-optimized add.c
# Clang
clang -O3 -march=native -Rpass=loop-vectorize add.c
# Check the assembly
objdump -d add.o | grep -E "vmovaps|vaddps|vmulps|vfmadd"
SIMD instructions in the output (ymm/zmm registers, v-prefixed instructions) confirm vectorization.
Why Auto-Vectorization Fails
Aliasing: The compiler cannot prove that a, b, and c do not overlap. If they overlap, vectorizing would be incorrect. The restrict keyword is the fix:
// Without restrict: compiler may not vectorize (assumes aliasing possible)
void add(float* c, const float* a, const float* b, int n) { ... }
// With restrict: compiler knows pointers don't alias, can vectorize
void add(float* __restrict__ c,
const float* __restrict__ a,
const float* __restrict__ b, int n) { ... }
Loop-carried dependency: A recurrence where each iteration depends on the previous:
// Cannot vectorize: c[i] depends on c[i-1]
for (int i = 1; i < n; i++)
c[i] = c[i-1] * alpha + b[i];
This is a genuine algorithmic constraint; you cannot vectorize this with standard SIMD without restructuring the algorithm (some recurrences admit vectorized reformulations, but not trivially).
Non-contiguous access: Stride-2 or gather access prevents the compiler from generating efficient vector loads:
// Stride-2 access: difficult to auto-vectorize efficiently
for (int i = 0; i < n; i += 2)
result += a[i] * b[i];
Restructure data to be contiguous where possible. Separate arrays (structure of arrays) over interleaved layouts (array of structures) is the standard recommendation for SIMD-friendly data.
Function calls: Most function calls prevent vectorization because the compiler cannot inspect their internals. Mark small functions inline or __attribute__((always_inline)).
Enabling Vectorization
Compiler Flags
# Enable AVX-512 on a server with Skylake/Cascade Lake Xeon
gcc -O3 -march=native -funroll-loops -ftree-vectorize
# Target a specific architecture
gcc -O3 -march=skylake-avx512
# Report missed vectorizations
gcc -O3 -march=native -fopt-info-vec-missed
-march=native enables all instruction set extensions of the current CPU. Do not use it for binaries deployed to heterogeneous clusters — use the oldest common denominator (-march=znver3 for AMD Zen 3, -march=icelake-server for Intel Icelake).
Compiler Hints
// Hint that the pointer is aligned (enables aligned loads/stores)
float* a = (float*)aligned_alloc(64, n * sizeof(float));
__builtin_assume_aligned(a, 64);
// Loop vectorization hints (GCC)
#pragma GCC ivdep // Ignore vector dependencies (use when safe)
#pragma GCC unroll 4 // Unroll 4 times before vectorizing
// Clang
#pragma clang loop vectorize(enable)
#pragma clang loop interleave_count(4)
Intel Intrinsics: When You Need Control
When the compiler fails to vectorize efficiently, or when you need to guarantee specific instruction sequences, intrinsics provide direct access to SIMD instructions.
AVX-512 example: computing the sum of squares of a float array.
#include <immintrin.h> // AVX-512 intrinsics
float sum_of_squares_avx512(const float* data, int n) {
__m512 acc = _mm512_setzero_ps(); // 16 × float32, initialized to 0
int i = 0;
// Main loop: 16 elements per iteration
for (; i <= n - 16; i += 16) {
__m512 v = _mm512_loadu_ps(&data[i]); // Load 16 floats
acc = _mm512_fmadd_ps(v, v, acc); // acc += v * v (FMA)
}
// Horizontal sum of the 16 accumulators
float result = _mm512_reduce_add_ps(acc);
// Scalar tail for remaining elements
for (; i < n; i++)
result += data[i] * data[i];
return result;
}
Key intrinsic naming conventions (Intel):
_mm_prefix: 128-bit (SSE)_mm256_prefix: 256-bit (AVX/AVX2)_mm512_prefix: 512-bit (AVX-512)_pssuffix: packed single (float32)_pdsuffix: packed double (float64)_epi32suffix: packed 32-bit integers
ARM NEON Intrinsics
#include <arm_neon.h> // ARM NEON intrinsics
float dot_product_neon(const float* a, const float* b, int n) {
float32x4_t acc = vdupq_n_f32(0.0f); // 4 × float32, all zeros
int i = 0;
for (; i <= n - 4; i += 4) {
float32x4_t va = vld1q_f32(&a[i]); // Load 4 floats
float32x4_t vb = vld1q_f32(&b[i]); // Load 4 floats
acc = vfmaq_f32(acc, va, vb); // acc += va * vb (FMA)
}
// Horizontal sum
float result = vaddvq_f32(acc);
// Scalar tail
for (; i < n; i++)
result += a[i] * b[i];
return result;
}
NEON is available on all AArch64 processors: Apple Silicon, AWS Graviton, Ampere Altra. It delivers 4 × float32 per cycle with FMA, substantially less than AVX-512 but with lower power consumption and better portability in the ARM world.
ARM SVE: Scalable Vectors
SVE (Scalable Vector Extension) is ARM's answer to AVX-512, with a twist: the vector width is implementation-defined and discoverable at runtime. Code written for SVE runs on 128-bit implementations (4 × float32) and 512-bit implementations (16 × float32) without recompilation.
#include <arm_sve.h>
float sum_sve(const float* data, int n) {
svfloat32_t acc = svdup_f32(0.0f);
svbool_t pg; // Predicate (mask)
for (int i = 0; i < n; i += svcntw()) { // svcntw() = vector length in uint32s
pg = svwhilelt_b32(i, n); // Predicate: active lanes where i+lane < n
svfloat32_t v = svld1_f32(pg, &data[i]); // Masked load
acc = svmla_f32_m(pg, acc, v, v); // acc += v * v (masked FMA)
}
return svaddv_f32(svptrue_b32(), acc); // Horizontal sum
}
SVE's predicated execution handles loop tails naturally — no separate scalar tail needed. This makes the code cleaner at the cost of a more complex programming model.
The Roofline Model for CPUs
The same arithmetic intensity analysis from the GPU fundamentals chapter applies to CPUs.
For a Cascade Lake Xeon at 3.5 GHz with AVX-512 and two FMA units:
- Peak compute: 3.5 GHz × 16 (AVX-512 width) × 2 (FMA) × 2 (FMAs/cycle) = 224 GFLOPS/core
- Peak bandwidth (DDR4-2933, 6-channel): ~140 GB/s per socket
Crossover (roofline knee): 224 / 140 ≈ 1.6 FLOP/byte
Operations with arithmetic intensity above 1.6 are compute-bound and benefit from SIMD. Operations below 1.6 are memory-bandwidth-bound, and SIMD helps only insofar as it reduces the number of load/store instructions that touch the bus.
This is why streaming operations (memcpy, simple element-wise ops) do not see large speedups from SIMD on large arrays: they are already memory-bound, and wider SIMD just saturates the memory bus faster without changing the ceiling.
Practical Vectorization Workflow
-
Measure first. Profile to confirm the loop is actually hot. Vectorizing a function that takes 0.1% of runtime provides 0.1% improvement in the best case.
-
Check compiler output. Use
-fopt-info-vec-optimizedand-fopt-info-vec-missed. Read the disassembly. Confirm you are seeing ymm/zmm registers in hot loops. -
Fix obvious blockers. Add
restrict, restructure data layouts, eliminate function calls in loops. -
Enable the right flags.
-O3 -march=nativeon development hardware; target a specific-marchfor deployment. -
Use intrinsics only when necessary. If the compiler generates correct SIMD with decent performance, stop. Intrinsics code is harder to maintain and architecture-specific.
-
Test on target hardware. Performance of SIMD code varies significantly between CPU generations. AVX-512 downclocks on some Intel parts (frequency throttling when AVX-512 is active). Measure on the hardware that matters.
AVX-512 and the Frequency Throttling Problem
A notable gotcha on Intel Skylake and Cascade Lake Xeons: executing AVX-512 instructions causes the CPU to reduce its clock frequency by 100–300 MHz. This "license 2" frequency state is designed to manage power consumption during wide SIMD execution.
The implications:
- A loop that uses AVX-512 may run slower than AVX2 if the frequency reduction offsets the wider SIMD width
- The throttle applies to the entire core for a settling period (~1 ms) after AVX-512 execution — affecting non-SIMD code that runs afterward
- This is specific to older Intel microarchitectures; Ice Lake and later have significantly reduced throttling; AMD does not throttle on AVX-512
Measurement is mandatory. Do not assume AVX-512 is always faster than AVX2 on Intel hardware without benchmarking on the target CPU.
Integration with HPC Libraries
Many HPC libraries handle SIMD internally and expose architecture-agnostic interfaces:
- FFTW: Detects and uses SSE2/AVX/AVX-512/NEON at runtime based on the CPU
- OpenBLAS/MKL: Architecture-optimized BLAS with runtime dispatch
- Eigen: C++ template library that generates SIMD code; use
-DEIGEN_VECTORIZE_AVX512for AVX-512 - Highway: Google's portable SIMD library; same code, multiple targets
Highway is worth highlighting for cross-architecture portability:
#include "highway/highway.h"
namespace hn = hwy::HWY_NAMESPACE;
// This code compiles for SSE2, AVX2, AVX-512, NEON, SVE, WASM SIMD
// The right version is selected at runtime
HWY_ATTR void add(float* HWY_RESTRICT c,
const float* HWY_RESTRICT a,
const float* HWY_RESTRICT b, int n) {
const hn::ScalableTag<float> d;
const int lanes = hn::Lanes(d);
int i = 0;
for (; i + lanes <= n; i += lanes) {
auto va = hn::Load(d, a + i);
auto vb = hn::Load(d, b + i);
hn::Store(hn::Add(va, vb), d, c + i);
}
// Scalar tail
for (; i < n; i++) c[i] = a[i] + b[i];
}
Highway's abstraction works well. The generated code quality is competitive with hand-written intrinsics in most cases. If you are writing a library that needs to run efficiently across x86, ARM, and RISC-V, Highway is the right tool.
Summary
Vectorization is free performance if you write the right code and compile with the right flags. The steps in order:
- Structure data for contiguous access (SoA over AoS)
- Annotate away aliasing with
restrict - Compile with
-O3 -march=native - Verify with disassembly or compiler reports
- Reach for intrinsics only when the compiler falls short and the profiler shows it matters
The gap between scalar and fully vectorized code can be 4–16×. The engineer who ignores vectorization and instead buys a bigger GPU is wasting money that the compiler would have given them for free.
When Training Clusters Became Research Clusters
There is a moment, roughly 2022–2023, when HPC centers across the world looked at their queue systems and noticed something unusual: the jobs were getting longer, the memory requests were getting larger, and an increasing fraction of them wanted GPUs. Not just GPUs as accelerators — GPUs exclusively. The CPU allocation was one core. For the scheduler.
These were ML training jobs.
The HPC community's first instinct was often to treat this as an infestation. ML jobs do not use MPI properly. They do not use the filesystem correctly (too many small files, or one enormous checkpoint). They run for days in an undivided block rather than checkpointing frequently. They consume GPU memory in ways that prevent packing multiple jobs on one node. They are, from the perspective of a traditional HPC operator, rude.
The ML community's first instinct was to build their own infrastructure. Why deal with Slurm when you can use Kubernetes? Why request GPUs through a scheduler when you can reserve a dedicated machine?
Both instincts were wrong, or at least insufficient. By 2025, HPC operators had adapted their workflows to accommodate ML training, and ML practitioners had learned — sometimes painfully — that the HPC community's hard-won knowledge about distributed computing, fault tolerance, and IO performance was not actually useless. The reconciliation was messy and ongoing. This chapter is about the intersection that emerged.
What Makes ML Training Different from Classical HPC
Classical HPC workloads — computational fluid dynamics, molecular dynamics, climate modeling — share certain characteristics:
- Deterministic communication patterns: MPI collectives at known points in the algorithm
- Floating-point precision matters: FP64 is standard; results must be bit-reproducible
- Job duration is predictable: "This will run for approximately N hours"
- Output is data: Files written to a shared filesystem
ML training differs on all of these:
- Non-deterministic communication patterns: The gradient aggregation pattern is regular, but dynamic batching, variable sequence lengths, and data-parallel scheduling create variability
- Lower precision is often better: FP16 and BF16 converge faster for many models; FP8 is in use for inference and increasingly for training
- Job duration is unpredictable: Training continues until loss converges or the researcher gets impatient, whichever comes first
- Intermediate output is enormous: Model checkpoints for a 70B parameter model in BF16 are ~140 GB; checkpointing every hour × weeks of training × multiple experiments fills filesystems
The implications for cluster operators:
- GPU allocation policies need to favor long-running exclusive reservations, not the small-job fairness typically optimized by HPC schedulers
- The IO system needs to handle massive sequential writes (checkpointing) and massive sequential reads (data loading) simultaneously
- Network topology matters differently: ML training uses AllReduce almost exclusively, which benefits from fat-tree or dragonfly topologies; HPC uses a broader variety of collectives
Data Parallelism: The Dominant Pattern
The dominant ML training pattern is data parallelism: each GPU holds a complete copy of the model; each GPU processes a different batch of data; gradients are aggregated (averaged) across GPUs after each step.
# PyTorch DDP: the standard data parallelism API
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
def train(rank, world_size):
dist.init_process_group("nccl", rank=rank, world_size=world_size)
model = MyModel().cuda(rank)
model = DDP(model, device_ids=[rank]) # Wraps model with AllReduce
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
for batch in dataloader:
optimizer.zero_grad()
loss = model(batch)
loss.backward() # DDP AllReduces gradients across GPUs here
optimizer.step()
dist.destroy_process_group()
DDP's gradient synchronization uses NCCL's AllReduce under the hood. NCCL is highly optimized for the NVLink topology of DGX nodes; on InfiniBand clusters, it uses RDMA to avoid host memory involvement. The communication overlaps with the backward pass where possible (buckets of gradients are reduced as they become ready, not all at once at the end).
The scaling efficiency of pure data parallelism is good but not perfect. At large batch sizes, the model may not train well (generalization degrades). Communication overhead grows as a fraction of compute time. In practice, linear scaling efficiency of 85–95% across 8 GPUs drops to 70–80% at 64 GPUs and 50–70% at 512+ GPUs, depending on model size and interconnect.
Tensor Parallelism: Splitting the Model
When a model does not fit in a single GPU's memory, data parallelism is insufficient. Tensor parallelism (also called model parallelism) splits the model itself across GPUs.
The prototypical form: split a large matrix multiplication across GPUs. For a weight matrix W with shape [4096, 4096], split it column-wise across 4 GPUs, each holding [4096, 1024]. Each GPU computes its portion of the output; an AllGather reconstructs the full result.
GPU 0: W[:, 0:1024] → partial output [B, 1024]
GPU 1: W[:,1024:2048] → partial output [B, 1024]
GPU 2: W[:,2048:3072] → partial output [B, 1024]
GPU 3: W[:,3072:4096] → partial output [B, 1024]
AllGather → full output [B, 4096]
Megatron-LM (NVIDIA's large model training library) popularized this approach for transformer layers. The attention and feed-forward layers in each transformer block are split across tensor-parallel GPUs; an AllReduce (or AllGather + ReduceScatter) synchronizes after each layer.
The cost: tensor parallelism requires extremely low-latency, high-bandwidth communication. It works well within an NVLink-connected DGX node (all 8 GPUs, 900 GB/s bidirectional). It does not scale across nodes well — the per-layer communication latency becomes the bottleneck.
Pipeline Parallelism: Splitting by Layer
Pipeline parallelism assigns different layers of the model to different GPUs (or groups of GPUs). GPU 0 computes layers 1–4, passes the activations to GPU 1 which computes layers 5–8, and so on.
GPU 0: layers 1-4 → activations → GPU 1: layers 5-8 → ...
Naively, this leaves GPUs idle while waiting for upstream stages to finish the forward pass. Micro-batching (GPipe, PipeDream) addresses this by splitting the batch into micro-batches that flow through the pipeline simultaneously:
Time: 1 2 3 4 5 6 7 8
GPU 0: F1 F2 F3 F4 B4 B3 B2 B1
GPU 1: F1 F2 F3 F4 B4 B3 B2
GPU 2: F1 F2 F3 F4 B4 B3
GPU 3: F1 F2 F3 F4 B4
(F=forward, B=backward, numbers=micro-batch)
Pipeline parallelism communicates activations (not gradients) between pipeline stages — the data size is typically much smaller than gradients, enabling effective use of inter-node InfiniBand.
3D Parallelism
Large model training at scale (GPT-4 scale, Llama-3 scale) uses all three forms of parallelism simultaneously:
- Data parallelism: across replica groups (each replica is a full copy of the model)
- Tensor parallelism: within a node (NVLink provides the bandwidth)
- Pipeline parallelism: across nodes (InfiniBand handles inter-node activation transfer)
The interplay between these dimensions is complex and the subject of ongoing research in systems papers. The key insight: match the communication pattern to the available interconnect:
- NVLink (~900 GB/s): tensor parallelism (high-bandwidth, low-latency collectives)
- InfiniBand (~400 Gb/s, ~50 GB/s): pipeline stages (moderate data, point-to-point)
- Ethernet: data parallel replicas (large messages, tolerance for some latency)
Mixed Precision Training
FP32 training is largely obsolete. The standard since 2018 is mixed precision: forward and backward passes in FP16 or BF16; master weights and optimizer state in FP32.
# PyTorch AMP (Automatic Mixed Precision)
from torch.cuda.amp import GradScaler, autocast
scaler = GradScaler() # Handles loss scaling for FP16
for batch in dataloader:
optimizer.zero_grad()
with autocast(): # Forward pass in FP16
output = model(batch)
loss = criterion(output, targets)
scaler.scale(loss).backward() # Scale loss to prevent FP16 underflow
scaler.step(optimizer) # Unscales gradients before optimizer step
scaler.update() # Adjusts scale factor
BF16 vs FP16: BF16 (bfloat16, Brain Float) has the same exponent range as FP32 but lower mantissa precision. It is less susceptible to overflow/underflow than FP16, which requires gradient scaling (the GradScaler above). On hardware that supports BF16 natively (A100, H100, MI300X, Apple Neural Engine), BF16 is generally preferred for training — simpler code, equivalent convergence, no scaling overhead.
FP8: Transformer Engine (NVIDIA) and equivalent frameworks enable FP8 training with dynamic scaling per tensor. Effective throughput on H100 with FP8 approaches 4× the FP16 throughput via Tensor Cores. The stability of FP8 training is an active research area; it works well for transformers with careful scaling; not universally applicable.
The IO Bottleneck
ML training at scale reveals IO bottlenecks that classical HPC jobs — which often have well-characterized sequential access patterns — manage to avoid.
Data loading: Training on large datasets (Common Crawl, RedPajama, LAION) requires reading terabytes of data with minimal stalls. The pattern is sequential within a shard, random across shards. Distributed filesystems (Lustre, GPFS) handle sequential reads well; object storage (S3, GCS) handles the random-shard pattern better with parallel prefetch.
Tools like WebDataset (PyTorch) and tf.data (TensorFlow) provide async prefetching pipelines that overlap data loading with compute. On a properly configured training pipeline, the GPU never waits for data. Improperly configured, data loading is the bottleneck — a $50,000 H100 idle while waiting for slow IO.
# WebDataset: streaming from object storage or filesystem
import webdataset as wds
dataset = (
wds.WebDataset("data/train-{000000..001023}.tar")
.shuffle(10000)
.decode("pil")
.to_tuple("jpg", "json")
.map(preprocess)
.batched(64)
)
loader = wds.WebLoader(dataset, num_workers=8, prefetch_factor=4)
Checkpointing: A 70B parameter model in BF16 = ~140 GB. Checkpointing every N steps writes 140 GB to shared storage. With 64 GPUs running simultaneously and checkpointing every 1000 steps, checkpoint IO can saturate shared filesystems in ways that degrade performance for all jobs on the cluster.
Asynchronous checkpointing (writing in the background while training continues) and distributed checkpointing (each GPU writes its shard, reassembled on load) are the standard mitigations.
Fault Tolerance at Scale
A 1024-GPU training run that experiences a GPU failure every 24 hours (conservative for old hardware) will see a failure every 1.4 hours on average. Without checkpointing and restart automation, the training job dies.
Elastic training: PyTorch's torchrun (successor to torch.distributed.launch) supports membership changes — adding or removing nodes from a running training job without restart. Implemented via TORCHELASTIC_AGENT_SIDE_MONITOR and torch.distributed.elastic.
# Launch with torchrun: auto-restarts on failure, supports elasticity
torchrun \
--nnodes=8:16 \ # Min 8, max 16 nodes (elastic)
--nproc_per_node=8 \
--max_restarts=3 \
train.py
Checkpoint frequency and recovery time: With 140 GB checkpoints, recovery time from checkpoint includes the IO time to read the checkpoint. Fast checkpointing (NVMe local storage, then async copy to shared FS) versus slow (write directly to Lustre) can mean the difference between 5-minute recovery and 45-minute recovery.
HPC Patterns That ML Engineers Should Know
The HPC community's experience with large-scale distributed computing contains hard-won knowledge that has been independently rediscovered (sometimes at great cost) by ML practitioners.
Checkpoint-restart is not optional. HPC jobs have always checkpointed regularly; ML training is learning this lesson through hardware failures at scale.
IO is a shared resource. Writing checkpoints from 1024 GPUs simultaneously destroys performance for everyone on the cluster. Job scheduling systems have bandwidth reservation; use it. Stagger checkpoint writes if the system does not.
Profiling distributed programs is different. A 5% compute imbalance across 512 ranks that would be insignificant in isolation causes the entire job to slow by 5% at every synchronization point. Trace-based profilers (Nsight Systems, Perfetto) that show all ranks simultaneously are essential for distributed performance debugging.
Network topology is not flat. InfiniBand fat-tree topologies have more bandwidth within a spine switch than across spine switches. Co-locating communicating processes on the same rack is not always possible, but it is the right model to have in your head when looking at why AllReduce performance varies.
ML Patterns That HPC Engineers Should Know
Automatic differentiation is real. ML frameworks compute gradients automatically via the computational graph. Understanding backpropagation at the graph level explains why the backward pass takes 2× the memory of the forward pass (activations must be retained) and why gradient checkpointing (recomputing activations during the backward pass instead of storing them) is a memory/compute tradeoff.
The optimizer state is large. Adam maintains two moment estimates per parameter: m (first moment) and v (second moment). For a 70B parameter model in FP32, the optimizer state is 2 × 70B × 4 bytes = 560 GB. This is why ZeRO (Zero Redundancy Optimizer) — which shards optimizer state across GPUs — is essential for large model training.
Quantization is a legitimate tool, not a hack. INT8 inference of a FP16 model can recover 1.9× throughput with minimal accuracy loss on many tasks. FP8 training is in production. This is different from low-precision arithmetic in scientific computing, where precision is non-negotiable.
The Unified Cluster
By 2026, the practical reality for most organizations running significant compute is a single GPU cluster that serves both ML training and scientific computing, managed through a unified scheduler (Slurm with GPU extensions, or Kubernetes with GPU operators).
The policies that make this work:
- Priority queues that differentiate elastic ML jobs from rigid HPC jobs
- GPU time limits that force checkpointing on long-running jobs
- Storage quotas that prevent checkpoint accumulation from consuming all available space
- Network partitioning that keeps bulk data transfer from saturating the HPC interconnect
- Fair-share scheduling that prevents one team's training run from monopolizing resources indefinitely
The tools that make this work:
- Slurm with
gres(generic resource) scheduling for GPUs, TRES fairshare for accounting - Kubernetes with NVIDIA GPU Operator or AMD ROCm Device Plugin
- MLflow or Weights & Biases for experiment tracking (the HPC equivalent of job accounting)
- Unified checkpoint storage (parallel filesystem + object storage tier) with automated retention policies
The integration is imperfect and organization-specific. There is no universal solution. There is only the recognition that the workloads converged before the infrastructure did, and the infrastructure is catching up.
Production Patterns and Workload Design
Everything in the preceding chapters was about individual tools: CUDA, ROCm, Metal, MPI, OpenMP, vectorization. This chapter is about how those tools combine into systems that run reliably at scale — the engineering decisions that separate code that works in a demo from code that runs for months in production without someone babysitting it.
The patterns here are not theoretical. They are the kind of things that get written up after incidents.
Know Your Workload's Critical Path
Before designing a parallel system, identify the critical path: the sequence of operations that determines end-to-end latency. Operations not on the critical path can be optimized indefinitely without improving overall performance.
A common mistake: optimizing GPU kernel throughput in a pipeline where the critical path is IO. A common parallel mistake: optimizing IO in a pipeline where the critical path is an AllReduce that is network-bound.
The tool for finding the critical path is a distributed trace, not a local profiler. Nsight Systems, Perfetto, or a custom timeline built on NVTX markers can show the timeline of events across all GPUs and the CPU simultaneously. A timeline view answers the question "what is the GPU waiting for and how long?" more directly than any per-kernel metric.
# NVTX markers: annotate code regions that appear in Nsight timeline
import torch.cuda.nvtx as nvtx
nvtx.range_push("data_loading")
batch = next(data_iter)
nvtx.range_pop()
nvtx.range_push("forward_pass")
with torch.cuda.amp.autocast():
output = model(batch)
nvtx.range_pop()
nvtx.range_push("backward_pass")
scaler.scale(output).backward()
nvtx.range_pop()
When you open this trace in Nsight Systems, the colored ranges appear on the timeline alongside CUDA kernel launches, memory transfers, and NCCL collectives. The gaps between GPU activity are immediately visible.
Roofline-Driven Optimization
The roofline model (introduced in the GPU Fundamentals chapter) is a practical tool for directing optimization effort. For each kernel:
- Measure actual arithmetic intensity (FLOPs / bytes transferred)
- Look up the hardware roofline (peak FLOPs / peak bandwidth)
- If below the roofline: memory-bound → optimize memory access patterns, reduce data movement, increase reuse
- If at or near the roofline: compute-bound → reduce arithmetic operations, use lower precision, or accept that you are near optimal
# Measure FLOPs and bandwidth for a PyTorch operation
import torch
from torch.profiler import profile, record_function, ProfilerActivity
with profile(activities=[ProfilerActivity.CUDA],
with_flops=True,
profile_memory=True) as prof:
with record_function("matmul"):
result = torch.mm(A, B)
print(prof.key_averages().table(
sort_by="cuda_time_total", row_limit=10))
The key metric from this output is Self CUDA time vs FLOPs — compute the ratio, compare to the hardware crossover.
Kernel Fusion
Every kernel launch has overhead (~3–10 μs on modern GPUs). For workloads composed of many small operations, this overhead accumulates. More importantly, separate kernels for adjacent operations read and write global memory between them — bandwidth that could be eliminated by fusing.
Manual fusion: Combine operations into a single kernel that reads inputs once, computes multiple operations, and writes outputs once:
// Instead of: norm_kernel → activation_kernel → scale_kernel
// Write one fused kernel:
__global__ void layer_norm_gelu_scale(
const float* input, const float* weight, const float* bias,
float* output, float scale, int n, int d)
{
int row = blockIdx.x;
int tid = threadIdx.x;
// 1. Compute mean and variance (using shared memory reduction)
__shared__ float mean_var[2];
// ... (reduction code) ...
// 2. Normalize
float val = (input[row * d + tid] - mean_var[0]) /
sqrtf(mean_var[1] + 1e-5f);
// 3. Apply learned parameters
val = val * weight[tid] + bias[tid];
// 4. GELU activation
val = 0.5f * val * (1.0f + tanhf(0.7978845608f * (val + 0.044715f * val * val * val)));
// 5. Scale
output[row * d + tid] = val * scale;
}
Automatic fusion via compilers:
# PyTorch 2.0+: torch.compile fuses ops automatically
import torch
@torch.compile
def fused_forward(x, weight, bias):
return torch.nn.functional.gelu(torch.nn.functional.layer_norm(x, [x.size(-1)], weight, bias))
torch.compile with TorchInductor generates Triton kernels that fuse operations across layer boundaries, achieving results close to hand-written fused kernels for common patterns. It is the right starting point before writing custom CUDA.
Triton for custom kernels:
import triton
import triton.language as tl
@triton.jit
def fused_add_relu_kernel(
x_ptr, y_ptr, output_ptr,
n_elements,
BLOCK_SIZE: tl.constexpr,
):
pid = tl.program_id(axis=0)
block_start = pid * BLOCK_SIZE
offsets = block_start + tl.arange(0, BLOCK_SIZE)
mask = offsets < n_elements
x = tl.load(x_ptr + offsets, mask=mask)
y = tl.load(y_ptr + offsets, mask=mask)
output = tl.maximum(x + y, 0.0) # Fused add+ReLU
tl.store(output_ptr + offsets, output, mask=mask)
Triton compiles to PTX (NVIDIA) or AMDGPU ISA (ROCm), handles shared memory management automatically, and generates code that competes with hand-written CUDA for many memory-bound patterns.
Batching and Throughput Optimization
Static vs. dynamic batching: Fixed batch sizes allow the compiler and runtime to optimize for known shapes. Variable batch sizes (common in inference serving) require padding or bucketing.
Padding to a fixed size wastes compute on padding tokens. Bucketing — grouping sequences of similar length into the same batch — reduces waste. torch.nn.utils.rnn.pad_sequence and custom dataset samplers implement this:
class BucketSampler(torch.utils.data.Sampler):
"""Groups samples by length to minimize padding waste."""
def __init__(self, lengths, batch_size, shuffle=True):
self.indices = sorted(range(len(lengths)), key=lambda i: lengths[i])
self.batch_size = batch_size
self.shuffle = shuffle
def __iter__(self):
batches = [self.indices[i:i+self.batch_size]
for i in range(0, len(self.indices), self.batch_size)]
if self.shuffle:
random.shuffle(batches)
for batch in batches:
yield batch
Continuous batching (inference): For LLM serving, requests arrive dynamically and have variable completion lengths. Naively waiting until a full batch is assembled increases first-token latency. Continuous batching (PagedAttention, vLLM) interleaves requests at the token level — a GPU iteration processes tokens from multiple partially-complete requests simultaneously. This increases GPU utilization at the cost of implementation complexity.
Memory Management in Production
Avoiding Memory Fragmentation
GPU memory allocators are less sophisticated than CPU allocators. Fragmentation — where total free memory exceeds what any single allocation can use — is a real failure mode.
# PyTorch memory management
torch.cuda.empty_cache() # Return cached memory to the allocator (not OS)
# Check fragmentation
print(torch.cuda.memory_stats())
# Look at 'active_bytes.all.current' vs 'reserved_bytes.all.current'
# Set memory fraction to avoid OOM on shared GPUs
torch.cuda.set_per_process_memory_fraction(0.9)
For CUDA C++ applications, using a pool allocator (rmm, CUB DeviceAllocator) with a pre-allocated pool avoids the overhead and fragmentation of repeated cudaMalloc/cudaFree calls.
Gradient Checkpointing
Trading compute for memory: rather than storing all activations during the forward pass, recompute them during the backward pass.
# PyTorch gradient checkpointing
from torch.utils.checkpoint import checkpoint
class TransformerLayer(torch.nn.Module):
def forward(self, x):
# Checkpointed: activations not stored, recomputed in backward
return checkpoint(self._forward_impl, x, use_reentrant=False)
def _forward_impl(self, x):
# ... expensive forward computation ...
return x
Gradient checkpointing typically increases compute by 33% (one extra forward pass per layer) while reducing activation memory by 60–70%. For memory-constrained training, this is a good trade.
ZeRO: Distributed Optimizer State
For very large models where optimizer state (Adam: 2× model parameters in FP32) does not fit on one GPU, ZeRO shards it across GPUs:
- ZeRO-1: Shard optimizer states
- ZeRO-2: Shard optimizer states + gradients
- ZeRO-3: Shard optimizer states + gradients + parameters
# DeepSpeed ZeRO-3
ds_config = {
"zero_optimization": {
"stage": 3,
"overlap_comm": True,
"contiguous_gradients": True,
"reduce_bucket_size": 5e8,
"prefetch_bucket_size": 5e7,
}
}
model, optimizer, _, _ = deepspeed.initialize(
model=model,
config=ds_config,
model_parameters=model.parameters()
)
ZeRO-3 enables training models that are 8–16× larger than a single GPU can hold, at the cost of additional AllGather/ReduceScatter communication overhead. The communication overhead is partially hidden by pipelining parameter fetches with computation.
Monitoring and Observability
Production HPC systems need observability: what are the GPUs doing, when are they idle, why did this job crash?
GPU Utilization Monitoring
# Basic: nvidia-smi in watch mode
watch -n1 nvidia-smi
# Better: dcgm-exporter for Prometheus metrics
docker run -d --gpus all \
-p 9400:9400 \
nvcr.io/nvidia/k8s/dcgm-exporter:latest
# AMD equivalent: amdsmi
amd-smi monitor --watch 1000 --watch_time 0
Key metrics to monitor:
SM Activity(GPU utilization): Percentage of SMs active. Below 80% suggests underutilization.Memory Bandwidth Utilization: Approaching 100% means memory-bound.NVLink Bandwidth: Active NVLink traffic indicates inter-GPU communication.GPU Memory Used: Approaching 100% risks OOM.GPU TemperatureandPower Draw: For thermal and power budget awareness.
Distributed Tracing
For multi-node jobs, distributed tracing connects events across processes:
# PyTorch profiler with distributed tracing
from torch.profiler import profile, ProfilerActivity, tensorboard_trace_handler
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1),
on_trace_ready=tensorboard_trace_handler('./log/profiler'),
with_stack=True,
) as prof:
for step, batch in enumerate(train_loader):
train_step(batch)
prof.step()
The resulting trace in TensorBoard shows CPU and GPU timelines, memory allocation history, and kernel execution details.
The Hardware-Software Co-Design Mindset
At scale, the boundary between software optimization and hardware selection blurs. Algorithm changes that reduce communication volume matter more than network upgrades at some problem sizes. Precision reductions (FP16 → FP8) that double throughput matter more than kernel micro-optimizations.
The questions worth asking when performance is insufficient:
Is the arithmetic intensity high enough to justify the hardware? A workload with 2 FLOP/byte arithmetic intensity on an H100 (roofline knee at ~20 FLOP/byte) is GPU-memory-bound. More GPUs will not help proportionally; the bottleneck is bandwidth, not compute.
Is the communication to compute ratio acceptable? For data parallelism, the ratio is (gradient size) / (compute time). Larger batches, more computation per parameter update, and lower precision all improve this ratio.
Is the IO system the bottleneck? If data loading saturates before compute, buying faster GPUs helps nothing. If checkpointing creates IO spikes that stall training, the fix is async checkpointing and staggering, not GPU upgrades.
Is the job scheduler part of the problem? Slurm's default fairshare policy can fragment large-job allocations across multiple scheduler cycles. Understanding your scheduler's topology-aware placement policies and requesting them explicitly (--constraint=gpu_h100, --switches=1@10:00:00 in Slurm) avoids jobs with bad network topology.
Putting It Together: A Production Checklist
When deploying a new HPC or ML workload to production:
Performance baseline:
- Profile on a single GPU/node to establish baseline efficiency
- Verify the bottleneck (compute-bound vs memory-bound vs IO-bound)
- Check vectorization (CPU-bound paths)
- Confirm memory access patterns (coalesced access, shared memory usage)
Scaling:
- Measure scaling efficiency at 2, 4, 8, 16, 32 GPUs
- Identify the communication overhead fraction at each scale point
- Enable GPU-aware MPI / NCCL for GPU-GPU communication
- Verify process/thread affinity and NUMA placement
Reliability:
- Implement checkpoint-restart with automatic resume
- Set checkpoint frequency based on MTBF of cluster hardware
- Test recovery from checkpoint on a subset of training
IO:
- Profile data loading throughput; confirm GPU never starves
- Async checkpoint writes; stagger across parallel jobs
- Set storage quotas and checkpoint retention policies
Monitoring:
- GPU utilization, bandwidth utilization, temperature, memory via dcgm-exporter or equivalent
- Distributed traces for multi-node jobs
- Alerting on OOM, NaN loss, gradient explosion, hardware errors (ECC)
Reproducibility:
- Fix random seeds for deterministic training runs
- Log software versions (framework, CUDA/ROCm, MPI, libraries)
- Log hardware configuration (GPU type, count, interconnect)
- Archive checkpoints at major milestones
Production HPC is not a solved problem. Hardware fails. Libraries have bugs. Communication patterns change when you scale from 16 to 128 GPUs. The checklist does not prevent all problems; it prevents the ones that have already burned someone else.
The literature that survives long enough to become conventional wisdom — checkpoint early, profile before optimizing, co-locate communicating processes, measure your arithmetic intensity — exists because someone did not do those things and paid for it. The good news is that the community's collective understanding of these failure modes continues to improve, and the tooling for catching them earlier continues to get better.
The bad news is that the hardware keeps getting faster, which keeps raising the ceiling on how wrong you can be before it becomes a problem.