GPU Framework Shootout

The three previous chapters made the case for each platform on its own terms. This chapter makes them compete. The goal is a practical decision framework — not a winner-declaration, because the winner depends on your constraints — but a structured analysis of what you gain and what you give up with each choice.

The Decision Matrix

Before benchmarks, acknowledge the meta-question: are you choosing a programming model or a hardware platform? They are coupled but not identical.

CUDA = NVIDIA hardware + CUDA programming model + full ecosystem
HIP/ROCm = NVIDIA or AMD hardware + HIP programming model + ROCm ecosystem (AMD) or CUDA ecosystem (NVIDIA)
Metal = Apple Silicon hardware + Metal/MSL programming model + Apple ecosystem

You can write HIP code that runs on NVIDIA hardware via the CUDA backend. You cannot run Metal on non-Apple hardware. You can write CUDA code that runs on AMD via porting tools, but it is a one-time port, not ongoing portability.

Side-by-Side: Language and Syntax

A simple reduction kernel in all three languages reveals where the models align and where they diverge.

CUDA

#include <cub/cub.cuh>

// Using CUB for a proper reduction
__global__ void reduce_sum(const float* input, float* output, int n) {
    using BlockReduce = cub::BlockReduce<float, 256>;
    __shared__ typename BlockReduce::TempStorage temp;

    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    float val = (idx < n) ? input[idx] : 0.0f;

    float block_sum = BlockReduce(temp).Sum(val);

    if (threadIdx.x == 0)
        atomicAdd(output, block_sum);
}

// Launch:
int threads = 256;
int blocks = (n + threads - 1) / threads;
float* d_output;
cudaMalloc(&d_output, sizeof(float));
cudaMemset(d_output, 0, sizeof(float));
reduce_sum<<<blocks, threads>>>(d_input, d_output, n);
cudaDeviceSynchronize();

HIP (ROCm)

#include <hip/hip_runtime.h>
#include <hipcub/hipcub.hpp>

__global__ void reduce_sum(const float* input, float* output, int n) {
    using BlockReduce = hipcub::BlockReduce<float, 256>;
    __shared__ typename BlockReduce::TempStorage temp;

    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    float val = (idx < n) ? input[idx] : 0.0f;

    float block_sum = BlockReduce(temp).Sum(val);

    if (threadIdx.x == 0)
        atomicAdd(output, block_sum);
}

// Launch — identical to CUDA
int threads = 256;
int blocks = (n + threads - 1) / threads;
float* d_output;
hipMalloc(&d_output, sizeof(float));
hipMemset(d_output, 0, sizeof(float));
hipLaunchKernelGGL(reduce_sum, dim3(blocks), dim3(threads), 0, 0,
                    d_input, d_output, n);
hipDeviceSynchronize();

The mechanical similarity is intentional. hipcub is a port of CUB to HIP; the interface is the same. This is the best-case scenario for HIP portability: the code is nearly identical.

Metal (MSL)

// MSL kernel — reduction using threadgroup memory
#include <metal_stdlib>
using namespace metal;

kernel void reduce_sum(
    device const float* input  [[buffer(0)]],
    device atomic_float* output [[buffer(1)]],
    constant uint& n            [[buffer(2)]],
    threadgroup float* shared   [[threadgroup(0)]],
    uint local_idx              [[thread_position_in_threadgroup]],
    uint global_idx             [[thread_position_in_grid]],
    uint group_size             [[threads_per_threadgroup]])
{
    shared[local_idx] = (global_idx < n) ? input[global_idx] : 0.0f;
    threadgroup_barrier(mem_flags::mem_threadgroup);

    for (uint stride = group_size / 2; stride > 0; stride >>= 1) {
        if (local_idx < stride)
            shared[local_idx] += shared[local_idx + stride];
        threadgroup_barrier(mem_flags::mem_threadgroup);
    }

    if (local_idx == 0)
        atomic_fetch_add_explicit(output, shared[0], memory_order_relaxed);
}

The Metal version is more verbose at the kernel signature level (every parameter requires attribute decoration) and uses different synchronization primitives. The reduction algorithm is the same; the idioms differ.

Performance Comparison

Benchmarking GPU kernels is an exercise in controlled disappointment: results vary significantly by hardware generation, problem size, precision, memory access pattern, and which version of which library you compiled against. The numbers below are indicative, not contractual.

GEMM (Matrix Multiplication)

Dense GEMM at FP32, 4096×4096 matrices, measured in TFLOPS effective throughput:

Platform	Hardware	Library	Effective TFLOPS
CUDA	H100 SXM5	cuBLAS 12	~60 TFLOPS
HIP	MI300X	rocBLAS 6.x	~55 TFLOPS
HIP	H100 (CUDA backend)	cuBLAS	~59 TFLOPS
Metal	M2 Ultra	MPS	~12 TFLOPS
Metal	M3 Max	MPS	~8 TFLOPS

The H100/MI300X gap in GEMM has narrowed considerably in ROCm 6.x. MI300X's memory bandwidth advantage does not help large GEMMs much (they are compute-bound at this size), but its 192 GB capacity allows working at much larger problem sizes without spilling to host memory.

Memory Bandwidth (Stream Benchmark)

Sustained memory bandwidth, measured in GB/s:

Platform	Hardware	Achieved BW	% of Peak
CUDA	H100 SXM5	~3.1 TB/s	~92%
HIP	MI300X	~4.9 TB/s	~92%
Metal	M2 Ultra	~720 GB/s	~90%
Metal	M3 Max	~360 GB/s	~90%

MI300X's bandwidth advantage is real and consistent. For memory-bound workloads (element-wise ops, reductions, gather/scatter), MI300X outperforms H100 by roughly 45%.

Inference Throughput (LLM, Large Batch)

For transformer inference at FP16, a representative large language model (70B parameters):

Setup	Tokens/sec (batch=32)
1× H100 SXM5 (requires quantization or offload)	~3,200
1× MI300X (full FP16, fits in 192 GB)	~4,100
2× H100 NVLink (full FP16)	~5,800
M2 Ultra (full FP16, fits in 192 GB)	~320

The MI300X's single-card advantage for large model inference is significant. Not needing to split across GPUs eliminates the NVLink/XGMI communication overhead and simplifies deployment. The M2 Ultra — despite also fitting the model in unified memory — is outpaced on raw throughput; it is relevant for inference where latency and power matter more than throughput.

Ecosystem Comparison

Capability	CUDA	ROCm/HIP	Metal
PyTorch	First-class	Supported, slight lag	Via MPS backend (limited)
JAX	First-class	Supported	Experimental
TensorFlow	First-class	Supported	Via Metal plugin
Triton kernels	First-class	ROCm backend (growing)	No
BLAS library	cuBLAS	rocBLAS	MPS
Profiler quality	Excellent (ncu, nsys)	Good (Omniperf)	Excellent (Instruments)
Community knowledge	Vast	Growing	Limited (HPC)
Debugger	cuda-gdb, compute-sanitizer	rocgdb	Xcode Metal Debugger
Docker support	`--gpus all`	`--device /dev/kfd /dev/dri`	macOS only
CI/CD integration	Mature	Workable	macOS runners only

The Portability Spectrum

GPU portability is a spectrum, not a binary:

Write-once, run-anywhere: Does not exist at the kernel level in 2026. WebGPU is the closest thing (Metal/Vulkan/D3D12 backends, runs in browsers), and it has restrictions that make it unsuitable for serious HPC.

Portable high-level code: Possible with frameworks that abstract hardware (PyTorch, JAX, MLX). You write in Python/Python-adjacent, the framework dispatches to the right backend. This works well for ML workloads and reasonably well for array-style numerical computing. It does not work if you need to write custom kernels.

Portable custom kernels via HIP: HIP code compiles for both NVIDIA and AMD. This is the practical portability option for kernel authors. It requires maintaining one codebase but accepting that architecture-specific optimizations (wavefront-64 vs warp-32) may need conditional compilation.

Platform-specific kernels with shared logic: The most common real-world pattern for performance-critical code. Write CUDA and HIP separately, share the algorithmic logic in header files or via an abstraction layer. More code, better per-platform performance.

Decision Guide

Choose CUDA if:

You need maximum ecosystem compatibility (third-party libraries, tutorials, hiring)
You are running on NVIDIA hardware and do not need AMD compatibility
You are working on ML training at scale (NCCL, cuDNN, cuBLAS are unmatched)
Your team already knows CUDA and the portability cost is not justified

Choose HIP/ROCm if:

You need code that runs on both NVIDIA and AMD hardware
Your deployment hardware is or will be AMD (MI300X, future Instinct parts)
The MI300X memory capacity or bandwidth is the decisive factor for your workload
You have a principled preference for open-source GPU software stacks
You are working with cloud providers who have made AMD competitive on price

Choose Metal if:

Your deployment target is Apple Silicon (macOS application, Apple-specific service)
You are developing and debugging on a MacBook and want GPU acceleration there
The power-efficiency of Apple Silicon matters for your use case
You are working with MLX and need to write custom kernels

None of the above if:

You can express your workload in terms of existing library operations (cuBLAS, MPS, rocBLAS). The best GPU kernel is often the one you did not write.

On the Fragmentation Tax

Every organization that uses more than one GPU platform pays a fragmentation tax. It manifests as:

Duplicate kernel implementations that must be maintained in sync
CI pipelines that test on multiple hardware targets
Engineers who know platform A but not platform B, creating knowledge silos
Library version skew where feature parity is not guaranteed across platforms

This tax is real, measurable in engineering time, and tends to grow as the codebase grows. Frameworks like PyTorch partially absorb the tax by presenting a unified API over multiple backends. But the moment you drop below the framework layer — which HPC work regularly requires — the tax reappears.

The honest answer to "which GPU platform should we use?" is often: pick one and commit, unless you have a specific, quantified reason to support multiple. The grass on the other side of the PCIe bus is sometimes genuinely greener, but moving there still costs moving expenses.

Keyboard shortcuts

High Performance Computing in 2026