Vectorization and SIMD
SIMD (Single Instruction, Multiple Data) is the most consistently underexploited source of performance in production HPC code. Unlike GPU parallelism — which requires rearchitecting algorithms — or OpenMP — which requires restructuring loops — vectorization often requires very little: write the code correctly, and the compiler does the work.
When it works, vectorization delivers 4–16× speedup on floating-point loops at zero programmer effort. When it does not work, engineers spend days hand-writing intrinsics and end up with code that is brittle, architecture-specific, and barely faster than the scalar version they were complaining about.
This chapter covers both paths.
What SIMD Is
Modern CPUs contain SIMD units — wide arithmetic units that operate on multiple data elements simultaneously. Instead of adding two floats, they add eight. Instead of multiplying one double-precision pair, they multiply four.
The relevant ISA extensions by platform:
| Extension | Width | Elements (float) | Platform |
|---|---|---|---|
| SSE2 | 128-bit | 4 × float32 | x86 (2000+) |
| AVX | 256-bit | 8 × float32 | x86 (2011+) |
| AVX2 | 256-bit | 8 × float32, integer | x86 (2013+) |
| AVX-512 | 512-bit | 16 × float32 | x86 (2017+, servers) |
| NEON | 128-bit | 4 × float32 | ARM (v7+) |
| SVE | Scalable | 4–64 × float32 | ARM (v8.2+) |
| VSX | 128-bit | 4 × float32 | POWER |
A Skylake-X Xeon supporting AVX-512 with two FMA units can execute 32 float32 FMA operations per cycle. At 3.5 GHz, that is 112 GFLOPS per core — before considering multiple cores. Leaving AVX-512 unused means leaving most of that peak performance on the table.
Auto-Vectorization
The compiler is the first line of vectorization. GCC, Clang, and MSVC all auto-vectorize loops that meet certain criteria. The criteria are simple to state and surprisingly easy to violate accidentally:
- Loop count must be known or estimable at compile time (or at loop entry for runtime vectorization)
- No loop-carried dependencies (iteration N depends on result of iteration N-1)
- No aliasing between input and output buffers
- Memory access must be contiguous (stride-1)
// Auto-vectorizes cleanly
void add(float* c, const float* a, const float* b, int n) {
for (int i = 0; i < n; i++)
c[i] = a[i] + b[i];
}
Check whether the compiler vectorized with:
# GCC
gcc -O3 -march=native -fopt-info-vec-optimized add.c
# Clang
clang -O3 -march=native -Rpass=loop-vectorize add.c
# Check the assembly
objdump -d add.o | grep -E "vmovaps|vaddps|vmulps|vfmadd"
SIMD instructions in the output (ymm/zmm registers, v-prefixed instructions) confirm vectorization.
Why Auto-Vectorization Fails
Aliasing: The compiler cannot prove that a, b, and c do not overlap. If they overlap, vectorizing would be incorrect. The restrict keyword is the fix:
// Without restrict: compiler may not vectorize (assumes aliasing possible)
void add(float* c, const float* a, const float* b, int n) { ... }
// With restrict: compiler knows pointers don't alias, can vectorize
void add(float* __restrict__ c,
const float* __restrict__ a,
const float* __restrict__ b, int n) { ... }
Loop-carried dependency: A recurrence where each iteration depends on the previous:
// Cannot vectorize: c[i] depends on c[i-1]
for (int i = 1; i < n; i++)
c[i] = c[i-1] * alpha + b[i];
This is a genuine algorithmic constraint; you cannot vectorize this with standard SIMD without restructuring the algorithm (some recurrences admit vectorized reformulations, but not trivially).
Non-contiguous access: Stride-2 or gather access prevents the compiler from generating efficient vector loads:
// Stride-2 access: difficult to auto-vectorize efficiently
for (int i = 0; i < n; i += 2)
result += a[i] * b[i];
Restructure data to be contiguous where possible. Separate arrays (structure of arrays) over interleaved layouts (array of structures) is the standard recommendation for SIMD-friendly data.
Function calls: Most function calls prevent vectorization because the compiler cannot inspect their internals. Mark small functions inline or __attribute__((always_inline)).
Enabling Vectorization
Compiler Flags
# Enable AVX-512 on a server with Skylake/Cascade Lake Xeon
gcc -O3 -march=native -funroll-loops -ftree-vectorize
# Target a specific architecture
gcc -O3 -march=skylake-avx512
# Report missed vectorizations
gcc -O3 -march=native -fopt-info-vec-missed
-march=native enables all instruction set extensions of the current CPU. Do not use it for binaries deployed to heterogeneous clusters — use the oldest common denominator (-march=znver3 for AMD Zen 3, -march=icelake-server for Intel Icelake).
Compiler Hints
// Hint that the pointer is aligned (enables aligned loads/stores)
float* a = (float*)aligned_alloc(64, n * sizeof(float));
__builtin_assume_aligned(a, 64);
// Loop vectorization hints (GCC)
#pragma GCC ivdep // Ignore vector dependencies (use when safe)
#pragma GCC unroll 4 // Unroll 4 times before vectorizing
// Clang
#pragma clang loop vectorize(enable)
#pragma clang loop interleave_count(4)
Intel Intrinsics: When You Need Control
When the compiler fails to vectorize efficiently, or when you need to guarantee specific instruction sequences, intrinsics provide direct access to SIMD instructions.
AVX-512 example: computing the sum of squares of a float array.
#include <immintrin.h> // AVX-512 intrinsics
float sum_of_squares_avx512(const float* data, int n) {
__m512 acc = _mm512_setzero_ps(); // 16 × float32, initialized to 0
int i = 0;
// Main loop: 16 elements per iteration
for (; i <= n - 16; i += 16) {
__m512 v = _mm512_loadu_ps(&data[i]); // Load 16 floats
acc = _mm512_fmadd_ps(v, v, acc); // acc += v * v (FMA)
}
// Horizontal sum of the 16 accumulators
float result = _mm512_reduce_add_ps(acc);
// Scalar tail for remaining elements
for (; i < n; i++)
result += data[i] * data[i];
return result;
}
Key intrinsic naming conventions (Intel):
_mm_prefix: 128-bit (SSE)_mm256_prefix: 256-bit (AVX/AVX2)_mm512_prefix: 512-bit (AVX-512)_pssuffix: packed single (float32)_pdsuffix: packed double (float64)_epi32suffix: packed 32-bit integers
ARM NEON Intrinsics
#include <arm_neon.h> // ARM NEON intrinsics
float dot_product_neon(const float* a, const float* b, int n) {
float32x4_t acc = vdupq_n_f32(0.0f); // 4 × float32, all zeros
int i = 0;
for (; i <= n - 4; i += 4) {
float32x4_t va = vld1q_f32(&a[i]); // Load 4 floats
float32x4_t vb = vld1q_f32(&b[i]); // Load 4 floats
acc = vfmaq_f32(acc, va, vb); // acc += va * vb (FMA)
}
// Horizontal sum
float result = vaddvq_f32(acc);
// Scalar tail
for (; i < n; i++)
result += a[i] * b[i];
return result;
}
NEON is available on all AArch64 processors: Apple Silicon, AWS Graviton, Ampere Altra. It delivers 4 × float32 per cycle with FMA, substantially less than AVX-512 but with lower power consumption and better portability in the ARM world.
ARM SVE: Scalable Vectors
SVE (Scalable Vector Extension) is ARM's answer to AVX-512, with a twist: the vector width is implementation-defined and discoverable at runtime. Code written for SVE runs on 128-bit implementations (4 × float32) and 512-bit implementations (16 × float32) without recompilation.
#include <arm_sve.h>
float sum_sve(const float* data, int n) {
svfloat32_t acc = svdup_f32(0.0f);
svbool_t pg; // Predicate (mask)
for (int i = 0; i < n; i += svcntw()) { // svcntw() = vector length in uint32s
pg = svwhilelt_b32(i, n); // Predicate: active lanes where i+lane < n
svfloat32_t v = svld1_f32(pg, &data[i]); // Masked load
acc = svmla_f32_m(pg, acc, v, v); // acc += v * v (masked FMA)
}
return svaddv_f32(svptrue_b32(), acc); // Horizontal sum
}
SVE's predicated execution handles loop tails naturally — no separate scalar tail needed. This makes the code cleaner at the cost of a more complex programming model.
The Roofline Model for CPUs
The same arithmetic intensity analysis from the GPU fundamentals chapter applies to CPUs.
For a Cascade Lake Xeon at 3.5 GHz with AVX-512 and two FMA units:
- Peak compute: 3.5 GHz × 16 (AVX-512 width) × 2 (FMA) × 2 (FMAs/cycle) = 224 GFLOPS/core
- Peak bandwidth (DDR4-2933, 6-channel): ~140 GB/s per socket
Crossover (roofline knee): 224 / 140 ≈ 1.6 FLOP/byte
Operations with arithmetic intensity above 1.6 are compute-bound and benefit from SIMD. Operations below 1.6 are memory-bandwidth-bound, and SIMD helps only insofar as it reduces the number of load/store instructions that touch the bus.
This is why streaming operations (memcpy, simple element-wise ops) do not see large speedups from SIMD on large arrays: they are already memory-bound, and wider SIMD just saturates the memory bus faster without changing the ceiling.
Practical Vectorization Workflow
-
Measure first. Profile to confirm the loop is actually hot. Vectorizing a function that takes 0.1% of runtime provides 0.1% improvement in the best case.
-
Check compiler output. Use
-fopt-info-vec-optimizedand-fopt-info-vec-missed. Read the disassembly. Confirm you are seeing ymm/zmm registers in hot loops. -
Fix obvious blockers. Add
restrict, restructure data layouts, eliminate function calls in loops. -
Enable the right flags.
-O3 -march=nativeon development hardware; target a specific-marchfor deployment. -
Use intrinsics only when necessary. If the compiler generates correct SIMD with decent performance, stop. Intrinsics code is harder to maintain and architecture-specific.
-
Test on target hardware. Performance of SIMD code varies significantly between CPU generations. AVX-512 downclocks on some Intel parts (frequency throttling when AVX-512 is active). Measure on the hardware that matters.
AVX-512 and the Frequency Throttling Problem
A notable gotcha on Intel Skylake and Cascade Lake Xeons: executing AVX-512 instructions causes the CPU to reduce its clock frequency by 100–300 MHz. This "license 2" frequency state is designed to manage power consumption during wide SIMD execution.
The implications:
- A loop that uses AVX-512 may run slower than AVX2 if the frequency reduction offsets the wider SIMD width
- The throttle applies to the entire core for a settling period (~1 ms) after AVX-512 execution — affecting non-SIMD code that runs afterward
- This is specific to older Intel microarchitectures; Ice Lake and later have significantly reduced throttling; AMD does not throttle on AVX-512
Measurement is mandatory. Do not assume AVX-512 is always faster than AVX2 on Intel hardware without benchmarking on the target CPU.
Integration with HPC Libraries
Many HPC libraries handle SIMD internally and expose architecture-agnostic interfaces:
- FFTW: Detects and uses SSE2/AVX/AVX-512/NEON at runtime based on the CPU
- OpenBLAS/MKL: Architecture-optimized BLAS with runtime dispatch
- Eigen: C++ template library that generates SIMD code; use
-DEIGEN_VECTORIZE_AVX512for AVX-512 - Highway: Google's portable SIMD library; same code, multiple targets
Highway is worth highlighting for cross-architecture portability:
#include "highway/highway.h"
namespace hn = hwy::HWY_NAMESPACE;
// This code compiles for SSE2, AVX2, AVX-512, NEON, SVE, WASM SIMD
// The right version is selected at runtime
HWY_ATTR void add(float* HWY_RESTRICT c,
const float* HWY_RESTRICT a,
const float* HWY_RESTRICT b, int n) {
const hn::ScalableTag<float> d;
const int lanes = hn::Lanes(d);
int i = 0;
for (; i + lanes <= n; i += lanes) {
auto va = hn::Load(d, a + i);
auto vb = hn::Load(d, b + i);
hn::Store(hn::Add(va, vb), d, c + i);
}
// Scalar tail
for (; i < n; i++) c[i] = a[i] + b[i];
}
Highway's abstraction works well. The generated code quality is competitive with hand-written intrinsics in most cases. If you are writing a library that needs to run efficiently across x86, ARM, and RISC-V, Highway is the right tool.
Summary
Vectorization is free performance if you write the right code and compile with the right flags. The steps in order:
- Structure data for contiguous access (SoA over AoS)
- Annotate away aliasing with
restrict - Compile with
-O3 -march=native - Verify with disassembly or compiler reports
- Reach for intrinsics only when the compiler falls short and the profiler shows it matters
The gap between scalar and fully vectorized code can be 4–16×. The engineer who ignores vectorization and instead buys a bigger GPU is wasting money that the compiler would have given them for free.