GPUs & Accelerators
The hardware that makes modern AI possible — and the constraints it puts on the software
Almost every interesting AI workload runs on accelerators: GPUs, TPUs, or AI-specific ASICs. Even when you're calling an API and never touching the hardware yourself, knowing what's underneath explains why pricing, latency, and capacity behave the way they do.
Why GPUs
Modern AI is dominated by large matrix multiplications. CPUs do them sequentially; GPUs do them with thousands of cores in parallel. The same architecture that originally accelerated graphics — many small cores doing the same operation on different data — turns out to be exactly what neural networks need.
What Matters in a GPU
Three numbers usually decide whether a model fits and runs well:
- Memory (VRAM) — must hold the weights, the activations, and the KV cache. The bottleneck for inference more often than compute.
- Memory bandwidth — how fast you can move weights from VRAM to compute units. Inference of large models is usually bandwidth-bound, not compute-bound.
- FLOPs — compute throughput, often quoted in different precisions (FP16, BF16, FP8, INT8, INT4).
Compute matters most for training. For inference, memory and bandwidth dominate.
The NVIDIA Lineup (and Why Everyone Talks About It)
- A100 — the workhorse of the previous generation. Still common in production.
- H100 — current high-end for training and serving frontier models. Hopper architecture, FP8 support, transformer-specific optimizations.
- H200 — H100 with more memory, specifically aimed at large-context inference.
- B100 / B200 / GB200 — Blackwell, the current top of the line.
- L40S / L4 — inference-focused, lower power, lower price.
NVIDIA dominates because of CUDA, the software ecosystem (cuDNN, TensorRT, cuBLAS), and the gravitational pull of every framework being optimized for it first.
The Alternatives
- AMD MI300 / MI325 — viable alternative, ROCm software stack catching up. Used at significant scale by some labs and clouds.
- Google TPU v5p / v5e / Trillium — strong on training, available primarily through GCP.
- AWS Trainium / Inferentia — AWS-specific, cost-competitive for the workloads they target.
- Apple Silicon (M-series) — surprisingly capable for on-device inference, MLX framework.
- Specialized inference ASICs — Groq, Cerebras, SambaNova, Tenstorrent. Often dramatically faster on their target workloads.
Inference vs Training Hardware
The optimal hardware differs:
- Training — needs huge memory, fast interconnect (NVLink, InfiniBand), prefers raw FLOPs. Concentrated in specialized clusters.
- Inference — needs memory bandwidth, low latency, often fine with smaller GPUs and weaker interconnect. Can run distributed across cheaper hardware.
Mixing them inefficiently — training on inference cards or vice versa — wastes a lot of money quickly.
Capacity Is Real
Frontier GPUs have been supply-constrained for years. Practical implications:
- Reservations matter. Spot capacity for H100s is unreliable.
- Multi-cloud / multi-region strategies are common to absorb capacity gaps.
- Older generations are still useful. A100s and even V100s serve real production workloads.
- API providers eat the capacity problem for you. That's part of what you're paying for.
When to Self-Host
Self-hosting open-weights models on your own GPUs makes sense when:
- Predictable, high-throughput workloads make API pricing painful.
- Data residency or compliance requires it.
- You need a custom model the API providers don't offer.
- You want sub-100ms latency that providers can't guarantee.
Self-hosting doesn't make sense when traffic is bursty, your team is small, or your workload fits comfortably in API pricing. The operational tax is real.
The Software Layer Matters
Two clusters with the same hardware can have very different effective throughput depending on the inference engine, batching strategy, quantization, and caching. Hardware is necessary; it isn't sufficient. (See Inference Engines.)