Diffusion Models
How modern image, video, and audio generation actually works
Diffusion is the architecture behind almost all serious image, video, and high-quality audio generation today. It's a different family from transformers — though the two have started borrowing from each other heavily — and worth understanding on its own terms.
The Core Idea
Train a model to remove noise from data, one small step at a time. Then to generate, start from pure noise and run the de-noising process in reverse, step by step, until you have a coherent sample.
Conceptually:
- Take a real image; gradually add Gaussian noise until it's pure static.
- Train a network to predict the noise that was added at each step.
- To generate, start with random noise and iteratively subtract the predicted noise.
It sounds wasteful — and was, for a while — but it produces dramatically better samples than earlier generative approaches (GANs, VAEs) on most modalities.
Latent Diffusion
The crucial efficiency trick. Instead of running diffusion on raw pixels (expensive), encode the image into a smaller latent space with an autoencoder, run diffusion in that compressed space, then decode the result back to pixels. Stable Diffusion and most modern image models work this way.
The same trick generalizes: latent video diffusion, latent audio diffusion, latent 3D.
Conditioning
Pure diffusion would generate random images. Conditioning makes it controllable:
- Text conditioning — embed the prompt with a text encoder (CLIP, T5), feed it into the de-noising network at every step.
- Image conditioning — for inpainting, image-to-image, style transfer.
- Structural conditioning — depth maps, edge maps, pose skeletons (ControlNet-style).
- Reference conditioning — generate variations consistent with a reference (a product, a character).
The text encoder matters as much as the diffusion model itself. Switching from CLIP to T5 was much of the leap from mid-quality to near-photorealistic models.
Sampling and Schedulers
The de-noising loop has many design choices:
- Number of steps — fewer is faster but lower quality. 20–50 is common; distilled models can run in 1–4.
- Sampler — DDPM, DDIM, Euler, DPM++, and dozens more. Each is a different numerical method for solving the de-noising ODE.
- Classifier-free guidance — generate twice (conditional and unconditional), extrapolate. Strengthens prompt adherence at some cost in diversity.
Distillation methods (LCM, LoRA-based one-step models) collapse the loop into very few steps, enabling near-realtime generation.
Diffusion vs Transformers
Transformers won language. Diffusion currently dominates pixels. But the line is blurring:
- Diffusion Transformers (DiT) — replace the U-Net backbone of classical diffusion with a transformer. The basis of recent state-of-the-art image and video models (Sora, SD3, Flux).
- Flow matching — a generalization of diffusion with cleaner mathematical foundations and faster training; the architecture underneath several frontier models.
- Autoregressive image generation — using transformer-style next-token prediction on image tokens, an alternative path that has been getting more competitive.
The dominant pattern in 2025 frontier image and video models is "diffusion transformer with flow matching," but the field is still moving.
Where It's Used
- Image generation — Midjourney, DALL-E, Stable Diffusion family, Flux, Imagen.
- Video generation — Sora, Veo, Runway, Kling, Pika.
- Audio generation — voice cloning, music generation, sound effects.
- 3D generation — emerging; methods like Gaussian splatting often combined with diffusion priors.
- Scientific applications — molecule generation, protein structure prediction, fluid simulation.
What's Hard
- Controllability — getting exactly what you want, not just something good.
- Coherence over time — video generation has to maintain consistent characters and physics across frames.
- Evaluation — quality is subjective; standardized metrics (FID, CLIP score) only partially correlate with human judgment.
- Compute cost — training and serving diffusion models is expensive; per-output cost is much higher than text generation.
Why an Engineer Should Care
Even if you never train one, diffusion is the substrate of every "AI image" and "AI video" feature you'll integrate. Understanding the cost shape (steps × parameters), the controllability limits, and the conditioning patterns is what separates a good integration from a frustrating one.