Steven's Knowledge

Deep Learning & Neural Networks

How neural networks actually learn, and which ideas you carry into the era of foundation models

Deep learning is what made the current AI moment possible. The architectures changed, the scale changed, but the core mechanics — layers of differentiable functions trained by gradient descent — are the same building blocks used inside every transformer.

The Anatomy of a Neural Network

A neural network is a stack of layers, each one a parameterized function. Data flows forward through the layers; gradients flow backward to update the parameters. The key components:

  • Linear (dense) layers — matrix multiplications, the workhorse.
  • Activation functions — non-linearities like ReLU, GELU, SiLU. Without them the whole network would collapse to a single linear map.
  • Normalization — LayerNorm, BatchNorm. Stabilizes training by keeping activation distributions in check.
  • Loss function — the scalar the optimizer is trying to minimize.

Backpropagation in One Paragraph

Backpropagation is the chain rule applied mechanically to a computation graph. The forward pass computes the loss; the backward pass walks the graph in reverse, multiplying local derivatives to get the gradient of the loss with respect to every parameter. Frameworks like PyTorch and JAX automate this — you write the forward pass and get the gradients for free.

Optimizers

Gradient descent gives you a direction. Optimizers decide how to take the step:

  • SGD with momentum — simple, surprisingly effective for vision.
  • Adam / AdamW — adaptive per-parameter learning rates. The default for transformers.
  • Learning rate schedules — warmup, cosine decay. Often matters as much as the optimizer.

Why Depth Helps

Each additional layer can compose features from the layer below. Early layers learn low-level patterns (edges, n-grams), deeper layers compose them into higher-level structure (objects, syntax, intent). Depth without the right architecture and normalization is unstable; the history of deep learning is largely the story of finding ways to train deeper networks reliably.

Carrying These Ideas Forward

Transformers, diffusion models, and everything else built on top still use these same primitives. Understanding gradient flow, normalization, and optimizer behavior is what lets you reason about why a model trains or refuses to.

On this page