How to decide what to test, at what level, and how to keep the suite fast and trustworthy as the codebase grows

Testing Strategy

Writing tests is the easy part. Deciding what to test, at what level, and what to do when the suite gets slow or starts lying is the part that takes judgment. A team without a strategy ends up with one of two outcomes: a suite that takes thirty minutes and still misses regressions, or a suite that everyone is afraid to touch because it breaks unpredictably on every change.

The strategy is not "test more." It is "test the right things, at the right altitude, with tests that earn their place."

The Test Pyramid

A useful mental model, attributed to Mike Cohn:

            /\
           /  \         End-to-End
          /----\        Few, slow, brittle, real-world coverage
         /      \
        /--------\      Integration
       /          \     More, medium speed, exercise wiring
      /------------\
     /              \   Unit
    /----------------\  Many, fast, exercise logic

The shape matters. The pyramid says: the bulk of the suite should be unit tests; integration tests fill the gaps unit tests cannot reach; end-to-end tests are the thin top that proves the whole thing actually runs.

Two common deviations:

Ice-cream cone. Few unit tests, many integration tests, even more E2E. The suite is slow, flaky, and every failure is hard to diagnose.
Hourglass. Many unit tests, almost no integration tests, many E2E tests. The integration layer is untested; defects there are caught only when the full stack runs.

A pyramid shape is the default for a reason. Deviations should be deliberate.

What Each Level Is For

Unit tests

A unit test exercises a single function, class, or small module in isolation, with its dependencies replaced by test doubles where they cross meaningful boundaries.

What unit tests are good at:

Logic with many cases (branching, edge values, error handling).
Pure transformations (validation, formatting, calculation).
Algorithms whose correctness is non-obvious.
Behavior that should not change accidentally.

What unit tests are bad at:

Catching defects in wiring — wrong dependency injected, wrong endpoint URL, wrong field name on a DTO.
Catching defects in external systems — schema drift, version skew, authentication failures.
Catching defects that depend on real time — race conditions, timeouts, retries.

Most production defects in a well-tested codebase hide in the things unit tests are bad at. That is what the higher levels exist to catch.

Integration tests

An integration test exercises several components together, with at most a controlled set of substitutions. It answers: do these parts work when wired up?

Typical shapes:

Module + database. Repository code with a real database (often in a container or in memory). Catches schema mismatches, migration ordering, transaction semantics.
Handler + service + repository. A request enters the application boundary; the response is asserted. External services are stubbed.
Component + state + reducer. A UI component is rendered with its state container; events are dispatched; the rendered output is asserted.

The defining property: less is faked than in a unit test, but more is controlled than in an end-to-end test.

End-to-end tests

An end-to-end test exercises the whole system as a user would, against a deployed (or locally booted) full stack.

What E2E tests prove:

The full path works at least once.
Critical user journeys do what they claim.
The deploy is not broken.

What E2E tests are bad at:

Coverage. Every branch in business logic at E2E level is a maintenance disaster.
Speed. Even a fast E2E suite is slower than the slowest unit suite.
Determinism. Real systems have real flakiness; the test suite inherits it.

A handful of E2E tests covering the most important journeys earns its place. Hundreds do not.

The Decision: What Goes Where

A useful heuristic: test at the lowest level that can prove the property.

A formatting function? Unit.
The wiring between an HTTP handler and a database? Integration.
"A user can complete checkout"? E2E.

Two anti-patterns appear when this is violated:

Pushing too low. Unit tests that mock so much they only verify that the mocks were called. Nothing about the real behavior is exercised.
Pushing too high. E2E tests for every business rule. The suite slows to a crawl; every infra hiccup turns into a triage session.

When unsure: write it at the lowest level that fails for the right reason.

Test Doubles: The Vocabulary

The word "mock" is overloaded. The real vocabulary (Gerard Meszaros, xUnit Test Patterns):

Type	What it is	When to use
Dummy	Passed but never used; fills a parameter slot.	A constructor needs a logger you do not exercise.
Stub	Returns canned answers to calls.	The code under test asks "what time is it?" — stub returns a fixed value.
Spy	A stub that also records how it was called.	You need to assert "send was called once with this payload."
Mock	A pre-programmed object with expectations; fails if expectations are unmet.	Rarely needed; usually a spy with an assertion is clearer.
Fake	A working implementation, simpler than production.	In-memory database, in-memory cache.

Two principles:

Prefer fakes over mocks where reasonable. A fake exercises real behavior; a mock proves only that the test author predicted the implementation correctly.
Mock at architectural boundaries, not inside them. Mock the database client, not the function that calls it. Mock the HTTP client, not the service wrapping it.

The smell to watch for: a test that looks identical to the implementation it tests. That test will pass even if the implementation is broken, because both move together.

Properties of a Trustworthy Suite

The FIRST mnemonic (Robert C. Martin) covers what a unit test should be: Fast, Independent, Repeatable, Self-validating, Timely. The same properties scale up:

Fast enough to run constantly

If running the relevant tests takes longer than the developer's patience, they are skipped. The break is somewhere around a few seconds for a focused run; a few minutes for the full suite.

When the suite slows down:

Split: hot tests (run on every save) vs cold tests (run on push or in CI).
Profile: a single slow test often blocks the whole tier.
Parallelize: most modern runners can.
Cull: low-value, redundant tests are not free.

Independent

A test should pass or fail on its own merits. Tests that depend on order, on shared state, or on the previous test's database row will eventually rot.

Symptoms: tests that pass alone and fail together, tests whose order in the file matters, tests that pass on the second run.

Repeatable

Same input, same output, every time. Sources of non-determinism:

The clock — inject a clock.
Random numbers — inject a random source.
Network — substitute or use deterministic fixtures.
File system — use a temporary directory; clean up.
Database state — wipe and seed in setup, or run in a transaction that rolls back.
Concurrency — make the test single-threaded where possible; use deterministic schedulers where not.

Every flake found and fixed is a permanent improvement. Every flake tolerated trains the team to ignore failures.

Self-validating

A test passes or fails — a human does not have to read output and decide. Tests that print and require interpretation are not tests; they are diagnostics.

Timely

Written close to the code they test, ideally with it. Tests added a sprint later usually test what the code happens to do now, not what it was supposed to do.

What to Test, Concretely

A senior engineer's checklist for any new change:

Happy path. The primary case works.
Boundary cases. Empty, single, maximum, minimum.
Error cases. Each documented failure mode.
Regression. The exact scenario from the bug report, if this is a fix.
Contract. What callers are allowed to rely on, captured in a test that would fail if it changed.

What not to test, usually:

Framework internals. The framework has its own tests.
Trivial getters and setters. Unless they encode logic.
Implementation details. Refactor will break them; the contract should not.
Generated code. The generator has tests.

Common Failure Modes

Tests that test the mock

test('saves the order', () => {
  const db = { save: vi.fn() };
  const service = new OrderService(db);
  service.placeOrder({ id: 1 });
  expect(db.save).toHaveBeenCalledWith({ id: 1 });
});

This test passes whether or not the order is actually saved correctly. It verifies that placeOrder calls save — which is a re-implementation of placeOrder in the test. If the production code mis-formats the payload, the test will fail in a way that re-implements the mistake.

A better test asserts an observable property: after placing the order, can we retrieve it? Does the database now have a row? Does the response include the order ID?

Setup-heavy tests

beforeEach(() => {
  // 80 lines of fixture setup
});

Setup that long is a smell. Usually it means:

The unit under test has too many dependencies.
The test is at the wrong level — the integration cost is too high for a unit test.
A test builder or factory would compress this.

Tests asserting too much

test('returns user', () => {
  const user = service.getUser(1);
  expect(user).toEqual({
    id: 1, name: 'Alice', email: 'a@b.com',
    createdAt: ..., updatedAt: ..., // 14 more fields
  });
});

Every irrelevant field becomes a source of brittleness. Assert what the test is actually about. If the test name is "returns the user's name correctly," assert only the name.

Tests asserting too little

test('places an order', () => {
  service.placeOrder({ id: 1 });
  expect(true).toBe(true);  // or no assertion at all
});

A test that cannot fail is not a test. It is documentation that the call does not throw, at most.

Snapshot tests as a substitute for assertions

Snapshots are useful for output that is hard to assert structurally (HTML, large objects). They are dangerous when:

The snapshot is large enough that no one reads diffs.
Snapshots are updated reflexively when they fail.

A snapshot updated without inspection is worse than no test.

When the Suite Lies

A suite that has lost trust has predictable symptoms:

Failures the team explains away ("oh, that one's flaky, just rerun").
PRs merged with red tests after a quick "looked unrelated."
Tests skipped or commented out and never re-enabled.

Recovery is unglamorous:

Stop the bleeding. No new flaky tests merged. CI is green or PRs do not ship.
Categorize the existing flakes. Real bugs hidden behind flakiness. Race conditions. External dependencies. Each gets a different fix.
Quarantine, do not delete. Move flakes to a known list, fix or remove on a schedule.
Make the cost visible. A dashboard of flake rate and time-to-green is the lever that gets resources allocated.

A suite is trusted when the team's first reaction to a red test is "something broke," not "rerun it."

Pre-Commit Question

Before merging a change with tests, ask:

If I broke the implementation in a plausible way, would at least one of these tests fail?

If not, the tests cover the implementation, not the behavior. Add the test that would catch the regression.

Testing Strategy

On this page