Debugging
A discipline for finding and fixing defects efficiently
Debugging
Debugging is not the absence of skill — it is a regular part of the work, and the people who do it well have a method, not a knack. The method is the same regardless of language, stack, or tooling: turn the problem into a question, form hypotheses, and reduce uncertainty one step at a time.
The goal is not to be clever. It is to be systematic.
The Wrong Way
The most common debugging anti-patterns are predictable:
- Pattern-matching against past bugs. Trying remedies that worked last time, without checking whether the same cause is present.
- Random changes. Editing code that "looks suspicious" and re-running. The code may indeed get fixed, but the fix is unrepeatable, and the same bug returns somewhere else.
- Blaming the environment. "It must be the network / the cache / the database." Sometimes true; usually a way to avoid investigating.
- Stopping at the first wrong thing found. Many bugs are a chain of failures; the first one fixed is often not the root cause.
- Not reading the error. Stack traces, log messages, and assertion failures contain information. Skim them carefully before guessing.
The practical effect of these patterns is that debugging takes longer than necessary, fixes are fragile, and the developer learns nothing from the bug.
The Method
1. Reproduce the failure reliably
A bug you cannot reproduce is a bug you cannot fix. The first investment is to find a sequence of steps — inputs, configuration, timing — that produces the failure consistently. Until you can, every "fix" is a guess.
Reproduction is sometimes the bulk of the work. It pays off: once you have a reliable repro, the rest of the process is straightforward, and you can verify the fix.
When the bug appears only intermittently, the cause is usually:
- A race condition or other timing dependency.
- An uninitialized value whose default varies.
- A dependency on environment state (cache, file system, time of day).
- A test or call ordering effect.
Each of these suggests a place to look.
2. Find the smallest reproduction
Once a failing case exists, shrink it. Remove inputs, configuration, code paths, and steps until the failure either vanishes (proving the part you removed mattered) or persists (proving the part you removed was incidental).
A small reproduction:
- Localizes the bug to a small region of code.
- Removes distractions from the investigation.
- Becomes the regression test once the bug is fixed.
Modern tools make this faster: git bisect for narrowing down the commit that introduced a regression; property-based testing libraries that automatically shrink failing inputs; minimization heuristics for input data.
3. State the bug precisely
Rephrase the symptom into a question whose answer would be the cause:
- Bad: "The page is broken."
- Better: "The price column shows 0 for orders placed before 2024."
- Best: "
computeTotalreturns 0 whenorder.placedAt < BILLING_EPOCH. Why?"
The precision matters. Vague questions get vague investigations. A precise question implies what to look at next.
4. Form hypotheses, then test them
Each hypothesis is a guess about what is going wrong, accompanied by an experiment that would confirm or refute it.
Hypothesis: The function uses the wrong default rate when the date is null.
Experiment: Call the function with a null date and observe the result.
Result: Returns 0. → Hypothesis confirmed.Two disciplines matter here:
- Test the hypothesis, do not "try things." A test is something whose outcome teaches you something either way. "Trying things" is editing code at random.
- Believe the evidence. When a test contradicts a strong belief about the code, the belief is the thing that is wrong. Don't argue with the data.
5. Find the cause, not just a cause
Many fixes silence the symptom without addressing the underlying defect. The bug returns somewhere else, often weeks later. The Pragmatic Programmer's phrase is Fix the Problem, Not the Blame.
A cause is the cause when:
- It explains every observed symptom, including the surprising ones.
- The fix is one targeted change, not a defensive shield.
- After the fix, you understand why the original code seemed to work — what conditions hid the bug previously.
If you cannot answer the last question, you have not finished debugging.
6. Write a regression test
The smallest reproduction from step 2 is the body of a regression test. Land it with the fix. Two things happen:
- The bug cannot return without being noticed.
- The next reader, looking at the test name, learns what mistake to avoid.
A bug fix without a regression test is incomplete.
Tools and Techniques
Read the stack trace
A stack trace tells you exactly where the failure was raised, and the call path that led there. Read it fully — including the frames in your own code, not just the topmost one. The first frame is rarely the most informative.
When the trace points to library code, look for the closest frame in code you control; the bug is usually a misuse of the library, not a defect in it.
Use the debugger
A debugger is the highest-bandwidth tool for understanding running code. Set a breakpoint near the failure, step through, inspect state. The investment to learn the debugger pays back daily.
println debugging has its place — it is fast, it works in environments where debuggers don't, and the resulting log can be diffed across runs. But for non-trivial bugs, a debugger is faster.
Logging is debugging at a distance
For bugs that surface only in production, logging is the only available debugger. A few habits make logs effective:
- Log structured data, not formatted strings.
{order_id: 1234, status: "FAILED"}is searchable;"order 1234 failed"is not. - Log at the boundary, not inside the body. The information that matters is what crossed in and what crossed out.
- Include enough context to reconstruct the scenario: request IDs, user IDs, timestamps, version.
- Log errors with their cause chain, not their message alone.
Binary search the cause
When a failure was introduced by some change in a known time window, git bisect finds the offending commit by binary search. Twenty commits → at most five build-and-test cycles to identify the culprit.
The same idea applies to data: when a large input fails, halve it and see which half still fails. Repeat. The minimal failing input is usually a small fraction of the original.
Diff, don't theorize
When code that worked yesterday breaks today, the fastest path to the cause is usually to compare the working state with the broken state — git diff between revisions, environment-variable diffs between machines, configuration diffs between deployments. Theorizing about what might have changed is slower than looking at what did change.
Rubber-duck
Explaining the problem out loud, in detail, often reveals the cause before the listener says anything. The act of articulating forces the assumptions to the surface, and the wrong assumption usually exposes itself.
The technique is named for the duck on a desk. Any patient listener (or imagined listener) works the same.
Engineering daybook
The Pragmatic Programmer recommends a running notebook of investigations. The benefits compound:
- A bug you debugged six months ago, encountered again, is solvable in minutes if your notes survive.
- The act of writing forces precision in stating the problem.
- The notes capture context the code does not — what you tried, what worked, what you almost missed.
A notebook does not have to be elaborate. A folder of dated text files, or notes in your team's wiki, suffices.
Hard Bugs
Some bugs resist the standard method. A few categories and their characteristic moves.
Heisenbugs
The bug disappears when you try to observe it. Usually a timing or memory-corruption issue: adding logging changes the timing; running under a debugger changes the memory layout.
Move: stop trying to observe the bug at the failure point. Observe at the boundary; capture state to a log or core dump; analyze offline.
Concurrency bugs
The bug reproduces only sometimes, only on certain machines, only under load. See Concurrency for the structural prevention; for debugging, look for unsynchronized shared state or order-dependent assumptions.
Move: stress-test with deliberate scheduling perturbations; use thread sanitizers; add invariant assertions and run under load until one fires.
State-corruption bugs
The bug appears far from the cause, because some earlier code corrupted state that a later piece read.
Move: add invariant checks at boundaries — at function exits, at state transitions, at persistence points — until one fires before the visible failure. The earliest assertion that fires is close to the cause.
Distributed-system bugs
The bug involves multiple processes, and you have only their logs.
Move: trace IDs that propagate across services; structured logs that can be joined; a clear timeline reconstructed from timestamps; comparison of behavior across replicas. Without these in place, distributed bugs are nearly impossible; with them, they reduce to careful reading.
Once the Bug Is Fixed
A bug is an opportunity to improve more than one line of code:
- Add a regression test. The minimum.
- Look for siblings. A bug that exists in one place often has cousins. Search for the same pattern elsewhere.
- Look for the cause's cause. Why did the bug get written? A misleading name? A missing type? A confusing API? Fix the upstream cause if you can.
- Capture the lesson. Note what you learned. Share it with the team if it generalizes.
The strongest engineering teams treat each defect as a small post-mortem. Most bugs do not deserve a formal write-up, but the questions — what happened, how did we miss it, what would catch the next one — are worth asking briefly.
Pre-Fix Checklist
Before declaring a bug fixed:
- Can you reproduce the bug reliably?
- Have you identified a cause, not just a symptom?
- Does the cause explain all observed behavior, including what previously seemed unrelated?
- Is there a regression test that fails before the fix and passes after?
- Are there other places in the codebase with the same pattern that should be examined?
- Is there an upstream cause — a misleading API, a missing constraint — worth fixing as well?