Steven's Knowledge

Software Engineering Solutions

LLMs across the software lifecycle — from review and testing to migration, debugging, and operations

Scenario Abstraction

Software engineering is the most LLM-fluent domain because models train heavily on code, and because every artifact (code, tests, docs, configs, logs) is already text. The applications section covers Code Assistants as a product category. This page covers the same domain as business scenarios — discrete problems an engineering org wants solved, each with its own pipeline, success criteria, and team to whom you'd pitch it.

A useful framing: take any place in the software lifecycle where a senior engineer currently does structured-but-not-routine reasoning — reviewing, migrating, generating tests, classifying incidents, drafting runbooks — and ask whether an LLM can carry the first draft.

Solution Shape

Most software-engineering scenarios share the same pattern:

  1. Repo / system context — code, dependency graph, recent diffs, build artifacts, design docs.
  2. Trigger — a PR opened, a commit pushed, a ticket created, an alert fired, a developer typing.
  3. Targeted retrieval — pull the relevant files, related callers, prior decisions, similar past PRs.
  4. LLM reasoning — review, explain, generate, classify, draft.
  5. Verification with deterministic tools — compile, run tests, run linters, run a static analyzer. Never trust LLM-only verdicts on code.
  6. Output — inline comment, generated file, drafted reply, ranked list, autofix PR.
  7. Human action — author / reviewer / on-call decides. Almost always Tier 2 (Copilot) at organization scale; Tier 1 inline in the editor.

The decisive bottleneck is rarely the model; it's getting the right context to the model (a working repo, the relevant files, the actual failing test output, the design-doc decision history) and closing the loop with deterministic verification.

Key Building Blocks

  • Repo access layer — read by ref, with permissions; symbol search; dependency graph.
  • Issue / PR / alert webhooks — to trigger on the right events.
  • Sandboxed execution — run tests / linters / static analysis safely.
  • Specialist tools — language-specific (LSP servers, AST tools); these beat raw text manipulation.
  • Diff-aware UI — propose changes as a reviewable patch, not a wall of text.
  • Eval set of historical engineering work — old PRs, fixed bugs, migrated code; ground truth is what humans actually shipped.

Concrete Cases

  • PR review automation. Each opened PR gets a structured review covering correctness suspicions, missing tests, style, security flags, dependency changes. Reviewer reads first; LLM accelerates them; never auto-merges.
  • Legacy migration. Codemods across thousands of files (framework upgrades, language migrations, API renames). LLM handles the long tail of "looks similar but isn't quite the same" that classic codemods miss. Verification by tests + diff stats per file.
  • Test generation. Generate unit / integration / property-based tests for changed files; flag missing coverage on hot paths.
  • Bug-report triage and reproduction. Inbound issue → classify, find related issues, generate a minimal repro, suggest first-look files.
  • Incident response copilot. Alert fires → run documented diagnostics, summarize current state, draft a Slack update, suggest the next escalation. Destructive actions stay human-gated.
  • Runbook execution / drift detection. Read live system state; compare against documented config; surface drift and propose corrections.
  • API / SDK generation from OpenAPI / Protobuf. Templates already do this; the LLM step is the polish: docs, examples, edge-case warnings, idiomatic per-language.
  • Documentation sync. When code changes, propose corresponding doc updates; flag stale docs.
  • Code review of dependency updates. Each dependency bump gets a "what actually changed in this lib, what risks does that imply for this codebase" brief.
  • Security review. Read PRs for OWASP-class patterns, auth changes, secrets handling, new endpoints; flag for security review. Overlaps with Decision Review.
  • Spec → implementation. From a written ticket / RFC, draft the scaffolding (routes, types, migrations, tests) for an engineer to flesh out.
  • Developer-onboarding assistant. Codebase-scoped Q&A (see Knowledge Assistant), tuned to first-week questions: "where is X defined," "how do I run Y locally."
  • DB / schema migration assistant. Given a schema change, draft the migration, the rollback, the backfill, the safety checks.
  • Log / trace summarization for debugging. Long log dumps → structured summary of what likely went wrong, with line references.
  • Code archaeology. "Why is this code here?" → search history, blame, related PRs, related design docs.

Similar Scenarios

  • Engineering-side workflow automation — covered by Workflow Automation; same shape, code-aware tools.
  • Engineering knowledge assistant — internal codebase Q&A; same as Knowledge Assistant.
  • Security review of changes — overlap with Decision Review.
  • Creative software design partner — overlap with Creative Assistance for high-level design conversations.

Pitfalls & Evaluation

  • Plausible code that doesn't compile. Always run the compiler / linter / tests. LLM-only "this looks right" verdicts produce expensive merges.
  • Hallucinated APIs. The model invents methods that don't exist. Mitigate with explicit dependency-version context and symbol-resolution tools.
  • Mass codemod blast radius. A migration applied across thousands of files needs per-file verification and good rollback. Don't open one giant PR.
  • Reviewer fatigue. A noisy auto-reviewer that flags everything teaches engineers to ignore it. Calibrate to high-precision findings; offer "more detail" rather than fire-hose.
  • Security via obscurity won't survive prompt injection. Inputs from issues, comments, log lines can carry instructions. Sanitize and sandbox.
  • The shippable bar is "passes tests + a human approved." Don't move past that quickly. Engineering culture changes slower than the model's capability.

Useful metrics: PR-review accept rate of LLM comments, time-to-first-review, test coverage delta on changed code, migration completion time vs hand baseline, bug-triage classification accuracy, post-incident "did the AI summary help?" survey, false-positive rate of automated review (engineers' most hated metric).

On this page