Incident & On-call Communication
How to communicate during outages, incidents, and postmortems in English (用英文做故障期间沟通和事后复盘)
Why Incident Comms Are a Senior Skill (为什么故障沟通是资深技能)
During an incident, clear communication matters as much as the fix. A calm, structured updater builds trust with leadership; a panicked or silent one loses it. This is one of the fastest ways to be noticed (or missed) by senior leadership.
故障期间,清晰的沟通和修复同样重要。冷静、有结构的同步建立对领导层的信任;恐慌或沉默会失去信任。这是最快被高层注意到(或忽视)的方式之一。
The principle 原则: Frequent short updates beat one perfect long one. Status > silence.
频繁的简短更新胜过一次完美的长更新。有状态总比沉默好。
Declaring an Incident (宣布故障)
Quick declaration (快速宣布)
- "Calling an incident — [system] is down/degraded. Spinning up a war room." — 宣布故障——[系统] 宕机/降级。开启 war room。
- "Sev-2 declared on payments. Comms in #incident-1234." — payments 升级到 Sev-2。沟通在 #incident-1234。(Sev = Severity)
- "We're seeing X. Treating as an incident until proven otherwise." — 我们看到 X。除非证明不是,否则按故障处理。
Severity levels (严重等级)
Most companies use a 1-5 or 1-3 scale. Common shorthand:
大多数公司用 1-5 或 1-3 分级。常见简称:
- SEV-1 / P0 — site down, major customer impact, all hands. 全站宕机,全员上。
- SEV-2 / P1 — significant degradation, urgent. 严重降级,紧急。
- SEV-3 / P2 — partial impact, address soon. 部分影响,尽快处理。
- SEV-4 / P3 — minor, can wait. 小问题,可等。
Roles During an Incident (故障期间的角色)
- IC (Incident Commander) — runs the response, makes decisions. 指挥官。
- Comms (Communications Lead) — handles internal/external updates. 沟通负责人。
- SME (Subject Matter Expert) — debugs and fixes. 技术专家。
- Scribe — keeps the timeline. 记录员。
Taking a role (接手角色)
- "I'll take IC." — 我做 IC。
- "I'll handle comms — pinging customers and leadership." — 我负责沟通——通知客户和领导。
- "I'll keep the timeline." — 我记时间线。
- "I'm the SME on this — let me drive debugging." — 这块我懂——我来推进排查。
Status Updates During the Incident (故障期间的状态同步)
Standard format (标准格式)
Update [time]:
- Impact: [who, what, severity]
- What we know: [confirmed facts]
- What we're doing: [current actions]
- ETA: [next update time, not a fix time]Example (例子)
Update 14:25 UTC:
- Impact: ~30% of login requests failing globally since 14:08
- What we know: Auth service returning 500s, started after deploy at 14:05
- What we're doing: Rolling back deploy, ETA on rollback ~5 min
- Next update: 14:35 UTCPhrases for updates (同步用语)
- "As of [time], [status]." — 截至 [时间],[状态]。
- "Still investigating — no root cause yet." — 仍在调查——尚未确认根因。
- "We have a hypothesis: [X]. Validating now." — 我们有假设:[X]。正在验证。
- "Confirmed: [X] is the root cause." — 已确认:[X] 是根因。
- "Mitigation in progress — ETA [time]." — 缓解中——预计 [时间]。
- "Service is recovering — monitoring." — 服务恢复中——监控中。
- "All metrics back to normal. Standing down." — 所有指标恢复正常。结束响应。(stand down = 解除戒备)
Communicating "I Don't Know" (说"我不知道")
Mid-incident, partial info is normal. Say what you know and don't know clearly.
故障中,信息不全是正常的。清楚说出知道的和不知道的。
- "We don't know the root cause yet. We do know that X started Y minutes ago." — 我们还不知道根因。但我们知道 X 在 Y 分钟前开始。
- "Best guess right now is [X], but I want to verify before saying it's confirmed." — 现在最好的猜测是 [X],但我想验证后再确认。
- "Honest answer: we're still digging." — 老实说:还在挖。
Communicating With Stakeholders (向相关方沟通)
Internal (leadership, other teams) (内部 - 领导、其他团队)
- "Heads up — we have a SEV-2 on [system]. War room in [link]. I'll send updates every 15 min." — 提前告知——[系统] SEV-2。war room 在 [链接]。我每 15 分钟同步一次。
- "For visibility: incident is in mitigation. Customers may see X for the next ~10 min." — 知会一下:故障在缓解中。客户接下来约 10 分钟可能看到 X。
- "All clear — incident resolved at [time]. Postmortem to follow." — 全部正常——[时间] 故障已解决。postmortem 后续发出。
External (customers, support) (外部 - 客户、支持)
Use status pages for customers. Templates:
对客户用状态页。模板:
- "Investigating: We're investigating reports of [issue]. We'll provide an update by [time]." — 调查中:我们正在调查 [问题] 的报告。我们会在 [时间] 前更新。
- "Identified: We've identified the cause of [issue] and are working on a fix." — 已定位:我们已找到 [问题] 的原因并在修复。
- "Monitoring: A fix has been deployed. We're monitoring to confirm full resolution." — 监控中:修复已部署。我们在监控确认完全解决。
- "Resolved: This incident has been resolved. We apologize for the disruption." — 已解决:本次故障已解决。我们为造成的不便致歉。
Asking for Help During an Incident (故障期间求助)
Don't be a hero. Pull people in fast.
不要逞英雄。快速把人拉进来。
- "Need eyes on [service] from someone who knows it well." — 需要熟悉 [服务] 的人来看一下。
- "Paging [team] for the database side." — 呼叫 [团队] 看数据库这边。
- "Anyone available to take comms? I need to focus on debugging." — 有人能接沟通吗?我需要专注排查。
- "Can someone manage the leadership update thread?" — 有人能管一下领导层的更新串吗?
Calling for Calm (引导冷静)
When the war room gets noisy or panicked.
当 war room 嘈杂或恐慌时。
- "Let's slow down for a sec. What do we actually know?" — 我们慢一下。我们实际知道什么?
- "One person talking at a time, please." — 一次一个人说,谢谢。
- "I want to take a 60-second pause to regroup." — 我想停 60 秒重新组织。
- "This is going to be OK. Step by step." — 这会没事的。一步一步来。
Closing an Incident (结束故障)
Standing down (解除响应)
- "Service has been stable for [N min]. I'm calling resolved at [time]." — 服务稳定 [N 分钟]。我宣布在 [时间] 解决。
- "Standing down the war room. Postmortem owner: [name]." — 解散 war room。postmortem 负责人:[名字]。
- "Thank you everyone for the fast response." — 感谢大家的快速响应。
Action items handoff (后续行动项交接)
- "Three follow-ups: [list]. Each has an owner — see thread." — 三个后续:[列表]。每个有负责人——见串。
- "Postmortem doc by [date]. I'll send the link when ready." — postmortem 在 [日期] 前完成。准备好我发链接。
Postmortem Writing (复盘写作)
Blameless tone (无指责语气)
| ❌ Blameful (指责) | ✅ Blameless (无指责) |
|---|---|
| "Bob deployed bad code." | "A deploy at 14:05 introduced a regression." |
| "The on-call missed the alert." | "The alert fired but was not actioned for 12 min — runbook lacked clear next steps." |
| "We forgot to test X." | "Our test suite did not cover X." |
| "Engineer should have known." | "The system did not surface this risk to the engineer." |
The framing is system failed, not person failed.
视角是系统失败,不是人失败。
Standard postmortem sections (标准 postmortem 章节)
# Postmortem: [Incident name] — [Date]
## Summary
One paragraph: what happened, impact, resolution.
## Impact
- Users affected: [number / %]
- Duration: [time]
- Severity: SEV-N
- Revenue / SLA impact: [if applicable]
## Timeline (UTC)
- 14:05 — Deploy of v1.42 to production
- 14:08 — Error rate begins climbing
- 14:12 — First alert fires
- 14:15 — On-call acknowledges, opens incident
- ...
## Root cause
[Technical explanation]
## What went well
- Detection was fast
- Rollback procedure worked as designed
- Comms cadence kept stakeholders informed
## What went poorly
- 7-min gap between alert firing and acknowledgement
- No automated rollback on canary failure
- Internal status page was not updated
## Action items
| Item | Owner | Due |
|------|-------|-----|
| Add canary auto-rollback | @alice | 2026-05-20 |
| Reduce alert ack target to 2 min | @bob | 2026-05-15 |
| Update runbook with new mitigation | @carol | 2026-05-13 |
## Lessons
[Free-form reflection]Phrases for postmortems (postmortem 用语)
- "The triggering event was [X]." — 触发事件是 [X]。
- "The root cause was [Y], not [X]." — 根因是 [Y],不是 [X]。
- "Contributing factors included..." — 相关因素包括……
- "The system did not detect [X] because..." — 系统没检测到 [X],因为……
- "This was the first time we'd seen [pattern]." — 这是我们第一次看到 [模式]。
Phrases to Avoid (要避免的表达)
| ❌ Avoid (避免) | Why (原因) | ✅ Better (更好) |
|---|---|---|
| Silence during incident | 比坏消息更糟 | "Still digging — next update in 10 min." |
| "Should be fixed soon." | 模糊,不可信 | "Mitigation deployed, monitoring for 10 min." |
| "It was [person]'s fault." | 永远不要在 postmortem 里 | "[Process/system] allowed X." |
| "We'll be more careful next time." | 不是 action item | Specific, owned, dated action items. |
| "This won't happen again." (no plan) | 空话 | "We've added [specific safeguard]." |
| "Sorry, sorry, my fault." (overdoing) | 内疚不是行动 | "Owning this — fix in flight." |
Cultural Notes (文化提示)
Calm beats clever (冷静胜过机智)
During an incident, the calm communicator earns trust, even if the heroic debugger fixes it. Both matter; calm comms is rarer.
故障中,冷静的沟通者赢得信任,即使英雄式 debug 的人修好了。两者都重要;冷静沟通更稀缺。
Cadence is sacred (节奏神圣)
If you said "next update in 15 min," update in 15 min — even if it's "no change, still investigating." Missing your own cadence is the fastest way to lose trust.
如果你说"15 分钟后更新",就 15 分钟更新——哪怕只说"无变化,仍在调查"。错过自己定的节奏是最快失去信任的方式。
Blameless is non-negotiable (无指责不可妥协)
Even hinting at blame in a postmortem makes engineers hide future incidents. The cultural cost of one blameful postmortem is high.
postmortem 里哪怕暗示指责,工程师都会藏起未来的故障。一篇有指责的 postmortem 文化代价很高。
Document, don't translate live (记录,不在直播中翻译)
In multilingual war rooms, write in English in the channel even if the call is in another language. The written record matters more than the call.
多语言 war room 里,即使电话用其他语言说,频道里也用英文写。书面记录比电话重要。
Postmortems are public (postmortem 是公开的)
Most companies share postmortems org-wide. They're a learning artifact, not a punishment. Write for the next on-call.
大多数公司公司范围内共享 postmortem。这是学习材料,不是惩罚。为下一个 on-call 而写。
Tips (小贴士)
- Update on cadence, even with no news — Silence sounds like things got worse. 按节奏更新,没新闻也更新。
- One source of truth — Pick one channel for updates. Forking is chaos. 一个真相来源。
- Distinguish facts from hypotheses — "We see X" vs "we think X." 事实和假设要分开。
- Pull help fast — Heroes burn out and miss things. 快速求援。
- Postmortem within a week — Memory fades and patterns get missed. 一周内做 postmortem。
- Track action items to completion — Postmortems without follow-through are theatre. 跟踪行动项到落地。