Best Practices
Production job systems - idempotency, retries, dead-letter, observability, scaling workers
Best Practices
Job systems eat real money — failed jobs lose work, runaway jobs burn compute, missed jobs drop user-visible features. These habits keep them reliable.
Idempotency Is Mandatory
Every job will run at least twice. Crashes, network blips, retries, manual replays — all routine. Design every job to be safe to repeat.
The classic mistake:
// BAD — retry sends two emails
async function processJob({ orderId }) {
await sendEmail(`Order ${orderId} placed!`);
}The fix is idempotency keys:
async function processJob({ orderId }) {
// Lock per order: if already processed, skip
const handle = await acquireOnceLock(`send-order-email:${orderId}`);
if (handle.alreadyDone) return handle.previousResult;
try {
await sendEmail(`Order ${orderId} placed!`);
return await handle.complete({ sent: true });
} catch (err) {
await handle.release();
throw err;
}
}The lock is per-business-operation. Use a database table with a unique constraint or a Redis SETNX. Most jobs are naturally idempotent if you key on a business ID (order:42:send-email) and check before doing the side-effect.
External APIs: pass an Idempotency-Key header (Stripe, Plaid, Twilio all support it). Same key → same result, even if the request runs twice.
Retries: Exponential with Jitter
Default exponential backoff is good; jitter prevents thundering herd:
attempts: 5,
backoff: {
type: 'exponential',
delay: 1000, // BullMQ: 1s, 2s, 4s, 8s, 16s
},For very long retries (external API outages), increase max attempts and cap delay:
attempts: 20,
backoff: {
type: 'exponential',
delay: 1000,
// cap at e.g. 60s; depends on library
},Differentiate retryable from terminal errors:
async function processJob(job) {
try {
await callExternalAPI();
} catch (err) {
if (err.status === 400 || err.status === 404) {
// Don't retry — the input is wrong
throw new UnrecoverableError(err.message);
}
throw err; // retry transient errors
}
}BullMQ has UnrecoverableError; Sidekiq has Sidekiq::Job#sidekiq_retries_exhausted. Use them to avoid retrying things that will never succeed.
Dead-Letter Queues
When retries exhaust, the job goes to a "dead" state. Two questions:
- Where does it go? A separate queue / table you can inspect.
- What triggers action? Alerting on dead-letter rate; manual or automated replay.
worker.on('failed', async (job, err) => {
if (job.attemptsMade >= (job.opts.attempts ?? 1)) {
await deadLetterQueue.add('dead', {
originalQueue: queue.name,
originalJobName: job.name,
data: job.data,
error: { message: err.message, stack: err.stack },
failedAt: new Date().toISOString(),
});
metrics.increment('jobs.dead_lettered', { queue: queue.name });
}
});Alert on dead-letter growth — a sudden spike means something downstream is broken.
Observability
| Metric | Why |
|---|---|
| Jobs enqueued / sec | Health of producers |
| Jobs completed / sec | Health of workers |
| Job latency (enqueue → start) | Queue depth / worker capacity |
| Job duration | Slow jobs; alert on regressions |
| Failure rate per job type | Identify problematic jobs |
| Dead-letter rate | Things that never succeed |
| Worker concurrency utilization | Capacity planning |
Most job libraries ship Prometheus exporters or have third-party ones — pipe into Prometheus & Grafana.
Trace every job — propagate the trace context from the enqueue site into the worker so a job becomes part of the parent request's trace. See Tracing.
// Enqueue
await queue.add('process', {
...data,
_traceContext: getCurrentTraceContext(),
});
// Worker
const worker = new Worker('queue', async (job) => {
const ctx = contextFromTraceContext(job.data._traceContext);
await tracer.startActiveSpan('process-job', { kind: 4 }, ctx, async (span) => {
// ...
});
});Scaling Workers
| Pattern | Notes |
|---|---|
| Multiple worker processes | Each consumes from the same queue; trivial horizontal scaling |
| Concurrency per worker | Each process handles N jobs in parallel (concurrency: 5) |
| Separate queues per workload class | "Slow" queue with low concurrency, "fast" queue with high |
| Dedicated workers per queue type | Pin specific queues to specific worker pools |
| Autoscaling on queue depth | Scale workers up when backlog grows; KEDA on Kubernetes |
For most apps: 2-3 worker processes, each with concurrency = (number of CPU cores), scale horizontally when queue depth grows. Start simple.
Don't put long jobs and short jobs on the same queue. A 30-minute report blocks the email worker behind it. Separate queues = separate failure domains.
Backpressure and Rate Limits
When a downstream is slow, your queue grows. Two responses:
- Limit upstream production — circuit-break or rate-limit at the API layer.
- Limit consumer concurrency — fewer workers means slower drain, but also less load on the downstream.
BullMQ's limiter and Sidekiq's throttling middleware let you cap jobs/second. Apply per-queue or per-job-class.
Long Jobs
Jobs over ~30 seconds are suspect:
- Worker timeouts kill them mid-flight. Configure
lockDuration(BullMQ) longer than the longest job; understand the implications. - Crashes lose progress. Break into smaller jobs that can resume.
- Dashboard UX gets weird — "running for 4 hours" looks like a stuck job.
For long work, either break into steps (each its own job) or move to workflow orchestration — see Jobs vs Workflows.
Stuck / Stalled Jobs
A worker process crashes mid-job. The job is "locked" but no longer being processed.
Most libraries handle this with lock expiry — after lockDuration, the job is reassigned. Tune carefully:
- Too short: active workers' jobs get re-picked-up by other workers (concurrent execution!).
- Too long: crashes leave jobs stuck for a while.
BullMQ defaults are reasonable (30s). Set lockDuration to ~2× the longest expected job time. Combine with idempotency — if a job runs twice due to lock expiry, that's now safe.
Job Versioning
Jobs enqueued today may run tomorrow after a deploy. Backward compatibility matters:
async function processJob(job) {
// Job data shape might be old version
const data = migrateJobData(job.data);
// ... process
}
function migrateJobData(data) {
if (data.version === 1) {
return { version: 2, ...data, newField: deriveFromOld(data) };
}
return data;
}For breaking changes, drain the old queue before deploying (set up a feature flag, stop enqueueing, wait for queue to empty, deploy).
Scheduled and Recurring Jobs
Cron jobs in your app code, not your OS:
// Daily summary at 06:00 UTC
await queue.add('daily-summary', {}, {
repeat: { pattern: '0 6 * * *', tz: 'UTC' },
});Benefits over OS cron:
- Versioned with code. Cron schedule is in your repo.
- Observability. Goes through your monitoring like any other job.
- HA. Multiple workers can claim recurring jobs without duplication (the library coordinates via Redis / DB).
One caveat: long-running scheduled jobs can collide if the previous run hasn't finished. Most libraries skip the new run; verify the behavior of yours.
Job Inspection / Replay
Production debugging requires:
- A dashboard showing live queues, in-flight jobs, recent failures.
- Replay for failed jobs once you've fixed the bug.
- Cancel for stuck or no-longer-needed jobs.
Bull Board, Sidekiq Web, Flower (Celery) all provide this. Put behind admin auth — they expose job data which may include PII.
Security
- Never put secrets in job data. Pass IDs; let the worker fetch from Secrets Management.
- Don't log full job data if it contains PII.
- Authenticate queue access — Redis ACLs / TLS / network isolation.
- Audit destructive jobs — "delete user data" jobs should log who triggered them.
Common Pitfalls
| Pitfall | Symptom | Fix |
|---|---|---|
| Non-idempotent jobs | Duplicate emails / charges on retry | Idempotency keys |
| Retry forever on bad input | DLQ piles up; metrics drown | UnrecoverableError for terminal failures |
Jobs much longer than lockDuration | Concurrent execution | Increase lock duration; or break job up |
| Mixing slow + fast jobs on one queue | Latency-sensitive jobs blocked | Separate queues |
| Storing PII in job data | Logs leak | Pass IDs; fetch in worker |
| Long-lived Redis connections from short-lived workers | Connection churn | Connection pool; reuse |
| Workers in same process as web server | Web latency spikes when jobs are heavy | Separate worker processes |
| No metrics on queue depth | Silent backups | Alert on growing queue depth |
| Cron jobs only on one worker host | Single point of failure | Library-coordinated recurring jobs |
| Not propagating trace context | Disconnected traces | Propagate via job data |
Checklist
Production background jobs checklist
- Every job is idempotent (or uses idempotency keys)
- Retries with exponential backoff + jitter
- Distinct retryable vs terminal errors
- Dead-letter queue for unfixable jobs
- Alert on dead-letter rate
- Metrics: enqueue rate, complete rate, latency, duration, failure rate
- Trace context propagated from enqueue to worker
- Dashboard for queue inspection (behind auth)
- Separate queues for different workload classes (fast / slow / priority)
- Workers in dedicated processes; web servers don't run them
- Autoscaling rules on queue depth
- Job data does not contain secrets / unscrubbed PII
- Recurring jobs defined in code; no OS cron
-
lockDurationtuned for longest expected job - Backward compatibility for job data shape
- Runbook for "queue is backing up" / "many failures"
- For multi-step workflows, evaluated graduation to orchestration