Steven's Knowledge

Best Practices

Production job systems - idempotency, retries, dead-letter, observability, scaling workers

Best Practices

Job systems eat real money — failed jobs lose work, runaway jobs burn compute, missed jobs drop user-visible features. These habits keep them reliable.

Idempotency Is Mandatory

Every job will run at least twice. Crashes, network blips, retries, manual replays — all routine. Design every job to be safe to repeat.

The classic mistake:

// BAD — retry sends two emails
async function processJob({ orderId }) {
  await sendEmail(`Order ${orderId} placed!`);
}

The fix is idempotency keys:

async function processJob({ orderId }) {
  // Lock per order: if already processed, skip
  const handle = await acquireOnceLock(`send-order-email:${orderId}`);
  if (handle.alreadyDone) return handle.previousResult;

  try {
    await sendEmail(`Order ${orderId} placed!`);
    return await handle.complete({ sent: true });
  } catch (err) {
    await handle.release();
    throw err;
  }
}

The lock is per-business-operation. Use a database table with a unique constraint or a Redis SETNX. Most jobs are naturally idempotent if you key on a business ID (order:42:send-email) and check before doing the side-effect.

External APIs: pass an Idempotency-Key header (Stripe, Plaid, Twilio all support it). Same key → same result, even if the request runs twice.

Retries: Exponential with Jitter

Default exponential backoff is good; jitter prevents thundering herd:

attempts: 5,
backoff: {
  type: 'exponential',
  delay: 1000,   // BullMQ: 1s, 2s, 4s, 8s, 16s
},

For very long retries (external API outages), increase max attempts and cap delay:

attempts: 20,
backoff: {
  type: 'exponential',
  delay: 1000,
  // cap at e.g. 60s; depends on library
},

Differentiate retryable from terminal errors:

async function processJob(job) {
  try {
    await callExternalAPI();
  } catch (err) {
    if (err.status === 400 || err.status === 404) {
      // Don't retry — the input is wrong
      throw new UnrecoverableError(err.message);
    }
    throw err;   // retry transient errors
  }
}

BullMQ has UnrecoverableError; Sidekiq has Sidekiq::Job#sidekiq_retries_exhausted. Use them to avoid retrying things that will never succeed.

Dead-Letter Queues

When retries exhaust, the job goes to a "dead" state. Two questions:

  1. Where does it go? A separate queue / table you can inspect.
  2. What triggers action? Alerting on dead-letter rate; manual or automated replay.
worker.on('failed', async (job, err) => {
  if (job.attemptsMade >= (job.opts.attempts ?? 1)) {
    await deadLetterQueue.add('dead', {
      originalQueue: queue.name,
      originalJobName: job.name,
      data: job.data,
      error: { message: err.message, stack: err.stack },
      failedAt: new Date().toISOString(),
    });

    metrics.increment('jobs.dead_lettered', { queue: queue.name });
  }
});

Alert on dead-letter growth — a sudden spike means something downstream is broken.

Observability

MetricWhy
Jobs enqueued / secHealth of producers
Jobs completed / secHealth of workers
Job latency (enqueue → start)Queue depth / worker capacity
Job durationSlow jobs; alert on regressions
Failure rate per job typeIdentify problematic jobs
Dead-letter rateThings that never succeed
Worker concurrency utilizationCapacity planning

Most job libraries ship Prometheus exporters or have third-party ones — pipe into Prometheus & Grafana.

Trace every job — propagate the trace context from the enqueue site into the worker so a job becomes part of the parent request's trace. See Tracing.

// Enqueue
await queue.add('process', {
  ...data,
  _traceContext: getCurrentTraceContext(),
});

// Worker
const worker = new Worker('queue', async (job) => {
  const ctx = contextFromTraceContext(job.data._traceContext);
  await tracer.startActiveSpan('process-job', { kind: 4 }, ctx, async (span) => {
    // ...
  });
});

Scaling Workers

PatternNotes
Multiple worker processesEach consumes from the same queue; trivial horizontal scaling
Concurrency per workerEach process handles N jobs in parallel (concurrency: 5)
Separate queues per workload class"Slow" queue with low concurrency, "fast" queue with high
Dedicated workers per queue typePin specific queues to specific worker pools
Autoscaling on queue depthScale workers up when backlog grows; KEDA on Kubernetes

For most apps: 2-3 worker processes, each with concurrency = (number of CPU cores), scale horizontally when queue depth grows. Start simple.

Don't put long jobs and short jobs on the same queue. A 30-minute report blocks the email worker behind it. Separate queues = separate failure domains.

Backpressure and Rate Limits

When a downstream is slow, your queue grows. Two responses:

  1. Limit upstream production — circuit-break or rate-limit at the API layer.
  2. Limit consumer concurrency — fewer workers means slower drain, but also less load on the downstream.

BullMQ's limiter and Sidekiq's throttling middleware let you cap jobs/second. Apply per-queue or per-job-class.

Long Jobs

Jobs over ~30 seconds are suspect:

  • Worker timeouts kill them mid-flight. Configure lockDuration (BullMQ) longer than the longest job; understand the implications.
  • Crashes lose progress. Break into smaller jobs that can resume.
  • Dashboard UX gets weird — "running for 4 hours" looks like a stuck job.

For long work, either break into steps (each its own job) or move to workflow orchestration — see Jobs vs Workflows.

Stuck / Stalled Jobs

A worker process crashes mid-job. The job is "locked" but no longer being processed.

Most libraries handle this with lock expiry — after lockDuration, the job is reassigned. Tune carefully:

  • Too short: active workers' jobs get re-picked-up by other workers (concurrent execution!).
  • Too long: crashes leave jobs stuck for a while.

BullMQ defaults are reasonable (30s). Set lockDuration to ~2× the longest expected job time. Combine with idempotency — if a job runs twice due to lock expiry, that's now safe.

Job Versioning

Jobs enqueued today may run tomorrow after a deploy. Backward compatibility matters:

async function processJob(job) {
  // Job data shape might be old version
  const data = migrateJobData(job.data);
  // ... process
}

function migrateJobData(data) {
  if (data.version === 1) {
    return { version: 2, ...data, newField: deriveFromOld(data) };
  }
  return data;
}

For breaking changes, drain the old queue before deploying (set up a feature flag, stop enqueueing, wait for queue to empty, deploy).

Scheduled and Recurring Jobs

Cron jobs in your app code, not your OS:

// Daily summary at 06:00 UTC
await queue.add('daily-summary', {}, {
  repeat: { pattern: '0 6 * * *', tz: 'UTC' },
});

Benefits over OS cron:

  • Versioned with code. Cron schedule is in your repo.
  • Observability. Goes through your monitoring like any other job.
  • HA. Multiple workers can claim recurring jobs without duplication (the library coordinates via Redis / DB).

One caveat: long-running scheduled jobs can collide if the previous run hasn't finished. Most libraries skip the new run; verify the behavior of yours.

Job Inspection / Replay

Production debugging requires:

  • A dashboard showing live queues, in-flight jobs, recent failures.
  • Replay for failed jobs once you've fixed the bug.
  • Cancel for stuck or no-longer-needed jobs.

Bull Board, Sidekiq Web, Flower (Celery) all provide this. Put behind admin auth — they expose job data which may include PII.

Security

  • Never put secrets in job data. Pass IDs; let the worker fetch from Secrets Management.
  • Don't log full job data if it contains PII.
  • Authenticate queue access — Redis ACLs / TLS / network isolation.
  • Audit destructive jobs — "delete user data" jobs should log who triggered them.

Common Pitfalls

PitfallSymptomFix
Non-idempotent jobsDuplicate emails / charges on retryIdempotency keys
Retry forever on bad inputDLQ piles up; metrics drownUnrecoverableError for terminal failures
Jobs much longer than lockDurationConcurrent executionIncrease lock duration; or break job up
Mixing slow + fast jobs on one queueLatency-sensitive jobs blockedSeparate queues
Storing PII in job dataLogs leakPass IDs; fetch in worker
Long-lived Redis connections from short-lived workersConnection churnConnection pool; reuse
Workers in same process as web serverWeb latency spikes when jobs are heavySeparate worker processes
No metrics on queue depthSilent backupsAlert on growing queue depth
Cron jobs only on one worker hostSingle point of failureLibrary-coordinated recurring jobs
Not propagating trace contextDisconnected tracesPropagate via job data

Checklist

Production background jobs checklist

  • Every job is idempotent (or uses idempotency keys)
  • Retries with exponential backoff + jitter
  • Distinct retryable vs terminal errors
  • Dead-letter queue for unfixable jobs
  • Alert on dead-letter rate
  • Metrics: enqueue rate, complete rate, latency, duration, failure rate
  • Trace context propagated from enqueue to worker
  • Dashboard for queue inspection (behind auth)
  • Separate queues for different workload classes (fast / slow / priority)
  • Workers in dedicated processes; web servers don't run them
  • Autoscaling rules on queue depth
  • Job data does not contain secrets / unscrubbed PII
  • Recurring jobs defined in code; no OS cron
  • lockDuration tuned for longest expected job
  • Backward compatibility for job data shape
  • Runbook for "queue is backing up" / "many failures"
  • For multi-step workflows, evaluated graduation to orchestration

On this page