Best Practices

Job systems eat real money — failed jobs lose work, runaway jobs burn compute, missed jobs drop user-visible features. These habits keep them reliable.

Idempotency Is Mandatory

Every job will run at least twice. Crashes, network blips, retries, manual replays — all routine. Design every job to be safe to repeat.

The classic mistake:

// BAD — retry sends two emails
async function processJob({ orderId }) {
  await sendEmail(`Order ${orderId} placed!`);
}

The fix is idempotency keys:

async function processJob({ orderId }) {
  // Lock per order: if already processed, skip
  const handle = await acquireOnceLock(`send-order-email:${orderId}`);
  if (handle.alreadyDone) return handle.previousResult;

  try {
    await sendEmail(`Order ${orderId} placed!`);
    return await handle.complete({ sent: true });
  } catch (err) {
    await handle.release();
    throw err;
  }
}

The lock is per-business-operation. Use a database table with a unique constraint or a Redis SETNX. Most jobs are naturally idempotent if you key on a business ID (order:42:send-email) and check before doing the side-effect.

External APIs: pass an Idempotency-Key header (Stripe, Plaid, Twilio all support it). Same key → same result, even if the request runs twice.

Retries: Exponential with Jitter

Default exponential backoff is good; jitter prevents thundering herd:

attempts: 5,
backoff: {
  type: 'exponential',
  delay: 1000,   // BullMQ: 1s, 2s, 4s, 8s, 16s
},

For very long retries (external API outages), increase max attempts and cap delay:

attempts: 20,
backoff: {
  type: 'exponential',
  delay: 1000,
  // cap at e.g. 60s; depends on library
},

Differentiate retryable from terminal errors:

async function processJob(job) {
  try {
    await callExternalAPI();
  } catch (err) {
    if (err.status === 400 || err.status === 404) {
      // Don't retry — the input is wrong
      throw new UnrecoverableError(err.message);
    }
    throw err;   // retry transient errors
  }
}

BullMQ has UnrecoverableError; Sidekiq has Sidekiq::Job#sidekiq_retries_exhausted. Use them to avoid retrying things that will never succeed.

Dead-Letter Queues

When retries exhaust, the job goes to a "dead" state. Two questions:

Where does it go? A separate queue / table you can inspect.
What triggers action? Alerting on dead-letter rate; manual or automated replay.

worker.on('failed', async (job, err) => {
  if (job.attemptsMade >= (job.opts.attempts ?? 1)) {
    await deadLetterQueue.add('dead', {
      originalQueue: queue.name,
      originalJobName: job.name,
      data: job.data,
      error: { message: err.message, stack: err.stack },
      failedAt: new Date().toISOString(),
    });

    metrics.increment('jobs.dead_lettered', { queue: queue.name });
  }
});

Alert on dead-letter growth — a sudden spike means something downstream is broken.

Observability

Metric	Why
Jobs enqueued / sec	Health of producers
Jobs completed / sec	Health of workers
Job latency (enqueue → start)	Queue depth / worker capacity
Job duration	Slow jobs; alert on regressions
Failure rate per job type	Identify problematic jobs
Dead-letter rate	Things that never succeed
Worker concurrency utilization	Capacity planning

Most job libraries ship Prometheus exporters or have third-party ones — pipe into Prometheus & Grafana.

Trace every job — propagate the trace context from the enqueue site into the worker so a job becomes part of the parent request's trace. See Tracing.

// Enqueue
await queue.add('process', {
  ...data,
  _traceContext: getCurrentTraceContext(),
});

// Worker
const worker = new Worker('queue', async (job) => {
  const ctx = contextFromTraceContext(job.data._traceContext);
  await tracer.startActiveSpan('process-job', { kind: 4 }, ctx, async (span) => {
    // ...
  });
});

Scaling Workers

Pattern	Notes
Multiple worker processes	Each consumes from the same queue; trivial horizontal scaling
Concurrency per worker	Each process handles N jobs in parallel (`concurrency: 5`)
Separate queues per workload class	"Slow" queue with low concurrency, "fast" queue with high
Dedicated workers per queue type	Pin specific queues to specific worker pools
Autoscaling on queue depth	Scale workers up when backlog grows; KEDA on Kubernetes

For most apps: 2-3 worker processes, each with concurrency = (number of CPU cores), scale horizontally when queue depth grows. Start simple.

Don't put long jobs and short jobs on the same queue. A 30-minute report blocks the email worker behind it. Separate queues = separate failure domains.

Backpressure and Rate Limits

When a downstream is slow, your queue grows. Two responses:

Limit upstream production — circuit-break or rate-limit at the API layer.
Limit consumer concurrency — fewer workers means slower drain, but also less load on the downstream.

BullMQ's limiter and Sidekiq's throttling middleware let you cap jobs/second. Apply per-queue or per-job-class.

Long Jobs

Jobs over ~30 seconds are suspect:

Worker timeouts kill them mid-flight. Configure lockDuration (BullMQ) longer than the longest job; understand the implications.
Crashes lose progress. Break into smaller jobs that can resume.
Dashboard UX gets weird — "running for 4 hours" looks like a stuck job.

For long work, either break into steps (each its own job) or move to workflow orchestration — see Jobs vs Workflows.

Stuck / Stalled Jobs

A worker process crashes mid-job. The job is "locked" but no longer being processed.

Most libraries handle this with lock expiry — after lockDuration, the job is reassigned. Tune carefully:

Too short: active workers' jobs get re-picked-up by other workers (concurrent execution!).
Too long: crashes leave jobs stuck for a while.

BullMQ defaults are reasonable (30s). Set lockDuration to ~2× the longest expected job time. Combine with idempotency — if a job runs twice due to lock expiry, that's now safe.

Job Versioning

Jobs enqueued today may run tomorrow after a deploy. Backward compatibility matters:

async function processJob(job) {
  // Job data shape might be old version
  const data = migrateJobData(job.data);
  // ... process
}

function migrateJobData(data) {
  if (data.version === 1) {
    return { version: 2, ...data, newField: deriveFromOld(data) };
  }
  return data;
}

For breaking changes, drain the old queue before deploying (set up a feature flag, stop enqueueing, wait for queue to empty, deploy).

Scheduled and Recurring Jobs

Cron jobs in your app code, not your OS:

// Daily summary at 06:00 UTC
await queue.add('daily-summary', {}, {
  repeat: { pattern: '0 6 * * *', tz: 'UTC' },
});

Benefits over OS cron:

Versioned with code. Cron schedule is in your repo.
Observability. Goes through your monitoring like any other job.
HA. Multiple workers can claim recurring jobs without duplication (the library coordinates via Redis / DB).

One caveat: long-running scheduled jobs can collide if the previous run hasn't finished. Most libraries skip the new run; verify the behavior of yours.

Job Inspection / Replay

Production debugging requires:

A dashboard showing live queues, in-flight jobs, recent failures.
Replay for failed jobs once you've fixed the bug.
Cancel for stuck or no-longer-needed jobs.

Bull Board, Sidekiq Web, Flower (Celery) all provide this. Put behind admin auth — they expose job data which may include PII.

Security

Never put secrets in job data. Pass IDs; let the worker fetch from Secrets Management.
Don't log full job data if it contains PII.
Authenticate queue access — Redis ACLs / TLS / network isolation.
Audit destructive jobs — "delete user data" jobs should log who triggered them.

Common Pitfalls

Pitfall	Symptom	Fix
Non-idempotent jobs	Duplicate emails / charges on retry	Idempotency keys
Retry forever on bad input	DLQ piles up; metrics drown	`UnrecoverableError` for terminal failures
Jobs much longer than `lockDuration`	Concurrent execution	Increase lock duration; or break job up
Mixing slow + fast jobs on one queue	Latency-sensitive jobs blocked	Separate queues
Storing PII in job data	Logs leak	Pass IDs; fetch in worker
Long-lived Redis connections from short-lived workers	Connection churn	Connection pool; reuse
Workers in same process as web server	Web latency spikes when jobs are heavy	Separate worker processes
No metrics on queue depth	Silent backups	Alert on growing queue depth
Cron jobs only on one worker host	Single point of failure	Library-coordinated recurring jobs
Not propagating trace context	Disconnected traces	Propagate via job data

Checklist

Best Practices

Idempotency Is Mandatory

Retries: Exponential with Jitter

Dead-Letter Queues

Observability

Scaling Workers

Backpressure and Rate Limits

Long Jobs

Stuck / Stalled Jobs

Job Versioning

Scheduled and Recurring Jobs

Job Inspection / Replay

Security

Common Pitfalls

Checklist

Best Practices

On this page