Steven's Knowledge

Multi-Tenancy

Tenancy models (silo/pool/bridge), data isolation strategies, tenant context propagation, noisy neighbors, and per-tenant configuration and limits

Multi-Tenancy

A multi-tenant system serves many customers (tenants) from one deployment. Each tenant believes they have the application to themselves; in reality they share code, and often share databases and servers. The central engineering problem is isolation: tenant A must never see, touch, or starve tenant B — and the cost of guaranteeing that scales with how much you share.

This is an application-code concern first. The hard part is not provisioning servers; it is making sure every query, every cache key, and every log line carries the right tenant scope, automatically, so that one forgotten WHERE tenant_id = ? does not leak another customer's data.

Tenancy Models

There is a spectrum from "one stack per tenant" to "everything shared," usually named silo, bridge, and pool.

SILO                      BRIDGE                    POOL
(isolated)                (hybrid)                  (shared)

Tenant A → Stack A        Tenant A ┐                Tenant A ┐
Tenant B → Stack B        Tenant B ┼→ Shared app    Tenant B ┼→ Shared app
Tenant C → Stack C        Tenant C ┘  + separate    Tenant C ┘  + shared DB
                                       DB per tenant    (tenant_id column)

most isolation ─────────────────────────────────────→ most efficiency
most cost      ←─────────────────────────────────────  least cost
ModelIsolationCost per tenantBlast radiusBest for
SiloStrongest — separate everythingHighestOne tenantRegulated/enterprise; few large tenants
BridgeStrong data isolation, shared computeMediumOne tenant's dataMixed customer sizes
PoolLogical only (enforced in code)LowestAll tenants (if a bug leaks)Many small tenants; SaaS at scale

Most SaaS products start pool (cheapest, fastest to ship) and peel high-value or compliance-bound tenants out into silo later. You can run different tiers simultaneously — a "tiered" or bridge approach where the same code path serves both.

Data Isolation Strategies

This is where the model becomes concrete. Three implementations, increasing in sharing:

1. Database per Tenant (Silo)

Each tenant gets a physically separate database. Routing picks the connection based on tenant.

function dbForTenant(tenantId: string): Pool {
  const connString = tenantRegistry.lookup(tenantId).databaseUrl;
  return getPool(connString); // pooled per tenant
}
  • Pro: Total isolation. A bad query cannot cross tenants. Easy per-tenant backup, restore, and even region placement.
  • Con: Hundreds of databases to migrate, monitor, and connection-pool. Migrations become a fan-out job. Connection count explodes.

2. Schema per Tenant (Bridge)

One database, one schema (namespace) per tenant. Set the search path per request.

-- Postgres: switch the active schema for this connection
SET search_path TO tenant_8f2a;
SELECT * FROM orders;   -- resolves to tenant_8f2a.orders
  • Pro: Strong logical isolation, one database to operate. Cheaper than database-per-tenant.
  • Con: Migrations still fan out across schemas. Thousands of schemas strain the catalog. The SET search_path must be bulletproof — forget it and you query the wrong (or public) schema.

3. Shared Table + tenant_id (Pool)

All tenants share tables; every row carries a tenant_id. Cheapest, and the one with the sharpest knife.

CREATE TABLE orders (
  id          uuid PRIMARY KEY,
  tenant_id   uuid NOT NULL,
  customer_id uuid NOT NULL,
  total_cents bigint NOT NULL,
  created_at  timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX ON orders (tenant_id, created_at);  -- tenant_id leads EVERY index

The danger is obvious: a single query missing WHERE tenant_id = ? leaks or corrupts across tenants. You cannot rely on developers remembering. Two layers of defense:

Row-Level Security (RLS) — let the database enforce the filter so application bugs cannot bypass it:

ALTER TABLE orders ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON orders
  USING (tenant_id = current_setting('app.tenant_id')::uuid);

-- Per request/transaction, set the current tenant:
SET LOCAL app.tenant_id = '8f2a...';
-- Now EVERY query on orders is automatically scoped. A forgotten WHERE is safe.

A repository/ORM layer that injects the filter — so no raw query escapes without scoping. RLS is the stronger guarantee because it holds even when someone bypasses the ORM with raw SQL.

If you choose shared-table pooling, treat RLS (or an equivalent unbypassable filter) as non-optional. The convenience of pooling is exactly what makes a single missing predicate catastrophic.

Tenant Context Propagation

Every request runs in the context of one tenant. That context has to flow from the edge all the way to the query — without being passed as an explicit argument through forty function calls.

Establish It at the Edge

Resolve the tenant once, from a trustworthy source, in middleware:

function tenantMiddleware(req, res, next) {
  // Source of truth, in order of trust:
  //   1. A claim in the validated JWT  (best — signed, can't be forged)
  //   2. A subdomain (acme.app.com)    (ok — but verify the user belongs to it)
  //   3. A header / path param         (only if independently authorized)
  const tenantId = req.auth.claims.tenant_id;
  if (!tenantId) return res.status(403).json({ error: { code: 'NO_TENANT' } });

  tenantContext.run({ tenantId }, () => next());  // bind for the rest of the request
}

Derive tenant from the authenticated identity, not from client-supplied input you haven't checked. A X-Tenant-Id header the client sets freely is an IDOR waiting to happen: user from tenant A sends tenant B's id and reads their data.

Carry It Implicitly

Use the runtime's request-scoped storage so deep code can read the tenant without it being threaded through every signature:

import { AsyncLocalStorage } from 'node:async_hooks';
export const tenantContext = new AsyncLocalStorage<{ tenantId: string }>();

// Deep in a repository — no tenantId parameter needed:
function currentTenant(): string {
  const ctx = tenantContext.getStore();
  if (!ctx) throw new Error('No tenant in context'); // fail closed, never default to "all"
  return ctx.tenantId;
}

Equivalents: Python contextvars, Go context.Context (passed explicitly — Go's idiom), Java ThreadLocal / Micrometer context. The rule everywhere: fail closed. Missing tenant context is an error, never "show everything."

Don't Forget the Other Caches

Tenant scope leaks through everything that holds data, not just the primary database:

  • Cache keys must include the tenant — cache.get(tenantId + ":user:" + id). A shared key serves tenant A's data to tenant B.
  • Background jobs lose request context — pass tenantId in the job payload and re-establish context in the worker.
  • Logs and traces should carry tenant_id as a structured field for debugging and per-tenant analysis.
  • Search indexes / object storage need a tenant prefix or filter, same as the database.

Noisy Neighbors

In pooled models, tenants share finite resources: CPU, connection pool slots, queue capacity, rate-limit budget. One heavy tenant — a bulk import, a runaway report, an abusive API client — can degrade everyone else. This is the noisy-neighbor problem.

Defenses, applied per tenant rather than globally:

MechanismWhat it boundsNotes
Per-tenant rate limitsRequests/sec per tenantReuse the token-bucket from Resilience, keyed by tenantId
Per-tenant concurrency capsIn-flight expensive opsA bulkhead per tenant stops one from eating the pool
Query cost / row limitsRunaway readsCap result-set size; statement timeouts
Fair queuingBackground job hoggingRound-robin or weighted scheduling across tenants, not pure FIFO
Tier-based quotasPlan-level fairnessFree tier gets less; enterprise gets reserved capacity
// Per-tenant rate limiting — each tenant gets an independent bucket
const tenantLimiter = new Map<string, TokenBucket>();

function limiterFor(tenantId: string): TokenBucket {
  if (!tenantLimiter.has(tenantId)) {
    tenantLimiter.set(tenantId, new TokenBucket(/* capacity */ 100, /* refill */ 10));
  }
  return tenantLimiter.get(tenantId)!;
}

The structural fix for chronic noisy neighbors is to promote the heavy tenant to a silo — give them their own database or stack so their load cannot touch anyone else. That is the whole point of the silo/pool spectrum: it is a dial you turn per tenant as their value and demands grow.

Per-Tenant Configuration & Limits

Tenants are not identical. They have different feature flags, branding, quotas, and integrations. Resolve this configuration through the same tenant context, with a sensible default-and-override layering:

interface TenantConfig {
  features: { advancedReports: boolean; sso: boolean };
  limits:   { maxUsers: number; maxStorageGb: number; apiRateLimit: number };
  branding: { logoUrl?: string; primaryColor?: string };
}

function resolveConfig(tenantId: string): TenantConfig {
  // Layer: global defaults  <  plan tier defaults  <  per-tenant overrides
  return merge(GLOBAL_DEFAULTS, planDefaults(tenant.plan), tenantOverrides(tenantId));
}

Enforce limits where the resource is consumed, and return a clear, tenant-aware error when a tenant hits a quota (403 with a code like PLAN_LIMIT_EXCEEDED, not a generic failure). Cache resolved config — but remember to scope and invalidate the cache per tenant.

Decision Tree

Regulated data, or a few large enterprise tenants?
  → Silo (database/stack per tenant). Pay for isolation.

Many small tenants, cost-sensitive, shipping fast?
  → Pool (shared tables + tenant_id) WITH row-level security enforced in the DB.

Mixed: most tenants small, a few demand isolation?
  → Bridge / tiered. Pool the small ones, silo the big ones. Same codebase.

Resolving which tenant a request belongs to?
  → From the signed JWT/identity, never an unverified client header. Fail closed.

One tenant degrading others?
  → Per-tenant rate limits + bulkheads first; promote chronic offenders to a silo.

Checklist

  • The isolation model (silo/bridge/pool) is a deliberate choice, not an accident of the first table you created.
  • If pooled, the database enforces tenant scope (RLS or an equivalent unbypassable filter) — not just application code.
  • tenant_id leads every index on shared tables.
  • Tenant is resolved from authenticated identity, never unverified client input.
  • Tenant context propagates implicitly (request-scoped storage) and fails closed when missing.
  • Cache keys, background-job payloads, logs, and search indexes all carry tenant scope.
  • Per-tenant rate limits and concurrency caps protect against noisy neighbors.
  • Per-tenant config layers global defaults → plan tier → tenant overrides, and limit violations return a clear, tenant-aware error.

On this page