Steven's Knowledge

Data Contracts

Producer/consumer data contracts, schema registry, schema evolution and compatibility (backward/forward), Avro/Protobuf/JSON Schema, contract enforcement in CI, breaking-change prevention, and data ownership.

Data Contracts

Most data incidents are not caused by bad code in the pipeline. They are caused by an upstream team renaming a column, changing a type, or dropping a field they did not know anyone depended on. A data contract is an explicit, versioned, enforceable agreement between the producer of data and its consumers - the API contract idea (see API Design) applied to data.

This page covers what goes into a contract, the serialization formats and schema registries that encode it, how schema evolution works, and how to enforce contracts in CI so breaking changes are blocked before they ship. It is the structural complement to Data Quality: quality checks values, contracts govern structure and meaning.

The Problem Contracts Solve

Without contracts, the boundary between a producer and a consumer is implicit. The producer treats their table or topic as a private internal detail; the consumer builds critical logic on top of it; neither knows about the other. Then the producer ships a "harmless" refactor:

-- Producer's migration, Tuesday afternoon
ALTER TABLE orders RENAME COLUMN total TO order_total;
ALTER TABLE orders ALTER COLUMN status TYPE int USING status_to_code(status);

Wednesday morning, three dashboards are blank, a fraud model is scoring on nulls, and nobody connected the change to the breakage. The producer never knew the consumers existed. A data contract makes that dependency explicit and the change impossible to ship silently.

Anatomy of a Data Contract

A contract is a machine-readable document describing the shape, semantics, and guarantees of a dataset.

# contracts/orders.yml
dataset: production.orders
version: 2.1.0
owner: orders-team
description: One row per customer order, emitted on order completion.

schema:
  - name: order_id
    type: string
    required: true
    unique: true
    description: UUID, stable for the life of the order
  - name: customer_id
    type: string
    required: true
    description: FK to customers.customer_id
  - name: order_total
    type: decimal(12,2)
    required: true
    constraints: [">= 0"]
    description: Total in NZD including tax
  - name: status
    type: string
    required: true
    enum: [pending, shipped, delivered, cancelled]

guarantees:
  freshness: "available within 15 minutes of order completion"
  uniqueness: "order_id is unique"
  retention: "90 days"

consumers:
  - team: finance
    usage: daily revenue reporting
  - team: ml-platform
    usage: fraud model features

compatibility: backward   # how this schema may evolve (see below)

The contract names: the schema (fields, types, constraints), the semantic meaning of each field, operational guarantees (freshness, uniqueness, retention), the known consumers, and - critically - the compatibility mode that governs how it may change.

Serialization Formats

A contract needs a concrete encoding for the data on the wire. Three formats dominate, especially in streaming (see Stream Processing).

FormatEncodingSchema locationStrength
AvroBinary, compactSeparate .avsc, registryRich evolution rules, streaming-native
ProtobufBinary, very compact.proto definitionSpeed, strong typing, gRPC
JSON SchemaText (JSON).json schema docHuman-readable, web-friendly

Avro

Avro is the default for Kafka-based pipelines. The schema is separate from the data, evolution rules are well-defined, and it integrates tightly with schema registries.

{
  "type": "record",
  "name": "Order",
  "namespace": "com.company.orders",
  "fields": [
    {"name": "order_id", "type": "string"},
    {"name": "customer_id", "type": "string"},
    {"name": "order_total", "type": {"type": "bytes", "logicalType": "decimal", "precision": 12, "scale": 2}},
    {"name": "status", "type": {"type": "enum", "name": "Status",
        "symbols": ["pending", "shipped", "delivered", "cancelled"]}},
    {"name": "promo_code", "type": ["null", "string"], "default": null}
  ]
}

Note promo_code: a nullable union with a default. This is the pattern that makes a field safe to add - old readers that do not know about it still work, because there is a default.

Protobuf

Protobuf is the choice when you want maximum performance and strong typing, often alongside gRPC services. Field numbers (not names) are the wire identity, which gives it clean evolution semantics.

syntax = "proto3";
package company.orders;

message Order {
  string order_id    = 1;
  string customer_id = 2;
  int64  total_cents = 3;   // integer cents avoids float rounding
  Status status      = 4;
  optional string promo_code = 5;  // new field; old readers ignore #5

  enum Status {
    PENDING   = 0;
    SHIPPED   = 1;
    DELIVERED = 2;
    CANCELLED = 3;
  }
}

JSON Schema

JSON Schema wins on readability and is ideal for HTTP/event payloads where humans and web tooling read the data directly.

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "Order",
  "type": "object",
  "required": ["order_id", "customer_id", "order_total", "status"],
  "properties": {
    "order_id":    {"type": "string", "format": "uuid"},
    "customer_id": {"type": "string"},
    "order_total": {"type": "number", "minimum": 0},
    "status":      {"type": "string", "enum": ["pending", "shipped", "delivered", "cancelled"]},
    "promo_code":  {"type": ["string", "null"]}
  },
  "additionalProperties": false
}

Schema Registry

A schema registry is a central service that stores schemas, assigns versions, and enforces compatibility rules at write time. Producers register their schema; the registry rejects a new version that would break the configured compatibility. This moves enforcement from "hope nobody breaks it" to "the system will not let you break it."

# Confluent Schema Registry: register and produce with schema validation
from confluent_kafka.schema_registry import SchemaRegistryClient
from confluent_kafka.schema_registry.avro import AvroSerializer

registry = SchemaRegistryClient({"url": "http://schema-registry:8081"})

# Registering an incompatible new version raises an error here -
# the registry enforces the subject's compatibility mode before accepting it.
serializer = AvroSerializer(registry, schema_str=order_schema_v2)

# Each message carries a schema ID; consumers fetch the matching schema
# from the registry to deserialize. Producer and consumer never share code.
producer.produce(topic="orders", value=serializer(order, ctx))

The registry stores schemas under a subject (usually one per topic), versions them, and is the single source of truth both producers and consumers resolve against.

Schema Evolution and Compatibility

Schemas must change - businesses add features. The art is changing them without breaking consumers. Compatibility modes define which changes are allowed.

ModeGuaranteesYou can safely...Upgrade order
BackwardNew schema reads old dataAdd optional fields, delete fieldsConsumers first
ForwardOld schema reads new dataAdd fields, delete optional fieldsProducers first
FullBoth backward and forwardAdd/remove optional fields onlyEither order
NoneNo checksAnything (dangerous)Coordinate manually

Backward compatibility

Backward-compatible means a consumer using the new schema can read data written with the old schema. This is the most common choice. It lets you upgrade consumers first, then producers.

Safe (backward-compatible):
  + Add a field WITH a default value
  - Remove a field (new readers just ignore the missing column)

Breaking (NOT backward-compatible):
  + Add a required field with no default  (old data has no value for it)
  ~ Rename a field                        (= drop + add; readers lose the old data)
  ~ Change a field's type                 (int -> string breaks deserialization)
  ~ Narrow an enum                         (old data may hold a now-invalid value)

Forward compatibility

Forward-compatible means a consumer using the old schema can read data written with the new schema. This lets you upgrade producers first - useful when producers deploy on a faster cadence than the many consumers.

The asymmetry matters: adding a field with a default is backward-compatible; removing an optional field is forward-compatible; doing only changes that are both gives you full compatibility and removes the upgrade-ordering constraint entirely. When in doubt, target full compatibility for shared, high-fan-out datasets.

Enforcing Contracts in CI

A contract that lives in a wiki is a suggestion. A contract enforced in CI is a guarantee. The goal: a pull request that would break a contract fails the build before it can merge.

# .github/workflows/data-contract.yml
name: Data Contract Check
on: [pull_request]

jobs:
  schema-compatibility:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      # 1. Validate the proposed schema is well-formed
      - run: contract-cli validate contracts/orders.yml
      # 2. Check it against the registered version for breaking changes
      - run: |
          contract-cli compatibility \
            --schema contracts/orders.yml \
            --against registry://orders \
            --mode backward
      # A backward-incompatible change exits non-zero and fails the PR.
# The compatibility check, conceptually
def check_compatibility(old_schema, new_schema, mode="backward") -> list[str]:
    """Return a list of breaking changes; empty means compatible."""
    breaks = []
    for field in old_schema.fields:
        if field.name not in new_schema.field_names:
            if mode in ("forward", "full"):
                breaks.append(f"Removed field '{field.name}' breaks forward compat")
    for field in new_schema.fields:
        if field.name not in old_schema.field_names:
            if field.required and field.default is None:
                breaks.append(f"New required field '{field.name}' has no default")
    return breaks

The producer cannot merge a breaking change without either (a) making it non-breaking (add a default, deprecate before delete), or (b) explicitly bumping the major version and coordinating a migration with the named consumers. The contract turns a silent break into a forced conversation.

Preventing Breaking Changes

A pragmatic playbook for evolving a contract without incidents:

  1. Additive-only by default. New fields are optional with defaults. This is always safe under backward compatibility.
  2. Deprecate before delete. Mark a field deprecated, announce it, give consumers a migration window, then remove it in a major version.
  3. Expand, then contract. To change a type, add a new field with the new type, dual-write both, migrate consumers, then remove the old field. Never mutate in place.
  4. Version explicitly. Use semantic versioning. Major = breaking, minor = additive, patch = docs/metadata. Consumers pin to a major version.
  5. Never reuse field identities. In Protobuf, never reuse a field number; in Avro, never reuse a name with different semantics. Reserve retired identifiers.
message Order {
  reserved 3;                 // total_cents was here; never reuse #3
  reserved "total_cents";
  string order_id    = 1;
  string customer_id = 2;
  Money  total       = 6;     // the replacement, a new field number
}

Ownership

A contract without an owner is unenforceable. Ownership is the organizational backbone that makes contracts real.

  • Producers own their contracts. The team that emits the data is accountable for its schema, guarantees, and changes. Data is a product, and they are its maintainers.
  • Every dataset has exactly one owning team, recorded in the contract metadata and discoverable in a catalog. "Whose data is this?" must always have an answer.
  • Consumers register their dependency. Listing consumers in the contract is what makes the impact of a change visible - it turns "I did not know anyone used it" into a notification list.
  • Breaking changes require consumer sign-off. The contract's consumer list is the approval list. You cannot break what you have promised without the people you promised it to agreeing.

This mirrors how good API teams operate: a public interface, owned by a team, versioned, with compatibility guarantees and a deprecation policy. Treating internal datasets as products with contracts is the single highest-leverage practice for scaling data reliability across an organization. It pairs naturally with the operational discipline in DataOps and the lineage that makes consumer impact visible.

On this page