Steven's Knowledge

Data Versioning & Lineage

Datasets are code — and they need their own version control story

Code has Git. Models have model registries. Datasets, in most teams, sit in a folder somewhere with names like eval_v3_final_FINAL.csv. That's the gap data versioning fills: making datasets first-class artifacts you can pin, diff, audit, and reproduce.

Why It Matters

Without versioning, every "did this change make things better?" question becomes ambiguous. Did quality improve because the prompt changed, or because the eval set drifted? Was the fine-tune worse because the recipe changed, or because someone accidentally re-shuffled the training data? You can't answer either question reliably without immutable, named dataset versions.

It also matters for:

  • Reproducibility. Re-running a fine-tune from six months ago.
  • Audit and compliance. Showing exactly which data the production model was trained on.
  • Collaboration. Multiple engineers iterating on the same dataset without stepping on each other.
  • Rollback. Reverting an eval set or training corpus that turned out to be flawed.

What Versioning Means for Data

The minimum bar:

  • Immutable named versions. eval_v0042 always means the same set of bytes, forever.
  • Content-addressed. Hashes verify that what you loaded is what you expected.
  • Tracked transitions. Every new version is connected to its predecessor and the change that produced it.
  • Linked to artifacts. Every model, eval run, and report records the dataset version it used.

Anything less and you'll eventually find yourself debugging a regression with no idea what data it was trained on.

The Tools

DVC (Data Version Control). Pairs with Git. Tracks data files in object storage, references them by hash from your repo. Open source, language-agnostic. The most common starting point.

Git LFS. Crude data versioning for repos with binary files. Fine for small datasets; falls over at scale.

LakeFS, Pachyderm, Delta Lake. Versioned data lakes — Git-like semantics for entire object stores or warehouses. Used when datasets are too large for DVC's file-by-file model.

Weights & Biases Artifacts, MLflow, Comet. Experiment-tracking platforms with first-class dataset versioning. Useful when you want eval runs, model versions, and dataset versions linked in one system.

Hugging Face Datasets. The community standard for sharing datasets, with built-in versioning via revisions. Convenient when you're working with public or shared datasets.

Custom S3 / GCS schemes. Many teams just use object storage with strict naming conventions. Works if everyone is disciplined; a recipe for chaos otherwise.

Lineage

Versioning tells you what version. Lineage tells you how it got there.

Useful lineage for an eval set:

  • Source records — which production traces, which manual additions, which generated examples.
  • Filter steps — what was removed and why.
  • Annotation history — who labeled what, when, with which spec version.
  • Augmentations — synthetic additions, dedup passes, balancing.
  • Approvals — who signed off, against what release.

Stored as structured metadata, lineage answers "where did this label come from?" months later when someone questions a result.

A Practical Pattern

A pragmatic setup that works for many teams:

  1. Eval sets in Git with nbstripout-style hygiene, content-hashed, frozen by version tag. Small enough that Git handles them.
  2. Training and RAG corpora in DVC or LakeFS, referenced by hash from the training pipeline.
  3. A dataset registry — a small service or a versioned config file that maps version names to underlying content hashes.
  4. Lineage in metadata. Every dataset version has a JSON record of how it was produced, what it derives from, and what the diff is.
  5. Eval and training runs record dataset version IDs in their own logs.

The investment is modest; the payoff is years of debuggability.

What Goes in a Dataset Diff

Beyond just file diffs, useful dataset diffs surface:

  • Counts. Total, by category, by split.
  • Distribution shifts. Average length, token counts, label distribution.
  • New examples. What was added, ideally categorized.
  • Removed examples. What was dropped and why.
  • Modified labels. Where existing examples got re-labeled.

A pull request reviewing a dataset change should look more like a dashboard than a wall of text.

Schema and Compatibility

Datasets, like APIs, have schemas — and like APIs, schema changes break things. Treat them with the same care:

  • Add columns liberally. New optional fields rarely break anything.
  • Rename or remove cautiously. Track downstream consumers.
  • Version the schema alongside the data. Mismatches should fail fast and loudly.

A dataset whose schema changes silently is one debugging session away from biting you.

What Good Looks Like

A team with mature data versioning can answer, in under five minutes:

  • "What exact data was the production model trained on?"
  • "How does the eval set today differ from the one we used last quarter?"
  • "Where did this specific label come from, and who approved it?"
  • "Can we re-run this experiment with exactly the same data?"

Reaching that bar takes time. Starting now, with something simple, beats waiting for a perfect tool.

On this page