Annotation & Labeling

Where ground truth comes from — and the discipline that makes annotation useful at scale

Annotation is the process of attaching labels, judgments, or structured information to raw data. It's the source of ground truth for evaluation, the backbone of supervised fine-tuning, and the input to preference tuning. Done well, it's where the team's understanding of the problem gets encoded into something a model can learn from. Done poorly, it's an expensive way to create noise.

Why Annotation Is Hard

It looks like data entry; it isn't. Hard parts:

Defining the task. Annotators can't label consistently if the task is ambiguous. Most "label noise" is task definition noise.
Consistency across annotators. Different humans with the same instructions still disagree. A lot.
Calibration drift. Annotators' standards change over weeks of work. The labels at week 1 don't match the labels at week 6.
Edge cases. The rare, weird examples are exactly the ones evals depend on — and exactly the ones humans disagree about.
Cost and throughput. Quality annotation is slow and expensive.

The Annotation Spec Is the Real Artifact

The labels are the output. The spec — the document that tells annotators what to do — is the artifact that determines quality. A good spec covers:

The exact decision being made. "Is this response accurate?" is too vague. "Does the response correctly answer the user's question, using only information from the provided documents?" is workable.
The label categories. Specific, mutually exclusive, with definitions.
Examples of each label. Real cases, not invented ones.
Edge case rules. "If the response is partially correct, label it X."
What to do when uncertain. A "skip" option, an escalation path, or a clear default.

Most annotation projects fail at this step, not at the labeling step.

Inter-Annotator Agreement

The standard quality metric: when two annotators independently label the same example, how often do they agree? Common forms:

Raw agreement. Simplest; suffers from inflation by chance agreement.
Cohen's kappa / Krippendorff's alpha. Adjusts for chance agreement.
Pairwise rates when there are more than two annotators.

Targets vary by task. Above 0.8 kappa is solid; 0.6–0.8 is workable but suggests the spec needs sharpening; below 0.6 means the task is genuinely ambiguous and the spec isn't fixing it.

Run agreement studies regularly, not just at project start. Drift catches you otherwise.

The Annotation Pipeline

A mature pipeline has more than just "labelers labeling":

Initial labeling. First-pass annotation of each example.
Adjudication. Disagreements go to a more senior annotator or pair.
Spot-checks. Random samples reviewed by a quality lead.
Calibration sessions. Annotators meet periodically to discuss tricky cases and align.
Spec updates. New edge cases discovered in calibration go into the spec.

Skipping adjudication and calibration is the most common shortcut, and the most expensive one.

LLMs as Annotators

Modern LLMs are surprisingly good at annotation, with caveats:

Cheaper and faster than humans for most tasks.
More consistent than human annotators (no fatigue, no drift).
Worse at edge cases and at tasks requiring deep domain expertise.
Subject to biases that quietly skew labels.

The right pattern is usually hybrid:

LLM first pass on everything.
Human review of low-confidence or important cases.
Human gold for the eval set itself.
Disagreements between LLM and human become flagged for spec review.

Active Learning and Smart Sampling

You usually can't label everything. The question is which examples to prioritize:

Uncertain examples. Where the model isn't confident, labels teach it the most.
Disagreement examples. Where multiple models or annotators disagree.
Production failures. Real users finding real bugs.
Coverage gaps. Categories underrepresented in current labeled data.

A queue prioritized this way is dramatically more efficient than uniform random sampling.

Privacy and Compliance

Annotation often involves real user data:

Anonymization before annotation reaches third-party labelers.
DPIA / GDPR considerations in regulated contexts.
Annotator-side data handling — who can see what, where it's stored, retention.
Sensitive content protocols — annotators reviewing toxic, distressing, or graphic content need real support.

These concerns scale up fast as projects grow.

Tools

Open-source / self-hosted: Label Studio, Argilla, Prodigy.
Managed annotation services: Surge, Scale, Toloka, Labelbox.
LLM-augmented platforms: newer tools blend automated and human annotation in a single workflow.

The right tool depends on volume, sensitivity, and whether you're labeling in-house or with vendors.

A Closing Frame

Annotation is the place where the team's judgment becomes the model's behavior. Treat it like the engineering problem it is — with specs, reviews, metrics, and iteration — and the resulting datasets will repay the investment for years.

On this page