Data Extraction & Classification

Turning messy real-world data into a clean structured dataset — at batch scale

Scenario Abstraction

There is a pile of data that is technically structured (a CSV, a table, a JSON dump) or should be structured (free-text fields, scraped HTML, mixed-language records) but is not usable as-is. The job is to normalize, enrich, classify, deduplicate, or extract it into a clean dataset for an analytics / ML / operational system.

This used to be the domain of regexes, brittle parsers, and Mechanical Turk. LLMs change the economics: for many fuzzy classification and extraction tasks they're cheaper and more accurate than maintaining hand-tuned rules, and dramatically cheaper than human labeling — if you set up evaluation properly.

This scenario differs from document-to-action in that it is batch and offline: nobody is waiting for an answer, and there is no transaction at the end. The output is data.

Solution Shape

Define the schema — what fields, what types, what enumerations, what is "unknown."
Label a small gold set — a few hundred examples carefully labeled by a domain expert. Without this, you cannot iterate.
Prototype the prompt — single-record extraction or classification with structured output.
Measure against gold — per-field accuracy; categorize errors.
Improve along the cheapest axis — better prompt, few-shot examples, schema clarification, splitting the task, then (rarely) a fine-tune.
Batch processing — async, retry-safe, with cost guardrails.
Quality monitoring at scale — sample audits, drift alerts on output distribution.
Human-in-the-loop for low-confidence rows — route uncertain extractions to a labeling queue.

Key Building Blocks

Structured output with strict schema validation.
Batch APIs (cheaper, slower) where latency doesn't matter.
A judge / verifier model for high-stakes outputs.
Confidence signal — logit-based, ensemble disagreement, or model self-reported.
Labeling tool for the human review loop.
Dataset versioning — outputs are data, treat them like data.

Concrete Cases

PII / sensitive data detection. Scan free-text fields in legacy databases, classify spans (name, SSN, health info), redact or tokenize.
Entity resolution / deduplication. "Are these two records the same company / person / product?" LLM compares on free-form attributes where fuzzy string match fails.
Product taxonomy classification. New SKUs assigned to category, subcategory, attributes by reading title + description.
Support ticket auto-tagging. Classify inbound tickets by topic, sentiment, urgency, customer segment.
Resume parsing. Pull structured experience records from PDFs; map skills to a controlled taxonomy.
Address normalization for international data. Parse messy address strings into structured fields per country format.
Sentiment / aspect-based opinion mining on reviews. Each review yields per-aspect scores, not a single number.
Synthetic data labeling. Pre-label a training set for a downstream classical ML model; humans review the LLM's labels rather than label from scratch.
Open-source repo metadata enrichment. Read READMEs at scale, classify by language ecosystem, maturity, purpose.
Healthcare coding suggestion. From clinical text, suggest ICD-10 / CPT codes for coder review.

Similar Scenarios

Web scraping post-processing — extract structured fields from heterogeneous HTML.
Survey free-text coding — classify open-ended responses into a fixed coding frame.
News / signal feeds tagging — tag incoming articles by company, theme, sentiment for analyst feeds.
Catalog cleanup — find inconsistencies across attributes of the same product.

Pitfalls & Evaluation

No gold set, no project. Without a labeled gold set you have no way to tell whether each prompt change helped or hurt. Build it first; protect it from contamination.
Class imbalance. The rare classes are usually the ones you care about. Stratify your eval.
Distribution drift. Source data changes over time. Re-sample and re-label periodically.
Cost blowup at scale. Cheap per-call, expensive per-million. Use small/fast models for first pass; reserve big models for hard examples flagged by confidence.
Format violations. Strict schema validators + retry-on-violation, or a constrained-decoding mode where available.

Useful metrics: per-field accuracy on gold, macro-F1 for classification, cost per processed record, percent routed to human queue, downstream-task lift (the dataset is only useful insofar as it improves something).