Data Extraction & Classification
Turning messy real-world data into a clean structured dataset — at batch scale
Scenario Abstraction
There is a pile of data that is technically structured (a CSV, a table, a JSON dump) or should be structured (free-text fields, scraped HTML, mixed-language records) but is not usable as-is. The job is to normalize, enrich, classify, deduplicate, or extract it into a clean dataset for an analytics / ML / operational system.
This used to be the domain of regexes, brittle parsers, and Mechanical Turk. LLMs change the economics: for many fuzzy classification and extraction tasks they're cheaper and more accurate than maintaining hand-tuned rules, and dramatically cheaper than human labeling — if you set up evaluation properly.
This scenario differs from document-to-action in that it is batch and offline: nobody is waiting for an answer, and there is no transaction at the end. The output is data.
Solution Shape
- Define the schema — what fields, what types, what enumerations, what is "unknown."
- Label a small gold set — a few hundred examples carefully labeled by a domain expert. Without this, you cannot iterate.
- Prototype the prompt — single-record extraction or classification with structured output.
- Measure against gold — per-field accuracy; categorize errors.
- Improve along the cheapest axis — better prompt, few-shot examples, schema clarification, splitting the task, then (rarely) a fine-tune.
- Batch processing — async, retry-safe, with cost guardrails.
- Quality monitoring at scale — sample audits, drift alerts on output distribution.
- Human-in-the-loop for low-confidence rows — route uncertain extractions to a labeling queue.
Key Building Blocks
- Structured output with strict schema validation.
- Batch APIs (cheaper, slower) where latency doesn't matter.
- A judge / verifier model for high-stakes outputs.
- Confidence signal — logit-based, ensemble disagreement, or model self-reported.
- Labeling tool for the human review loop.
- Dataset versioning — outputs are data, treat them like data.
Concrete Cases
- PII / sensitive data detection. Scan free-text fields in legacy databases, classify spans (name, SSN, health info), redact or tokenize.
- Entity resolution / deduplication. "Are these two records the same company / person / product?" LLM compares on free-form attributes where fuzzy string match fails.
- Product taxonomy classification. New SKUs assigned to category, subcategory, attributes by reading title + description.
- Support ticket auto-tagging. Classify inbound tickets by topic, sentiment, urgency, customer segment.
- Resume parsing. Pull structured experience records from PDFs; map skills to a controlled taxonomy.
- Address normalization for international data. Parse messy address strings into structured fields per country format.
- Sentiment / aspect-based opinion mining on reviews. Each review yields per-aspect scores, not a single number.
- Synthetic data labeling. Pre-label a training set for a downstream classical ML model; humans review the LLM's labels rather than label from scratch.
- Open-source repo metadata enrichment. Read READMEs at scale, classify by language ecosystem, maturity, purpose.
- Healthcare coding suggestion. From clinical text, suggest ICD-10 / CPT codes for coder review.
Similar Scenarios
- Web scraping post-processing — extract structured fields from heterogeneous HTML.
- Survey free-text coding — classify open-ended responses into a fixed coding frame.
- News / signal feeds tagging — tag incoming articles by company, theme, sentiment for analyst feeds.
- Catalog cleanup — find inconsistencies across attributes of the same product.
Pitfalls & Evaluation
- No gold set, no project. Without a labeled gold set you have no way to tell whether each prompt change helped or hurt. Build it first; protect it from contamination.
- Class imbalance. The rare classes are usually the ones you care about. Stratify your eval.
- Distribution drift. Source data changes over time. Re-sample and re-label periodically.
- Cost blowup at scale. Cheap per-call, expensive per-million. Use small/fast models for first pass; reserve big models for hard examples flagged by confidence.
- Format violations. Strict schema validators + retry-on-violation, or a constrained-decoding mode where available.
Useful metrics: per-field accuracy on gold, macro-F1 for classification, cost per processed record, percent routed to human queue, downstream-task lift (the dataset is only useful insofar as it improves something).