Steven's Knowledge

Data Extraction & Classification

Turning messy real-world data into a clean structured dataset — at batch scale

Scenario Abstraction

There is a pile of data that is technically structured (a CSV, a table, a JSON dump) or should be structured (free-text fields, scraped HTML, mixed-language records) but is not usable as-is. The job is to normalize, enrich, classify, deduplicate, or extract it into a clean dataset for an analytics / ML / operational system.

This used to be the domain of regexes, brittle parsers, and Mechanical Turk. LLMs change the economics: for many fuzzy classification and extraction tasks they're cheaper and more accurate than maintaining hand-tuned rules, and dramatically cheaper than human labeling — if you set up evaluation properly.

This scenario differs from document-to-action in that it is batch and offline: nobody is waiting for an answer, and there is no transaction at the end. The output is data.

Solution Shape

  1. Define the schema — what fields, what types, what enumerations, what is "unknown."
  2. Label a small gold set — a few hundred examples carefully labeled by a domain expert. Without this, you cannot iterate.
  3. Prototype the prompt — single-record extraction or classification with structured output.
  4. Measure against gold — per-field accuracy; categorize errors.
  5. Improve along the cheapest axis — better prompt, few-shot examples, schema clarification, splitting the task, then (rarely) a fine-tune.
  6. Batch processing — async, retry-safe, with cost guardrails.
  7. Quality monitoring at scale — sample audits, drift alerts on output distribution.
  8. Human-in-the-loop for low-confidence rows — route uncertain extractions to a labeling queue.

Key Building Blocks

  • Structured output with strict schema validation.
  • Batch APIs (cheaper, slower) where latency doesn't matter.
  • A judge / verifier model for high-stakes outputs.
  • Confidence signal — logit-based, ensemble disagreement, or model self-reported.
  • Labeling tool for the human review loop.
  • Dataset versioning — outputs are data, treat them like data.

Concrete Cases

  • PII / sensitive data detection. Scan free-text fields in legacy databases, classify spans (name, SSN, health info), redact or tokenize.
  • Entity resolution / deduplication. "Are these two records the same company / person / product?" LLM compares on free-form attributes where fuzzy string match fails.
  • Product taxonomy classification. New SKUs assigned to category, subcategory, attributes by reading title + description.
  • Support ticket auto-tagging. Classify inbound tickets by topic, sentiment, urgency, customer segment.
  • Resume parsing. Pull structured experience records from PDFs; map skills to a controlled taxonomy.
  • Address normalization for international data. Parse messy address strings into structured fields per country format.
  • Sentiment / aspect-based opinion mining on reviews. Each review yields per-aspect scores, not a single number.
  • Synthetic data labeling. Pre-label a training set for a downstream classical ML model; humans review the LLM's labels rather than label from scratch.
  • Open-source repo metadata enrichment. Read READMEs at scale, classify by language ecosystem, maturity, purpose.
  • Healthcare coding suggestion. From clinical text, suggest ICD-10 / CPT codes for coder review.

Similar Scenarios

  • Web scraping post-processing — extract structured fields from heterogeneous HTML.
  • Survey free-text coding — classify open-ended responses into a fixed coding frame.
  • News / signal feeds tagging — tag incoming articles by company, theme, sentiment for analyst feeds.
  • Catalog cleanup — find inconsistencies across attributes of the same product.

Pitfalls & Evaluation

  • No gold set, no project. Without a labeled gold set you have no way to tell whether each prompt change helped or hurt. Build it first; protect it from contamination.
  • Class imbalance. The rare classes are usually the ones you care about. Stratify your eval.
  • Distribution drift. Source data changes over time. Re-sample and re-label periodically.
  • Cost blowup at scale. Cheap per-call, expensive per-million. Use small/fast models for first pass; reserve big models for hard examples flagged by confidence.
  • Format violations. Strict schema validators + retry-on-violation, or a constrained-decoding mode where available.

Useful metrics: per-field accuracy on gold, macro-F1 for classification, cost per processed record, percent routed to human queue, downstream-task lift (the dataset is only useful insofar as it improves something).

On this page