Steven's Knowledge

Multimodal Inspection

Reading images, video, and screens to extract structured judgments at scale

Scenario Abstraction

A human today looks at something — a photo, a frame from CCTV, a medical scan, a product on a conveyor belt, a screen — and produces a structured output: a tag, a measurement, a defect call, a fitness judgment. This is the visual analog of conversation intelligence: turning unstructured perception into structured data.

Vision-capable LLMs have collapsed a wall that used to require a custom CV model per task. For many "look-and-say" jobs, prompting a vision LLM with a few examples now beats months of bespoke training, especially when the categories evolve.

Solution Shape

  1. Capture — phone camera, fixed camera, drone, microscope, CT scanner, screen recorder.
  2. Pre-process — normalize resolution, color, orientation; sometimes crop / segment to the region of interest.
  3. Inspect with a vision LLM — given the image plus a prompt that specifies what to look for and the output schema, produce a structured response.
  4. Geometric / count corrections — for tasks where counting or measuring matters, pair with classical CV (segmentation, OCR) and combine.
  5. Confidence-aware review — low confidence routes to a human reviewer.
  6. Action / record — write the structured judgment to wherever it's needed.
  7. Continuous evaluation — keep a labeled set; rotate fresh samples in to catch drift.

For mission-critical visual judgments (medical, industrial QA), pair the LLM with a specialist model. The LLM is great at naming and describing; specialist models are still better at measuring.

Key Building Blocks

  • Image / video ingest with retention and provenance.
  • Vision LLM with image + structured-output support.
  • Classical CV pre-processors — detection, segmentation, OCR, where useful.
  • Region-of-interest cropping to reduce model load and improve accuracy.
  • Annotation tool for human review.
  • Eval set with class balance — rare-but-critical classes must be represented.

Concrete Cases

  • Industrial visual quality inspection. Conveyor camera + vision LLM flags defects on parts; a specialist measures critical dimensions; rejects are routed to a manual station.
  • Construction site safety / progress monitoring. Camera streams analyzed for PPE compliance, fall hazards, daily progress vs plan.
  • Retail shelf compliance. Store photo → planogram compliance, share-of-shelf, out-of-stock detection.
  • Insurance damage assessment. Vehicle / property photos → identify damage, estimate severity, draft initial claim memo.
  • Medical image triage. Radiology pre-read for urgency; pathology slide triage; ophthalmology screening; always feeding a physician, never replacing.
  • Document processing of mixed-media files. Forms with handwritten and printed fields, signatures, stamps; vision LLM extracts and validates.
  • UI testing & screen agents. Read app screenshots, identify elements, locate bugs, drive a flow.
  • Geospatial / aerial analysis. Solar panel detection on rooftops; crop health from drone footage; deforestation monitoring.
  • Manufacturing assembly verification. Photo of finished assembly → check that all components are present and oriented correctly.
  • Food safety / freshness grading. Inbound produce photos → grade and route.

Similar Scenarios

  • Video analytics for media archives — same vision step, applied frame-sampled to long-form content.
  • Receipt / invoice OCR — overlaps with document-to-action when the input is image-only.
  • Accessibility alt-text generation — same engine, the "structured output" is descriptive text.
  • Counterfeit / IP detection on marketplaces — image-driven flag plus catalog cross-check.

Pitfalls & Evaluation

  • Long-tail failure modes. Vision LLMs hallucinate on edge cases (occlusion, glare, unusual angle) more readily than they admit. Always have a low-confidence path.
  • Camera drift. A change in lighting, lens, or angle silently kills accuracy. Monitor by camera ID; alert on output distribution change.
  • Counting / measuring weaknesses. LLMs are not yet reliable at exact counts or precise measurements. Pair with classical CV.
  • Privacy & retention. Faces, license plates, medical data require careful retention rules.
  • Adversarial physical inputs. Stickers / patterns can fool models. Spot-check periodically.

Useful metrics: per-class precision/recall on a labeled eval set, false-reject rate (production-killing if too high), reviewer agreement on borderline cases, end-to-end yield change (in QA: how does the inspection change rejection rates?).

On this page