Multimodal Inspection
Reading images, video, and screens to extract structured judgments at scale
Scenario Abstraction
A human today looks at something — a photo, a frame from CCTV, a medical scan, a product on a conveyor belt, a screen — and produces a structured output: a tag, a measurement, a defect call, a fitness judgment. This is the visual analog of conversation intelligence: turning unstructured perception into structured data.
Vision-capable LLMs have collapsed a wall that used to require a custom CV model per task. For many "look-and-say" jobs, prompting a vision LLM with a few examples now beats months of bespoke training, especially when the categories evolve.
Solution Shape
- Capture — phone camera, fixed camera, drone, microscope, CT scanner, screen recorder.
- Pre-process — normalize resolution, color, orientation; sometimes crop / segment to the region of interest.
- Inspect with a vision LLM — given the image plus a prompt that specifies what to look for and the output schema, produce a structured response.
- Geometric / count corrections — for tasks where counting or measuring matters, pair with classical CV (segmentation, OCR) and combine.
- Confidence-aware review — low confidence routes to a human reviewer.
- Action / record — write the structured judgment to wherever it's needed.
- Continuous evaluation — keep a labeled set; rotate fresh samples in to catch drift.
For mission-critical visual judgments (medical, industrial QA), pair the LLM with a specialist model. The LLM is great at naming and describing; specialist models are still better at measuring.
Key Building Blocks
- Image / video ingest with retention and provenance.
- Vision LLM with image + structured-output support.
- Classical CV pre-processors — detection, segmentation, OCR, where useful.
- Region-of-interest cropping to reduce model load and improve accuracy.
- Annotation tool for human review.
- Eval set with class balance — rare-but-critical classes must be represented.
Concrete Cases
- Industrial visual quality inspection. Conveyor camera + vision LLM flags defects on parts; a specialist measures critical dimensions; rejects are routed to a manual station.
- Construction site safety / progress monitoring. Camera streams analyzed for PPE compliance, fall hazards, daily progress vs plan.
- Retail shelf compliance. Store photo → planogram compliance, share-of-shelf, out-of-stock detection.
- Insurance damage assessment. Vehicle / property photos → identify damage, estimate severity, draft initial claim memo.
- Medical image triage. Radiology pre-read for urgency; pathology slide triage; ophthalmology screening; always feeding a physician, never replacing.
- Document processing of mixed-media files. Forms with handwritten and printed fields, signatures, stamps; vision LLM extracts and validates.
- UI testing & screen agents. Read app screenshots, identify elements, locate bugs, drive a flow.
- Geospatial / aerial analysis. Solar panel detection on rooftops; crop health from drone footage; deforestation monitoring.
- Manufacturing assembly verification. Photo of finished assembly → check that all components are present and oriented correctly.
- Food safety / freshness grading. Inbound produce photos → grade and route.
Similar Scenarios
- Video analytics for media archives — same vision step, applied frame-sampled to long-form content.
- Receipt / invoice OCR — overlaps with document-to-action when the input is image-only.
- Accessibility alt-text generation — same engine, the "structured output" is descriptive text.
- Counterfeit / IP detection on marketplaces — image-driven flag plus catalog cross-check.
Pitfalls & Evaluation
- Long-tail failure modes. Vision LLMs hallucinate on edge cases (occlusion, glare, unusual angle) more readily than they admit. Always have a low-confidence path.
- Camera drift. A change in lighting, lens, or angle silently kills accuracy. Monitor by camera ID; alert on output distribution change.
- Counting / measuring weaknesses. LLMs are not yet reliable at exact counts or precise measurements. Pair with classical CV.
- Privacy & retention. Faces, license plates, medical data require careful retention rules.
- Adversarial physical inputs. Stickers / patterns can fool models. Spot-check periodically.
Useful metrics: per-class precision/recall on a labeled eval set, false-reject rate (production-killing if too high), reviewer agreement on borderline cases, end-to-end yield change (in QA: how does the inspection change rejection rates?).