Multimodal Inspection

Scenario Abstraction

A human today looks at something — a photo, a frame from CCTV, a medical scan, a product on a conveyor belt, a screen — and produces a structured output: a tag, a measurement, a defect call, a fitness judgment. This is the visual analog of conversation intelligence: turning unstructured perception into structured data.

Vision-capable LLMs have collapsed a wall that used to require a custom CV model per task. For many "look-and-say" jobs, prompting a vision LLM with a few examples now beats months of bespoke training, especially when the categories evolve.

Solution Shape

Capture — phone camera, fixed camera, drone, microscope, CT scanner, screen recorder.
Pre-process — normalize resolution, color, orientation; sometimes crop / segment to the region of interest.
Inspect with a vision LLM — given the image plus a prompt that specifies what to look for and the output schema, produce a structured response.
Geometric / count corrections — for tasks where counting or measuring matters, pair with classical CV (segmentation, OCR) and combine.
Confidence-aware review — low confidence routes to a human reviewer.
Action / record — write the structured judgment to wherever it's needed.
Continuous evaluation — keep a labeled set; rotate fresh samples in to catch drift.

For mission-critical visual judgments (medical, industrial QA), pair the LLM with a specialist model. The LLM is great at naming and describing; specialist models are still better at measuring.

Key Building Blocks

Image / video ingest with retention and provenance.
Vision LLM with image + structured-output support.
Classical CV pre-processors — detection, segmentation, OCR, where useful.
Region-of-interest cropping to reduce model load and improve accuracy.
Annotation tool for human review.
Eval set with class balance — rare-but-critical classes must be represented.

Concrete Cases

Industrial visual quality inspection. Conveyor camera + vision LLM flags defects on parts; a specialist measures critical dimensions; rejects are routed to a manual station.
Construction site safety / progress monitoring. Camera streams analyzed for PPE compliance, fall hazards, daily progress vs plan.
Retail shelf compliance. Store photo → planogram compliance, share-of-shelf, out-of-stock detection.
Insurance damage assessment. Vehicle / property photos → identify damage, estimate severity, draft initial claim memo.
Medical image triage. Radiology pre-read for urgency; pathology slide triage; ophthalmology screening; always feeding a physician, never replacing.
Document processing of mixed-media files. Forms with handwritten and printed fields, signatures, stamps; vision LLM extracts and validates.
UI testing & screen agents. Read app screenshots, identify elements, locate bugs, drive a flow.
Geospatial / aerial analysis. Solar panel detection on rooftops; crop health from drone footage; deforestation monitoring.
Manufacturing assembly verification. Photo of finished assembly → check that all components are present and oriented correctly.
Food safety / freshness grading. Inbound produce photos → grade and route.

Similar Scenarios

Video analytics for media archives — same vision step, applied frame-sampled to long-form content.
Receipt / invoice OCR — overlaps with document-to-action when the input is image-only.
Accessibility alt-text generation — same engine, the "structured output" is descriptive text.
Counterfeit / IP detection on marketplaces — image-driven flag plus catalog cross-check.

Pitfalls & Evaluation

Long-tail failure modes. Vision LLMs hallucinate on edge cases (occlusion, glare, unusual angle) more readily than they admit. Always have a low-confidence path.
Camera drift. A change in lighting, lens, or angle silently kills accuracy. Monitor by camera ID; alert on output distribution change.
Counting / measuring weaknesses. LLMs are not yet reliable at exact counts or precise measurements. Pair with classical CV.
Privacy & retention. Faces, license plates, medical data require careful retention rules.
Adversarial physical inputs. Stickers / patterns can fool models. Spot-check periodically.

Useful metrics: per-class precision/recall on a labeled eval set, false-reject rate (production-killing if too high), reviewer agreement on borderline cases, end-to-end yield change (in QA: how does the inspection change rejection rates?).