Multimodal Applications

Models that see, hear, and reason across modalities open up product surfaces text alone cannot

Multimodal models accept and produce mixes of text, images, audio, and video. The category has matured from "neat demo" to "genuinely useful" and is opening up product surfaces that pure-text models couldn't reach.

What "Multimodal" Means in Practice

Most modern frontier models are at least vision-language: they accept images alongside text and reason about both. A growing fraction also handle audio (speech in, speech out) and some handle video natively. Image generation, video generation, and 3D generation are usually separate model families specialized for each.

Common Use Cases

Document understanding — extract structured data from PDFs, invoices, forms, screenshots. Replaces brittle OCR + regex pipelines.
Visual QA — answer questions about an image. Used in support, accessibility, education.
UI understanding — interpret screenshots, click coordinates, navigate apps visually. The foundation of computer-use agents.
Speech interfaces — transcribe, translate, respond by voice. Lower friction for many users than typing.
Image generation — product imagery, mockups, illustrations. Cheaper iteration than commissioning art.
Video generation — short marketing clips, prototypes, b-roll. Quality is rising fast; production use is starting.

Document Understanding Is the Sleeper Hit

The category that quietly delivers the most value today is using vision-language models to extract structured information from messy documents. They're more robust than OCR + parsing because they understand layout, context, and ambiguity. Patterns:

Direct extraction — paste the document image, ask for the structured fields back.
Layout-aware — preserve tables, columns, and visual hierarchy in the output.
Confidence and citation — return where in the document each field came from.

Speech Pipelines

Two architectures:

Cascaded — speech-to-text → LLM → text-to-speech. Three separate models. Easy to debug; latency stacks up.
End-to-end speech — one model handles speech in and out. Lower latency, more natural turn-taking, harder to introspect.

End-to-end is winning at the frontier; cascaded is still the right default when you need control over each stage.

Image and Video Generation

The models keep getting better quickly. The integration patterns matter as much as raw quality:

Reference-conditioned generation — generate variations consistent with a reference image (a product, a character).
Inpainting / editing — change part of an image, leave the rest alone.
Structured prompting — schema-based prompts that produce consistent series (catalog imagery, social posts).

What's Hard

Evaluation. Image and video quality are subjective; text-based metrics don't transfer.
Cost. Generation is much more expensive per output than text.
Latency. Generation is slow enough to need different UX patterns (queues, progress indicators, async results).
Safety. Generated imagery has rights, likeness, and content concerns text does not.

What to Watch

Multimodal context is getting longer (whole videos, large image sets), the line between input and output modalities is blurring (a model that sees, listens, and speaks in one loop), and computer-use agents are pushing visual grounding into a serious engineering discipline.