Multimodal AI
AI Đa phương thức (Multimodal)
AI models that process and generate multiple types of data — text, images, audio, and video — within a single unified model.
Multimodal AI refers to models that can understand and/or generate multiple data modalities — text, images, audio, video, code — within a single model rather than using separate specialized models for each.
**What "multimodal" covers**
- *Vision + language*: Analyze images, describe photos, answer questions about visual content, generate images from text (DALL-E, Gemini, GPT-4V). - *Audio + language*: Transcribe speech (Whisper), generate voice from text (TTS models), understand spoken input (GPT-4o's native audio mode). - *Video*: Analyze video frames, generate short video clips (Sora, Runway, Kling). - *Document understanding*: Parse PDFs, spreadsheets, charts — not just text but layout and structure.
**Why multimodal matters**
Real-world information is multimodal. A product image, a diagram, a recorded meeting, a PDF invoice — these don't translate well to text. Multimodal models process the original format, preserving information that text conversion would lose.
**Architecture approaches**
Early approaches: encode each modality separately, project into a shared embedding space, feed to the LLM. Newer approaches: train from scratch on interleaved multimodal data (text and images mixed), building native multimodal representations.
**Current frontier examples**
GPT-4o ("omni"): text, image, and audio input/output in one model. Gemini 1.5 Pro: text, image, audio, video, code. Claude 3.5: vision-capable (text + image input, text output).
**Pitfalls**
Multimodal models are larger and more expensive to run than text-only models. Image understanding quality varies significantly across models. Video generation remains expensive and limited in length. Audio generation can produce convincing deepfakes, raising serious misuse concerns.