Skip to main content

Multimodal RAG

Multimodal RAG extends the retrieval-augmented generation pattern to handle mixed content: text documents with embedded images, standalone images, and video content. Instead of retrieving only text chunks, the system retrieves the most relevant content regardless of modality.

The Pipeline

Key Differences from Text RAG

AspectText RAGMultimodal RAG
Embedding modelText-only (voyage-4-large)Multimodal (voyage-multimodal-3.5)
Chunk typesText chunksText chunks + image regions + video frames
LLMText LLMMultimodal LLM (GPT-4V, Claude 3, etc.)
ContextText passagesText + images in the prompt

Use Cases

  • Technical documentation with diagrams — search finds both the text explanation and the relevant diagram
  • Product catalogs — search by description, retrieve product images and specs
  • Medical records — find relevant X-rays and clinical notes together
  • Educational content — retrieve relevant slides, figures, and lecture text

Considerations

  • Multimodal embedding costs include both token-based (text) and pixel-based (image) pricing
  • The modality gap means cross-modal scores are lower than within-modality scores
  • Not all LLMs can process mixed text+image context — use a multimodal LLM for generation

Further Reading