Skip to main content

Multimodal Embeddings

Multimodal embeddings place different content types — text, images, and video frames — into the same vector space. This enables cross-modal search: finding images with text queries, or finding text documents related to an image.

How It Works

A multimodal embedding model processes different input types through specialized encoders, then maps them into a shared embedding space where similarity is meaningful across modalities.

Voyage AI Multimodal Model

Voyage AI offers voyage-multimodal-3.5 which supports text, images, and video:

FeatureDetails
Modelvoyage-multimodal-3.5
InputsText, images, video frames
Context32K tokens
Dimensions1024 (default), 256, 512, 2048
Pricing$0.12/M tokens + $0.60/B pixels

Use Cases

  • Visual search: Find products by describing them in text
  • Content discovery: Find articles related to an uploaded image
  • Video search: Search video content using natural language queries
  • Mixed collections: Search across documents that contain both text and images

Further Reading