Skip to main content

Cross-Modal Search

Cross-modal search uses multimodal embeddings to find content in one modality using a query from another — for example, finding images using a text description, or finding documents related to an uploaded image.

How It Works

When text and images share the same embedding space, a text query like "sunset over the ocean" produces a vector that's close to images of sunsets. The vector search doesn't know or care about modalities — it just finds the nearest vectors.

Examples

Query ModalityResult ModalityExample
Text → Images"red sports car" finds car photos
Image → TextUpload photo, find related articles
Text → Video"cooking pasta" finds video clips
Image → ImagesUpload photo, find visually similar images

Key Consideration: Modality Gap

There's typically a small but measurable gap between text and image embeddings — text vectors cluster together, and image vectors cluster together, with a gap between them. This means:

  • Within-modality similarity scores are higher (text-to-text, image-to-image)
  • Cross-modal similarity scores are somewhat lower (text-to-image)

This is normal and expected. See Modality Gap for details.

Further Reading