Skip to main content

Quantization

Quantization reduces the storage size of embeddings by converting 32-bit floats to smaller data types like int8 (8-bit integers) or binary (1-bit). This trades a small amount of accuracy for significant storage and memory savings.

Size Comparison

Output TypeBits per Dimension1024-dim Vector SizeRelative Size
float324,096 bytes100%
int881,024 bytes25%
uint881,024 bytes25%
binary1128 bytes3.1%
ubinary1128 bytes3.1%

When to Use Quantization

  • Large corpora (millions of documents) where storage costs matter
  • Edge deployment with limited memory
  • Fast approximate search as a first-pass filter before float-precision reranking

Quality Impact

Quantization reduces precision but the impact is often small:

  • int8: Minimal quality loss (~1-2% on retrieval benchmarks). Good default for production.
  • binary: Larger quality loss (~5-10%) but 32× smaller. Best as a coarse filter with float reranking.
# Benchmark the impact yourself
vai benchmark quantization

Using Quantization in vai

# Generate int8 embeddings
vai embed "hello world" --output-dtype int8

# Generate binary embeddings
vai embed "hello world" --output-dtype binary

# Store with quantized embeddings
vai store --text "..." --db myapp --collection docs --output-dtype int8

Further Reading