Skip to main content

vai pipeline

The all-in-one command: chunk documents, generate embeddings, and store vectors in MongoDB Atlas in a single step.

Synopsis

vai pipeline <input> [options]

Description

vai pipeline takes a file or directory as input and runs the complete RAG ingestion pipeline:

  1. Chunk: Split files into embedding-sized pieces using one of 5 strategies
  2. Embed: Generate vector embeddings, either via the Voyage AI API or locally with voyage-4-nano
  3. Store: Insert documents with embeddings into MongoDB Atlas

It reads settings from .vai.json (created by vai init) and merges them with CLI flags. For directories, it recursively scans for supported file types (.txt, .md, .html, .json, .jsonl, .pdf).

Optionally, --create-index auto-creates a vector search index after insertion. In v1.31.0, --local routes the embedding step through voyage-4-nano using the lightweight Python bridge while leaving chunking and MongoDB storage unchanged.

Options

FlagDescriptionDefault
<input>File or directory to process (required)
--db <database>Database nameFrom .vai.json
--collection <name>Collection nameFrom .vai.json
--field <name>Embedding field nameembedding
--index <name>Vector search index namevector_index
-m, --model <model>Embedding modelvoyage-4-large
--localUse local voyage-4-nano inference for the embedding stepfalse
-d, --dimensions <n>Output dimensionsModel default
-s, --strategy <strategy>Chunking strategy: fixed, sentence, paragraph, recursive, markdownrecursive
-c, --chunk-size <n>Target chunk size in characters512
--overlap <n>Overlap between chunks50
--batch-size <n>Texts per embedding API call25
--text-field <name>Text field for JSON/JSONL inputtext
--extensions <exts>File extensions to include (comma-separated)All supported
--ignore <dirs>Directory names to skipnode_modules,.git,__pycache__
--create-indexAuto-create vector search index
--dry-runShow what would happen without executing
--estimateShow cost estimate, optionally switch model
--jsonMachine-readable JSON output
-q, --quietSuppress non-essential output

Examples

Ingest a directory of docs

vai pipeline ./docs/ --db myapp --collection knowledge --create-index

Ingest with local nano inference

vai nano setup
vai pipeline ./docs/ --local --db myapp --collection knowledge --create-index

Ingest a single file with markdown chunking

vai pipeline README.md --db myapp --collection docs --strategy markdown

Dry run to preview chunks and cost

vai pipeline ./docs/ --db myapp --collection knowledge --dry-run

Cost estimate with model comparison

vai pipeline ./docs/ --db myapp --collection knowledge --estimate

Custom chunking settings

vai pipeline ./data/ --db myapp --collection docs \
--strategy recursive --chunk-size 1024 --overlap 100 --batch-size 50

How It Works

  1. File scanning: Recursively finds supported files in the input directory. .md files automatically use the markdown chunking strategy when recursive is selected.
  2. Chunking: Each file is read and split into chunks. Metadata (source path, chunk index, total chunks) is attached to each chunk.
  3. Embedding: Chunks are batched and embedded with --input-type document, either through the Voyage AI API or locally with voyage-4-nano when --local is enabled.
  4. Storage: Documents (text + embedding + metadata) are inserted into MongoDB via insertMany.
  5. Indexing (optional): A vector search index is created on the embedding field.

Tips

  • Initialize your project with vai init first to avoid passing --db and --collection every time.
  • The --dry-run flag shows chunk counts and estimated cost without making any API calls or database writes.
  • Use --local when you want a local-first ingestion path with voyage-4-nano and no Voyage API key for embedding.
  • After pipeline completes, use vai query to search your indexed documents.
  • For Markdown files, the recursive strategy auto-detects and switches to markdown chunking for better section-aware splits.
  • Because voyage-4-nano shares embedding space with the rest of the Voyage 4 family, a collection indexed locally can still fit into broader Voyage 4 workflows later.