Skip to main content

vai chunk

Split documents into embedding-sized chunks using one of five strategies. Outputs JSONL by default for piping into other commands.

Synopsis

vai chunk [input] [options]

Description

vai chunk reads files or directories, splits text into chunks, and outputs them as JSONL (one JSON object per line) with text and metadata. Each chunk includes the source file, chunk index, and total chunks from that file.

Five chunking strategies are available:

StrategyDescription
fixedSplit every N characters
sentenceSplit on sentence boundaries
paragraphSplit on paragraph boundaries (double newlines)
recursiveSplit recursively by paragraphs → sentences → characters
markdownSplit on Markdown headings, preserving document structure

When using recursive on .md files, vai automatically switches to the markdown strategy.

Options

FlagDescriptionDefault
[input]File or directory to chunk
-s, --strategy <strategy>Chunking strategyrecursive
-c, --chunk-size <n>Target chunk size in charactersFrom .vai.json
--overlap <n>Overlap between chunks in charactersFrom .vai.json
--min-size <n>Minimum chunk size (drop smaller)
-o, --output <path>Output file (JSONL). Omit for stdout
--text-field <name>Text field for JSON/JSONL inputtext
--extensions <exts>File extensions to include (comma-separated)All supported
--ignore <dirs>Directory names to skipnode_modules,.git,__pycache__
--dry-runShow what would be chunked without processing
--statsShow chunking statistics after processing
--jsonFull JSON output (includes stats + chunks)
-q, --quietSuppress non-essential output

Examples

Chunk a single file

vai chunk README.md

Chunk a directory with custom settings

vai chunk ./docs/ --strategy markdown --chunk-size 1024 --overlap 100

Save chunks to a file

vai chunk ./docs/ -o chunks.jsonl

Dry run to preview files

vai chunk ./src/ --dry-run --extensions ".js,.ts"

Pipe chunks into ingest

vai chunk ./docs/ -o chunks.jsonl
vai ingest --file chunks.jsonl --db myapp --collection docs --field embedding

Output Format

Each line of JSONL output:

{"text": "chunk content here...", "metadata": {"source": "docs/intro.md", "chunk_index": 0, "total_chunks": 5}}

With --json, outputs a single JSON object with totalChunks, totalTokens, strategy, files array, and chunks array.

With --stats, shows a summary including file count, total input characters, chunk count, average chunk size, estimated tokens, and estimated cost.

Tips

  • Use --strategy markdown for Markdown files to split on headings. The recursive strategy auto-detects .md files and does this automatically.
  • The --min-size option drops chunks smaller than the threshold — useful for filtering out heading-only chunks.
  • Settings from .vai.json (chunk strategy, size, overlap) are used as defaults when available.
  • For the full pipeline (chunk → embed → store), use vai pipeline instead.