Chunking defines the unit of retrieval. If chunks are poorly constructed, retrievers either miss relevant evidence or return noisy context, and downstream generation quality falls immediately.
Five strategy families (from simple to advanced):
- Character splitter: fixed-size chunks; fast and cheap, but brittle on long mixed-topic text.
- Recursive splitter: tries paragraph/sentence/word boundaries in priority order; best default for most text corpora.
- Document-structure-aware splitting: uses native structure such as headings, sections, pages, rows, or code blocks.
- Semantic chunking: split where embedding similarity drops between adjacent sentences.
- Agentic chunking: LLM decides boundary placement from meaning and task intent.
Decision matrix in practice:
- Choose recursive when you need strong baseline quality fast.
- Choose document-aware when structure is explicit (legal headers, markdown docs, financial sections).
- Choose semantic/agentic only when quality gains justify significantly higher ingestion cost and complexity.
Operational risks: over-chunking (context fragmentation), under-chunking (retrieval noise), duplicate-heavy overlap, and inconsistent policies across document types.
High-performing systems route by document type: FAQ text, policy PDFs, scanned documents, and tables often need different split strategies, not one global default.
Interview-Ready Deepening
Source-backed reinforcement: these points add detail beyond short-duration UI hints and emphasize production tradeoffs.
- Why chunking is the most impactful RAG decision โ fixed vs semantic vs agentic.
- Semantic chunking : split where embedding similarity drops between adjacent sentences.
- Choose semantic/agentic only when quality gains justify significantly higher ingestion cost and complexity.
- Agentic chunking : LLM decides boundary placement from meaning and task intent.
- Character splitter : fixed-size chunks; fast and cheap, but brittle on long mixed-topic text.
- Recursive splitter : tries paragraph/sentence/word boundaries in priority order; best default for most text corpora.
- Operational risks: over-chunking (context fragmentation), under-chunking (retrieval noise), duplicate-heavy overlap, and inconsistent policies across document types.
- High-performing systems route by document type: FAQ text, policy PDFs, scanned documents, and tables often need different split strategies, not one global default.
Tradeoffs You Should Be Able to Explain
- Higher recall often increases context noise; reranking and filtering are required to keep precision high.
- Smaller chunks improve semantic precision but can break cross-sentence context needed for accurate answers.
- Aggressive grounding reduces hallucinations but can increase abstentions when retrieval coverage is weak.
First-time learner note: Master one stage at a time: ingestion, retrieval, then grounded generation. Validate each stage with small test questions before tuning everything together.
Production note: Treat quality as measurable system behavior. Track retrieval relevance, groundedness, and abstention quality with repeatable eval sets.