RAG combines language generation with retrieval over an external knowledge index. In practical terms, the LLM no longer answers from memory alone; it answers from fetched evidence.
Core limitation RAG addresses: context window size is finite while knowledge bases are effectively unbounded. Even when a model supports very large token windows, sending everything is still expensive, slow, and often lower quality because irrelevant text dilutes signal.
Tokens: model input/output is priced and bounded by tokens. This means architecture choices (chunk size, top-k, prompt template) directly affect both quality and cost.
Embeddings: text is mapped into high-dimensional vectors where semantic similarity becomes geometric proximity. A query like 'refund period' can retrieve chunks mentioning 'return window' without exact keyword overlap.
Vector database responsibilities:
- Indexing vectors for fast nearest-neighbor search (ANN/HNSW/IVF style internals depending on backend).
- Metadata filtering (tenant, language, policy version, date range, access scope).
- Persistence and lifecycle (upserts, deletes, re-indexing, snapshot/backup).
The two pipelines and their boundaries:
- Injection (offline): load documents, normalize, chunk, embed, index with metadata.
- Retrieval (online): interpret query, embed query, retrieve/rank candidates, pass evidence to answer generation.
Important production caveat: embeddings are model-specific. If you rotate embedding models, you usually need full re-embedding and re-indexing to keep similarity semantics consistent.
Interview-Ready Deepening
Source-backed reinforcement: these points add detail beyond short-duration UI hints and emphasize production tradeoffs.
- Context windows, chunking, embedding models, and the injection vs retrieval pipeline.
- Embeddings: text is mapped into high-dimensional vectors where semantic similarity becomes geometric proximity.
- Even though we see three dimensions in each vector embedding, in reality, popular embedding models like OpenAI's text embedding three large transform text or paragraphs to up to 3,72 dimensions in each vector embedding.
- Note that if we embed a small word like cat or a large paragraph, we always get one vector embedding output that has 3,72 dimensions.
- What this retriever component is going to do is it's going to take this particular data and then it's going to go through all these different vector embeddings and see which one is closer in semantic meaning to the user's query.
- We are not going to deal with vector embeddings after this particular point.
- But we are going to match we're going to retrieve the top five or 10 different chunks that matches with this particular vector embedding.
- Even when a model supports very large token windows, sending everything is still expensive, slow, and often lower quality because irrelevant text dilutes signal.
Tradeoffs You Should Be Able to Explain
- Higher recall often increases context noise; reranking and filtering are required to keep precision high.
- Smaller chunks improve semantic precision but can break cross-sentence context needed for accurate answers.
- Aggressive grounding reduces hallucinations but can increase abstentions when retrieval coverage is weak.
First-time learner note: Master one stage at a time: ingestion, retrieval, then grounded generation. Validate each stage with small test questions before tuning everything together.
Production note: Treat quality as measurable system behavior. Track retrieval relevance, groundedness, and abstention quality with repeatable eval sets.