Skip to content
Concept-Lab
โ† RAG Systems๐Ÿ” 2 / 17
RAG Systems

What is RAG, Tokens, Embeddings & Vector Databases

Context windows, chunking, embedding models, and the injection vs retrieval pipeline.

Core Theory

RAG combines language generation with retrieval over an external knowledge index. In practical terms, the LLM no longer answers from memory alone; it answers from fetched evidence.

Core limitation RAG addresses: context window size is finite while knowledge bases are effectively unbounded. Even when a model supports very large token windows, sending everything is still expensive, slow, and often lower quality because irrelevant text dilutes signal.

Tokens: model input/output is priced and bounded by tokens. This means architecture choices (chunk size, top-k, prompt template) directly affect both quality and cost.

Embeddings: text is mapped into high-dimensional vectors where semantic similarity becomes geometric proximity. A query like 'refund period' can retrieve chunks mentioning 'return window' without exact keyword overlap.

Vector database responsibilities:

  • Indexing vectors for fast nearest-neighbor search (ANN/HNSW/IVF style internals depending on backend).
  • Metadata filtering (tenant, language, policy version, date range, access scope).
  • Persistence and lifecycle (upserts, deletes, re-indexing, snapshot/backup).

The two pipelines and their boundaries:

  • Injection (offline): load documents, normalize, chunk, embed, index with metadata.
  • Retrieval (online): interpret query, embed query, retrieve/rank candidates, pass evidence to answer generation.

Important production caveat: embeddings are model-specific. If you rotate embedding models, you usually need full re-embedding and re-indexing to keep similarity semantics consistent.

Interview-Ready Deepening

Source-backed reinforcement: these points add detail beyond short-duration UI hints and emphasize production tradeoffs.

  • Context windows, chunking, embedding models, and the injection vs retrieval pipeline.
  • Embeddings: text is mapped into high-dimensional vectors where semantic similarity becomes geometric proximity.
  • Even though we see three dimensions in each vector embedding, in reality, popular embedding models like OpenAI's text embedding three large transform text or paragraphs to up to 3,72 dimensions in each vector embedding.
  • Note that if we embed a small word like cat or a large paragraph, we always get one vector embedding output that has 3,72 dimensions.
  • What this retriever component is going to do is it's going to take this particular data and then it's going to go through all these different vector embeddings and see which one is closer in semantic meaning to the user's query.
  • We are not going to deal with vector embeddings after this particular point.
  • But we are going to match we're going to retrieve the top five or 10 different chunks that matches with this particular vector embedding.
  • Even when a model supports very large token windows, sending everything is still expensive, slow, and often lower quality because irrelevant text dilutes signal.

Tradeoffs You Should Be Able to Explain

  • Higher recall often increases context noise; reranking and filtering are required to keep precision high.
  • Smaller chunks improve semantic precision but can break cross-sentence context needed for accurate answers.
  • Aggressive grounding reduces hallucinations but can increase abstentions when retrieval coverage is weak.

First-time learner note: Master one stage at a time: ingestion, retrieval, then grounded generation. Validate each stage with small test questions before tuning everything together.

Production note: Treat quality as measurable system behavior. Track retrieval relevance, groundedness, and abstention quality with repeatable eval sets.

๐Ÿงพ Comprehensive Coverage

Exhaustive coverage points to ensure complete topic understanding without missing core concepts.

Loading interactive module...

๐Ÿ’ก Concrete Example

Suppose your corpus has 250,000 policy chunks. A user asks, 'Can interns access production dashboards?' The query is embedded once, nearest-neighbor search returns top candidates, and metadata filtering removes chunks outside the user's org and policy version. You then send only 3-5 evidence chunks (not whole documents) into the prompt, which lowers token cost, reduces latency, and produces a precise, source-backed answer.

๐Ÿง  Beginner-Friendly Examples

Guided Starter Example

Suppose your corpus has 250,000 policy chunks. A user asks, 'Can interns access production dashboards?' The query is embedded once, nearest-neighbor search returns top candidates, and metadata filtering removes chunks outside the user's org and policy version. You then send only 3-5 evidence chunks (not whole documents) into the prompt, which lowers token cost, reduces latency, and produces a precise, source-backed answer.

Source-grounded Practical Scenario

Context windows, chunking, embedding models, and the injection vs retrieval pipeline.

Source-grounded Practical Scenario

Embeddings: text is mapped into high-dimensional vectors where semantic similarity becomes geometric proximity.

๐Ÿงญ Architecture Flow

Loading interactive module...

๐ŸŽฌ Interactive Visualization

Loading interactive module...

๐Ÿ›  Interactive Tool

Loading interactive module...

๐Ÿงช Interactive Sessions

  1. Concept Drill: Manipulate key parameters and observe behavior shifts for What is RAG, Tokens, Embeddings & Vector Databases.
  2. Failure Mode Lab: Trigger an edge case and explain remediation decisions.
  3. Architecture Reorder Exercise: Reorder 5 flow steps into the correct production sequence.

๐Ÿ’ป Code Walkthrough

Topic-aligned code references for conceptual-to-implementation mapping.

content/github_code/rag-for-beginners/2_retrieval_pipeline.py

Reference implementation path for What is RAG, Tokens, Embeddings & Vector Databases.

Open highlighted code โ†’

scratch_pad/github_code/rag-for-beginners/2_retrieval_pipeline.py

Reference implementation path for What is RAG, Tokens, Embeddings & Vector Databases.

Open highlighted code โ†’
  1. Define input/output contract before reading implementation details.
  2. Map each conceptual step to one concrete function/class decision.
  3. Call out one tradeoff and one failure mode in interview wording.

๐ŸŽฏ Interview Prep

Questions an interviewer is likely to ask about this topic. Think through your answer before reading the senior angle.

  • Q1[beginner] What is an embedding and how does it enable semantic search in a vector database?
    An embedding is a dense numeric representation of text meaning. Semantic search works by embedding the query and finding nearby vectors in index space, so lexical mismatch ('refund' vs 'return window') can still match.
  • Q2[beginner] Walk me through the injection pipeline step by step.
    Injection steps: load/parse source documents, clean + normalize text, chunk with overlap policy, generate embeddings, write vectors + metadata to DB, then validate index integrity with sample queries.
  • Q3[intermediate] Why does chunk size matter? What happens if chunks are too small vs too large?
    Tiny chunks improve precision but can lose context and increase retrieval fan-out. Oversized chunks improve context but add noise and token cost. Optimal size depends on document structure, query style, and top-k budget.
  • Q4[expert] How do metadata filters change retrieval quality in multi-tenant systems?
    Metadata filtering enforces scope (tenant, role, version, recency) before ranking. This prevents cross-tenant leakage and reduces irrelevant candidates, improving both safety and precision.
  • Q5[expert] How would you explain this in a production interview with tradeoffs?
    The real senior insight is that RAG is a precision-recall trade-off. Larger chunks = higher recall (more context) but lower precision (more noise). Smaller chunks = higher precision but risk losing context across chunk boundaries. Production systems tune chunk size empirically per document type, often with overlap (e.g. 200 token overlap) to prevent context loss at boundaries.
๐Ÿ† Senior answer angle โ€” click to reveal
Use the tier progression: beginner correctness -> intermediate tradeoffs -> expert production constraints and incident readiness.

๐Ÿ“š Revision Flash Cards

Test yourself before moving on. Flip each card to check your understanding โ€” great for quick revision before an interview.

Loading interactive module...