What is RAG, Tokens, Embeddings & Vector Databases

Core Theory

RAG combines language generation with retrieval over an external knowledge index. In practical terms, the LLM no longer answers from memory alone; it answers from fetched evidence.

Core limitation RAG addresses: context window size is finite while knowledge bases are effectively unbounded. Even when a model supports very large token windows, sending everything is still expensive, slow, and often lower quality because irrelevant text dilutes signal.

Tokens: model input/output is priced and bounded by tokens. This means architecture choices (chunk size, top-k, prompt template) directly affect both quality and cost.

Embeddings: text is mapped into high-dimensional vectors where semantic similarity becomes geometric proximity. A query like 'refund period' can retrieve chunks mentioning 'return window' without exact keyword overlap.

Vector database responsibilities:

Indexing vectors for fast nearest-neighbor search (ANN/HNSW/IVF style internals depending on backend).
Metadata filtering (tenant, language, policy version, date range, access scope).
Persistence and lifecycle (upserts, deletes, re-indexing, snapshot/backup).

The two pipelines and their boundaries:

Injection (offline): load documents, normalize, chunk, embed, index with metadata.
Retrieval (online): interpret query, embed query, retrieve/rank candidates, pass evidence to answer generation.

Important production caveat: embeddings are model-specific. If you rotate embedding models, you usually need full re-embedding and re-indexing to keep similarity semantics consistent.

Interview-Ready Deepening

Source-backed reinforcement: these points add detail beyond short-duration UI hints and emphasize production tradeoffs.

Context windows, chunking, embedding models, and the injection vs retrieval pipeline.
Embeddings: text is mapped into high-dimensional vectors where semantic similarity becomes geometric proximity.
Even though we see three dimensions in each vector embedding, in reality, popular embedding models like OpenAI's text embedding three large transform text or paragraphs to up to 3,72 dimensions in each vector embedding.
Note that if we embed a small word like cat or a large paragraph, we always get one vector embedding output that has 3,72 dimensions.
What this retriever component is going to do is it's going to take this particular data and then it's going to go through all these different vector embeddings and see which one is closer in semantic meaning to the user's query.
We are not going to deal with vector embeddings after this particular point.
But we are going to match we're going to retrieve the top five or 10 different chunks that matches with this particular vector embedding.
Even when a model supports very large token windows, sending everything is still expensive, slow, and often lower quality because irrelevant text dilutes signal.

Tradeoffs You Should Be Able to Explain

Higher recall often increases context noise; reranking and filtering are required to keep precision high.
Smaller chunks improve semantic precision but can break cross-sentence context needed for accurate answers.
Aggressive grounding reduces hallucinations but can increase abstentions when retrieval coverage is weak.

First-time learner note: Master one stage at a time: ingestion, retrieval, then grounded generation. Validate each stage with small test questions before tuning everything together.

Production note: Treat quality as measurable system behavior. Track retrieval relevance, groundedness, and abstention quality with repeatable eval sets.

🧾 Comprehensive Coverage

Exhaustive coverage points to ensure complete topic understanding without missing core concepts.

Covered: 0 / 23

Context windows, chunking, embedding models, and the injection vs retrieval pipeline.Embeddings: text is mapped into high-dimensional vectors where semantic similarity becomes geometric proximity.Even when a model supports very large token windows, sending everything is still expensive, slow, and often lower quality because irrelevant text dilutes signal.Indexing vectors for fast nearest-neighbor search (ANN/HNSW/IVF style internals depending on backend).Injection (offline) : load documents, normalize, chunk, embed, index with metadata.Retrieval (online) : interpret query, embed query, retrieve/rank candidates, pass evidence to answer generation.RAG combines language generation with retrieval over an external knowledge index.Core limitation RAG addresses: context window size is finite while knowledge bases are effectively unbounded.Tokens: model input/output is priced and bounded by tokens.Metadata filtering (tenant, language, policy version, date range, access scope).In practical terms, the LLM no longer answers from memory alone; it answers from fetched evidence.This means architecture choices (chunk size, top-k, prompt template) directly affect both quality and cost.A query like 'refund period' can retrieve chunks mentioning 'return window' without exact keyword overlap.Even though we see three dimensions in each vector embedding, in reality, popular embedding models like OpenAI's text embedding three large transform text or paragraphs to up to 3,72 dimensions in each vector embedding.Note that if we embed a small word like cat or a large paragraph, we always get one vector embedding output that has 3,72 dimensions.What this retriever component is going to do is it's going to take this particular data and then it's going to go through all these different vector embeddings and see which one is closer in semantic meaning to the user's query.We are not going to deal with vector embeddings after this particular point.But we are going to match we're going to retrieve the top five or 10 different chunks that matches with this particular vector embedding.It is unrealistic because LLMs have a context window, meaning a limited number of information it can process at a given time.Clot for claw models like sonnet models they seem to process 200,000 tokens.In the context of language models, a token is a unit of text that the LLM model processes.Understanding tokens are crucial because LLMs have a limit on how many tokens they can handle at once.A vector embedding is a mathematical representation of words, sentences or even images.

Loading interactive module...

💡 Concrete Example

Suppose your corpus has 250,000 policy chunks. A user asks, 'Can interns access production dashboards?' The query is embedded once, nearest-neighbor search returns top candidates, and metadata filtering removes chunks outside the user's org and policy version. You then send only 3-5 evidence chunks (not whole documents) into the prompt, which lowers token cost, reduces latency, and produces a precise, source-backed answer.

🧠 Beginner-Friendly Examples

Guided Starter Example

Suppose your corpus has 250,000 policy chunks. A user asks, 'Can interns access production dashboards?' The query is embedded once, nearest-neighbor search returns top candidates, and metadata filtering removes chunks outside the user's org and policy version. You then send only 3-5 evidence chunks (not whole documents) into the prompt, which lowers token cost, reduces latency, and produces a precise, source-backed answer.

Source-grounded Practical Scenario

Context windows, chunking, embedding models, and the injection vs retrieval pipeline.

Source-grounded Practical Scenario

Embeddings: text is mapped into high-dimensional vectors where semantic similarity becomes geometric proximity.

🧭 Architecture Flow

Drag to reorder the architecture flow for What is RAG, Tokens, Embeddings & Vector Databases. This is designed as an interview rehearsal for explaining end-to-end execution.

1.Ingest and normalize source documents

2.Chunk and embed for retriever indexing

3.Retrieve top-k evidence for user query

4.Rerank/filter context for precision

5.Generate grounded answer with citations

Flow order matches canonical architecture sequence.

Loading interactive module...

🎬 Interactive Visualization

Vector Search Space

Drag the sliders to move the query point (red). The visualization retrieves the top 3 nearest chunks based on Euclidean distance in a simple 2D space.

Query X-axis (Topic): 50

Query Y-axis (Tone): 50

k (Nearest Neighbors): 3

Retrieved Chunks

Quantum computing is complex (0.0)
Embeddings represent semantics (33.5)
SQL databases use tables (35.0)

Loading interactive module...

🛠 Interactive Tool

Token Counter (Heuristic)

Enter text below to see an approximate token count based on a common heuristic (Words * 1.3).

Words0

Estimated Tokens0

Loading interactive module...

🧪 Interactive Sessions

Concept Drill: Manipulate key parameters and observe behavior shifts for What is RAG, Tokens, Embeddings & Vector Databases.
Failure Mode Lab: Trigger an edge case and explain remediation decisions.
Architecture Reorder Exercise: Reorder 5 flow steps into the correct production sequence.

💻 Code Walkthrough

Topic-aligned code references for conceptual-to-implementation mapping.

content/github_code/rag-for-beginners/2_retrieval_pipeline.py

Reference implementation path for What is RAG, Tokens, Embeddings & Vector Databases.

Open highlighted code →

scratch_pad/github_code/rag-for-beginners/2_retrieval_pipeline.py

Reference implementation path for What is RAG, Tokens, Embeddings & Vector Databases.

Open highlighted code →

Define input/output contract before reading implementation details.
Map each conceptual step to one concrete function/class decision.
Call out one tradeoff and one failure mode in interview wording.

🎯 Interview Prep

Questions an interviewer is likely to ask about this topic. Think through your answer before reading the senior angle.

Q1[beginner] What is an embedding and how does it enable semantic search in a vector database?
An embedding is a dense numeric representation of text meaning. Semantic search works by embedding the query and finding nearby vectors in index space, so lexical mismatch ('refund' vs 'return window') can still match.
Q2[beginner] Walk me through the injection pipeline step by step.
Injection steps: load/parse source documents, clean + normalize text, chunk with overlap policy, generate embeddings, write vectors + metadata to DB, then validate index integrity with sample queries.
Q3[intermediate] Why does chunk size matter? What happens if chunks are too small vs too large?
Tiny chunks improve precision but can lose context and increase retrieval fan-out. Oversized chunks improve context but add noise and token cost. Optimal size depends on document structure, query style, and top-k budget.
Q4[expert] How do metadata filters change retrieval quality in multi-tenant systems?
Metadata filtering enforces scope (tenant, role, version, recency) before ranking. This prevents cross-tenant leakage and reduces irrelevant candidates, improving both safety and precision.
Q5[expert] How would you explain this in a production interview with tradeoffs?
The real senior insight is that RAG is a precision-recall trade-off. Larger chunks = higher recall (more context) but lower precision (more noise). Smaller chunks = higher precision but risk losing context across chunk boundaries. Production systems tune chunk size empirically per document type, often with overlap (e.g. 200 token overlap) to prevent context loss at boundaries.

🏆 Senior answer angle — click to reveal

Use the tier progression: beginner correctness -> intermediate tradeoffs -> expert production constraints and incident readiness.

📚 Revision Flash Cards

Test yourself before moving on. Flip each card to check your understanding — great for quick revision before an interview.

Start flipping cards to track your progress

Question

What is a token in the context of LLMs?

tap to reveal →

Answer

A unit of text the LLM processes — roughly one English word. 'hello' = 1 token, 'I am' = 2 tokens. LLMs have a hard limit on how many tokens they can process at once (the context window).

Loading interactive module...