Coding the Injection Pipeline

Core Theory

The injection pipeline is where you define knowledge quality for the entire system. It is an offline process, but it determines online behavior. If ingestion is noisy or inconsistent, retrieval quality collapses.

Canonical flow: load documents → normalize text/layout → split into chunks → embed chunks → persist vectors + metadata.

Implementation details that beginners usually miss:

Idempotency: re-running ingestion should not create duplicate vectors. Use deterministic chunk IDs (doc_id + chunk_index + content_hash).
Versioning: track source document version and embedding model version in metadata.
Incremental upserts: avoid full re-index for every update; ingest only changed documents when possible.
Delete propagation: if a source document is removed, corresponding vectors must be deleted to avoid stale citations.

Common code building blocks:

DirectoryLoader/PyPDFDirectoryLoader for source reading.
RecursiveCharacterTextSplitter or semantic splitter for stable chunking policy.
OpenAIEmbeddings (or equivalent) to generate dense vectors.
Chroma/Pinecone/pgvector for storage + nearest-neighbor retrieval.

Data contract for each chunk in production: {chunk_id, doc_id, source_path, page, section, created_at, version, embedding_model, text}. Missing metadata makes debugging, access control, and citation auditing painful.

Operational guidance: add ingestion validation tests (chunk count sanity checks, empty-chunk rate, duplicate-chunk rate, embedding failure rate) before promoting an index version to production.

Interview-Ready Deepening

Source-backed reinforcement: these points add detail beyond short-duration UI hints and emphasize production tradeoffs.

Chunk → embed → store in a vector DB. Implementing from scratch.
The next step, the third step is to send all of these chunks through the embedding model and convert it into the vector embedding and also store it in the vector DB.
Pass in all of the chunks and then we are just going to have the vector store right here because it returns the vector store.
Right, we have chunked it and now we have embedded all of the chunks and then we've stored it in the vector database as well.
Canonical flow: load documents → normalize text/layout → split into chunks → embed chunks → persist vectors + metadata.
Idempotency : re-running ingestion should not create duplicate vectors. Use deterministic chunk IDs (doc_id + chunk_index + content_hash).
Data contract for each chunk in production: {chunk_id, doc_id, source_path, page, section, created_at, version, embedding_model, text} .
Operational guidance: add ingestion validation tests (chunk count sanity checks, empty-chunk rate, duplicate-chunk rate, embedding failure rate) before promoting an index version to production.

Tradeoffs You Should Be Able to Explain

Higher recall often increases context noise; reranking and filtering are required to keep precision high.
Smaller chunks improve semantic precision but can break cross-sentence context needed for accurate answers.
Aggressive grounding reduces hallucinations but can increase abstentions when retrieval coverage is weak.

First-time learner note: Master one stage at a time: ingestion, retrieval, then grounded generation. Validate each stage with small test questions before tuning everything together.

Production note: Treat quality as measurable system behavior. Track retrieval relevance, groundedness, and abstention quality with repeatable eval sets.

🧾 Comprehensive Coverage

Exhaustive coverage points to ensure complete topic understanding without missing core concepts.

Covered: 0 / 16

Chunk → embed → store in a vector DB. Implementing from scratch.Canonical flow: load documents → normalize text/layout → split into chunks → embed chunks → persist vectors + metadata.Idempotency : re-running ingestion should not create duplicate vectors. Use deterministic chunk IDs (doc_id + chunk_index + content_hash).Data contract for each chunk in production: {chunk_id, doc_id, source_path, page, section, created_at, version, embedding_model, text} .Operational guidance: add ingestion validation tests (chunk count sanity checks, empty-chunk rate, duplicate-chunk rate, embedding failure rate) before promoting an index version to production.Versioning : track source document version and embedding model version in metadata.Delete propagation : if a source document is removed, corresponding vectors must be deleted to avoid stale citations.RecursiveCharacterTextSplitter or semantic splitter for stable chunking policy.Chroma / Pinecone / pgvector for storage + nearest-neighbor retrieval.Incremental upserts : avoid full re-index for every update; ingest only changed documents when possible.Missing metadata makes debugging, access control, and citation auditing painful.It is an offline process, but it determines online behavior.If ingestion is noisy or inconsistent, retrieval quality collapses.The next step, the third step is to send all of these chunks through the embedding model and convert it into the vector embedding and also store it in the vector DB.Pass in all of the chunks and then we are just going to have the vector store right here because it returns the vector store.Right, we have chunked it and now we have embedded all of the chunks and then we've stored it in the vector database as well.

Loading interactive module...

💡 Concrete Example

A policy corpus has 12,000 PDFs. Nightly ingestion computes file hashes and detects only 73 changed files, then re-chunks and re-embeds just those files. Old vectors from retired policies are deleted, and new chunks are tagged with policy_version + embedding_model metadata. Next morning, retrieval automatically surfaces the latest clauses without a costly full-corpus rebuild.

🧠 Beginner-Friendly Examples

Guided Starter Example

A policy corpus has 12,000 PDFs. Nightly ingestion computes file hashes and detects only 73 changed files, then re-chunks and re-embeds just those files. Old vectors from retired policies are deleted, and new chunks are tagged with policy_version + embedding_model metadata. Next morning, retrieval automatically surfaces the latest clauses without a costly full-corpus rebuild.

Source-grounded Practical Scenario

Chunk → embed → store in a vector DB. Implementing from scratch.

Source-grounded Practical Scenario

The next step, the third step is to send all of these chunks through the embedding model and convert it into the vector embedding and also store it in the vector DB.

🧭 Architecture Flow

Drag to reorder the architecture flow for Coding the Injection Pipeline. This is designed as an interview rehearsal for explaining end-to-end execution.

1.Ingest and normalize source documents

2.Chunk and embed for retriever indexing

3.Retrieve top-k evidence for user query

4.Rerank/filter context for precision

5.Generate grounded answer with citations

Flow order matches canonical architecture sequence.

Loading interactive module...

🎬 Interactive Visualization

RAG has two distinct pipelines. The injection pipeline runs once offline to build the knowledge base. The retrieval pipeline runs on every user query in real-time. Click any step to highlight it.

⚙️ Injection Pipeline (Offline / Batch)

📄

Source Documents

PDFs, DOCX, TXT files

↓

✂️

Chunking

Split into ~1,000 token pieces

↓

🔢

Embedding

Each chunk → dense vector

↓

🗃️

Vector Store

ChromaDB / Pinecone / pgvector

⏱ Runs once. Re-run only when documents change.

⚡ Retrieval Pipeline (Real-time / Per Query)

❓

User Query

Natural language question

↓

🔢

Embed Query

Same model as injection

↓

🔍

Similarity Search

Top-K by cosine similarity

↓

📦

Retrieved Chunks

Top 3–5 most relevant pieces

↓

🤖

LLM + Context

Chunks + query → grounded answer

↓

💬

Final Answer

Cited, accurate, grounded

⚡ Runs every query. Must be fast (<500 ms target).

Key constraint: Both pipelines must use the exact same embedding model. If you embed documents with text-embedding-3-small you must also embed queries with it. Mixing models corrupts the similarity scores entirely.

Failure triad to watch: poor chunking, weak thresholding, and missing abstention policy. Most early RAG incidents come from these three interacting together.

Loading interactive module...

🛠 Interactive Tool

RAG has two distinct pipelines. The injection pipeline runs once offline to build the knowledge base. The retrieval pipeline runs on every user query in real-time. Click any step to highlight it.

⚙️ Injection Pipeline (Offline / Batch)

📄

Source Documents

PDFs, DOCX, TXT files

↓

✂️

Chunking

Split into ~1,000 token pieces

↓

🔢

Embedding

Each chunk → dense vector

↓

🗃️

Vector Store

ChromaDB / Pinecone / pgvector

⏱ Runs once. Re-run only when documents change.

⚡ Retrieval Pipeline (Real-time / Per Query)

❓

User Query

Natural language question

↓

🔢

Embed Query

Same model as injection

↓

🔍

Similarity Search

Top-K by cosine similarity

↓

📦

Retrieved Chunks

Top 3–5 most relevant pieces

↓

🤖

LLM + Context

Chunks + query → grounded answer

↓

💬

Final Answer

Cited, accurate, grounded

⚡ Runs every query. Must be fast (<500 ms target).

Key constraint: Both pipelines must use the exact same embedding model. If you embed documents with text-embedding-3-small you must also embed queries with it. Mixing models corrupts the similarity scores entirely.

Failure triad to watch: poor chunking, weak thresholding, and missing abstention policy. Most early RAG incidents come from these three interacting together.

Loading interactive module...

🧪 Interactive Sessions

Concept Drill: Manipulate key parameters and observe behavior shifts for Coding the Injection Pipeline.
Failure Mode Lab: Trigger an edge case and explain remediation decisions.
Architecture Reorder Exercise: Reorder 5 flow steps into the correct production sequence.

💻 Code Walkthrough

Use the ingestion script from the course repo to revisit document loading, chunking, embedding, and vector-store persistence.

content/github_code/rag-for-beginners/1_ingestion_pipeline.py

End-to-end ingestion flow: loader -> splitter -> embeddings -> Chroma.

Open highlighted code →

Trace where documents are loaded and validated before indexing.
Observe chunk_size/chunk_overlap and how they affect context granularity.
Confirm persisted vector DB path and embedding model configuration.

🎯 Interview Prep

Questions an interviewer is likely to ask about this topic. Think through your answer before reading the senior angle.

Q1[beginner] Why do we need to use the same embedding model in the injection pipeline and the retrieval pipeline?
Similarity only works in the same embedding space. If injection and retrieval use different models, nearest-neighbor scores become unreliable even if vector dimensions match.
Q2[beginner] What is chunk_overlap and why would you set it to a non-zero value?
Overlap preserves boundary context when concepts span chunk edges. Without overlap, crucial phrases can be split across chunks and never retrieved together.
Q3[intermediate] Why would you choose ChromaDB for development but Pinecone for production?
Chroma is simple and local for rapid dev/testing; Pinecone or managed vector infra is preferred at scale for reliability, replication, and operational SLAs.
Q4[expert] How would you design ingestion so re-runs are safe and cheap?
Use content hashing + deterministic chunk IDs, incremental upserts, and document-level diffing. This prevents duplicate vectors and avoids full-corpus re-embedding on every run.
Q5[expert] How would you explain this in a production interview with tradeoffs?
The dimensionality of embedding vectors must be consistent: you cannot inject with text-embedding-3-small (1,536 dims) and retrieve with text-embedding-ada-002 (also 1,536 dims but different vector space). The embedding model baked into the vector store at injection time is locked in — changing it requires re-embedding your entire corpus. Plan this decision carefully in production systems.

🏆 Senior answer angle — click to reveal

Use the tier progression: beginner correctness -> intermediate tradeoffs -> expert production constraints and incident readiness.

📚 Revision Flash Cards

Test yourself before moving on. Flip each card to check your understanding — great for quick revision before an interview.

Start flipping cards to track your progress

Question

What are the four steps of the RAG injection pipeline?

tap to reveal →

Answer

1) Load source documents, 2) Split into chunks, 3) Embed each chunk to a vector, 4) Store vectors in a persistent vector database.

Loading interactive module...