Skip to content
Concept-Lab
โ† RAG Systems๐Ÿ” 3 / 17
RAG Systems

Coding the Injection Pipeline

Chunk โ†’ embed โ†’ store in a vector DB. Implementing from scratch.

Core Theory

The injection pipeline is where you define knowledge quality for the entire system. It is an offline process, but it determines online behavior. If ingestion is noisy or inconsistent, retrieval quality collapses.

Canonical flow: load documents โ†’ normalize text/layout โ†’ split into chunks โ†’ embed chunks โ†’ persist vectors + metadata.

Implementation details that beginners usually miss:

  • Idempotency: re-running ingestion should not create duplicate vectors. Use deterministic chunk IDs (doc_id + chunk_index + content_hash).
  • Versioning: track source document version and embedding model version in metadata.
  • Incremental upserts: avoid full re-index for every update; ingest only changed documents when possible.
  • Delete propagation: if a source document is removed, corresponding vectors must be deleted to avoid stale citations.

Common code building blocks:

  • DirectoryLoader/PyPDFDirectoryLoader for source reading.
  • RecursiveCharacterTextSplitter or semantic splitter for stable chunking policy.
  • OpenAIEmbeddings (or equivalent) to generate dense vectors.
  • Chroma/Pinecone/pgvector for storage + nearest-neighbor retrieval.

Data contract for each chunk in production: {chunk_id, doc_id, source_path, page, section, created_at, version, embedding_model, text}. Missing metadata makes debugging, access control, and citation auditing painful.

Operational guidance: add ingestion validation tests (chunk count sanity checks, empty-chunk rate, duplicate-chunk rate, embedding failure rate) before promoting an index version to production.

Interview-Ready Deepening

Source-backed reinforcement: these points add detail beyond short-duration UI hints and emphasize production tradeoffs.

  • Chunk โ†’ embed โ†’ store in a vector DB. Implementing from scratch.
  • The next step, the third step is to send all of these chunks through the embedding model and convert it into the vector embedding and also store it in the vector DB.
  • Pass in all of the chunks and then we are just going to have the vector store right here because it returns the vector store.
  • Right, we have chunked it and now we have embedded all of the chunks and then we've stored it in the vector database as well.
  • Canonical flow: load documents โ†’ normalize text/layout โ†’ split into chunks โ†’ embed chunks โ†’ persist vectors + metadata.
  • Idempotency : re-running ingestion should not create duplicate vectors. Use deterministic chunk IDs (doc_id + chunk_index + content_hash).
  • Data contract for each chunk in production: {chunk_id, doc_id, source_path, page, section, created_at, version, embedding_model, text} .
  • Operational guidance: add ingestion validation tests (chunk count sanity checks, empty-chunk rate, duplicate-chunk rate, embedding failure rate) before promoting an index version to production.

Tradeoffs You Should Be Able to Explain

  • Higher recall often increases context noise; reranking and filtering are required to keep precision high.
  • Smaller chunks improve semantic precision but can break cross-sentence context needed for accurate answers.
  • Aggressive grounding reduces hallucinations but can increase abstentions when retrieval coverage is weak.

First-time learner note: Master one stage at a time: ingestion, retrieval, then grounded generation. Validate each stage with small test questions before tuning everything together.

Production note: Treat quality as measurable system behavior. Track retrieval relevance, groundedness, and abstention quality with repeatable eval sets.

๐Ÿงพ Comprehensive Coverage

Exhaustive coverage points to ensure complete topic understanding without missing core concepts.

Loading interactive module...

๐Ÿ’ก Concrete Example

A policy corpus has 12,000 PDFs. Nightly ingestion computes file hashes and detects only 73 changed files, then re-chunks and re-embeds just those files. Old vectors from retired policies are deleted, and new chunks are tagged with policy_version + embedding_model metadata. Next morning, retrieval automatically surfaces the latest clauses without a costly full-corpus rebuild.

๐Ÿง  Beginner-Friendly Examples

Guided Starter Example

A policy corpus has 12,000 PDFs. Nightly ingestion computes file hashes and detects only 73 changed files, then re-chunks and re-embeds just those files. Old vectors from retired policies are deleted, and new chunks are tagged with policy_version + embedding_model metadata. Next morning, retrieval automatically surfaces the latest clauses without a costly full-corpus rebuild.

Source-grounded Practical Scenario

Chunk โ†’ embed โ†’ store in a vector DB. Implementing from scratch.

Source-grounded Practical Scenario

The next step, the third step is to send all of these chunks through the embedding model and convert it into the vector embedding and also store it in the vector DB.

๐Ÿงญ Architecture Flow

Loading interactive module...

๐ŸŽฌ Interactive Visualization

Loading interactive module...

๐Ÿ›  Interactive Tool

Loading interactive module...

๐Ÿงช Interactive Sessions

  1. Concept Drill: Manipulate key parameters and observe behavior shifts for Coding the Injection Pipeline.
  2. Failure Mode Lab: Trigger an edge case and explain remediation decisions.
  3. Architecture Reorder Exercise: Reorder 5 flow steps into the correct production sequence.

๐Ÿ’ป Code Walkthrough

Use the ingestion script from the course repo to revisit document loading, chunking, embedding, and vector-store persistence.

content/github_code/rag-for-beginners/1_ingestion_pipeline.py

End-to-end ingestion flow: loader -> splitter -> embeddings -> Chroma.

Open highlighted code โ†’
  1. Trace where documents are loaded and validated before indexing.
  2. Observe chunk_size/chunk_overlap and how they affect context granularity.
  3. Confirm persisted vector DB path and embedding model configuration.

๐ŸŽฏ Interview Prep

Questions an interviewer is likely to ask about this topic. Think through your answer before reading the senior angle.

  • Q1[beginner] Why do we need to use the same embedding model in the injection pipeline and the retrieval pipeline?
    Similarity only works in the same embedding space. If injection and retrieval use different models, nearest-neighbor scores become unreliable even if vector dimensions match.
  • Q2[beginner] What is chunk_overlap and why would you set it to a non-zero value?
    Overlap preserves boundary context when concepts span chunk edges. Without overlap, crucial phrases can be split across chunks and never retrieved together.
  • Q3[intermediate] Why would you choose ChromaDB for development but Pinecone for production?
    Chroma is simple and local for rapid dev/testing; Pinecone or managed vector infra is preferred at scale for reliability, replication, and operational SLAs.
  • Q4[expert] How would you design ingestion so re-runs are safe and cheap?
    Use content hashing + deterministic chunk IDs, incremental upserts, and document-level diffing. This prevents duplicate vectors and avoids full-corpus re-embedding on every run.
  • Q5[expert] How would you explain this in a production interview with tradeoffs?
    The dimensionality of embedding vectors must be consistent: you cannot inject with text-embedding-3-small (1,536 dims) and retrieve with text-embedding-ada-002 (also 1,536 dims but different vector space). The embedding model baked into the vector store at injection time is locked in โ€” changing it requires re-embedding your entire corpus. Plan this decision carefully in production systems.
๐Ÿ† Senior answer angle โ€” click to reveal
Use the tier progression: beginner correctness -> intermediate tradeoffs -> expert production constraints and incident readiness.

๐Ÿ“š Revision Flash Cards

Test yourself before moving on. Flip each card to check your understanding โ€” great for quick revision before an interview.

Loading interactive module...