Skip to content
Concept-Lab
LangChain⛓️ 25 / 29
LangChain

RAGs - Basic Example (2)

Query-time retrieval: load the vector store, embed the question, and tune threshold and top-k to return the right chunks.

Core Theory

This second basic example focuses on the retrieval half of the pipeline. The vector store already exists, so the system now loads that persistent store and uses it to pull back the most relevant chunks for a user's question.

The most important implementation rule: the embedding model used for the user question must match the embedding model used during ingestion. If stored chunks were embedded with one model and the incoming query is embedded with another, the similarity scores become unreliable because the vectors are no longer comparable in the same space.

The retriever is then configured with two important controls:

  • top-k: how many highest-ranked chunks to return.
  • similarity threshold: the minimum score a chunk must have to be considered relevant.

Why tuning matters: if the threshold is too low, the generator receives noisy context. If it is too high, the retriever may return no chunks at all, even when the answer is present. This example teaches that retrieval quality is a balancing problem, not a fixed default.

Interview-Ready Deepening

Source-backed reinforcement: these points add detail beyond short-duration UI hints and emphasize production tradeoffs.

  • Query-time retrieval: load the vector store, embed the question, and tune threshold and top-k to return the right chunks.
  • If stored chunks were embedded with one model and the incoming query is embedded with another, the similarity scores become unreliable because the vectors are no longer comparable in the same space.
  • The vector store already exists, so the system now loads that persistent store and uses it to pull back the most relevant chunks for a user's question.
  • This second basic example focuses on the retrieval half of the pipeline.
  • The most important implementation rule: the embedding model used for the user question must match the embedding model used during ingestion.
  • This example teaches that retrieval quality is a balancing problem, not a fixed default.
  • Aggressive grounding reduces hallucinations but can increase abstentions when retrieval coverage is weak.
  • Why tuning matters: if the threshold is too low, the generator receives noisy context.

Tradeoffs You Should Be Able to Explain

  • Higher recall often increases context noise; reranking and filtering are required to keep precision high.
  • Smaller chunks improve semantic precision but can break cross-sentence context needed for accurate answers.
  • Aggressive grounding reduces hallucinations but can increase abstentions when retrieval coverage is weak.

First-time learner note: Build deterministic baseline chains first (prompt -> model -> parser), then add retrieval, memory, or tools only when the baseline is stable.

Production note: Keep contracts explicit at each boundary: input variables, output schema, retries, and logs. This is what keeps orchestration reliable at scale.

This second example is the query-time half of RAG. The database already exists, so the problem shifts from 'how do we store knowledge?' to 'how do we pull back the right chunks for this question?' The transcript emphasizes two implementation rules: reload the persisted vector store correctly, and use the exact same embedding model for the user question that you used for the stored chunks. Mixing embedding models silently breaks retrieval because the vectors no longer share the same space.

The important tunables here are top-k and threshold. Top-k controls how many of the best candidate chunks survive. The similarity threshold controls how strict the retriever is allowed to be. Set the threshold too low and the model gets noisy evidence. Set it too high and the retriever can return nothing at all, even when the answer is present. The point is not to memorize one magic threshold; it is to understand that retrieval quality is a tuning problem with observable tradeoffs.

Operational takeaway: query-time retrieval should always be inspected before blaming generation. If the retrieved chunks already contain the answer, grounding is possible. If they do not, changing the answer prompt alone will not rescue the system.

🧾 Comprehensive Coverage

Exhaustive coverage points to ensure complete topic understanding without missing core concepts.

Loading interactive module...

💡 Concrete Example

Query-time retrieval walkthrough: 1) Reload the persisted Chroma store. 2) Embed the user's question with the same embedding model used for the documents. 3) Ask the retriever for the top-k chunks above the score threshold. 4) Inspect the returned chunks before blaming answer quality on the generator. 5) Adjust threshold or top-k only after seeing what evidence the retriever is actually returning. Good generation depends on getting this step right first.

🧠 Beginner-Friendly Examples

Guided Starter Example

Query-time retrieval walkthrough: 1) Reload the persisted Chroma store. 2) Embed the user's question with the same embedding model used for the documents. 3) Ask the retriever for the top-k chunks above the score threshold. 4) Inspect the returned chunks before blaming answer quality on the generator. 5) Adjust threshold or top-k only after seeing what evidence the retriever is actually returning. Good generation depends on getting this step right first.

Source-grounded Practical Scenario

Query-time retrieval: load the vector store, embed the question, and tune threshold and top-k to return the right chunks.

Source-grounded Practical Scenario

If stored chunks were embedded with one model and the incoming query is embedded with another, the similarity scores become unreliable because the vectors are no longer comparable in the same space.

🧭 Architecture Flow

Loading interactive module...

🎬 Interactive Visualization

Loading interactive module...

🛠 Interactive Tool

Loading interactive module...

🧪 Interactive Sessions

  1. Concept Drill: Manipulate key parameters and observe behavior shifts for RAGs - Basic Example (2).
  2. Failure Mode Lab: Trigger an edge case and explain remediation decisions.
  3. Architecture Reorder Exercise: Reorder 5 flow steps into the correct production sequence.

💻 Code Walkthrough

Second basic RAG example that extends part 1 with additional flow detail.

content/github_code/langchain-course/4_RAGs/1b_basic_part_2.py

Continuation of baseline RAG pipeline.

Open highlighted code →
  1. Compare part-1 and part-2 prompt/retrieval differences.

🎯 Interview Prep

Questions an interviewer is likely to ask about this topic. Think through your answer before reading the senior angle.

  • Q1[beginner] How do you choose which retrieval parameter to tune first?
    Implement this in a controlled sequence: frame the target outcome, define measurable success criteria, build the smallest correct baseline, and instrument traces/metrics before optimization. In this node, keep decisions grounded in LCEL composition, prompt contracts, structured output parsing, and tool schemas and validate each change against real failure cases. Iteration comparison:. Production hardening means planning for parser breaks, prompt-tool mismatch, and fragile chain coupling and enforcing typed I/O boundaries, retries with fallback paths, and trace-level observability.
  • Q2[beginner] Why is A/B comparison on a fixed eval set essential for RAG iteration?
    The causal reason is that system behavior is constrained by data, model contracts, and runtime context, not just algorithm choice. Basic Example 2 introduces targeted improvements over baseline.. A practical check is to validate impact on quality, latency, and failure recovery before scaling. If ignored, teams usually hit parser breaks, prompt-tool mismatch, and fragile chain coupling; prevention requires typed I/O boundaries, retries with fallback paths, and trace-level observability.
  • Q3[intermediate] What does deduplication improve in retrieval-to-generation handoff?
    It is best defined by the role it plays in the end-to-end system, not in isolation. Basic Example 2 introduces targeted improvements over baseline.. Operationally, its value appears only when integrated with LCEL composition, prompt contracts, structured output parsing, and tool schemas and measured against real outcomes. Iteration comparison:. A common pitfall is parser breaks, prompt-tool mismatch, and fragile chain coupling; mitigate with typed I/O boundaries, retries with fallback paths, and trace-level observability.
  • Q4[expert] How do you avoid overfitting retriever settings to a tiny question sample?
    Implement this in a controlled sequence: frame the target outcome, define measurable success criteria, build the smallest correct baseline, and instrument traces/metrics before optimization. In this node, keep decisions grounded in LCEL composition, prompt contracts, structured output parsing, and tool schemas and validate each change against real failure cases. Iteration comparison:. Production hardening means planning for parser breaks, prompt-tool mismatch, and fragile chain coupling and enforcing typed I/O boundaries, retries with fallback paths, and trace-level observability.
  • Q5[expert] How would you explain this in a production interview with tradeoffs?
    Iteration discipline beats intuition. Senior teams tie each tuning change to a hypothesis, metric delta, and rollback path.
🏆 Senior answer angle — click to reveal
Use the tier progression: beginner correctness -> intermediate tradeoffs -> expert production constraints and incident readiness.

📚 Revision Flash Cards

Test yourself before moving on. Flip each card to check your understanding — great for quick revision before an interview.

Loading interactive module...