RAGs - Basic Example (2)

Core Theory

This second basic example focuses on the retrieval half of the pipeline. The vector store already exists, so the system now loads that persistent store and uses it to pull back the most relevant chunks for a user's question.

The most important implementation rule: the embedding model used for the user question must match the embedding model used during ingestion. If stored chunks were embedded with one model and the incoming query is embedded with another, the similarity scores become unreliable because the vectors are no longer comparable in the same space.

The retriever is then configured with two important controls:

top-k: how many highest-ranked chunks to return.
similarity threshold: the minimum score a chunk must have to be considered relevant.

Why tuning matters: if the threshold is too low, the generator receives noisy context. If it is too high, the retriever may return no chunks at all, even when the answer is present. This example teaches that retrieval quality is a balancing problem, not a fixed default.

Interview-Ready Deepening

Source-backed reinforcement: these points add detail beyond short-duration UI hints and emphasize production tradeoffs.

Query-time retrieval: load the vector store, embed the question, and tune threshold and top-k to return the right chunks.
If stored chunks were embedded with one model and the incoming query is embedded with another, the similarity scores become unreliable because the vectors are no longer comparable in the same space.
The vector store already exists, so the system now loads that persistent store and uses it to pull back the most relevant chunks for a user's question.
This second basic example focuses on the retrieval half of the pipeline.
The most important implementation rule: the embedding model used for the user question must match the embedding model used during ingestion.
This example teaches that retrieval quality is a balancing problem, not a fixed default.
Aggressive grounding reduces hallucinations but can increase abstentions when retrieval coverage is weak.
Why tuning matters: if the threshold is too low, the generator receives noisy context.

Tradeoffs You Should Be Able to Explain

Higher recall often increases context noise; reranking and filtering are required to keep precision high.
Smaller chunks improve semantic precision but can break cross-sentence context needed for accurate answers.
Aggressive grounding reduces hallucinations but can increase abstentions when retrieval coverage is weak.

First-time learner note: Build deterministic baseline chains first (prompt -> model -> parser), then add retrieval, memory, or tools only when the baseline is stable.

Production note: Keep contracts explicit at each boundary: input variables, output schema, retries, and logs. This is what keeps orchestration reliable at scale.

This second example is the query-time half of RAG. The database already exists, so the problem shifts from 'how do we store knowledge?' to 'how do we pull back the right chunks for this question?' The transcript emphasizes two implementation rules: reload the persisted vector store correctly, and use the exact same embedding model for the user question that you used for the stored chunks. Mixing embedding models silently breaks retrieval because the vectors no longer share the same space.

The important tunables here are top-k and threshold. Top-k controls how many of the best candidate chunks survive. The similarity threshold controls how strict the retriever is allowed to be. Set the threshold too low and the model gets noisy evidence. Set it too high and the retriever can return nothing at all, even when the answer is present. The point is not to memorize one magic threshold; it is to understand that retrieval quality is a tuning problem with observable tradeoffs.

Operational takeaway: query-time retrieval should always be inspected before blaming generation. If the retrieved chunks already contain the answer, grounding is possible. If they do not, changing the answer prompt alone will not rescue the system.

🧾 Comprehensive Coverage

Exhaustive coverage points to ensure complete topic understanding without missing core concepts.

Covered: 0 / 13

Query-time retrieval: load the vector store, embed the question, and tune threshold and top-k to return the right chunks.If stored chunks were embedded with one model and the incoming query is embedded with another, the similarity scores become unreliable because the vectors are no longer comparable in the same space.The vector store already exists, so the system now loads that persistent store and uses it to pull back the most relevant chunks for a user's question.This second basic example focuses on the retrieval half of the pipeline.The most important implementation rule: the embedding model used for the user question must match the embedding model used during ingestion.This example teaches that retrieval quality is a balancing problem, not a fixed default.Why tuning matters: if the threshold is too low, the generator receives noisy context.If it is too high, the retriever may return no chunks at all, even when the answer is present.The retriever is then configured with two important controls:Higher recall often increases context noise; reranking and filtering are required to keep precision high.Smaller chunks improve semantic precision but can break cross-sentence context needed for accurate answers.Aggressive grounding reduces hallucinations but can increase abstentions when retrieval coverage is weak.The retriever is then configured with two important controls: top-k : how many highest-ranked chunks to return.

Loading interactive module...

💡 Concrete Example

Query-time retrieval walkthrough: 1) Reload the persisted Chroma store. 2) Embed the user's question with the same embedding model used for the documents. 3) Ask the retriever for the top-k chunks above the score threshold. 4) Inspect the returned chunks before blaming answer quality on the generator. 5) Adjust threshold or top-k only after seeing what evidence the retriever is actually returning. Good generation depends on getting this step right first.

🧠 Beginner-Friendly Examples

Guided Starter Example

Query-time retrieval walkthrough: 1) Reload the persisted Chroma store. 2) Embed the user's question with the same embedding model used for the documents. 3) Ask the retriever for the top-k chunks above the score threshold. 4) Inspect the returned chunks before blaming answer quality on the generator. 5) Adjust threshold or top-k only after seeing what evidence the retriever is actually returning. Good generation depends on getting this step right first.

Source-grounded Practical Scenario

Query-time retrieval: load the vector store, embed the question, and tune threshold and top-k to return the right chunks.

Source-grounded Practical Scenario

If stored chunks were embedded with one model and the incoming query is embedded with another, the similarity scores become unreliable because the vectors are no longer comparable in the same space.

🧭 Architecture Flow

Drag to reorder the architecture flow for RAGs - Basic Example (2). This is designed as an interview rehearsal for explaining end-to-end execution.

1.Define the objective for RAGs - Basic Example (2)

2.Prepare and validate inputs/state

3.Execute core algorithmic step

4.Evaluate outputs and detect failure modes

5.Apply feedback loop and iterate

Flow order matches canonical architecture sequence.

Loading interactive module...

🎬 Interactive Visualization

❓

User Query~0ms

🔢

Embed Query~80ms

🔍

Vector Search~30ms

📄

Top-K Chunks~5ms

🧩

Augment Prompt~2ms

🤖

LLM Generation~800ms

✅

Grounded Answertotal ~920ms

Click any step to inspect it • ~920ms total

Critical constraint: The embedding model used at injection time must be identical to the one used at retrieval time. Vectors from different models live in incompatible spaces — mixing them silently corrupts similarity scores.

Loading interactive module...

🛠 Interactive Tool

Walk through the three practical RAG steps from the LangChain examples: ingest the document once, retrieve chunks with a tuned retriever, and assemble a grounded one-off answer prompt.

Retriever Tuning

Load the persisted vector store, embed the user question with the same embedding model, then tune top-k and threshold until the retriever returns enough relevant context without too much noise.

This mirrors the second basic example: same embedding model, similarity threshold, and top-k control the query-time behavior.

Tuning Controls

Top-K: 3Similarity threshold: 0.50Query mode

Key constraint: the same embedding model must be used for stored chunks and incoming user queries, otherwise similarity scores become meaningless.

Retrieval outcome

Healthy retrieval window

The retriever is returning a compact, relevant candidate set for downstream answer generation.

lotr.txtscore 0.86

It was in the heart of Hobbiton, at the home of Frodo Baggins, that Gandalf came to speak of matters far greater than Frodo could have imagined.

lotr.txtscore 0.81

Gandalf came to Hobbiton to visit Frodo one summer day, carrying news that would alter the quiet rhythm of the Shire.

lotr.txtscore 0.76

Frodo was in his home in Hobbiton when Gandalf, his old friend and mentor, arrived for the conversation that changed everything.

Prompt assembly preview

Question: Where does Gandalf meet Frodo?

Documents:
[1] It was in the heart of Hobbiton, at the home of Frodo Baggins, that Gandalf came to speak of matters far greater than Frodo could have imagined.

[2] Gandalf came to Hobbiton to visit Frodo one summer day, carrying news that would alter the quiet rhythm of the Shire.

[3] Frodo was in his home in Hobbiton when Gandalf, his old friend and mentor, arrived for the conversation that changed everything.

Instruction: Answer only from the provided documents. If the answer is not found, respond with "I'm not sure".

One-off vs conversational

This request is stateless: retrieve evidence, build one prompt, answer, and stop. That keeps latency and operational complexity lower.

One-off mode is better for search widgets and isolated doc Q and A endpoints.
Conversation mode is useful only when follow-up questions truly depend on prior turns.
Either way, retrieval quality still depends on chunking, embedding consistency, and prompt grounding rules.

Loading interactive module...

🧪 Interactive Sessions

Concept Drill: Manipulate key parameters and observe behavior shifts for RAGs - Basic Example (2).
Failure Mode Lab: Trigger an edge case and explain remediation decisions.
Architecture Reorder Exercise: Reorder 5 flow steps into the correct production sequence.

💻 Code Walkthrough

Second basic RAG example that extends part 1 with additional flow detail.

content/github_code/langchain-course/4_RAGs/1b_basic_part_2.py

Continuation of baseline RAG pipeline.

Open highlighted code →

Compare part-1 and part-2 prompt/retrieval differences.

🎯 Interview Prep

Questions an interviewer is likely to ask about this topic. Think through your answer before reading the senior angle.

Q1[beginner] How do you choose which retrieval parameter to tune first?
Implement this in a controlled sequence: frame the target outcome, define measurable success criteria, build the smallest correct baseline, and instrument traces/metrics before optimization. In this node, keep decisions grounded in LCEL composition, prompt contracts, structured output parsing, and tool schemas and validate each change against real failure cases. Iteration comparison:. Production hardening means planning for parser breaks, prompt-tool mismatch, and fragile chain coupling and enforcing typed I/O boundaries, retries with fallback paths, and trace-level observability.
Q2[beginner] Why is A/B comparison on a fixed eval set essential for RAG iteration?
The causal reason is that system behavior is constrained by data, model contracts, and runtime context, not just algorithm choice. Basic Example 2 introduces targeted improvements over baseline.. A practical check is to validate impact on quality, latency, and failure recovery before scaling. If ignored, teams usually hit parser breaks, prompt-tool mismatch, and fragile chain coupling; prevention requires typed I/O boundaries, retries with fallback paths, and trace-level observability.
Q3[intermediate] What does deduplication improve in retrieval-to-generation handoff?
It is best defined by the role it plays in the end-to-end system, not in isolation. Basic Example 2 introduces targeted improvements over baseline.. Operationally, its value appears only when integrated with LCEL composition, prompt contracts, structured output parsing, and tool schemas and measured against real outcomes. Iteration comparison:. A common pitfall is parser breaks, prompt-tool mismatch, and fragile chain coupling; mitigate with typed I/O boundaries, retries with fallback paths, and trace-level observability.
Q4[expert] How do you avoid overfitting retriever settings to a tiny question sample?
Implement this in a controlled sequence: frame the target outcome, define measurable success criteria, build the smallest correct baseline, and instrument traces/metrics before optimization. In this node, keep decisions grounded in LCEL composition, prompt contracts, structured output parsing, and tool schemas and validate each change against real failure cases. Iteration comparison:. Production hardening means planning for parser breaks, prompt-tool mismatch, and fragile chain coupling and enforcing typed I/O boundaries, retries with fallback paths, and trace-level observability.
Q5[expert] How would you explain this in a production interview with tradeoffs?
Iteration discipline beats intuition. Senior teams tie each tuning change to a hypothesis, metric delta, and rollback path.

🏆 Senior answer angle — click to reveal

Use the tier progression: beginner correctness -> intermediate tradeoffs -> expert production constraints and incident readiness.

📚 Revision Flash Cards

Test yourself before moving on. Flip each card to check your understanding — great for quick revision before an interview.

Start flipping cards to track your progress

Question

What is the purpose of basic example 2?

tap to reveal →

Answer

Improve baseline RAG precision through controlled retrieval and prompt refinements.

Loading interactive module...