The retrieval pipeline is the online critical path. Every user request depends on it, so both quality and latency matter. A practical flow is: query preprocess โ query embedding โ candidate retrieval โ optional rerank/filter โ context assembly for generation.
Retriever configuration knobs and their impact:
k: too low hurts recall, too high adds noise and token cost.score_threshold: prevents weak matches from reaching generation; enables clean abstention.search_type: similarity/MMR/threshold strategies depending on corpus redundancy and use case.
Failure modes you must design for:
- No relevant chunks: return abstention/fallback UX, not fabricated answer.
- Redundant chunks: multiple near-duplicates consume context budget; use MMR or deduplication.
- Tenant leakage: missing metadata filters can retrieve another customer's data.
- Latency spikes: embedding call or vector search tail latency can break user experience.
Production retrieval architecture guidance:
- Apply metadata filters before scoring (scope, role, locale, version).
- Cache frequent query embeddings and hot retrieval results where possible.
- Log per-query retrieval traces: candidate IDs, scores, filter decisions, and final selected chunks.
- Define latency SLOs by stage (embed/search/rerank/generate) so bottlenecks are measurable.
Cosine similarity remains the default because embedding semantics are directional; however, retrieval quality comes from the full system: good chunking, good metadata, good thresholds, and robust no-answer behavior.
Interview-Ready Deepening
Source-backed reinforcement: these points add detail beyond short-duration UI hints and emphasize production tradeoffs.
- Query โ embed โ similarity search โ top-k chunks โ LLM prompt โ answer.
- Cosine similarity remains the default because embedding semantics are directional; however, retrieval quality comes from the full system: good chunking, good metadata, good thresholds, and robust no-answer behavior.
- Log per-query retrieval traces: candidate IDs, scores, filter decisions, and final selected chunks.
- A practical flow is: query preprocess โ query embedding โ candidate retrieval โ optional rerank/filter โ context assembly for generation.
- Cache frequent query embeddings and hot retrieval results where possible.
- Latency spikes : embedding call or vector search tail latency can break user experience.
- Define latency SLOs by stage (embed/search/rerank/generate) so bottlenecks are measurable.
- It does not have the answer but still the retriever fetched it because it is somewhat similar to the user's question.
Tradeoffs You Should Be Able to Explain
- Higher recall often increases context noise; reranking and filtering are required to keep precision high.
- Smaller chunks improve semantic precision but can break cross-sentence context needed for accurate answers.
- Aggressive grounding reduces hallucinations but can increase abstentions when retrieval coverage is weak.
First-time learner note: Master one stage at a time: ingestion, retrieval, then grounded generation. Validate each stage with small test questions before tuning everything together.
Production note: Treat quality as measurable system behavior. Track retrieval relevance, groundedness, and abstention quality with repeatable eval sets.