RAG retrieval that survives production traffic

Notes on improving retrieval quality, measuring relevance, and keeping latency predictable when search becomes part of the product path.

By Shashank Bhardwaj

Retrieval-augmented generation fails in two common ways: the model hallucinates confidently when context is missing, or latency balloons when every query fans out to multiple stores. Production RAG needs evaluation harnesses the same way shipping APIs needs integration tests—without metrics, tuning embeddings is guesswork.

Measure before you rewrite

Start with offline sets: query, expected document ids, and graded relevance. Track precision@k and nDCG, but also measure end-to-end latency percentiles because users experience both quality and speed.

When accuracy is low, inspect failure buckets: wrong domain retrieved, duplicate chunks, stale documents, or formatting that models parse poorly. Many “model” issues are actually chunking and metadata issues.

Operational guardrails

Cache stable retrievals where safe, deduplicate overlapping chunks, and cap retrieved tokens. Add circuit breakers so search outages degrade to a safe message instead of hanging the assistant path.

These practices mirror what you would do for any high-traffic dependency—except the product now depends on search in every answer. Treat retrieval as a tier-one service with dashboards, alerts, and owners.