Profiling a RAG pipeline from 14s to 2.1s
A step-by-step walkthrough of using PRISM's latency profiler to identify and fix a re-ranker bottleneck that was adding 11 seconds to every request.
The setup
Our user — a legal tech company — had a document QA pipeline with five stages: document retrieval, chunk extraction, re-ranking, context assembly, and LLM generation. End-to-end latency: 14 seconds.
Step 2: Run the simulation
We ran a 10,000-iteration Monte Carlo simulation. PRISM's latency profiler immediately highlighted the bottleneck: the re-ranking stage had a P95 latency of 11.2 seconds.
The culprit? Their self-hosted re-ranker was running on a single CPU instance. Under load, requests queued up, and the queue depth was unbounded.
Step 3: Test alternatives
PRISM lets you swap out individual stages and re-run simulations. We tested a Cohere re-rank API: P95 of 0.4s. Total pipeline latency dropped from 14s to 2.1s at P95.