Tutorial/February 18, 2026/12 min read

Profiling a RAG pipeline from 14s to 2.1s

A step-by-step walkthrough of using PRISM's latency profiler to identify and fix a re-ranker bottleneck that was adding 11 seconds to every request.

The setup

Our user — a legal tech company — had a document QA pipeline with five stages: document retrieval, chunk extraction, re-ranking, context assembly, and LLM generation. End-to-end latency: 14 seconds.

Step 2: Run the simulation

We ran a 10,000-iteration Monte Carlo simulation. PRISM's latency profiler immediately highlighted the bottleneck: the re-ranking stage had a P95 latency of 11.2 seconds.

The culprit? Their self-hosted re-ranker was running on a single CPU instance. Under load, requests queued up, and the queue depth was unbounded.

Step 3: Test alternatives

PRISM lets you swap out individual stages and re-run simulations. We tested a Cohere re-rank API: P95 of 0.4s. Total pipeline latency dropped from 14s to 2.1s at P95.

Recursive_Read_Next

Architecture

Why we deprecated the Accuracy Engine

Product

The setup

Step 2: Run the simulation

Step 3: Test alternatives

Why we deprecated the Accuracy Engine

The hidden cost of redundant LLM calls