AI / LLM
Semantic vs Agentic RAG
A team has shipped semantic RAG. The eval set is real, the reranker is in place, costs are predictable. Some queries still come back wrong, and the internet has opinions about agentic RAG being the answer. The question is when the upgrade earns its cost.
This short guide compares the two. The argument it defends: default to semantic, add agentic in measured steps, against benchmark numbers, on the slice of traffic that needs it. The expensive thing in 2026 is not building either pipeline. It is paying for unnecessary agentic loops in production.
The contrast in one paragraph each
Semantic RAG is a fixed, author-controlled pipeline: chunk, embed, retrieve top-k, optionally rerank, paste into a prompt, call the model once. Retrieval policy, k, filters, and prompt are decided in code. One retrieval, one generation call.
Agentic RAG hands flow control to the model. The model decides whether to retrieve, what to query, which tool to use, whether the evidence is sufficient, and whether to loop. The harness defines the state machine; the model walks it.
This is the same workflows-vs-agents distinction from Anthropic's Building Effective Agents: "for many applications, optimizing single LLM calls with retrieval and in-context examples is usually enough."
What the benchmarks say
Self-RAG vs vanilla RAG (Asai et al., 2023):
| Task | Llama2-13B RAG | Ret-ChatGPT | Self-RAG 13B |
|---|---|---|---|
| PopQA | 14.7 | 50.8 | 55.8 |
| PubHealth | 16.0 | 54.7 | 74.5 |
| Bio (FactScore) | 55.9 | 71.8 | 80.2 |
CRAG vs Self-RAG (Yan et al., 2024, on SelfRAG-LLaMA2-7B):
| Method | PopQA | PubHealth | Bio |
|---|---|---|---|
| Vanilla RAG | 40.3 | 39.0 | 59.2 |
| Self-RAG | 54.9 | 72.4 | 81.2 |
| Self-CRAG | 61.8 | 74.8 | 86.2 |
Adaptive-RAG (Jeong et al., NAACL 2024, F1 with FLAN-T5-XL):
| Dataset | Single-step | Multi-step | Adaptive |
|---|---|---|---|
| SQuAD | 36.8 | 35.6 | 38.3 |
| Natural Questions | 47.3 | 41.8 | 47.3 |
| MuSiQue | 16.2 | 31.9 | 31.8 |
| HotpotQA | 41.4 | 56.5 | 53.8 |
The interesting fact is not that multi-step wins on hard datasets. It is that single-step wins on easy ones. Adaptive-RAG matches multi-step accuracy at roughly half the cost on the easy half of the distribution.
FRAMES (Krishna et al., 2024, Gemini-Pro-1.5):
| Setting | Accuracy |
|---|---|
| Naive prompt (no retrieval) | 0.408 |
| Single-step BM25, 4 docs | 0.474 |
| Multi-step agentic (5 iters) | 0.66 |
| Oracle (all gold docs) | 0.729 |
A 50 percent relative lift over single-shot, at the cost of roughly six sequential inference calls per question.
Cost and latency
Three rules of thumb.
- Token spend. Semantic RAG is 1x. Self-RAG or CRAG inner loop: 2 to 4x. ReAct with 2 to 3 retrievals: 4 to 8x. Anthropic reports 4x for single-agent and 15x for multi-agent in their research system writeup.
- Latency. Single call: 0.8 to 3 s. Five-step agentic chain: 10 to 30 s. Multi-agent research: often beyond a minute.
- Per-query cost. Semantic RAG: roughly $0.001 to $0.01. Agentic RAG: $0.02 to $0.10. A 5 to 50x spread driven mostly by loop depth.
Prompt caching halves the agentic premium in well-designed systems. Anthropic's cache reads at 0.10x input cost; keep the stable prefix (system prompt, tools, corpus) at the front and never reorder it.
Where each actually wins
- Single-fact lookup, FAQ. Semantic wins on cost; ties on quality.
- Multi-hop / compositional questions. Agentic clearly wins (HotpotQA from ~41 F1 to 56-62; FRAMES from 0.41 to 0.66).
- Ambiguous or under-specified queries. Agentic wins via a single rewrite step. The smallest possible upgrade.
- Long-tail or out-of-domain. Agentic with a web fallback wins. CRAG's PopQA gain is almost entirely the fallback.
- Time-sensitive queries. Agentic with web search wins. Semantic is bounded by index freshness.
- Global / sensemaking questions. Agentic (specifically GraphRAG) wins decisively.
- Stable, high-volume, cacheable queries. Semantic wins. A cache hit beats any agent.
Operational differences
- Debugging. Semantic: one trace. Agentic: multi-step traces; an observability tool (LangSmith, Langfuse, Arize Phoenix) is a requirement.
- Determinism. Semantic is largely deterministic. Agentic varies in shape even at temperature zero.
- Cost predictability. Semantic is bounded. Agentic is a distribution with a long right tail. A
max_stepscap is part of the architecture. - Guardrails. Agentic needs step limits, tool allowlists, output validators, citation requirements, refusal-on-loop heuristics. Semantic needs grounding checks; the surface area is much smaller.
- Evaluation. Semantic: retrieval@k plus end-to-end. Agentic: step-level trajectory evaluation plus end-to-end.
The decision framework
flowchart TD
A[Ship semantic RAG] --> B{Retrieval recall acceptable?}
B -- "no" --> C[Fix retrieval: hybrid + reranker]
C --> B
B -- "yes" --> D{Answers correct on eval?}
D -- "yes" --> Z[Stop. Ship it.]
D -- "no" --> E{Failure shape?}
E -- "ambiguity" --> F[Add 1 rewrite or HyDE call]
E -- "multi-hop" --> G[Bounded multi-step loop]
E -- "long-tail or fresh" --> H[Web-search fallback]
E -- "global or themes" --> I[GraphRAG for that slice]
E -- "heterogeneous mix" --> J[Adaptive-RAG routing]
F --> Z
G --> Z
H --> Z
I --> Z
J --> Z
A checklist for choosing agentic over semantic. Two or more should hold before committing.
- multi-hop or compositional reasoning required
- query distribution is heterogeneous
- recall caps below acceptable answer quality even after a reranker
- domain requires fusion across web, SQL, or multi-source corpora
- accuracy ranks far above latency and cost in the product spec
- failure cases require visible self-correction
A few clearly bad reasons.
- "agentic RAG was on a benchmark we saw last week"
- "we have a framework that supports it"
- "our PM saw a demo"
Hybrids: the practical default
Most production systems end up neither pure semantic nor pure agentic. Five hybrids are common.
- Semantic plus single rewrite. One LLM call (HyDE, paraphrase, decomposition) before retrieval. Roughly 1x extra cost, most of the agentic accuracy gain on ambiguous queries.
- Agentic with semantic as the default tool. Tool-using agent where the first tool is the existing vanilla RAG. Most queries finish in one call; the loop only fires when the model judges the result inadequate.
- Adaptive RAG. Classifier routes simple to single-step, complex to multi-step. Approaches multi-step accuracy at half the average cost.
- CRAG-style confidence-gated fallback. Semantic first, evaluator scores, web fallback only on Incorrect.
- Two-tier cache. Tier-1 answer cache for above ~95 percent query similarity. Tier-2 retrieval cache on topic match. Agent runs only on a full miss. Reports of 60 to 90 percent LLM-spend reduction on high-volume surfaces.
The honest framing: most "agentic RAG in production" is one of these hybrids.
Trade-off table
| Dimension | Semantic RAG | Agentic RAG | Hybrid |
|---|---|---|---|
| Who controls retrieval | Author (code) | Model (runtime) | Author routes, model executes on hard path |
| LLM calls per query | 1 | 3 to 10+ | 1 on easy, 2 to 5 on hard |
| Tokens vs vanilla | 1x | 5 to 20x (10 to 50x tail) | 1.5 to 3x avg |
| Latency, P50 | 1 to 3 s | 8 to 30 s | 2 to 5 s avg |
| Cost per query | $0.001 to $0.01 | $0.02 to $0.10 | $0.003 to $0.03 |
| Cost predictability | Bounded | Long right tail | Bounded by routing |
| Single-fact accuracy | High | High (overkill) | High |
| Multi-hop accuracy (HotpotQA F1) | ~41 | 55 to 62 | 50 to 55 |
| FRAMES accuracy | ~0.41 to 0.47 | ~0.66 | ~0.55 to 0.60 |
| Long-tail / web coverage | Poor | Good | Good on miss |
| Global / sensemaking | Poor | Good (GraphRAG) | Routed: good |
| Determinism | High | Low | Medium |
| Debuggability | Easy | Hard (tracer required) | Medium |
| Time to build (MVP) | Days | Weeks to months | One to three weeks |
| Guardrails needed | Few | Many | Moderate |
| Best fit | FAQ, docs Q&A, lookup | Multi-hop research, dynamic data | Mixed production workloads |
The right column is faster to build for week one. The middle column wins on the queries that need it. The left column is the cheapest defensible default and the floor for any hybrid.
The shape of the recommendation
Three concrete defaults for a team standing at the decision.
- Default to semantic. Ship the fixed pipeline, get retrieval right, build a real eval set. Measure before changing anything.
- When the eval shows a specific failure shape, add the smallest piece of agentic that addresses it. Ambiguity? One rewrite. Multi-hop? A bounded loop. Long-tail or freshness? A web fallback. Mixed mix? A classifier route. Sensemaking? GraphRAG for that slice.
- Treat agentic cost as part of the architecture. Turn caps, token caps, wall-clock caps, prompt caching, and a tier-1 answer cache for the head of the query distribution. These are the difference between an agentic system at 2x semantic cost and one at 30x.
The bet that almost always pays off is the hybrid. The bet that almost always loses is going pure-agentic by default and discovering the bill at the end of the month.
Neighbors
- Semantic RAG: the fixed pipeline this article uses as the baseline.
- Agentic RAG: the patterns this article compares against.
- The Augmented LLM: Retrieval, Tools, and Memory: the broader family.
- Choosing the Right Agentic Pattern: the same kind of decision framework for agent patterns beyond RAG.
References and Good Reads
- Anthropic. Building Effective Agents. The workflows-vs-agents framing.
- Anthropic. Multi-agent research system. Source of the 4x and 15x token-amplification numbers.
- Asai et al. Self-RAG. Source of the PopQA and PubHealth benchmark numbers.
- Yan et al. Corrective RAG (CRAG). Source of the CRAG benchmark table.
- Jeong et al. Adaptive-RAG. The classifier-routed hybrid.
- Krishna et al. FRAMES. The 0.41 to 0.66 result.
- Edge et al. GraphRAG. Local-vs-global search for sensemaking.
- Gao et al. RAG Survey. Naive, advanced, modular taxonomy.
- Singh et al. Agentic RAG Survey. The first formal survey of the agentic family.
- LangChain. On Agent Frameworks and Agent Observability. Why tracing is a requirement.
- Cohere. Agentic Multi-Stage RAG. The "rerank-first" recommendation, with an honest framing of when to go agentic.
- Pinecone. Beyond the hype: Why RAG remains essential. The "85 percent of agent compute on re-discovery" data point.
- NVIDIA. Traditional RAG vs Agentic RAG. A vendor-neutral framing for the same trade-off.
