AI / LLM

Semantic vs Agentic RAG

May 15, 20268 min readAILLM

A team has shipped semantic RAG. The eval set is real, the reranker is in place, costs are predictable. Some queries still come back wrong, and the internet has opinions about agentic RAG being the answer. The question is when the upgrade earns its cost.

This short guide compares the two. The argument it defends: default to semantic, add agentic in measured steps, against benchmark numbers, on the slice of traffic that needs it. The expensive thing in 2026 is not building either pipeline. It is paying for unnecessary agentic loops in production.

The contrast in one paragraph each

Semantic RAG is a fixed, author-controlled pipeline: chunk, embed, retrieve top-k, optionally rerank, paste into a prompt, call the model once. Retrieval policy, k, filters, and prompt are decided in code. One retrieval, one generation call.

Agentic RAG hands flow control to the model. The model decides whether to retrieve, what to query, which tool to use, whether the evidence is sufficient, and whether to loop. The harness defines the state machine; the model walks it.

This is the same workflows-vs-agents distinction from Anthropic's Building Effective Agents: "for many applications, optimizing single LLM calls with retrieval and in-context examples is usually enough."

What the benchmarks say

Self-RAG vs vanilla RAG (Asai et al., 2023):

Task	Llama2-13B RAG	Ret-ChatGPT	Self-RAG 13B
PopQA	14.7	50.8	55.8
PubHealth	16.0	54.7	74.5
Bio (FactScore)	55.9	71.8	80.2

CRAG vs Self-RAG (Yan et al., 2024, on SelfRAG-LLaMA2-7B):

Method	PopQA	PubHealth	Bio
Vanilla RAG	40.3	39.0	59.2
Self-RAG	54.9	72.4	81.2
Self-CRAG	61.8	74.8	86.2

Adaptive-RAG (Jeong et al., NAACL 2024, F1 with FLAN-T5-XL):

Dataset	Single-step	Multi-step	Adaptive
SQuAD	36.8	35.6	38.3
Natural Questions	47.3	41.8	47.3
MuSiQue	16.2	31.9	31.8
HotpotQA	41.4	56.5	53.8

The interesting fact is not that multi-step wins on hard datasets. It is that single-step wins on easy ones. Adaptive-RAG matches multi-step accuracy at roughly half the cost on the easy half of the distribution.

FRAMES (Krishna et al., 2024, Gemini-Pro-1.5):

Setting	Accuracy
Naive prompt (no retrieval)	0.408
Single-step BM25, 4 docs	0.474
Multi-step agentic (5 iters)	0.66
Oracle (all gold docs)	0.729

A 50 percent relative lift over single-shot, at the cost of roughly six sequential inference calls per question.

Cost and latency

Three rules of thumb.

Token spend. Semantic RAG is 1x. Self-RAG or CRAG inner loop: 2 to 4x. ReAct with 2 to 3 retrievals: 4 to 8x. Anthropic reports 4x for single-agent and 15x for multi-agent in their research system writeup.
Latency. Single call: 0.8 to 3 s. Five-step agentic chain: 10 to 30 s. Multi-agent research: often beyond a minute.
Per-query cost. Semantic RAG: roughly $0.001 to $0.01. Agentic RAG: $0.02 to $0.10. A 5 to 50x spread driven mostly by loop depth.

Prompt caching halves the agentic premium in well-designed systems. Anthropic's cache reads at 0.10x input cost; keep the stable prefix (system prompt, tools, corpus) at the front and never reorder it.

Where each actually wins

Single-fact lookup, FAQ. Semantic wins on cost; ties on quality.
Multi-hop / compositional questions. Agentic clearly wins (HotpotQA from ~41 F1 to 56-62; FRAMES from 0.41 to 0.66).
Ambiguous or under-specified queries. Agentic wins via a single rewrite step. The smallest possible upgrade.
Long-tail or out-of-domain. Agentic with a web fallback wins. CRAG's PopQA gain is almost entirely the fallback.
Time-sensitive queries. Agentic with web search wins. Semantic is bounded by index freshness.
Global / sensemaking questions. Agentic (specifically GraphRAG) wins decisively.
Stable, high-volume, cacheable queries. Semantic wins. A cache hit beats any agent.

Operational differences

Debugging. Semantic: one trace. Agentic: multi-step traces; an observability tool (LangSmith, Langfuse, Arize Phoenix) is a requirement.
Determinism. Semantic is largely deterministic. Agentic varies in shape even at temperature zero.
Cost predictability. Semantic is bounded. Agentic is a distribution with a long right tail. A max_steps cap is part of the architecture.
Guardrails. Agentic needs step limits, tool allowlists, output validators, citation requirements, refusal-on-loop heuristics. Semantic needs grounding checks; the surface area is much smaller.
Evaluation. Semantic: retrieval@k plus end-to-end. Agentic: step-level trajectory evaluation plus end-to-end.

The decision framework

flowchart TD
    A[Ship semantic RAG] --> B{Retrieval recall acceptable?}
    B -- "no" --> C[Fix retrieval: hybrid + reranker]
    C --> B
    B -- "yes" --> D{Answers correct on eval?}
    D -- "yes" --> Z[Stop. Ship it.]
    D -- "no" --> E{Failure shape?}
    E -- "ambiguity" --> F[Add 1 rewrite or HyDE call]
    E -- "multi-hop" --> G[Bounded multi-step loop]
    E -- "long-tail or fresh" --> H[Web-search fallback]
    E -- "global or themes" --> I[GraphRAG for that slice]
    E -- "heterogeneous mix" --> J[Adaptive-RAG routing]
    F --> Z
    G --> Z
    H --> Z
    I --> Z
    J --> Z

A checklist for choosing agentic over semantic. Two or more should hold before committing.

multi-hop or compositional reasoning required
query distribution is heterogeneous
recall caps below acceptable answer quality even after a reranker
domain requires fusion across web, SQL, or multi-source corpora
accuracy ranks far above latency and cost in the product spec
failure cases require visible self-correction

A few clearly bad reasons.

"agentic RAG was on a benchmark we saw last week"
"we have a framework that supports it"
"our PM saw a demo"

Hybrids: the practical default

Most production systems end up neither pure semantic nor pure agentic. Five hybrids are common.

Semantic plus single rewrite. One LLM call (HyDE, paraphrase, decomposition) before retrieval. Roughly 1x extra cost, most of the agentic accuracy gain on ambiguous queries.
Agentic with semantic as the default tool. Tool-using agent where the first tool is the existing vanilla RAG. Most queries finish in one call; the loop only fires when the model judges the result inadequate.
Adaptive RAG. Classifier routes simple to single-step, complex to multi-step. Approaches multi-step accuracy at half the average cost.
CRAG-style confidence-gated fallback. Semantic first, evaluator scores, web fallback only on Incorrect.
Two-tier cache. Tier-1 answer cache for above ~95 percent query similarity. Tier-2 retrieval cache on topic match. Agent runs only on a full miss. Reports of 60 to 90 percent LLM-spend reduction on high-volume surfaces.

The honest framing: most "agentic RAG in production" is one of these hybrids.

Trade-off table

Dimension	Semantic RAG	Agentic RAG	Hybrid
Who controls retrieval	Author (code)	Model (runtime)	Author routes, model executes on hard path
LLM calls per query	1	3 to 10+	1 on easy, 2 to 5 on hard
Tokens vs vanilla	1x	5 to 20x (10 to 50x tail)	1.5 to 3x avg
Latency, P50	1 to 3 s	8 to 30 s	2 to 5 s avg
Cost per query	$0.001 to $0.01	$0.02 to $0.10	$0.003 to $0.03
Cost predictability	Bounded	Long right tail	Bounded by routing
Single-fact accuracy	High	High (overkill)	High
Multi-hop accuracy (HotpotQA F1)	~41	55 to 62	50 to 55
FRAMES accuracy	~0.41 to 0.47	~0.66	~0.55 to 0.60
Long-tail / web coverage	Poor	Good	Good on miss
Global / sensemaking	Poor	Good (GraphRAG)	Routed: good
Determinism	High	Low	Medium
Debuggability	Easy	Hard (tracer required)	Medium
Time to build (MVP)	Days	Weeks to months	One to three weeks
Guardrails needed	Few	Many	Moderate
Best fit	FAQ, docs Q&A, lookup	Multi-hop research, dynamic data	Mixed production workloads

The right column is faster to build for week one. The middle column wins on the queries that need it. The left column is the cheapest defensible default and the floor for any hybrid.

The shape of the recommendation

Three concrete defaults for a team standing at the decision.

Default to semantic. Ship the fixed pipeline, get retrieval right, build a real eval set. Measure before changing anything.
When the eval shows a specific failure shape, add the smallest piece of agentic that addresses it. Ambiguity? One rewrite. Multi-hop? A bounded loop. Long-tail or freshness? A web fallback. Mixed mix? A classifier route. Sensemaking? GraphRAG for that slice.
Treat agentic cost as part of the architecture. Turn caps, token caps, wall-clock caps, prompt caching, and a tier-1 answer cache for the head of the query distribution. These are the difference between an agentic system at 2x semantic cost and one at 30x.

The bet that almost always pays off is the hybrid. The bet that almost always loses is going pure-agentic by default and discovering the bill at the end of the month.

Neighbors

Semantic RAG: the fixed pipeline this article uses as the baseline.
Agentic RAG: the patterns this article compares against.
The Augmented LLM: Retrieval, Tools, and Memory: the broader family.
Choosing the Right Agentic Pattern: the same kind of decision framework for agent patterns beyond RAG.

References and Good Reads

Anthropic. Building Effective Agents. The workflows-vs-agents framing.
Anthropic. Multi-agent research system. Source of the 4x and 15x token-amplification numbers.
Asai et al. Self-RAG. Source of the PopQA and PubHealth benchmark numbers.
Yan et al. Corrective RAG (CRAG). Source of the CRAG benchmark table.
Jeong et al. Adaptive-RAG. The classifier-routed hybrid.
Krishna et al. FRAMES. The 0.41 to 0.66 result.
Edge et al. GraphRAG. Local-vs-global search for sensemaking.
Gao et al. RAG Survey. Naive, advanced, modular taxonomy.
Singh et al. Agentic RAG Survey. The first formal survey of the agentic family.
LangChain. On Agent Frameworks and Agent Observability. Why tracing is a requirement.
Cohere. Agentic Multi-Stage RAG. The "rerank-first" recommendation, with an honest framing of when to go agentic.
Pinecone. Beyond the hype: Why RAG remains essential. The "85 percent of agent compute on re-discovery" data point.
NVIDIA. Traditional RAG vs Agentic RAG. A vendor-neutral framing for the same trade-off.

rag semantic-rag agentic-rag decision-framework self-rag crag adaptive-rag ai llm

← Back to all posts