AI / LLM

Semantic vs Agentic RAG

8 min readAILLM

A team has shipped semantic RAG. The eval set is real, the reranker is in place, costs are predictable. Some queries still come back wrong, and the internet has opinions about agentic RAG being the answer. The question is when the upgrade earns its cost.

This short guide compares the two. The argument it defends: default to semantic, add agentic in measured steps, against benchmark numbers, on the slice of traffic that needs it. The expensive thing in 2026 is not building either pipeline. It is paying for unnecessary agentic loops in production.

The contrast in one paragraph each

Semantic RAG is a fixed, author-controlled pipeline: chunk, embed, retrieve top-k, optionally rerank, paste into a prompt, call the model once. Retrieval policy, k, filters, and prompt are decided in code. One retrieval, one generation call.

Agentic RAG hands flow control to the model. The model decides whether to retrieve, what to query, which tool to use, whether the evidence is sufficient, and whether to loop. The harness defines the state machine; the model walks it.

This is the same workflows-vs-agents distinction from Anthropic's Building Effective Agents: "for many applications, optimizing single LLM calls with retrieval and in-context examples is usually enough."

What the benchmarks say

Self-RAG vs vanilla RAG (Asai et al., 2023):

Task Llama2-13B RAG Ret-ChatGPT Self-RAG 13B
PopQA 14.7 50.8 55.8
PubHealth 16.0 54.7 74.5
Bio (FactScore) 55.9 71.8 80.2

CRAG vs Self-RAG (Yan et al., 2024, on SelfRAG-LLaMA2-7B):

Method PopQA PubHealth Bio
Vanilla RAG 40.3 39.0 59.2
Self-RAG 54.9 72.4 81.2
Self-CRAG 61.8 74.8 86.2

Adaptive-RAG (Jeong et al., NAACL 2024, F1 with FLAN-T5-XL):

Dataset Single-step Multi-step Adaptive
SQuAD 36.8 35.6 38.3
Natural Questions 47.3 41.8 47.3
MuSiQue 16.2 31.9 31.8
HotpotQA 41.4 56.5 53.8

The interesting fact is not that multi-step wins on hard datasets. It is that single-step wins on easy ones. Adaptive-RAG matches multi-step accuracy at roughly half the cost on the easy half of the distribution.

FRAMES (Krishna et al., 2024, Gemini-Pro-1.5):

Setting Accuracy
Naive prompt (no retrieval) 0.408
Single-step BM25, 4 docs 0.474
Multi-step agentic (5 iters) 0.66
Oracle (all gold docs) 0.729

A 50 percent relative lift over single-shot, at the cost of roughly six sequential inference calls per question.

Cost and latency

Three rules of thumb.

  • Token spend. Semantic RAG is 1x. Self-RAG or CRAG inner loop: 2 to 4x. ReAct with 2 to 3 retrievals: 4 to 8x. Anthropic reports 4x for single-agent and 15x for multi-agent in their research system writeup.
  • Latency. Single call: 0.8 to 3 s. Five-step agentic chain: 10 to 30 s. Multi-agent research: often beyond a minute.
  • Per-query cost. Semantic RAG: roughly $0.001 to $0.01. Agentic RAG: $0.02 to $0.10. A 5 to 50x spread driven mostly by loop depth.

Prompt caching halves the agentic premium in well-designed systems. Anthropic's cache reads at 0.10x input cost; keep the stable prefix (system prompt, tools, corpus) at the front and never reorder it.

Where each actually wins

  • Single-fact lookup, FAQ. Semantic wins on cost; ties on quality.
  • Multi-hop / compositional questions. Agentic clearly wins (HotpotQA from ~41 F1 to 56-62; FRAMES from 0.41 to 0.66).
  • Ambiguous or under-specified queries. Agentic wins via a single rewrite step. The smallest possible upgrade.
  • Long-tail or out-of-domain. Agentic with a web fallback wins. CRAG's PopQA gain is almost entirely the fallback.
  • Time-sensitive queries. Agentic with web search wins. Semantic is bounded by index freshness.
  • Global / sensemaking questions. Agentic (specifically GraphRAG) wins decisively.
  • Stable, high-volume, cacheable queries. Semantic wins. A cache hit beats any agent.

Operational differences

  • Debugging. Semantic: one trace. Agentic: multi-step traces; an observability tool (LangSmith, Langfuse, Arize Phoenix) is a requirement.
  • Determinism. Semantic is largely deterministic. Agentic varies in shape even at temperature zero.
  • Cost predictability. Semantic is bounded. Agentic is a distribution with a long right tail. A max_steps cap is part of the architecture.
  • Guardrails. Agentic needs step limits, tool allowlists, output validators, citation requirements, refusal-on-loop heuristics. Semantic needs grounding checks; the surface area is much smaller.
  • Evaluation. Semantic: retrieval@k plus end-to-end. Agentic: step-level trajectory evaluation plus end-to-end.

The decision framework

flowchart TD
    A[Ship semantic RAG] --> B{Retrieval recall acceptable?}
    B -- "no" --> C[Fix retrieval: hybrid + reranker]
    C --> B
    B -- "yes" --> D{Answers correct on eval?}
    D -- "yes" --> Z[Stop. Ship it.]
    D -- "no" --> E{Failure shape?}
    E -- "ambiguity" --> F[Add 1 rewrite or HyDE call]
    E -- "multi-hop" --> G[Bounded multi-step loop]
    E -- "long-tail or fresh" --> H[Web-search fallback]
    E -- "global or themes" --> I[GraphRAG for that slice]
    E -- "heterogeneous mix" --> J[Adaptive-RAG routing]
    F --> Z
    G --> Z
    H --> Z
    I --> Z
    J --> Z

A checklist for choosing agentic over semantic. Two or more should hold before committing.

  • multi-hop or compositional reasoning required
  • query distribution is heterogeneous
  • recall caps below acceptable answer quality even after a reranker
  • domain requires fusion across web, SQL, or multi-source corpora
  • accuracy ranks far above latency and cost in the product spec
  • failure cases require visible self-correction

A few clearly bad reasons.

  • "agentic RAG was on a benchmark we saw last week"
  • "we have a framework that supports it"
  • "our PM saw a demo"

Hybrids: the practical default

Most production systems end up neither pure semantic nor pure agentic. Five hybrids are common.

  • Semantic plus single rewrite. One LLM call (HyDE, paraphrase, decomposition) before retrieval. Roughly 1x extra cost, most of the agentic accuracy gain on ambiguous queries.
  • Agentic with semantic as the default tool. Tool-using agent where the first tool is the existing vanilla RAG. Most queries finish in one call; the loop only fires when the model judges the result inadequate.
  • Adaptive RAG. Classifier routes simple to single-step, complex to multi-step. Approaches multi-step accuracy at half the average cost.
  • CRAG-style confidence-gated fallback. Semantic first, evaluator scores, web fallback only on Incorrect.
  • Two-tier cache. Tier-1 answer cache for above ~95 percent query similarity. Tier-2 retrieval cache on topic match. Agent runs only on a full miss. Reports of 60 to 90 percent LLM-spend reduction on high-volume surfaces.

The honest framing: most "agentic RAG in production" is one of these hybrids.

Trade-off table

Dimension Semantic RAG Agentic RAG Hybrid
Who controls retrieval Author (code) Model (runtime) Author routes, model executes on hard path
LLM calls per query 1 3 to 10+ 1 on easy, 2 to 5 on hard
Tokens vs vanilla 1x 5 to 20x (10 to 50x tail) 1.5 to 3x avg
Latency, P50 1 to 3 s 8 to 30 s 2 to 5 s avg
Cost per query $0.001 to $0.01 $0.02 to $0.10 $0.003 to $0.03
Cost predictability Bounded Long right tail Bounded by routing
Single-fact accuracy High High (overkill) High
Multi-hop accuracy (HotpotQA F1) ~41 55 to 62 50 to 55
FRAMES accuracy ~0.41 to 0.47 ~0.66 ~0.55 to 0.60
Long-tail / web coverage Poor Good Good on miss
Global / sensemaking Poor Good (GraphRAG) Routed: good
Determinism High Low Medium
Debuggability Easy Hard (tracer required) Medium
Time to build (MVP) Days Weeks to months One to three weeks
Guardrails needed Few Many Moderate
Best fit FAQ, docs Q&A, lookup Multi-hop research, dynamic data Mixed production workloads

The right column is faster to build for week one. The middle column wins on the queries that need it. The left column is the cheapest defensible default and the floor for any hybrid.

The shape of the recommendation

Three concrete defaults for a team standing at the decision.

  • Default to semantic. Ship the fixed pipeline, get retrieval right, build a real eval set. Measure before changing anything.
  • When the eval shows a specific failure shape, add the smallest piece of agentic that addresses it. Ambiguity? One rewrite. Multi-hop? A bounded loop. Long-tail or freshness? A web fallback. Mixed mix? A classifier route. Sensemaking? GraphRAG for that slice.
  • Treat agentic cost as part of the architecture. Turn caps, token caps, wall-clock caps, prompt caching, and a tier-1 answer cache for the head of the query distribution. These are the difference between an agentic system at 2x semantic cost and one at 30x.

The bet that almost always pays off is the hybrid. The bet that almost always loses is going pure-agentic by default and discovering the bill at the end of the month.

Neighbors

References and Good Reads

ragsemantic-ragagentic-ragdecision-frameworkself-ragcragadaptive-ragaillm
← Back to all posts