AI / LLM

Agentic RAG

8 min readAILLM

Semantic RAG fails on a specific class of question: the ones whose answer is not in a single retrieval. Multi-hop reasoning, long-tail entities, time-sensitive facts, dataset-level themes. The FRAMES benchmark from Google quantifies the gap: a frontier model answering FRAMES questions with no retrieval scores 0.41 accuracy. With a single-shot retrieval it scores 0.45 to 0.47. With a multi-step retrieval pipeline that lets the model issue queries in sequence, it scores 0.66 (Krishna et al., 2024).

Agentic RAG is RAG where one or more LLM-driven agents control when, what, and how to retrieve. Where semantic RAG is a directed acyclic graph the engineer drew, agentic RAG is a state machine the engineer drew that hands control-flow decisions back to the model at branch points. Accuracy on hard queries goes up; cost, latency, and observational complexity go up with it.

The basic loop

flowchart TD
    Q[Question] --> D{Retrieve?}
    D -- "yes" --> R[Retrieval tool]
    R --> E{Enough?}
    E -- "no, refine" --> D
    E -- "yes" --> G[Generate answer]
    D -- "no" --> G
    G --> END([Answer + citations])

Three decisions the model makes that the semantic pipeline cannot. Whether to retrieve at all. Whether the retrieved evidence is sufficient. What to query next. Every agentic-RAG pattern in the literature is a variation on which of these three decisions the model owns and how the harness implements the loop.

Five patterns worth knowing

Self-RAG: reflect during generation

Asai et al., 2023 trains a single LM to emit four kinds of reflection tokens: Retrieve (do I need to retrieve?), ISREL (is the passage relevant?), ISSUP (is the generation supported by the passage?), ISUSE (is the response useful, scored 1 to 5). The model decides on retrieval, critiques passages, and grades its own outputs through these tokens. Self-RAG-13B beats vanilla retrieval-augmented Llama2-13B by 41 points on PopQA and 58 points on PubHealth. In production the reflection capability is usually expressed at the prompt level: a separate grader LLM call instead of a fine-tuned token.

CRAG: grade retrievals, fall back to the web

Yan et al., 2024 adds a lightweight retrieval evaluator between retrieval and generation. The evaluator classifies retrieved documents as Correct (filter and use), Ambiguous (combine corpus and web), or Incorrect (discard, fall back to web search). The web fallback is the agentic edge: it gives the system somewhere to go when the static corpus does not have the answer. On PopQA, the web fallback alone lifts accuracy by 19 points over Self-RAG.

flowchart TD
    Q[Query] --> R[Retrieve]
    R --> E{Evaluator}
    E -- "Correct" --> F[Filter]
    E -- "Ambiguous" --> BOTH[Filter + Web]
    E -- "Incorrect" --> W[Web search]
    F --> G[Generate]
    BOTH --> G
    W --> G

Adaptive RAG: route the query

Jeong et al., NAACL 2024 is the explicit hybrid. A small classifier predicts query complexity and routes each query to (A) no retrieval, (B) single-step retrieval, or (C) multi-step iterative retrieval. Adaptive-RAG approaches multi-step accuracy at roughly half the cost on average. The classifier can be a fine-tuned small model or just one LLM call returning a structured "simple" or "complex" label. The interesting insight: every dollar spent looping on a simple question is wasted; every dollar spent not looping on a multi-hop question is mis-spent.

Tool-using RAG: the practical default

The simplest agentic-RAG architecture in production is a tool-using loop. The retriever is registered as a search(query, k) tool. The harness runs a while loop. The model decides when to call search, what query to issue, and when to stop and answer. This is what Anthropic's tool use docs describe, what the OpenAI Responses API offers either as a hosted file_search tool or a custom function, and what Claude Projects surfaces as the built-in project_knowledge_search. Self-RAG, CRAG, FLARE, and multi-hop iteration are all expressible as different prompts and tools on top of this skeleton.

GraphRAG: for sensemaking, not lookup

The hardest question semantic RAG cannot answer is "summarize the dataset." Top-k retrieval over chunks has no way to see the whole. Microsoft GraphRAG (Edge et al., 2024) builds an entity graph from the corpus, partitions it with Leiden community detection, and uses an LLM to pre-generate summaries for every community. Local search answers entity questions by walking the graph from matched entities. Global search answers sensemaking questions by map-reducing across community summaries. In the paper's eval, GraphRAG wins comprehensiveness and diversity by 72 to 83 percent; naive RAG wins on direct lookup.

The operational reality

The shift from one round-trip to a loop changes four things.

Token spend. A Self-RAG or CRAG inner loop runs 2 to 4x a vanilla call. A ReAct-style agent with two or three retrieval rounds runs 4 to 8x. Anthropic's multi-agent research system reports: "Agents use about 4x more tokens than chat. Multi-agent systems use about 15x more tokens than chat."

Latency. Each extra LLM call is serial unless explicitly parallelized. A single Claude or Gemini call runs 0.8 to 3 seconds; a five-step agentic chain commonly runs 10 to 30 seconds end to end.

Failure modes. New ones not present in semantic RAG: tool-call thrash (near-duplicate queries), subagent over-spawn, empty-result spirals, and infinite loops on adversarial input. Fixes are structural: deduplicate at the orchestrator, build a CRAG-style fallback for empty results, enforce a hard turn cap.

Prompt caching is the cost lever. Anthropic's caching reads at 0.10x input cost. Put system prompt, tool definitions, and the retrieved-corpus prefix at the front of the message list and never reorder them; the cache amortizes across every loop iteration. Without it, the loop pays full input price on every step.

A reasonable production default: a hard max_steps cap (10 to 15), a cumulative token-cost cap, a wall-clock cap, plus soft stops like "no new information for N turns" and fixed-point detection on near-duplicate queries.

A native tool-use loop

The smallest agentic-RAG implementation that does real work is an Anthropic tool-use loop with search as a registered tool.

import anthropic
client = anthropic.Anthropic()

TOOLS = [{
    "name": "search",
    "description": "Search the internal knowledge base. Returns top-k passages.",
    "input_schema": {"type": "object",
                     "properties": {"query": {"type": "string"}, "k": {"type": "integer", "default": 5}},
                     "required": ["query"]},
}]

def run_search(query, k=5):                     # vector, BM25, hybrid, graph, whatever
    return retriever.retrieve(query, k=k)

def agentic_rag(question, max_turns=8):
    messages = [{"role": "user", "content": question}]
    for _ in range(max_turns):
        resp = client.messages.create(
            model="claude-opus-4-7",
            system="Answer using the search tool. Cite passages by id.",
            tools=TOOLS, messages=messages, max_tokens=1024,
        )
        messages.append({"role": "assistant", "content": resp.content})
        if resp.stop_reason != "tool_use":
            return "".join(b.text for b in resp.content if b.type == "text")
        results = [
            {"type": "tool_result", "tool_use_id": b.id, "content": run_search(**b.input)}
            for b in resp.content if b.type == "tool_use"
        ]
        messages.append({"role": "user", "content": results})
    raise RuntimeError("max_turns hit")

The loop is the entire architecture. Self-RAG, CRAG, and multi-hop iteration are all expressible as different system prompts and additional tools on top of this skeleton.

A LangGraph state machine

When the structure has more than one branch (relevance grader, query rewriter, web fallback), a state machine is cleaner than a free-form tool loop. The canonical Self-RAG and CRAG-flavored graph: retrieve, grade, either generate or rewrite, loop.

from typing import TypedDict
from langgraph.graph import StateGraph, START, END
from pydantic import BaseModel, Field

class State(TypedDict):
    question:  str
    documents: list[str]

class Grade(BaseModel):
    binary_score: str = Field(description="'yes' or 'no'")

def retrieve(s):     return {"documents": retriever.invoke(s["question"])}
def grade_docs(s):
    keep = [d for d in s["documents"]
            if grader.with_structured_output(Grade)
                     .invoke({"q": s["question"], "d": d}).binary_score == "yes"]
    return {"documents": keep}
def rewrite(s):      return {"question": rewriter.invoke(s["question"]).content}
def generate(s):     return {"answer": answer_llm.invoke({"q": s["question"],
                                                          "ctx": s["documents"]}).content}

def route(s):        return "generate" if s["documents"] else "rewrite"

g = StateGraph(State)
for n, f in [("retrieve", retrieve), ("grade_docs", grade_docs),
             ("rewrite", rewrite), ("generate", generate)]:
    g.add_node(n, f)
g.add_edge(START, "retrieve")
g.add_edge("retrieve", "grade_docs")
g.add_conditional_edges("grade_docs", route, {"generate": "generate", "rewrite": "rewrite"})
g.add_edge("rewrite", "retrieve")
g.add_edge("generate", END)
graph = g.compile()

Three extensions are common: a hallucination grader between generate and END, a router node before retrieve that picks web_search for current-events queries, and a turn counter to enforce a hard cap on the rewrite-and-retrieve cycle. LangChain's Build a custom RAG agent tutorial walks the same skeleton.

When agentic is overkill

Three common ways teams over-spend.

  • Reaching for a loop when the bottleneck is recall. If retrieval recall is below acceptable, an agent will loop forever on bad retrievals. Fix retrieval first. Hybrid plus reranker is the cheapest large gain available.
  • Reaching for multi-agent when single-agent works. Multi-agent costs 15x tokens versus chat. If a single-agent loop with three to five turns clears the eval, an orchestrator-worker system rarely earns the cost.
  • Building the loop without a budget. Turn cap, token cap, wall-clock cap. Not afterthoughts.

The shape of the recommendation: default to semantic. Add one rewrite or HyDE call when ambiguity is the bottleneck. Move to grade-and-retry only when single-pass retrieval is genuinely insufficient. Use Adaptive-RAG routing when the query mix is heterogeneous. Reach for GraphRAG when the questions are about the corpus rather than within it. Reach for multi-agent only when the answer is worth paying for.

Neighbors

References and Good Reads

ragagentic-ragself-ragcragadaptive-raggraphraglanggraphtool-useaillm
← Back to all posts