AI / LLM

The Augmented LLM: Retrieval, Tools, and Memory

February 18, 20266 min readAILLM

A language model in isolation is a frozen system. It cannot look up a fact published after its training cutoff, cannot act in the world beyond emitting tokens, and cannot remember the last conversation it had. Every practical agentic architecture, whether a three-line pipeline or a multi-agent cluster, is an attempt to lift one or more of those three constraints. Anthropic's guide to building effective agents names the resulting configuration an augmented LLM and identifies three augmentations that matter: retrieval, tools, and memory (Anthropic, 2024).

The vocabulary is worth adopting. Naming which augmentation a system depends on turns arguments about frameworks, prompts, and orchestration into arguments about primitives. Every pattern covered later in this series is an opinion on how to combine these three.

The frozen system and the three exits

flowchart TD
    U([User query]) --> C[Context assembler]
    R[Retrieval: vector store, search, DB] --> C
    M1[Short-term memory: session, scratchpad] --> C
    M2[Long-term memory: facts, preferences] --> C
    C --> L[LLM call]
    L -->|tool call| T[Tool: API, code, search]
    T -->|observation| L
    L --> O([Response])
    L -.writes.-> M1
    L -.writes.-> M2

The diagram names the three exits from the frozen system. Retrieval brings external knowledge in. Tools let the model act. Memory persists state across turns. The context assembler is a conceptual component; its job is to decide, on each turn, what enters the prompt.

Retrieval: pulling fresh context in

The model's knowledge is old and shallow on anything the training data did not cover well. Retrieval addresses this by pulling relevant fragments from a corpus, a database, or a search index and injecting them into the prompt at inference time. The corpus stays outside the model; only the fragments relevant to this query enter the context window.

The excerpt below injects retrieved passages without any framework. The retrieval function is an ordinary Python call; the model sees the results as plain text under a labeled heading.

from openai import OpenAI

client = OpenAI()

def retrieve(query: str) -> list[str]:
    # In production: vector search, BM25, or a hybrid. Here: stub.
    return ["Policy: refunds processed within 5 business days.",
            "Policy: duplicate charges are refunded automatically within 24h."]

def answer_with_retrieval(question: str) -> str:
    passages = retrieve(question)
    context = "\n".join(f"- {p}" for p in passages)
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Answer using only the provided context."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
        ],
    ).choices[0].message.content

Tools: letting the model act

The model cannot do anything beyond producing tokens. Tools address this by exposing a set of functions the model is allowed to call. Each tool has a name, a typed signature, and an effect in the world. When the model decides it needs one, it emits a structured call; the harness executes it, and the result flows back in. Tools are the bridge between reasoning and action.

Tools introduce a loop. The model is given tool definitions; it may decide to call one; the harness executes the call and feeds the result back. The LangChain equivalent below collapses most of the wiring. Retrieval and a bound tool appear as first-class objects.

from langchain.chat_models import init_chat_model
from langchain_core.tools import tool
from langchain_core.messages import HumanMessage, SystemMessage

@tool
def get_order_status(order_id: str) -> str:
    """Look up the status of an order by ID."""
    return lookup_order(order_id)

model = init_chat_model("gpt-4o-mini").bind_tools([get_order_status])

def answer(question: str, retriever) -> str:
    passages = retriever.invoke(question)
    context = "\n".join(f"- {p.page_content}" for p in passages)
    return model.invoke([
        SystemMessage("Answer using only the provided context. Use tools for orders."),
        HumanMessage(f"Context:\n{context}\n\nQuestion: {question}"),
    ]).content

Memory: carrying state across turns

A single call has no past and no future. Memory addresses this by persisting information across calls. Short-term memory holds the current conversation and the working scratchpad of an ongoing task. Long-term memory holds facts, preferences, and artifacts that should survive restart.

Memory is covered in depth in the Memory stage of this series. At this level of abstraction, the shape is simple: a store that the harness reads from before each call and writes to after each call. Production systems typically use a combination of a message buffer (short-term) and a vector or relational store (long-term).

Diagnosing by which primitive is missing

Each augmentation introduces a family of failures the bare model does not have. Diagnosing these quickly requires naming them separately.

Retrieval failures are either recall failures or precision failures. If the right passage is not retrieved, the model answers from its priors and hallucinates. If too many irrelevant passages are retrieved, the model gets confused or latches onto the loudest fragment. The fix is to measure retrieval separately from generation, typically with a held-out query set and a recall-at-k metric.

Tool failures are interface failures. A tool with a vague description, ambiguous parameters, or inconsistent output will be misused by the model far more often than a well-specified one. Anthropic's engineering team specifically calls out the Agent-Computer Interface as where agentic systems succeed or fail, and recommends investing as much effort in tool design as in prompt design (Anthropic, 2024). That topic gets a dedicated article later in the series.

Memory failures are mostly staleness and contamination failures. Long-term memory that grows without bound pollutes the context window; memory that is never written to is useless; memory that mixes sessions leaks information across users. Each failure mode has a different fix, and none of them is "add more memory."

Augmentation without governance becomes a liability regardless of primitive. A tool that can send emails, a retriever that can read across tenants, or a memory store that persists secrets are all augmentations with safety implications. Guardrails are covered later in this series; they should never be an afterthought.

Stacking the three

The question in practice is rarely whether to augment but which augmentation to reach for first.

The task is knowledge-bound. Failure reads as "the model does not know about this document, this customer, this policy." Add retrieval.
The task is action-bound. Failure reads as "the model cannot compute this, cannot call this API, cannot run this query." Add tools.
The task is history-bound. Failure reads as "the model forgot what we discussed, forgot preferences, cannot learn from past sessions." Add memory.
The task demands all three. A research assistant needs retrieval for source material, tools for calculations and external systems, and memory for continuity across sessions. Production systems almost always reach this state.

Naming the augmentation by its failure mode matters. Teams sometimes add retrieval when they actually need tools, or add memory when the real problem is that retrieval is not being invoked on the right turns. A correct diagnosis precedes the correct fix.

What gets built on this substrate

The augmented LLM is not a pattern in the strict sense. It is the substrate on which every pattern in this series is built. Runnable examples live at github.com/subodhjena/agentic-patterns under examples/01_basic_call.py and the later tool-use examples.

Structured output is the next article in the series and is the prerequisite that makes tool calls reliable by fixing the shape of the model's response. Context engineering, the article after that, is the discipline that decides which retrieved passages, which memory items, and which tool definitions enter the prompt on any given turn. ReAct is the pattern that fuses tools and reasoning into a loop. The Memory stage later in the series formalizes the distinction between short-term and long-term memory introduced here.

References

Anthropic. Building effective agents. December 2024.
Anthropic. Introducing the Model Context Protocol. November 2024.
Lewis, Patrick, et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020.
OpenAI. Function calling and other API updates. 2023.
LangChain. Augmented LLMs with retrieval, tools, and memory. 2024.

agentic-patterns augmented-llm retrieval tools memory ai llm

← Back to all posts