AI / LLM

Context Engineering for Agents

February 23, 20266 min readAILLM

Prompt engineering is the shorthand most teams use for the practice of getting useful behavior out of a language model. The shorthand is leaky. The prompt, strictly speaking, is only one of the inputs the model sees. Retrieved passages, tool definitions, prior messages, summaries of earlier conversations, and the structured outputs of previous calls all compete for the same finite budget of tokens on every turn. Decisions about what to include, what to exclude, and in what form have more effect on system behavior than the wording of any single instruction.

Anthropic names this broader discipline context engineering, defined as "the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference" (Anthropic, 2024). Every later pattern in this series assumes a context-engineering discipline is in place. Failing to separate this discipline from prompt engineering is one of the most common causes of agent systems that work in demos but not in production.

The window is a workspace

flowchart TD
    IN([User turn]) --> CE[Context engineer]
    SP[System prompt: right altitude] --> CE
    JIT[Just-in-time retrieval: ids to payloads] --> CE
    COMP[Compaction of prior history] --> CE
    NOTES[Persistent notes written by the agent] --> CE
    SUB[Sub-agent summaries] --> CE
    TOOLS[Tool definitions] --> CE
    CE --> W[(Context window)]
    W --> LLM[LLM inference]
    LLM --> OUT([Turn output])
    LLM -.writes.-> NOTES

A model call is a function of everything in its context window at the moment of inference. Two calls with identical user queries can produce radically different outputs when the surrounding tokens differ: a different system prompt, a different set of tool definitions, a different batch of retrieved passages, a different slice of conversation history. The context window is a workspace, and what a team chooses to place on that workspace at each turn is a design decision.

Five strategies that compose

The discipline splits into a small number of choices. No single strategy is sufficient on its own; none is optional at scale. All five are drawn from Anthropic's engineering writeups on agent design.

Right-altitude system prompts

A prompt that is too vague gives the model no signal. A prompt that is too specific hardcodes brittle decision logic the first unexpected input will break. The recommended practice is to state the role, the scope, the hard constraints, and a handful of edge-case examples, then let the model reason about everything else. Start minimal, fail the prompt against a set of real inputs, and add instructions only in response to specific failures.

Just-in-time retrieval

Store lightweight identifiers in the context rather than full payloads. When the model needs the payload, it issues a tool call and receives only the relevant fragment. The pattern preserves the window for reasoning rather than spending it on material the model may never use.

Compaction

Summarize prior history when the window approaches its limit. The discipline is to maximize recall first, then optimize precision: an under-compacted summary wastes tokens, but an over-compacted summary silently drops the fact the next turn will need. The safest compaction is clearing stale tool results, which rarely need to persist once they have been reasoned over.

Structured note-taking

Have the agent write persistent notes outside the window. Anthropic reports that a Claude model playing Pokemon maintained tallies, maps, and strategy learnings across thousands of steps entirely through self-written notes (Anthropic, 2025). The notes act as external memory; only the relevant ones are pulled back when needed.

Sub-agent context isolation

Run a focused subtask in a clean window and return a one-to-two-thousand-token summary. The caller never pays the token cost of the subtask's exploration. This strategy becomes a pattern of its own in the Multi-Agent stage of the series.

Two strategies in code

The excerpt below shows the Right-Altitude principle without a framework. The three variants of the same call expose how much behavior changes when only the system prompt changes.

from openai import OpenAI

client = OpenAI()

UNDER_SPECIFIED = "You are a helpful assistant."

OVER_SPECIFIED = (
    "If return is within 30 days, approve it. If after 30 days, deny. "
    "If the item is damaged, deny. Output 'APPROVED' or 'DENIED'."
)

RIGHT_ALTITUDE = (
    "You are a customer service agent for an electronics retailer.\n"
    "Policy: standard returns within 30 days in original condition. "
    "Defective items covered under 90-day warranty. Physical damage by the "
    "customer is not covered.\n"
    "Guidelines: be empathetic but honest about policy. "
    "Offer to escalate when a case is ambiguous."
)

def answer(system: str, query: str) -> str:
    return client.responses.create(model="gpt-4o-mini",
                                   instructions=system,
                                   input=query).output_text

The excerpt below shows just-in-time retrieval as a tool in LangChain. The knowledge base is not preloaded into the prompt; the model sees only identifiers and pulls the payload when needed.

from langchain.chat_models import init_chat_model
from langchain_core.tools import tool

KB = {"returns": "30-day standard, 90-day defective, ...",
      "shipping": "3-5 business days domestic, ..."}

@tool
def load_article(article_id: str) -> str:
    """Load the full text of a knowledge base article by id."""
    return KB.get(article_id, "not found")

SYSTEM = (
    "You are a customer service agent. Knowledge base articles available: "
    f"{list(KB.keys())}. Call load_article(id) when you need the full policy."
)

model = init_chat_model("gpt-4o-mini").bind_tools([load_article])

def answer(query: str) -> str:
    return model.invoke([("system", SYSTEM), ("user", query)]).content

Compaction is typically a scheduled LLM call that rewrites the message history. Structured note-taking is a tool the agent calls. Sub-agent isolation is covered later in this series. Full examples live at github.com/subodhjena/agentic-patterns under examples/03_context_engineering.py.

Four anti-patterns

Every strategy has a failure mode. Anthropic's guide names four that account for most degradation observed in production.

Exhaustive edge-case lists in the prompt. The prompt becomes a runbook, and the agent follows the closest rule instead of reasoning. Compress the list into principles and a handful of examples.
Pre-loading all potentially relevant data. The window is full before the user has said anything. Attention dilutes and the model fixates on the wrong passage. Store identifiers and retrieve on demand.
Over-compaction. The summary removes a subtle fact that the next turn needs. Compact only tool results by default; compact conversation history only with a prompt designed to preserve constraints, decisions, and open questions.
Notes that nobody reads. The agent writes notes but never consults them. The retrieval of notes must be a tool the agent is prompted to use; writing without reading is dead weight.

A system that "generally feels off" is usually suffering from two or three of these at once. Diagnosing them in isolation is faster than rewriting the whole prompt.

Adopting the strategies in order

Each strategy pays a cost in complexity for a win on a specific axis. Under deadline pressure, the order of adoption matters.

Strategy	What you gain	What you pay
Right-altitude system prompt	Robustness across unseen inputs	Time in prompt design, more evaluation runs
Just-in-time retrieval	Smaller windows, sharper attention	Extra tool calls, retrieval pipeline to maintain
Compaction	Unbounded conversation length	Occasional loss of subtle context, extra model calls
Structured note-taking	Memory across long tasks without a full memory system	Discipline about when to read notes back
Sub-agent isolation	Clean windows for focused subtasks	Coordination cost, output-contract discipline

Over time, production systems adopt most of these at once. Early, the question is which one to invest in first, and that depends on which failure is hurting the system most.

Neighbors in the series

Structured output, the previous article, is the prerequisite that makes sub-agent summaries reliable. The augmented LLM article names retrieval, tools, and memory as the raw primitives; context engineering is the discipline that decides how those primitives show up in the window on any given turn. Prompt chaining, covered next, is itself an application of context engineering: each step in the chain gets a narrower window than a one-shot call would use. Short-term and long-term memory, covered later, formalizes the persistence side of what structured note-taking does informally.

References

Anthropic. Effective context engineering for AI agents. 2024.
Anthropic. Claude plays Pokemon: a case study in long-horizon behavior. 2025.
Anthropic. Building effective agents. December 2024.
Google Cloud. Prompt design strategies. 2024.
LangChain. Context management for agents. 2024.

agentic-patterns context-engineering prompt-engineering ai llm

← Back to all posts