AI / LLM

Scaling and Cost Optimization for LLM Agentic Systems

7 min readAILLM

Production agentic systems live under two constraints that research systems do not. Tokens cost money, and bad architecture choices get expensive fast. Multi-agent systems amplify the second constraint: adding agents adds calls, and adding calls can either multiply capability or multiply error.

Google DeepMind's "Towards a Science of Scaling Agent Systems" tested 180 configurations across five architectures, four benchmarks, and three model families (Han et al., 2024). The results name specific conditions under which multi-agent systems degrade. Anthropic's production guidance reports complementary findings on cost: a lead model plus worker models can outperform a single strong model by 90 percent while cutting cost to a fraction. Both lines of work converge on a practical takeaway: the right architecture depends on the task, and the wrong architecture amplifies errors rather than capability.

This article covers the three scaling findings from the DeepMind paper, the Anthropic production heuristics for agent count and tool budget, the expensive-planner-cheap-worker cost pattern, and the latency shape that determines whether parallelization is worth its cost.

Three scaling findings

The DeepMind paper reports three findings that inform every multi-agent design decision.

Tool-coordination tradeoff. Tasks requiring more than sixteen tools show disproportionate error amplification in multi-agent setups. The communication overhead fragments reasoning: each agent sees a subset of the tools and makes decisions with incomplete information about what other agents can do. The finding implies a ceiling: when the tool count passes sixteen, simple multi-agent patterns start hurting rather than helping. The fix is not to add more agents; it is to reorganize the tool structure (skills, tool search, hierarchical delegation).

Capability saturation. Adding agents yields diminishing returns past a threshold. A three-agent team outperforms a one-agent team on the right task. A thirty-agent team rarely outperforms a five-agent team on the same task. The right agent count is task-specific and usually smaller than teams assume.

Topology-dependent error amplification. This is the strongest finding. The same task, run across different multi-agent topologies, produces radically different error rates.

  • Independent agents with no communication: 17.2x error amplification over single-agent baseline.
  • Centralized coordination with an orchestrator: 4.4x error amplification.
  • Parallelizable tasks with centralized coordination: up to 81 percent improvement.
  • Sequential planning tasks across multi-agent variants: 39 to 70 percent degradation.

The implication is sharp. Multi-agent architecture is beneficial only on parallelizable tasks with centralized coordination. On sequential planning tasks, every multi-agent variant measured degraded performance. The DeepMind paper's predictive model correctly identified the optimal architecture for 87 percent of unseen configurations, suggesting that architecture selection is learnable from task characteristics.

Anthropic's scaling heuristics

Anthropic's production writeups report practical agent counts for different task shapes.

Simple fact-finding tasks:    1 agent  /  3-10 tool calls
Moderate research:            3-5 agents  /  10-30 total tool calls
Complex investigation:        10+ agents  /  divided responsibilities

Token usage explains roughly 80 percent of performance variance in Anthropic's tests. A single-agent system that burns many tokens often matches a multi-agent system. The multi-agent win appears when the work is genuinely parallelizable and the cost of coordination is smaller than the cost of sequential execution.

Anthropic's own multi-agent research system uses Opus (the strongest, most expensive model) as the lead researcher and Sonnet (a smaller, cheaper model) for three to five subagents. This configuration outperformed single-agent Opus by 90.2 percent on their internal benchmark.

The expensive-planner-cheap-worker pattern

Most agent workloads have two distinct cost profiles inside them. Planning and synthesis need a strong model; execution and routine tool-calling do not. Routing the two to different models produces substantial cost reductions at near-identical quality.

from langchain_anthropic import ChatAnthropic

planner = ChatAnthropic(model="claude-sonnet-4-6")    # strong reasoning
worker  = ChatAnthropic(model="claude-haiku-4-5")     # fast, cheap execution

The cost shape is stark.

All Sonnet:      $$$$$    (planning + N workers at Sonnet price)
Sonnet + Haiku:  $$       (up to ~90% cost reduction)

For tasks with a fan-out of ten or twenty workers, the savings dominate the bill. Haiku may be slightly less reliable on any given call, but with clear subtask descriptions (covered in the orchestrator-workers article) the reliability gap is narrow and the cost gap is wide.

The same principle applies to workflow patterns inside a single agent. A ReAct loop that uses a strong model for the first and last steps and a cheaper model for routine tool calls in the middle can cut costs substantially without measurable quality loss. Teams should measure before committing, but the structure reliably pays off.

The latency shape

flowchart LR
    subgraph Sequential["Sequential execution"]
        direction LR
        P1[Plan] --> W1[Worker 1] --> W2[Worker 2] --> W3[Worker 3] --> OUT1([Done])
    end
    subgraph Parallel["Parallel execution"]
        direction LR
        P2[Plan] --> FAN{Fan out}
        FAN --> PW1[Worker 1]
        FAN --> PW2[Worker 2]
        FAN --> PW3[Worker 3]
        PW1 --> OUT2([Done])
        PW2 --> OUT2
        PW3 --> OUT2
    end

The latency math for parallel versus sequential execution is simple but worth stating.

Sequential:   latency = plan_time + (worker_time × N)
Parallel:     latency = plan_time + max(worker_times)

For N equal to ten and worker_time equal to five seconds, sequential takes fifty seconds plus plan time; parallel takes five seconds plus plan time. The gap is substantial. For user-facing agents where latency matters, parallel is often the only viable shape for multi-worker tasks.

The caveat is that the parallel shape requires the subtasks to be genuinely independent. When subtasks depend on each other, serialization is unavoidable; attempting to parallelize produces the error amplification the DeepMind paper measured.

Two versions in code

The excerpt below shows the expensive-planner-cheap-worker split without a framework. The planner uses a strong model; workers use a cheaper model; parallelism is asyncio.gather.

import asyncio
from openai import OpenAI, AsyncOpenAI
from pydantic import BaseModel

planner_client = OpenAI()
worker_client = AsyncOpenAI()

class TaskPlan(BaseModel):
    subtasks: list[str]

def plan(goal: str) -> TaskPlan:
    r = planner_client.responses.parse(
        model="gpt-4o",                 # strong model
        instructions="Break the goal into 3-5 subtasks.",
        input=goal, text_format=TaskPlan)
    return r.output[0].content[0].parsed

async def worker(subtask: str) -> str:
    r = await worker_client.responses.create(
        model="gpt-4o-mini",            # cheap model
        instructions="Complete the subtask concisely.",
        input=subtask)
    return r.output_text

async def run(goal: str) -> str:
    p = plan(goal)
    results = await asyncio.gather(*(worker(s) for s in p.subtasks))
    return planner_client.responses.create(
        model="gpt-4o",                 # strong model again for synthesis
        instructions="Synthesize the subtask results.",
        input=f"Goal: {goal}\n\nResults:\n" + "\n".join(results),
    ).output_text

The LangGraph version uses two init_chat_model instances and the Send API for dynamic fan-out. The tiering is explicit in the node configuration.

from langchain.chat_models import init_chat_model
from langgraph.types import Send
from langgraph.graph import StateGraph, START, END
from typing import TypedDict, Annotated
from operator import add

planner_model = init_chat_model("claude-sonnet-4-6")
worker_model = init_chat_model("claude-haiku-4-5")

class State(TypedDict):
    goal: str
    plan: TaskPlan
    results: Annotated[list[str], add]
    final: str

def plan_node(s: State) -> State:
    return {**s, "plan": planner_model.with_structured_output(TaskPlan)
                                       .invoke(s["goal"])}

def fan_out(s: State):
    return [Send("worker", {"subtask": t}) for t in s["plan"].subtasks]

def worker_node(payload: dict) -> dict:
    return {"results": [worker_model.invoke(payload["subtask"]).content]}

def synthesize_node(s: State) -> State:
    return {**s, "final": planner_model.invoke(
        f"Goal: {s['goal']}\nResults:\n" + "\n".join(s["results"])).content}

Full runnable versions will live at github.com/subodhjena/agentic-patterns under examples/27_cost_optimization.py as that lesson lands.

Where scaling strategies break

The strategies above have known failure modes.

Planner-worker mismatch. A planner that produces subtasks the worker model cannot handle wastes both calls. The subtasks must fit the worker's capability. Test the worker on realistic subtasks before committing to the split.

Parallelism without independence. Firing up ten workers on a sequential task produces ten workers stepping on each other. The DeepMind paper's 17.2x error amplification for independent agents was measured on tasks where communication would have been needed. Parallelize only what is actually independent.

Cost optimization at the wrong layer. Optimizing tokens on a call that runs a thousand times a day matters. Optimizing tokens on a call that runs ten times a day rarely matters. Profile usage before optimizing.

Over-aggressive caching. Prompt caching helps when prompt prefixes are stable; it hurts when prefixes change frequently. Measure cache hit rates before assuming caching saves money.

Synthesizer overload. A planner-worker split with ten workers feeds ten results into a synthesizer, which may overflow its own window. Cap the number of workers and compact their outputs before synthesis.

Fallback that defeats the savings. A cheap worker that frequently falls back to the expensive planner for re-tries turns the cost savings into a cost increase. Measure fallback rates; tune worker prompts until fallback is rare.

Trade against naive single-model single-agent

The table summarizes the decisions under different constraints.

Axis Single-model single-agent Tiered multi-agent with parallelism
Cost per task Simple and known Lower when workers are used well
Latency Bounded by single sequence Bounded by slowest parallel branch
Implementation complexity Low Medium
Error amplification risk Low High if architecture is wrong
Parallelism None inherent Native
Fit Short, simple tasks Research, multi-subtask synthesis

The scaling win is real but conditional. Teams that default to multi-agent architecture without measurement end up with systems that are more expensive and less reliable than a single agent would have been.

Neighbors in the series

Orchestrator-workers and parallelization, in the Workflows stage, are the patterns that this article optimizes. Supervisor patterns, in the Multi-Agent stage, are where the planner-worker split typically lives. Harness design, the previous article, describes the broader architecture the cost strategies plug into. The decision framework, next in the Synthesis stage, uses these scaling findings to decide between patterns.

References

  1. Anthropic. Building effective agents. December 2024.
  2. Han, Joshua, et al. Towards a Science of Scaling Agent Systems. Google DeepMind, 2024.
  3. Anthropic. How we built our multi-agent research system. 2024.
  4. LangChain. LangGraph parallel branches and the Send API. 2024.
  5. OpenAI. Prompt caching documentation. 2024.
agentic-patternsscalingcostproductionmulti-agentaillm
← Back to all posts