AI / LLM

Generative Agents Memory: The Stanford Architecture for Persistent LLM Agents

April 22, 20268 min readAILLM

In 2023, Park and collaborators at Stanford and Google DeepMind built a small simulation called Smallville. Twenty-five LLM-powered agents lived in it. They woke, made plans, took actions, spoke with each other, and went to sleep. Over the course of several simulated days they formed relationships, organized a Valentine's Day party, and spread invitations through their social network, all without explicit instructions to do so. The study's lasting contribution was not the simulation itself but the memory architecture that made coherent behavior possible over that timespan (Park et al., 2023).

The architecture became the reference design for persistent agent memory. CrewAI's production memory system, the reflection patterns in later multi-agent frameworks, and most serious attempts at building long-running agents all trace back to three pillars the Smallville paper names: memory stream, retrieval scoring function, and reflection. This article describes each pillar and the role reflection plays in keeping agent behavior coherent across time.

Three pillars

flowchart LR
    OBS[Observations from environment] --> STREAM[(Memory stream)]
    STREAM --> RET[Retrieval: recency + importance + relevance]
    RET --> CONTEXT[Context for current action]
    STREAM --> REF[Reflection: synthesize higher-level conclusions]
    REF --> STREAM

The memory stream is a timestamped log of everything the agent perceives. Each entry has a natural-language description, a creation timestamp, a last-accessed timestamp, and an importance score (from one to ten, assigned by the LLM at creation time). Perception is broadly defined: observations of the environment, things other agents said, things the agent itself said or did, and, once the reflection pillar fires, its own reflections.

The retrieval scoring function decides which memories enter the prompt on any given turn. Three signals combine into a single score. Recency decays exponentially with time since the memory was last accessed. Importance is the score assigned when the memory was created, with mundane observations ranked low and meaningful events ranked high. Relevance is embedding similarity between the memory and the current context. The final score is a weighted sum; Stanford reports roughly equal weights as a reasonable default.

Reflection is the pillar that separates this architecture from plain retrieval. Periodically, the agent reads the memories that have accumulated recently and synthesizes higher-level conclusions from them. These reflections are stored back in the memory stream like any other entry, and they can themselves become inputs to later reflections. The mechanism is explicit: when the cumulative importance of recent observations crosses a threshold, a reflection call fires.

Why reflection matters

The paper's strongest ablation is the one on reflection. When the reflection pillar was removed, agents still behaved coherently in the short term (a few minutes of simulated time) because retrieval surfaced the observations that mattered. Over longer horizons (hours to days), behavior degraded. Agents forgot recurring themes, failed to develop personalities that stayed consistent, and struggled to plan based on patterns across many observations.

Reflection fixes this by compressing many observations into a few abstract conclusions. The authors give a canonical example. An agent makes the observations "Klaus saw papers on his desk" and "Klaus talked about his research project" and "Klaus stayed up late in the library." Individually, these are low-importance observations. Together, they compose into a reflection: "Klaus is busy with an important research deadline." The reflection is short, high-importance, and relevant to any future context involving Klaus. The agent can retrieve the reflection more reliably than it could retrieve the three raw observations, and the reflection conveys more than their sum.

Reflections can reflect on reflections. Over many cycles, the memory stream accumulates a layered structure: raw observations at the bottom, low-level reflections in the middle, higher-level reflections above, and, at the top, durable conclusions about people, places, and patterns that behave almost like personality. This layering is what the paper calls the "emergent social dynamics" of Smallville.

Two versions in code

The excerpt below shows the three pillars without a framework. The memory stream is a list of typed entries; retrieval is a composite scoring function; reflection is an LLM call triggered by an importance threshold.

from openai import OpenAI
from datetime import datetime
from pydantic import BaseModel
import math

client = OpenAI()

class Entry(BaseModel):
    text: str
    created: datetime
    accessed: datetime
    importance: float  # 1 to 10

def embed(text): return client.embeddings.create(
    model="text-embedding-3-small", input=text).data[0].embedding

def assess_importance(text: str) -> float:
    class Score(BaseModel):
        importance: float
    r = client.responses.parse(
        model="gpt-4o-mini",
        instructions="Rate importance 1-10: 1=mundane (ate lunch), 10=major event.",
        input=text, text_format=Score)
    return r.output[0].content[0].parsed.importance

class Stream:
    def __init__(self):
        self.entries: list[Entry] = []
        self.recent_importance = 0.0

    def observe(self, text: str) -> None:
        imp = assess_importance(text)
        self.entries.append(Entry(text=text, created=datetime.now(),
                                  accessed=datetime.now(), importance=imp))
        self.recent_importance += imp
        if self.recent_importance > 30:
            self.reflect()

    def reflect(self) -> None:
        recent = self.entries[-20:]
        r = client.responses.create(
            model="gpt-4o-mini",
            instructions=("Given the recent observations, produce 2-3 higher-level "
                          "conclusions. Each conclusion on its own line."),
            input="\n".join(f"- {e.text}" for e in recent))
        for line in r.output_text.splitlines():
            if line.strip():
                imp = assess_importance(line)
                self.entries.append(Entry(text=line.strip(),
                    created=datetime.now(), accessed=datetime.now(),
                    importance=max(imp, 6.0)))  # reflections inherit high importance
        self.recent_importance = 0.0

    def recall(self, query: str, k: int = 5) -> list[Entry]:
        q = embed(query); now = datetime.now()
        scored = []
        for e in self.entries:
            hours = (now - e.accessed).total_seconds() / 3600
            rec = math.exp(-hours / 72)
            rel = sum(a*b for a, b in zip(q, embed(e.text)))  # unnormalized; simplified
            score = rec + (e.importance / 10) + rel
            scored.append((score, e))
        scored.sort(reverse=True, key=lambda x: x[0])
        for _, e in scored[:k]:
            e.accessed = now
        return [e for _, e in scored[:k]]

The LangGraph version represents the memory stream as a persistent Store and schedules reflection as a periodic graph node that reads recent entries and writes new ones.

from langgraph.store.memory import InMemoryStore
from langchain.chat_models import init_chat_model
from langgraph.graph import StateGraph, START, END
from typing import TypedDict
from datetime import datetime

store = InMemoryStore()
model = init_chat_model("gpt-4o-mini")

class State(TypedDict):
    user_id: str
    recent_observations: list[str]
    since_last_reflection: float

def observe_node(s: State) -> State:
    for obs in s["recent_observations"]:
        store.put(("agent", s["user_id"], "memories"), obs,
                  {"text": obs, "importance": 5.0,
                   "created": datetime.now().isoformat()})
    return {**s, "since_last_reflection": s["since_last_reflection"] + 5.0}

def reflect_node(s: State) -> State:
    if s["since_last_reflection"] < 30:
        return s
    memories = store.search(("agent", s["user_id"], "memories"), limit=20)
    body = "\n".join(f"- {m.value['text']}" for m in memories)
    reflection = model.invoke(f"Produce 2-3 conclusions from:\n{body}").content
    for line in reflection.splitlines():
        if line.strip():
            store.put(("agent", s["user_id"], "memories"), line,
                      {"text": line, "importance": 8.0,
                       "created": datetime.now().isoformat()})
    return {**s, "since_last_reflection": 0.0}

Full runnable versions will live at github.com/subodhjena/agentic-patterns alongside the short-term and long-term memory examples.

Tuning the composite score

The score has three weights and several secondary knobs. Each affects behavior in ways worth naming.

Semantic weight. Too high, and retrieval becomes a pure similarity search; the agent keeps surfacing the same memories regardless of context. Too low, and the agent forgets recent work because it is not phrased the way the current query is.

Recency weight. Too high, and old memories never resurface; the agent has no long-term continuity. Too low, and old memories crowd out the relevant recent ones.

Importance weight. Too high, and routine observations never make it into context; the agent feels amnesic about small but pertinent details. Too low, and low-signal observations flood every query.

Recency decay half-life. Stanford used values in the tens of hours for simulated time. Production agents with real time use half-lives of days to weeks. Tune this against realistic usage.

Reflection threshold. Too low, and reflections fire constantly and dilute each other. Too high, and reflections never fire and behavior drifts. Stanford's threshold of roughly 150 cumulative importance points is a starting point; tune empirically.

Where the architecture misfires

Common failures are specific to the pillars.

Importance inflation. An LLM asked to rate importance consistently rates everything high. All memories tie for the top of the retrieval score. Constrain importance to a rubric with explicit examples and verify the distribution on a held-out set.

Reflection that invents. A reflection call produces plausible but false conclusions when the underlying observations are noisy. Prompt reflections to cite specific memory entries and treat them as hypotheses rather than facts.

Stream bloat. A long-running agent accumulates tens of thousands of memories. Retrieval stays fast with an approximate vector index; the LLM-based importance assessment becomes the bottleneck. Batch assessments or use a smaller model for them.

Forgotten reflections. Reflections scored purely on recency fade. Scoring them on importance fixes this, but only if importance inflation is controlled.

Context pollution. Too many retrieved memories drown out the current query. Top-k of three to five is the usable range; more than ten rarely helps.

Cross-agent leakage. In a multi-agent setting, the memory stream belongs to a single agent. Shared memory is a different pattern (covered in the shared scratchpad article) and should not be accidental.

Trade against simpler memory

Generative Agents Memory is substantially more than plain vector retrieval. The table compares the approaches.

Axis	Vector RAG	Generative agents memory
Retrieval signal	Similarity only	Similarity + recency + importance
Abstraction	None	Reflections synthesize observations
Freshness	Implicit	Explicit decay
Importance awareness	None	Explicit per-entry score
Write cost	One embedding	Embedding + importance call
Implementation complexity	Low	Medium
Fit	Short-session QA	Long-running persistent agents

For agents that live for minutes, vector RAG is sufficient. For agents that live for days, the Stanford architecture is the reference design.

Neighbors in the series

Short-term and long-term memory, the previous article, introduced the four kinds of memory and the composite score. This article goes deeper on the specific architecture Stanford proposed. Persistence and checkpointing, the next article, covers the infrastructure that makes the memory stream durable across sessions. Reflexion, in the Reasoning stage, is a narrower application of reflection: post-failure self-critique on trajectories rather than periodic synthesis on observations. Context engineering, in Foundations, covers the compaction side of memory management.

References

Park, Joon Sung, et al. Generative Agents: Interactive Simulacra of Human Behavior. UIST 2023.
Anthropic. Building effective agents. December 2024.
CrewAI. Memory systems documentation. 2024.
Shinn, Noah, et al. Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023.
LangChain. Long-term memory with Store. 2024.

agentic-patterns generative-agents memory reflection stanford ai llm

← Back to all posts