AI / LLM

Persistence and Checkpointing: Time Travel and Recovery for LLM Agents

April 24, 20268 min readAILLM

An agent that works once in a notebook is not an agent that works in production. The two differ in requirements that rarely surface in demos. Production agents need to survive process restarts, support human approvals mid-run, let engineers replay past executions for debugging, and continue from the last successful step when an error interrupts them. All four requirements share an underlying capability: the ability to snapshot the agent's state at every step and restore from any snapshot on demand.

Checkpointing is the name for this capability. LangGraph saves a state snapshot at every super-step of execution (LangChain, 2024); OpenAI's Assistants API persists thread state by default; Temporal and other workflow engines apply the same pattern at a larger granularity. The primitives are the same regardless of framework: a checkpointer that stores snapshots, a thread that groups snapshots for one conversation or task, and optionally a store that holds state across threads. This article covers the four capabilities checkpointing enables and the tradeoffs between the common backends.

What checkpoints enable

flowchart LR
    START([Start]) --> A[Node A]
    A --> B[Node B]
    B --> C[Node C]
    C --> END([End])
    A -.snapshot.-> CP1[(Checkpoint 1)]
    B -.snapshot.-> CP2[(Checkpoint 2)]
    C -.snapshot.-> CP3[(Checkpoint 3)]
    CP1 -.resume from.-> A
    CP2 -.resume from.-> B
    CP3 -.resume from.-> C

A checkpoint is a snapshot of the agent's entire state at a specific point in execution: messages, tool call history, working memory, intermediate results, and any state variables declared in the graph. The checkpointer writes one snapshot per super-step. From any saved checkpoint, the runtime can resume execution, either by continuing forward or by forking from the snapshot with modified state.

Four capabilities fall out of this primitive.

Conversational memory. A thread accumulates state across runs. When a user sends a second message to the same thread, the runtime loads the last checkpoint, appends the new message, and resumes. The agent has no memory of its own; the checkpointer gives it continuity.

Human-in-the-loop. An agent can pause at a designated checkpoint, surface its state to a human for review, and resume once the human responds. The pause is free because the state is already checkpointed; the loop just waits on an external signal. Human-in-the-loop has a dedicated article later in this series; checkpointing is what makes it tractable.

Time travel. Any saved checkpoint can become the root of a new run. Engineers replay past executions for debugging, rewind to a decision point and take a different branch, or fork production traces into reproducible test cases. Without checkpointing, traces are read-only logs; with it, they become starting points.

Fault tolerance. When a process crashes mid-run, the next invocation loads the last successful checkpoint and resumes. No retry-all-steps behavior is needed; the completed work is durably saved.

Threads and stores

LangGraph and similar frameworks distinguish two scopes of persistence. Thread state is the conversation or task at hand; checkpoints save it. Cross-thread state is durable data that outlives any single conversation; the Store is the API that persists it across threads.

# Thread state: the conversation with thread_id "t1"
agent.invoke({"messages": [("user", "Hello")]},
             config={"configurable": {"thread_id": "t1"}})
# Later, same user, same thread: continues where it left off
agent.invoke({"messages": [("user", "What did I just say?")]},
             config={"configurable": {"thread_id": "t1"}})

# Cross-thread state: a fact that should survive any conversation
store.put(("user", user_id, "preferences"), "theme", {"value": "dark"})
results = store.search(("user", user_id, "preferences"))

Two concepts, two APIs, two different lifecycles. Thread state is automatic and framework-managed; cross-thread state is explicit and tool-driven. The short-term versus long-term memory split covered in the previous article maps directly onto this: short-term lives in thread state, long-term lives in the store.

Two versions in code

The excerpt below shows the raw shape without a framework. State is a dict serialized to a SQLite row per step; resumption loads the last row for the given thread id.

import sqlite3
import json
from openai import OpenAI

client = OpenAI()
db = sqlite3.connect("agent.db")
db.execute("""CREATE TABLE IF NOT EXISTS checkpoints(
    thread_id TEXT, step INT, state TEXT, PRIMARY KEY (thread_id, step))""")

def load(thread_id: str) -> tuple[int, dict]:
    row = db.execute(
        "SELECT step, state FROM checkpoints WHERE thread_id=? "
        "ORDER BY step DESC LIMIT 1", (thread_id,)).fetchone()
    return (row[0], json.loads(row[1])) if row else (-1, {"messages": []})

def save(thread_id: str, step: int, state: dict) -> None:
    db.execute("INSERT OR REPLACE INTO checkpoints VALUES (?, ?, ?)",
               (thread_id, step, json.dumps(state)))
    db.commit()

def step(thread_id: str, user_msg: str) -> str:
    current_step, state = load(thread_id)
    state["messages"].append({"role": "user", "content": user_msg})
    r = client.chat.completions.create(
        model="gpt-4o-mini", messages=state["messages"])
    reply = r.choices[0].message.content
    state["messages"].append({"role": "assistant", "content": reply})
    save(thread_id, current_step + 1, state)
    return reply

The LangGraph version uses SqliteSaver for local workflows and PostgresSaver for production. The prebuilt ReAct agent wired with a checkpointer gains persistence with no other changes.

from langchain.chat_models import init_chat_model
from langgraph.prebuilt import create_react_agent
from langgraph.checkpoint.sqlite import SqliteSaver

checkpointer = SqliteSaver.from_conn_string("agent.db")

agent = create_react_agent(
    model=init_chat_model("gpt-4o-mini"),
    tools=[...],
    checkpointer=checkpointer,
)

# First turn
agent.invoke({"messages": [("user", "Book me a flight to Tokyo")]},
             config={"configurable": {"thread_id": "user-42"}})

# A week later, same thread: state is loaded automatically
agent.invoke({"messages": [("user", "Change it to Osaka")]},
             config={"configurable": {"thread_id": "user-42"}})

# Time travel: resume from a specific checkpoint
history = list(agent.get_state_history(
    config={"configurable": {"thread_id": "user-42"}}))
earlier_state = history[3]  # some earlier checkpoint
agent.invoke(None, config=earlier_state.config)  # resume from there

Full runnable versions will live at github.com/subodhjena/agentic-patterns under examples/22_persistence.py as that lesson lands.

Choosing a backend

LangGraph ships three checkpointer backends that cover most production needs.

InMemorySaver. Stores checkpoints in process memory. Fast, simple, and loses everything on restart. Appropriate only for development, testing, and ephemeral workloads where persistence is not required.

SqliteSaver. Stores checkpoints in a local SQLite file. Durable across restarts, easy to deploy, suitable for single-machine workflows and small teams. Transaction boundaries align with checkpoint boundaries; restart recovery is automatic.

PostgresSaver. Stores checkpoints in a Postgres database. Suitable for production deployments, multi-process access, and high-throughput agent systems. Supports the same API as SqliteSaver; the substitution is a one-line change.

For most teams, SQLite is enough for development and Postgres for production. The distinction between them is operational, not functional; the agent code does not change.

When checkpointing costs more than it saves

Every checkpoint is a database write. For agents that run many short tasks with no user continuity, the write volume can exceed the value of the snapshots.

High-frequency, stateless calls. Classification services that handle thousands of requests per second often do not need per-call checkpointing. The right configuration is a checkpointer at the batch boundary, not the call boundary.

Short-lived, fire-and-forget tasks. A one-shot LLM call that produces a report and exits does not benefit from checkpointing. Spin up the agent, run, save the final output.

Large intermediate state. Agents that carry megabytes of intermediate tool results checkpoint slowly. Either compact aggressively (cover topic is in the context engineering article) or move heavy artifacts to blob storage and keep pointers in the checkpoint.

Snapshot pollution. Without a retention policy, checkpoint tables grow without bound. Configure TTL, cap per-thread checkpoint counts, or archive older checkpoints to cheaper storage.

For most workloads, the tradeoff is easy: the capabilities are too valuable to skip. But "checkpoint everything, forever" is not a production default; retention and scope need explicit decisions.

Where persistence goes wrong

Common failures are operational.

Schema evolution without migration. Adding a field to the state type breaks deserialization of old checkpoints. Treat the checkpoint schema as a versioned API; write migrations when shape changes.

Thread id collisions. Using a non-unique thread id (username instead of user id plus session id) mixes different conversations into one thread. State confuses; the agent responds to yesterday's context. Use unique, structured thread ids.

Concurrent writes to the same thread. Two requests with the same thread id, arriving in parallel, race on the checkpointer. Most backends serialize on the thread id; verify the guarantee before relying on it.

Time travel with side effects. Resuming from an earlier checkpoint re-executes side-effectful tool calls. A payment tool called on the resumed branch charges the card again. Mark side-effectful tools idempotent or gate them through human-in-the-loop on resume.

Cross-thread leakage via store keys. A cross-thread store that uses a flat key namespace leaks data across users. Scope every store key with the user id or tenant id; enforce at the library level.

Checkpoint size bloat. An agent's messages list grows on every turn; without compaction, checkpoints grow linearly. Apply context compaction, which is covered in the context engineering article, before the state is serialized.

Trade against stateless agents

For many use cases, a stateless agent that takes everything it needs as input is simpler. The table compares the shapes.

Axis	Stateless agent	Checkpointed agent
State across runs	None	Thread-scoped
Cross-session memory	None	Store-based
Human-in-the-loop	Hard to implement	Native (pause at checkpoint)
Time travel	Not possible	Native (fork any checkpoint)
Fault tolerance	Full retry	Resume from last checkpoint
Operational cost	Minimal	Database writes per step
Fit	Short, one-shot tasks	Conversations, long-running tasks

Checkpointing is the right default for any agent that will run longer than a few seconds, have more than one turn, or need to pause for a human. Stateless is the right choice for high-frequency classification and one-shot generation.

Neighbors in the series

Short-term and long-term memory, two articles ago, introduces the distinction that thread state and the store implement respectively. Generative agents memory, the previous article, describes the content of what thread state might hold for long-running agents. Human-in-the-loop, in the Safety stage, depends on checkpointing for pause-and-resume behavior. Harness design, in the Production stage, describes how persistence fits into the planner-generator-evaluator architecture.

References

LangChain. Persistence and the checkpointer concept. 2024.
Anthropic. Building effective agents. December 2024.
OpenAI. Assistants API and thread persistence. 2024.
Temporal. Durable execution for agent workflows. 2024.
LangChain. LangGraph checkpointer backends. 2024.

agentic-patterns persistence checkpointing langgraph time-travel ai llm

← Back to all posts