AI / LLM

Tree-of-Thought: Branching Reasoning with Search for LLMs

March 25, 20267 min readAILLM

Chain-of-thought reasoning commits to one trajectory from the first token. The model writes a step, then the next, then the next, and each step conditions on the previous without ever backing up. On problems where the answer follows from a sequential derivation, that is sufficient. On problems where the right move at step two depends on something that only becomes visible at step four, the single chain paints into a corner.

Tree-of-thought, introduced by Yao and colleagues in 2023, generalizes the chain to a search (Yao et al., 2023). At each step the model proposes several candidate thoughts rather than one. A separate evaluator LLM scores how promising each candidate is. A search algorithm, typically breadth-first or depth-first with pruning, decides which branch to expand and which to abandon. The pattern's flagship result is on the Game of 24 arithmetic benchmark, where it reached 74 percent accuracy compared with 4 percent for chain-of-thought alone. The rest of this article describes the four components and the class of problems for which the cost is justified.

Four components

The pattern decomposes cleanly into four roles.

Thought decomposition is the act of deciding what constitutes a "step" in the reasoning. For Game of 24 it is a single arithmetic operation; for crossword solving it is a word candidate for a clue; for creative writing it is a plot beat. The decomposition is problem-specific and needs to be defined before the search can start.

Thought generation is an LLM call that produces k candidate thoughts given the current state. The candidates compete for expansion. A reasonable k is three to five; much larger and the search balloons, much smaller and the tree collapses back to a chain.

State evaluation is an LLM call that scores each candidate on how likely it is to lead to a correct answer. The evaluator is prompted with the candidate thought plus the state that preceded it; it returns a numeric value or a classification (promising, maybe, unlikely). The evaluator should be separate from the generator, for the same reason the evaluator in evaluator-optimizer is separate from the generator.

Search decides which node to expand next. Breadth-first expands all siblings before descending; depth-first commits to a path and backtracks on failure. Pruning thresholds cut low-scoring branches without exploring them. The choice of algorithm depends on the problem shape and the budget.

The tree in motion

flowchart TD
    ROOT[Root state] --> T1[Thought A]
    ROOT --> T2[Thought B]
    ROOT --> T3[Thought C]
    T1 --> E1{Eval: low}
    T2 --> E2{Eval: high}
    T3 --> E3{Eval: medium}
    E1 -.prune.-> X1[ ]
    E2 --> T2A[Thought B1]
    E2 --> T2B[Thought B2]
    T2A --> T2AE{Eval: high}
    T2B --> T2BE{Eval: low}
    T2BE -.prune.-> X2[ ]
    T2AE --> OUT[Final answer]

The search frontier is a set of states. The generator expands a state into candidate thoughts; the evaluator scores each; the search algorithm keeps the survivors and prunes the rest. When a state is judged to be a terminal answer, the search terminates.

Two versions in code

The excerpt below is a minimal breadth-first tree-of-thought search without a framework. For clarity the depth is bounded, the branching factor is small, and the evaluator returns a single float.

from openai import OpenAI
from pydantic import BaseModel, Field
from typing import Literal

client = OpenAI()

class Thoughts(BaseModel):
    candidates: list[str]

class Score(BaseModel):
    value: float = Field(ge=0.0, le=1.0)
    verdict: Literal["sure", "maybe", "impossible"]

def generate(state: str, k: int = 3) -> list[str]:
    r = client.responses.parse(
        model="gpt-4o-mini",
        instructions=f"Propose {k} next reasoning steps. Be diverse.",
        input=state, text_format=Thoughts,
    )
    return r.output[0].content[0].parsed.candidates

def evaluate(state: str, candidate: str) -> Score:
    r = client.responses.parse(
        model="gpt-4o-mini",
        instructions="Score how likely this candidate leads to the correct answer.",
        input=f"State: {state}\nCandidate: {candidate}", text_format=Score,
    )
    return r.output[0].content[0].parsed

def search(root: str, depth: int = 3, k: int = 3, beam: int = 2) -> str:
    frontier = [root]
    for _ in range(depth):
        scored = []
        for state in frontier:
            for c in generate(state, k):
                s = evaluate(state, c)
                if s.verdict != "impossible":
                    scored.append((s.value, state + "\n- " + c))
        scored.sort(reverse=True)
        frontier = [s for _, s in scored[:beam]]
        if not frontier:
            return "search exhausted"
    return frontier[0]

The LangGraph version expresses the frontier as part of the state and uses conditional edges to loop until the depth budget is reached or a terminal state is found.

from typing import TypedDict
from langgraph.graph import StateGraph, START, END
from langchain.chat_models import init_chat_model

model = init_chat_model("gpt-4o-mini")

class State(TypedDict):
    frontier: list[str]
    depth: int
    best: str

def expand_node(s: State) -> State:
    scored = []
    for st in s["frontier"]:
        cands = model.with_structured_output(Thoughts).invoke(
            f"Propose 3 next steps for: {st}").candidates
        for c in cands:
            sc = model.with_structured_output(Score).invoke(
                f"Score: {c} given {st}")
            if sc.verdict != "impossible":
                scored.append((sc.value, st + "\n- " + c))
    scored.sort(reverse=True)
    return {**s, "frontier": [x for _, x in scored[:2]],
            "depth": s["depth"] + 1,
            "best": scored[0][1] if scored else s["best"]}

def keep_going(s: State) -> str:
    return "done" if s["depth"] >= 3 or not s["frontier"] else "expand"

graph = (StateGraph(State)
         .add_node("expand", expand_node)
         .add_edge(START, "expand")
         .add_conditional_edges("expand", keep_going,
                                {"expand": "expand", "done": END})
         .compile())

Full runnable versions live at github.com/subodhjena/agentic-patterns under examples/13_tree_of_thought.py and examples/13_tree_of_thought_langgraph.py.

When the search is worth the cost

Tree-of-thought is expensive. A depth-three search with branching factor three generates 27 candidates and evaluates every one; self-consistency at five samples generates five chains with no evaluator. The pattern is worth the premium on a specific class of problems.

Problems with backtracking. Chess-puzzle-like search, crossword solving, Game of 24. The right move depends on lookahead.

Problems with parallel hypotheses. Scientific hypothesis generation, debugging candidates for a bug, plausible root causes for an incident. Enumerating and scoring several alternatives beats committing to one.

Problems where the evaluator is sharper than the generator. If a model can recognize a good answer when it sees one but struggles to produce one from scratch, tree-of-thought leverages that asymmetry.

Problems where wrong answers are cheap to detect. If evaluation is mostly a sanity check, search is efficient. If evaluation itself requires deep reasoning, the pattern loses its cost advantage over self-consistency.

Where the pattern misfires

The failure modes are specific to the search machinery.

Poor evaluator calibration. If the evaluator is over-optimistic, the tree never prunes and cost balloons. If it is over-pessimistic, promising branches are cut early. Measure the evaluator on labeled intermediate states before trusting it.

Shallow trees. A depth of one or two collapses to a CoT with voting; the search is decoration.

Over-branched trees. A branching factor of ten with depth three is 1,000 leaf evaluations. Cost climbs past the point of usefulness quickly. Keep the branching factor small and tune the pruning threshold.

Non-decomposable problems. If the problem has no natural "step" the generator can propose, the pattern has nothing to work with. Chain-of-thought or direct prompting is the right response.

Unbounded runtime. Without a depth cap or a time budget, search on adversarial inputs runs forever. Always cap.

Generator and evaluator on the same model. Using one model for both roles is cheap but often underperforms a two-tier setup (cheap generator, stronger evaluator). Measure before defaulting.

Trade against chain-of-thought and self-consistency

The Reasoning stage has three shapes to choose among. The axes below describe the choice.

Axis	Chain-of-Thought	Self-Consistency	Tree-of-Thought
Structure	Linear	N parallel chains	Branching search
Cost per query	Baseline	N times baseline	k^depth times baseline in worst case
Latency	Low	Same as baseline (parallel)	High, sequential expansion
Accuracy on exploration-heavy tasks	Low	Medium	High
Accuracy on sequential derivation	Medium to High	High	Similar to CoT, for more cost
Implementation complexity	Low	Low	Medium

Tree-of-thought is the pattern to reach for when the problem itself is a search. For sequential reasoning, a cheaper pattern dominates.

Neighbors in the series

Chain-of-thought, the previous article, is the linear reasoning pattern that tree-of-thought generalizes. Graph-of-thought, covered next, extends the search to a graph, adding aggregation and refinement edges that trees do not support. LATS, covered later in the Reasoning stage, replaces breadth-first or depth-first with Monte Carlo Tree Search, unifying reasoning and action selection. Evaluator-optimizer, earlier in the Workflows stage, uses the same generator-evaluator split for iterative refinement on a single output.

References

Yao, Shunyu, et al. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. NeurIPS 2023.
Wei, Jason, et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022.
Long, Jieyi. Large Language Model Guided Tree-of-Thought. 2023.
LangChain. Tree-of-thought implementations in LangGraph. 2024.
Anthropic. Extended thinking and reasoning models. 2025.

agentic-patterns tree-of-thought reasoning search bfs dfs ai llm

← Back to all posts