AI / LLM

Graph-of-Thought: Non-Linear Reasoning with Merge and Refine

March 30, 20267 min readAILLM

Tree-of-thought reasoning adds branching and backtracking to a chain. The search is richer than a single chain but still constrained: thoughts flow from parent to child, never across siblings, never back into themselves. That restriction excludes a class of natural reasoning moves. When a human solver consolidates three partial solutions into one unified answer, or loops back to refine an earlier step based on something discovered later, the structure traced on a whiteboard is a graph, not a tree.

Graph-of-Thought, introduced by Besta and colleagues at ETH Zurich in 2023, generalizes tree-of-thought to a directed graph of thoughts (Besta et al., 2023). Thoughts are vertices; operations on thoughts are edges. The pattern supports four operations rather than two: generate, aggregate, refine, and score. The first two and the last are familiar from earlier patterns; aggregation and refinement are the additions that make the structure a graph. The rest of this article describes the four operations, illustrates the pattern in code, and names the problem class for which the extra machinery earns its keep.

Four operations

Generate is the operation that produces new thoughts from existing ones. It is identical to the generator in tree-of-thought: an LLM call that proposes k candidate next thoughts given a current state. Graph-of-thought inherits this operation without modification.

Score is the operation that evaluates a thought. Also identical to tree-of-thought's evaluator, it returns a value or a verdict that informs later decisions. Many graph-of-thought implementations run score over every thought eagerly; some defer it until a selection is needed.

Aggregate is the first addition. It merges multiple thoughts into a single new thought. The LLM call reads several inputs and produces a consolidated output. Aggregation is what lets the search converge parallel exploration into one synthesized result rather than picking a single winner.

Refine is the second addition. It takes a thought and produces a better version of itself. The LLM call reads the current thought plus any relevant context and rewrites the thought. Refinement is what lets the search revisit an earlier step with information learned later.

Together these four operations define a richer reasoning space than the tree supports. A Game-of-24 solve does not benefit; a research report that synthesizes three subqueries does.

From tree to graph

flowchart TD
    ROOT[Problem] --> T1[Thought A]
    ROOT --> T2[Thought B]
    ROOT --> T3[Thought C]
    T1 --> AGG[Aggregate]
    T2 --> AGG
    T3 --> AGG
    AGG --> MERGED[Merged thought]
    MERGED --> REF[Refine]
    REF --> REFINED[Refined thought]
    REFINED --> SCORE{Score}
    SCORE --> OUT[Final]

The diagram traces a flow that a tree cannot express. Three parallel thoughts converge on an aggregate. The aggregate is refined. The refined thought is scored and either returned or fed back. In a tree, the three thoughts would remain in three separate branches, and the final answer would come from the single highest-scored path rather than from a synthesis.

Aggregation over three parallel branches

Aggregation is the operation that most frequently justifies reaching for graph-of-thought. When a problem can be attacked from several angles, and the best final answer requires combining them rather than picking one, aggregation is the right tool. Research synthesis, multi-perspective analysis, and ensemble-style generation are canonical fits.

The excerpt below shows aggregation without a framework. Three generation calls run in parallel, each producing a candidate thought. An aggregation call receives all three and produces a merged thought.

import asyncio
from openai import AsyncOpenAI
from pydantic import BaseModel

client = AsyncOpenAI(timeout=60.0, max_retries=2)

class Thought(BaseModel):
    content: str

async def generate(problem: str, perspective: str) -> Thought:
    r = await client.responses.parse(
        model="gpt-4o-mini",
        instructions=f"Solve from the {perspective} perspective.",
        input=problem, text_format=Thought,
    )
    return r.output[0].content[0].parsed

async def aggregate(thoughts: list[Thought]) -> Thought:
    body = "\n\n".join(f"[{i}] {t.content}" for i, t in enumerate(thoughts))
    r = await client.responses.parse(
        model="gpt-4o-mini",
        instructions="Merge the perspectives into a single coherent thought.",
        input=body, text_format=Thought,
    )
    return r.output[0].content[0].parsed

async def solve(problem: str) -> str:
    perspectives = ["technical", "economic", "social"]
    results = await asyncio.gather(*(generate(problem, p) for p in perspectives))
    merged = await aggregate(results)
    return merged.content

Refinement is the other addition. A thought produced early can be rewritten later when downstream work reveals what it was missing. The loop is bounded by a quality threshold or a maximum refinement count.

The LangChain excerpt below shows refinement paired with scoring. The loop exits when the scorer is satisfied or the budget runs out.

from langchain.chat_models import init_chat_model
from pydantic import BaseModel, Field
from typing import Literal

class Thought(BaseModel):
    content: str

class Score(BaseModel):
    value: float = Field(ge=0.0, le=1.0)
    verdict: Literal["pass", "needs_work"]

model = init_chat_model("gpt-4o-mini")

def refine(thought: Thought, context: str) -> Thought:
    r = model.with_structured_output(Thought).invoke(
        f"Refine the thought using this context.\n\n"
        f"Thought: {thought.content}\n\nContext: {context}")
    return r

def score(thought: Thought) -> Score:
    return model.with_structured_output(Score).invoke(
        f"Score the quality of this thought. >=0.8 = pass.\n\n{thought.content}")

def solve(initial: Thought, context: str, max_rounds: int = 3) -> Thought:
    thought = initial
    for _ in range(max_rounds):
        s = score(thought)
        if s.verdict == "pass":
            return thought
        thought = refine(thought, context)
    return thought

Full runnable versions live at github.com/subodhjena/agentic-patterns under the tree-of-thought examples; graph-specific implementations sit at the same layer but add aggregate and refine calls around the existing generator and scorer.

Where aggregation earns its cost

The operation is cheap to describe and surprisingly hard to get right. It earns its cost in two specific situations.

Synthesis problems. When three parallel analyses produce complementary findings, aggregation produces a better final output than any single analysis would. Research summaries, multi-perspective decision documents, and literature reviews fit this shape.

Voting with understanding. Aggregation generalizes majority voting. The voting variant of parallelization, covered earlier in this series, picks the most common answer. Aggregation reads the underlying reasoning and produces an output that acknowledges the overlaps and disagreements.

When a simple majority vote suffices, aggregation is over-engineering. When the underlying content carries information the vote would discard, aggregation is a better aggregator than a counter.

Refinement is closer to evaluator-optimizer than it is to tree-of-thought. The operation earns its cost when the refinement signal is genuinely informative.

Late-arriving context. A thought produced before later information is known is a candidate for refinement once that information arrives. Research drafts revised after sources are checked, code revised after test output is seen, and plans revised after the first step executed all illustrate.

Targeted polish. A nearly-correct thought that needs a single rewrite to be correct benefits more from one refinement call than from regenerating from scratch.

Refinement degenerates into a poor evaluator-optimizer when there is no new context, which is why graph-of-thought tends to use refinement sparingly and in combination with aggregation rather than as the primary mechanism.

Where the graph goes wrong

The pattern has the highest implementation surface area of any reasoning pattern covered in this series. Failures cluster around three issues.

Aggregation that averages. A naive aggregator reads three candidates and produces a bland mean. The symptom is a merged thought that loses the specificity each input had. Prompt the aggregator to preserve distinctive claims and resolve conflicts explicitly.

Refinement that drifts. Successive refinements move further from the original goal. Anchor the refinement prompt to the original problem, not just the previous thought.

Graph that does not terminate. Without termination criteria on both the search and the refinement loops, the pattern runs indefinitely. Cap both budgets.

Over-engineering. Reaching for graph-of-thought when tree-of-thought or evaluator-optimizer would do produces code that is hard to maintain for a marginal gain. Use the simplest pattern that meets the accuracy bar.

Evaluator that cannot judge aggregates. An evaluator trained on atomic thoughts may rate an aggregated thought poorly because it has a different shape. Calibrate the evaluator on aggregated outputs when aggregation is in use.

Trade against tree-of-thought

Graph-of-thought and tree-of-thought address overlapping problem classes. The table below names the choice.

Axis	Tree-of-Thought	Graph-of-Thought
Primary operation	Branch and backtrack	Branch, aggregate, refine
Best-fit problems	Search with clear winners	Synthesis and iterative refinement
Implementation complexity	Medium	High
Cost scaling	k^depth	Higher; aggregation and refinement add calls
Observability	Clean trees	Graphs are harder to visualize
Typical use	Game-of-24, combinatorial search	Research synthesis, multi-perspective analysis

Graph-of-thought is the right pattern when the problem needs synthesis or refinement in addition to search. When the problem is purely search, tree-of-thought is cheaper and sufficient.

Neighbors in the series

Tree-of-thought, the previous article, is the pattern graph-of-thought generalizes. Chain-of-thought, two articles ago, is the linear baseline both branch from. Evaluator-optimizer, earlier in the Workflows stage, is the non-branching cousin of refinement: a generator and a critic in a loop on one artifact. Orchestrator-workers, also in Workflows, uses a synthesis step that resembles aggregation but without the option to refine afterward.

References

Besta, Maciej, et al. Graph of Thoughts: Solving Elaborate Problems with Large Language Models. AAAI 2024.
Yao, Shunyu, et al. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. NeurIPS 2023.
Wei, Jason, et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022.
Madaan, Aman, et al. Self-Refine: Iterative Refinement with Self-Feedback. NeurIPS 2023.
LangChain. Graph-based reasoning patterns in LangGraph. 2024.

agentic-patterns graph-of-thought reasoning aggregation refinement ai llm

← Back to all posts