LangGraph Tutorial (Python): implementing retry logic for advanced developers

By Cyprian AaronsUpdated 2026-04-22
langgraphimplementing-retry-logic-for-advanced-developerspython

This tutorial shows how to add retry logic to a LangGraph workflow in Python without turning your graph into a mess of nested try/except blocks. You’ll build a small agent pipeline that retries failed model calls, preserves state, and stops after a controlled number of attempts.

What You'll Need

  • Python 3.10+
  • langgraph
  • langchain-openai
  • langchain-core
  • An OpenAI API key set as OPENAI_API_KEY
  • Basic familiarity with:
    • StateGraph
    • typed state with TypedDict
    • conditional edges in LangGraph

Install the packages:

pip install langgraph langchain-openai langchain-core

Step-by-Step

  1. Start by defining state that tracks both the task and the retry count. The key is to keep retry metadata inside graph state so every node can make decisions without relying on global variables.
from typing import TypedDict, Annotated
from operator import add

from langgraph.graph import StateGraph, START, END


class GraphState(TypedDict):
    prompt: str
    result: str
    error: str
    attempts: int
    max_attempts: int


def merge_strings(left: str, right: str) -> str:
    return right if right else left


class RetryState(TypedDict):
    prompt: str
    result: Annotated[str, merge_strings]
    error: Annotated[str, merge_strings]
    attempts: int
    max_attempts: int
  1. Next, create the node that does the risky work. In production this is usually an LLM call, tool call, or external API request; here we simulate a failure so you can see the retry path clearly.
import random


def risky_operation(state: RetryState) -> dict:
    attempts = state.get("attempts", 0) + 1

    if random.random() < 0.6:
        return {
            "attempts": attempts,
            "error": f"Transient failure on attempt {attempts}",
            "result": "",
        }

    return {
        "attempts": attempts,
        "error": "",
        "result": f"Processed prompt: {state['prompt']}",
    }
  1. Add a router that decides whether to retry or stop. This keeps retry policy outside the node itself, which is cleaner when you later swap in backoff, circuit breakers, or different policies per node.
def route_after_risky_operation(state: RetryState) -> str:
    if state.get("error") and state["attempts"] < state["max_attempts"]:
        return "retry"
    if state.get("error"):
        return "fail"
    return "done"
  1. Build the graph with explicit retry and failure paths. The retry edge loops back into the same node, while the failure edge terminates with a final error message.
def fail_node(state: RetryState) -> dict:
    return {
        "result": "",
        "error": f"Failed after {state['attempts']} attempts: {state['error']}",
    }


builder = StateGraph(RetryState)

builder.add_node("risky_operation", risky_operation)
builder.add_node("fail", fail_node)

builder.add_edge(START, "risky_operation")
builder.add_conditional_edges(
    "risky_operation",
    route_after_risky_operation,
    {
        "retry": "risky_operation",
        "fail": "fail",
        "done": END,
    },
)
builder.add_edge("fail", END)

graph = builder.compile()
  1. Run it with an initial state that sets your retry budget. In real systems this budget should be small and tied to the type of failure you expect; transient network errors deserve retries, bad inputs usually do not.
initial_state: RetryState = {
    "prompt": "Summarize policy document A12",
    "result": "",
    "error": "",
    "attempts": 0,
    "max_attempts": 3,
}

final_state = graph.invoke(initial_state)
print(final_state)
  1. If you want this pattern around an actual LLM call, wrap the model invocation inside the node and catch only transient exceptions. The graph-level routing stays the same; only the node body changes.
from langchain_openai import ChatOpenAI


llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)


def llm_node(state: RetryState) -> dict:
    attempts = state.get("attempts", 0) + 1

    try:
        response = llm.invoke(state["prompt"])
        return {
            "attempts": attempts,
            "error": "",
            "result": response.content,
        }
    except Exception as e:
        return {
            "attempts": attempts,
            "error": str(e),
            "result": "",
        }

Testing It

Run the graph several times and watch how it behaves across success and failure cases. You should see attempts increase until either a successful result is returned or max_attempts is reached.

To verify retry logic properly, temporarily force failures by changing the random threshold in risky_operation to always fail. Then confirm the graph exits through fail instead of looping forever.

For LLM-backed nodes, test against a known bad input and inspect whether only transient failures are retried. If your exception handling catches everything, you’ll end up retrying validation errors and wasting tokens.

Next Steps

  • Add exponential backoff by storing next_retry_at in state and routing through a wait node.
  • Split retry policy by error class so timeouts retry but schema violations fail fast.
  • Combine this pattern with LangGraph checkpointing so retries survive process restarts.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides