LangGraph Tutorial (Python): implementing retry logic for beginners

By Cyprian AaronsUpdated 2026-04-22
langgraphimplementing-retry-logic-for-beginnerspython

This tutorial shows you how to add retry logic to a LangGraph workflow in Python using real graph nodes, conditional routing, and state updates. You need this when an LLM call, API request, or tool invocation fails intermittently and you want the graph to try again without crashing the whole run.

What You'll Need

  • Python 3.10+
  • langgraph
  • langchain-core
  • langchain-openai if you want to use OpenAI models
  • An OpenAI API key set as OPENAI_API_KEY
  • Basic familiarity with LangGraph nodes, edges, and state

Install the packages:

pip install langgraph langchain-core langchain-openai

Step-by-Step

1) Define the graph state

Retry logic needs state that tracks both the work result and how many times you've retried. Keep it simple: store the user input, the latest output, an error message, and a retry counter.

from typing import TypedDict, Optional

class GraphState(TypedDict):
    input_text: str
    result: Optional[str]
    error: Optional[str]
    retry_count: int

2) Write a node that can fail

This example simulates an unstable operation. In production, this could be an LLM call, a database query, or a third-party API request.

import random

def unstable_node(state: GraphState) -> GraphState:
    text = state["input_text"]

    if random.random() < 0.6:
        raise RuntimeError("Transient failure from downstream service")

    return {
        **state,
        "result": f"Processed: {text}",
        "error": None,
    }

3) Add a retry handler node

When the unstable node fails, we route to a retry handler that increments the counter and records the error. If retries are exhausted, it stops trying and returns the failure state.

def handle_retry(state: GraphState) -> GraphState:
    retry_count = state["retry_count"] + 1
    error = state["error"]

    return {
        **state,
        "retry_count": retry_count,
        "error": error,
    }

4) Build conditional routing for retries

This is the core pattern. The graph runs the unstable node, then uses a router function to decide whether to end successfully, retry, or fail permanently.

from langgraph.graph import StateGraph, START, END

MAX_RETRIES = 3

def route_after_failure(state: GraphState) -> str:
    if state["retry_count"] >= MAX_RETRIES:
        return "end"
    return "retry"

def route_after_success(state: GraphState) -> str:
    return "end"

5) Assemble the graph with exception-safe execution

LangGraph nodes should not crash your whole application if you want controlled retries. Wrap the risky work in a safe node that catches exceptions and stores them in state.

def safe_unstable_node(state: GraphState) -> GraphState:
    try:
        return unstable_node(state)
    except Exception as e:
        return {
            **state,
            "result": None,
            "error": str(e),
        }

builder = StateGraph(GraphState)
builder.add_node("work", safe_unstable_node)
builder.add_node("retry", handle_retry)

builder.add_edge(START, "work")
builder.add_conditional_edges(
    "work",
    lambda state: "success" if state["error"] is None else route_after_failure(state),
    {
        "success": END,
        "retry": "retry",
        "end": END,
    },
)
builder.add_edge("retry", "work")

graph = builder.compile()

6) Run it with initial state

Start with zero retries and no result. The graph will keep looping until it succeeds or hits MAX_RETRIES.

initial_state: GraphState = {
    "input_text": "hello world",
    "result": None,
    "error": None,
    "retry_count": 0,
}

final_state = graph.invoke(initial_state)

print(final_state)

Testing It

Run the script multiple times because the failure is randomized. On successful runs, result should contain "Processed: hello world" and error should be None. On repeated failures, retry_count should stop at MAX_RETRIES, and error should contain the last exception message.

If you want deterministic testing, replace random.random() with a fixed sequence or mock unstable_node. That makes it easy to verify both branches: success on first try and failure after retries.

A good production check is logging each transition so you can see whether your workflow is bouncing between "work" and "retry" as expected.

Next Steps

  • Add exponential backoff before retrying by storing timestamps in state.
  • Replace the simulated failure with a real LLM call using ChatOpenAI.
  • Split retries by error type so validation errors fail fast while transient errors retry.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides