LangGraph Tutorial (Python): implementing retry logic for beginners
This tutorial shows you how to add retry logic to a LangGraph workflow in Python using real graph nodes, conditional routing, and state updates. You need this when an LLM call, API request, or tool invocation fails intermittently and you want the graph to try again without crashing the whole run.
What You'll Need
- •Python 3.10+
- •
langgraph - •
langchain-core - •
langchain-openaiif you want to use OpenAI models - •An OpenAI API key set as
OPENAI_API_KEY - •Basic familiarity with LangGraph nodes, edges, and state
Install the packages:
pip install langgraph langchain-core langchain-openai
Step-by-Step
1) Define the graph state
Retry logic needs state that tracks both the work result and how many times you've retried. Keep it simple: store the user input, the latest output, an error message, and a retry counter.
from typing import TypedDict, Optional
class GraphState(TypedDict):
input_text: str
result: Optional[str]
error: Optional[str]
retry_count: int
2) Write a node that can fail
This example simulates an unstable operation. In production, this could be an LLM call, a database query, or a third-party API request.
import random
def unstable_node(state: GraphState) -> GraphState:
text = state["input_text"]
if random.random() < 0.6:
raise RuntimeError("Transient failure from downstream service")
return {
**state,
"result": f"Processed: {text}",
"error": None,
}
3) Add a retry handler node
When the unstable node fails, we route to a retry handler that increments the counter and records the error. If retries are exhausted, it stops trying and returns the failure state.
def handle_retry(state: GraphState) -> GraphState:
retry_count = state["retry_count"] + 1
error = state["error"]
return {
**state,
"retry_count": retry_count,
"error": error,
}
4) Build conditional routing for retries
This is the core pattern. The graph runs the unstable node, then uses a router function to decide whether to end successfully, retry, or fail permanently.
from langgraph.graph import StateGraph, START, END
MAX_RETRIES = 3
def route_after_failure(state: GraphState) -> str:
if state["retry_count"] >= MAX_RETRIES:
return "end"
return "retry"
def route_after_success(state: GraphState) -> str:
return "end"
5) Assemble the graph with exception-safe execution
LangGraph nodes should not crash your whole application if you want controlled retries. Wrap the risky work in a safe node that catches exceptions and stores them in state.
def safe_unstable_node(state: GraphState) -> GraphState:
try:
return unstable_node(state)
except Exception as e:
return {
**state,
"result": None,
"error": str(e),
}
builder = StateGraph(GraphState)
builder.add_node("work", safe_unstable_node)
builder.add_node("retry", handle_retry)
builder.add_edge(START, "work")
builder.add_conditional_edges(
"work",
lambda state: "success" if state["error"] is None else route_after_failure(state),
{
"success": END,
"retry": "retry",
"end": END,
},
)
builder.add_edge("retry", "work")
graph = builder.compile()
6) Run it with initial state
Start with zero retries and no result. The graph will keep looping until it succeeds or hits MAX_RETRIES.
initial_state: GraphState = {
"input_text": "hello world",
"result": None,
"error": None,
"retry_count": 0,
}
final_state = graph.invoke(initial_state)
print(final_state)
Testing It
Run the script multiple times because the failure is randomized. On successful runs, result should contain "Processed: hello world" and error should be None. On repeated failures, retry_count should stop at MAX_RETRIES, and error should contain the last exception message.
If you want deterministic testing, replace random.random() with a fixed sequence or mock unstable_node. That makes it easy to verify both branches: success on first try and failure after retries.
A good production check is logging each transition so you can see whether your workflow is bouncing between "work" and "retry" as expected.
Next Steps
- •Add exponential backoff before retrying by storing timestamps in state.
- •Replace the simulated failure with a real LLM call using
ChatOpenAI. - •Split retries by error type so validation errors fail fast while transient errors retry.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit