LangGraph Tutorial (Python): implementing retry logic for advanced developers
This tutorial shows how to add retry logic to a LangGraph workflow in Python without turning your graph into a mess of nested try/except blocks. You’ll build a small agent pipeline that retries failed model calls, preserves state, and stops after a controlled number of attempts.
What You'll Need
- •Python 3.10+
- •
langgraph - •
langchain-openai - •
langchain-core - •An OpenAI API key set as
OPENAI_API_KEY - •Basic familiarity with:
- •
StateGraph - •typed state with
TypedDict - •conditional edges in LangGraph
- •
Install the packages:
pip install langgraph langchain-openai langchain-core
Step-by-Step
- •Start by defining state that tracks both the task and the retry count. The key is to keep retry metadata inside graph state so every node can make decisions without relying on global variables.
from typing import TypedDict, Annotated
from operator import add
from langgraph.graph import StateGraph, START, END
class GraphState(TypedDict):
prompt: str
result: str
error: str
attempts: int
max_attempts: int
def merge_strings(left: str, right: str) -> str:
return right if right else left
class RetryState(TypedDict):
prompt: str
result: Annotated[str, merge_strings]
error: Annotated[str, merge_strings]
attempts: int
max_attempts: int
- •Next, create the node that does the risky work. In production this is usually an LLM call, tool call, or external API request; here we simulate a failure so you can see the retry path clearly.
import random
def risky_operation(state: RetryState) -> dict:
attempts = state.get("attempts", 0) + 1
if random.random() < 0.6:
return {
"attempts": attempts,
"error": f"Transient failure on attempt {attempts}",
"result": "",
}
return {
"attempts": attempts,
"error": "",
"result": f"Processed prompt: {state['prompt']}",
}
- •Add a router that decides whether to retry or stop. This keeps retry policy outside the node itself, which is cleaner when you later swap in backoff, circuit breakers, or different policies per node.
def route_after_risky_operation(state: RetryState) -> str:
if state.get("error") and state["attempts"] < state["max_attempts"]:
return "retry"
if state.get("error"):
return "fail"
return "done"
- •Build the graph with explicit retry and failure paths. The retry edge loops back into the same node, while the failure edge terminates with a final error message.
def fail_node(state: RetryState) -> dict:
return {
"result": "",
"error": f"Failed after {state['attempts']} attempts: {state['error']}",
}
builder = StateGraph(RetryState)
builder.add_node("risky_operation", risky_operation)
builder.add_node("fail", fail_node)
builder.add_edge(START, "risky_operation")
builder.add_conditional_edges(
"risky_operation",
route_after_risky_operation,
{
"retry": "risky_operation",
"fail": "fail",
"done": END,
},
)
builder.add_edge("fail", END)
graph = builder.compile()
- •Run it with an initial state that sets your retry budget. In real systems this budget should be small and tied to the type of failure you expect; transient network errors deserve retries, bad inputs usually do not.
initial_state: RetryState = {
"prompt": "Summarize policy document A12",
"result": "",
"error": "",
"attempts": 0,
"max_attempts": 3,
}
final_state = graph.invoke(initial_state)
print(final_state)
- •If you want this pattern around an actual LLM call, wrap the model invocation inside the node and catch only transient exceptions. The graph-level routing stays the same; only the node body changes.
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
def llm_node(state: RetryState) -> dict:
attempts = state.get("attempts", 0) + 1
try:
response = llm.invoke(state["prompt"])
return {
"attempts": attempts,
"error": "",
"result": response.content,
}
except Exception as e:
return {
"attempts": attempts,
"error": str(e),
"result": "",
}
Testing It
Run the graph several times and watch how it behaves across success and failure cases. You should see attempts increase until either a successful result is returned or max_attempts is reached.
To verify retry logic properly, temporarily force failures by changing the random threshold in risky_operation to always fail. Then confirm the graph exits through fail instead of looping forever.
For LLM-backed nodes, test against a known bad input and inspect whether only transient failures are retried. If your exception handling catches everything, you’ll end up retrying validation errors and wasting tokens.
Next Steps
- •Add exponential backoff by storing
next_retry_atin state and routing through a wait node. - •Split retry policy by error class so timeouts retry but schema violations fail fast.
- •Combine this pattern with LangGraph checkpointing so retries survive process restarts.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit