LangGraph Tutorial (Python): implementing retry logic for intermediate developers
This tutorial shows you how to add retry logic to a LangGraph workflow in Python so failed model calls or tool calls can be retried without collapsing the whole run. You need this when external APIs are flaky, rate limits hit, or a single node occasionally throws a transient error.
What You'll Need
- •Python 3.10+
- •
langgraph - •
langchain-openai - •
langchain-core - •An OpenAI API key set as
OPENAI_API_KEY - •Basic familiarity with LangGraph nodes, edges, and
StateGraph
Install the packages:
pip install langgraph langchain-openai langchain-core
Step-by-Step
- •Start with a simple graph state and a node that can fail.
The key idea is to keep retry metadata in state instead of hiding it inside your business logic. That gives you visibility into how many attempts happened and lets downstream nodes make decisions based on failure history.
from typing import TypedDict, Annotated
from operator import add
class GraphState(TypedDict):
prompt: str
result: str
error: str
attempts: int
logs: Annotated[list[str], add]
- •Build a node that raises on transient failures.
Here I’m using an LLM call, but the pattern is the same for any intermediate step that can fail. The node updates attempt count and records whether it succeeded or failed.
import os
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
def call_model(state: GraphState) -> dict:
attempt = state.get("attempts", 0) + 1
try:
response = llm.invoke(state["prompt"])
return {
"result": response.content,
"error": "",
"attempts": attempt,
"logs": [f"model call succeeded on attempt {attempt}"],
}
except Exception as e:
return {
"result": "",
"error": str(e),
"attempts": attempt,
"logs": [f"model call failed on attempt {attempt}: {e}"],
}
- •Add a retry router that decides whether to loop or stop.
This is the core of the retry logic. Instead of letting one failure kill the graph, route back to the same node until you hit your retry limit.
MAX_RETRIES = 3
def route_after_call(state: GraphState) -> str:
if state.get("result"):
return "end"
if state.get("attempts", 0) < MAX_RETRIES:
return "retry"
return "end"
- •Wire the graph with conditional edges.
LangGraph makes this clean: one node does the work, one function routes based on state, and the graph loops only when needed. The END node is reached either on success or after retries are exhausted.
from langgraph.graph import StateGraph, START, END
builder = StateGraph(GraphState)
builder.add_node("call_model", call_model)
builder.add_edge(START, "call_model")
builder.add_conditional_edges(
"call_model",
route_after_call,
{
"retry": "call_model",
"end": END,
},
)
graph = builder.compile()
- •Run it with an initial state and inspect logs.
You want retries to be observable, not mysterious. The logs field gives you a simple audit trail you can print or send to your app logger.
initial_state: GraphState = {
"prompt": "Write one sentence about why retries matter in distributed systems.",
"result": "",
"error": "",
"attempts": 0,
"logs": [],
}
final_state = graph.invoke(initial_state)
print("Attempts:", final_state["attempts"])
print("Result:", final_state["result"])
print("Error:", final_state["error"])
print("Logs:")
for line in final_state["logs"]:
print("-", line)
Testing It
Run the script once with a valid API key and confirm that result is populated and attempts is usually 1. Then temporarily break your API key or point the model name at something invalid to force failures and confirm the graph retries up to MAX_RETRIES. Check that the workflow exits cleanly after the limit instead of raising an uncaught exception from inside the graph.
If you want more realistic testing, wrap a flaky tool function instead of an LLM call and randomly raise exceptions for one out of every few runs. The important behavior is consistent: failures stay local to the node, retries are controlled by state, and downstream steps only see success or final failure.
Next Steps
- •Add exponential backoff between retries using
time.sleep()inside a dedicated wait node. - •Split retry policies by error type so rate limits, timeouts, and validation errors behave differently.
- •Move from simple looping to a supervisor-style graph where separate nodes handle recovery, fallback prompts, or human escalation.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit