How to Fix 'rate limit exceeded in production' in LangGraph (Python)
What this error means
rate limit exceeded in production usually means your LangGraph app is making more LLM calls than the provider allows in a short window. In practice, this shows up when a graph retries too aggressively, fans out too many parallel nodes, or runs multiple user requests through the same model account without throttling.
The actual exception often comes from the underlying SDK, not LangGraph itself. You’ll see errors like openai.RateLimitError, anthropic.RateLimitError, or a provider-specific 429 Too Many Requests bubbling up through your graph execution.
The Most Common Cause
The #1 cause is uncontrolled concurrency inside the graph.
A lot of people build a graph node that loops over items and calls the model for each one, or they run multiple branches in parallel without limiting throughput. In dev, it works. In production, traffic spikes and you hit the provider’s RPM/TPM ceiling.
Here’s the broken pattern:
| Broken | Fixed |
|---|---|
| Calls the LLM inside a tight loop with no backoff | Batches work and limits concurrency |
| Lets every request fan out immediately | Uses a semaphore / rate limiter |
| Retries instantly on 429 | Retries with exponential backoff |
# broken.py
from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")
def summarize_items(state):
summaries = []
for item in state["items"]:
# Bad: one request per item with no throttle
resp = llm.invoke(f"Summarize this: {item}")
summaries.append(resp.content)
return {"summaries": summaries}
graph = StateGraph(dict)
graph.add_node("summarize_items", summarize_items)
graph.set_entry_point("summarize_items")
graph.add_edge("summarize_items", END)
app = graph.compile()
# fixed.py
import time
from threading import Semaphore
from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")
limit = Semaphore(3) # cap concurrent model calls
def call_llm(prompt: str):
with limit:
return llm.invoke(prompt)
def summarize_items(state):
summaries = []
for item in state["items"]:
# Better: controlled throughput
resp = call_llm(f"Summarize this: {item}")
summaries.append(resp.content)
time.sleep(0.2) # optional pacing for bursty workloads
return {"summaries": summaries}
graph = StateGraph(dict)
graph.add_node("summarize_items", summarize_items)
graph.set_entry_point("summarize_items")
graph.add_edge("summarize_items", END)
app = graph.compile()
If you’re using async nodes, use asyncio.Semaphore instead of threading.Semaphore. Same idea: stop every branch from hammering the provider at once.
Other Possible Causes
1. Retry logic that replays the same failed call too fast
If you catch exceptions and immediately retry inside the node, you can turn one 429 into five more 429s.
# bad retry
for _ in range(5):
try:
return llm.invoke(prompt)
except Exception:
pass # instant retry, no delay
Use exponential backoff and only retry rate-limit errors:
import time
from openai import RateLimitError
delay = 1
for _ in range(5):
try:
return llm.invoke(prompt)
except RateLimitError:
time.sleep(delay)
delay *= 2
2. Parallel branches multiplying token usage
LangGraph makes it easy to branch execution. That’s useful, but three branches calling the model at once can triple your request rate.
# conceptual example: too many parallel LLM calls in branches
graph.add_node("branch_a", branch_a)
graph.add_node("branch_b", branch_b)
graph.add_node("branch_c", branch_c)
If each branch invokes an LLM, add throttling at a shared layer, not just inside each node.
3. Long-running conversations with no context trimming
If your state keeps growing and every turn sends the full transcript back to the model, token usage climbs until you hit TPM limits.
# bad: keep appending forever
state["messages"].append(user_msg)
response = llm.invoke(state["messages"])
Trim messages before invoking:
messages = state["messages"][-10:]
response = llm.invoke(messages)
4. Multiple workers sharing one API key
This is common in production when you scale horizontally. Each worker looks fine alone, but together they exceed account limits.
# example symptom: 8 gunicorn workers all using one API key
workers: 8
env:
OPENAI_API_KEY: ${OPENAI_API_KEY}
Fix by lowering worker count, adding a distributed rate limiter, or moving heavy LLM traffic to a queue-backed worker pool.
How to Debug It
- •
Check the exact exception class
- •Look for
openai.RateLimitError,anthropic.RateLimitError, or HTTP429. - •If it’s wrapped by LangGraph, inspect the root cause in logs or stack traces.
- •Look for
- •
Log per-node call counts
- •Add counters around every LLM invocation.
- •You want to know which node explodes under load.
import logging
log = logging.getLogger(__name__)
def tracked_call(node_name, prompt):
log.info("node=%s calling_llm", node_name)
return llm.invoke(prompt)
- •
Test with concurrency set to 1
- •Run one request at a time.
- •If the error disappears, your problem is fan-out or shared-account throughput.
- •
Inspect retries and worker count
- •Search for retry decorators, custom loops, Celery workers, Gunicorn workers, or async gather patterns.
- •A hidden retry loop plus parallel workers is a classic production-only failure mode.
Prevention
- •Put every model call behind a shared throttle or queue.
- •Use exponential backoff on
429responses; never instant-retry. - •Keep graph state small: trim messages, summarize history, and avoid repeated full-context calls.
- •Load test your LangGraph app with production-like concurrency before shipping.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit