How to Fix 'rate limit exceeded in production' in LangGraph (Python)

By Cyprian AaronsUpdated 2026-04-21

rate-limit-exceeded-in-productionlanggraphpython

What this error means

rate limit exceeded in production usually means your LangGraph app is making more LLM calls than the provider allows in a short window. In practice, this shows up when a graph retries too aggressively, fans out too many parallel nodes, or runs multiple user requests through the same model account without throttling.

The actual exception often comes from the underlying SDK, not LangGraph itself. You’ll see errors like openai.RateLimitError, anthropic.RateLimitError, or a provider-specific 429 Too Many Requests bubbling up through your graph execution.

The Most Common Cause

The #1 cause is uncontrolled concurrency inside the graph.

A lot of people build a graph node that loops over items and calls the model for each one, or they run multiple branches in parallel without limiting throughput. In dev, it works. In production, traffic spikes and you hit the provider’s RPM/TPM ceiling.

Here’s the broken pattern:

Broken	Fixed
Calls the LLM inside a tight loop with no backoff	Batches work and limits concurrency
Lets every request fan out immediately	Uses a semaphore / rate limiter
Retries instantly on 429	Retries with exponential backoff

# broken.py
from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

def summarize_items(state):
    summaries = []
    for item in state["items"]:
        # Bad: one request per item with no throttle
        resp = llm.invoke(f"Summarize this: {item}")
        summaries.append(resp.content)
    return {"summaries": summaries}

graph = StateGraph(dict)
graph.add_node("summarize_items", summarize_items)
graph.set_entry_point("summarize_items")
graph.add_edge("summarize_items", END)
app = graph.compile()

# fixed.py
import time
from threading import Semaphore
from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")
limit = Semaphore(3)  # cap concurrent model calls

def call_llm(prompt: str):
    with limit:
        return llm.invoke(prompt)

def summarize_items(state):
    summaries = []
    for item in state["items"]:
        # Better: controlled throughput
        resp = call_llm(f"Summarize this: {item}")
        summaries.append(resp.content)
        time.sleep(0.2)  # optional pacing for bursty workloads
    return {"summaries": summaries}

graph = StateGraph(dict)
graph.add_node("summarize_items", summarize_items)
graph.set_entry_point("summarize_items")
graph.add_edge("summarize_items", END)
app = graph.compile()

If you’re using async nodes, use asyncio.Semaphore instead of threading.Semaphore. Same idea: stop every branch from hammering the provider at once.

Other Possible Causes

1. Retry logic that replays the same failed call too fast

If you catch exceptions and immediately retry inside the node, you can turn one 429 into five more 429s.

# bad retry
for _ in range(5):
    try:
        return llm.invoke(prompt)
    except Exception:
        pass  # instant retry, no delay

Use exponential backoff and only retry rate-limit errors:

import time
from openai import RateLimitError

delay = 1
for _ in range(5):
    try:
        return llm.invoke(prompt)
    except RateLimitError:
        time.sleep(delay)
        delay *= 2

2. Parallel branches multiplying token usage

LangGraph makes it easy to branch execution. That’s useful, but three branches calling the model at once can triple your request rate.

# conceptual example: too many parallel LLM calls in branches
graph.add_node("branch_a", branch_a)
graph.add_node("branch_b", branch_b)
graph.add_node("branch_c", branch_c)

If each branch invokes an LLM, add throttling at a shared layer, not just inside each node.

3. Long-running conversations with no context trimming

If your state keeps growing and every turn sends the full transcript back to the model, token usage climbs until you hit TPM limits.

# bad: keep appending forever
state["messages"].append(user_msg)
response = llm.invoke(state["messages"])

Trim messages before invoking:

messages = state["messages"][-10:]
response = llm.invoke(messages)

4. Multiple workers sharing one API key

This is common in production when you scale horizontally. Each worker looks fine alone, but together they exceed account limits.

# example symptom: 8 gunicorn workers all using one API key
workers: 8
env:
  OPENAI_API_KEY: ${OPENAI_API_KEY}

Fix by lowering worker count, adding a distributed rate limiter, or moving heavy LLM traffic to a queue-backed worker pool.

How to Debug It

•
Check the exact exception class
- •Look for openai.RateLimitError, anthropic.RateLimitError, or HTTP 429.
- •If it’s wrapped by LangGraph, inspect the root cause in logs or stack traces.
•
Log per-node call counts
- •Add counters around every LLM invocation.
- •You want to know which node explodes under load.

import logging

log = logging.getLogger(__name__)

def tracked_call(node_name, prompt):
    log.info("node=%s calling_llm", node_name)
    return llm.invoke(prompt)

•
Test with concurrency set to 1
- •Run one request at a time.
- •If the error disappears, your problem is fan-out or shared-account throughput.
•
Inspect retries and worker count
- •Search for retry decorators, custom loops, Celery workers, Gunicorn workers, or async gather patterns.
- •A hidden retry loop plus parallel workers is a classic production-only failure mode.

Prevention

•Put every model call behind a shared throttle or queue.
•Use exponential backoff on 429 responses; never instant-retry.
•Keep graph state small: trim messages, summarize history, and avoid repeated full-context calls.
•Load test your LangGraph app with production-like concurrency before shipping.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit