How to Fix 'rate limit exceeded when scaling' in LangGraph (Python)

By Cyprian AaronsUpdated 2026-04-21
rate-limit-exceeded-when-scalinglanggraphpython

When you see rate limit exceeded when scaling in a LangGraph Python app, it usually means your graph is firing too many model calls at once. This shows up when you add parallel branches, fan-out loops, retries, or multiple users hitting the same agent at the same time.

In practice, the problem is rarely LangGraph itself. It’s usually your execution pattern pushing OpenAI, Anthropic, Azure OpenAI, or another provider past its request-per-minute or token-per-minute limits.

The Most Common Cause

The #1 cause is uncontrolled parallelism inside a graph node or across graph branches.

A common mistake is to spawn multiple LLM calls at once with asyncio.gather() or to fan out too aggressively in a conditional edge. LangGraph will happily execute your graph logic, but the provider will reject bursts with errors like:

  • RateLimitError: Error code: 429
  • openai.RateLimitError: Rate limit exceeded
  • anthropic.RateLimitError: rate_limit_error
  • AzureOpenAIError: TooManyRequests

Broken vs fixed pattern

Broken patternFixed pattern
Fires many requests at onceLimits concurrency
No backoff on 429sRetries with jitter
Fan-out without throttlingQueue or batch work
Reuses same provider quota for all usersIsolates per-tenant limits
# BROKEN
import asyncio
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

async def summarize_many(chunks):
    # This can burst 20+ requests at once
    tasks = [llm.ainvoke(f"Summarize: {chunk}") for chunk in chunks]
    return await asyncio.gather(*tasks)
# FIXED
import asyncio
from tenacity import retry, wait_exponential_jitter, stop_after_attempt
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
sem = asyncio.Semaphore(3)  # cap concurrency

@retry(wait=wait_exponential_jitter(initial=1, max=20), stop=stop_after_attempt(5))
async def safe_invoke(prompt: str):
    async with sem:
        return await llm.ainvoke(prompt)

async def summarize_many(chunks):
    tasks = [safe_invoke(f"Summarize: {chunk}") for chunk in chunks]
    return await asyncio.gather(*tasks)

If this error appears only after you “scale” from one user to many, this is almost always the issue.

Other Possible Causes

1) Retries multiplying traffic

If your HTTP client, LangChain wrapper, and application code all retry independently, one failed call can become 6–10 calls very quickly.

# Watch out for stacked retries
from tenacity import retry, stop_after_attempt

@retry(stop=stop_after_attempt(3))
def call_llm():
    return llm.invoke("...")

If the SDK already retries 429s and you add another retry layer on top, you can amplify the problem.

2) Graph loops that re-enter too often

A recursive LangGraph setup can accidentally keep calling the model on every loop iteration.

# Example: conditional edge keeps routing back to LLM node
# without a hard stop condition.

Check your StateGraph transitions and make sure your loop has:

  • a maximum iteration count
  • a termination condition based on state
  • no accidental self-edge on failure paths

3) Batch size too large

If you process documents in batches of 50 and each item triggers an LLM call, you can exceed RPM instantly.

batch_size = 50  # too large for some providers

for batch in batches(docs, batch_size=batch_size):
    results = await asyncio.gather(*(llm.ainvoke(doc) for doc in batch))

Drop batch size to something aligned with your quota and add pacing between batches.

4) Multiple workers sharing one API key

This happens a lot in FastAPI, Celery, or Kubernetes deployments. Each worker looks fine alone; together they exceed the account limit.

# Example symptom: 8 replicas * 5 concurrent requests each = burst traffic
replicas: 8
concurrency: 5

If all workers share one provider key, they also share one rate limit bucket unless your vendor scopes limits differently.

How to Debug It

  1. Confirm the exact provider error

    • Look for 429 responses.
    • Check whether it’s openai.RateLimitError, anthropic.RateLimitError, or an Azure throttling error.
    • If you only see LangGraph failing upstream, inspect the underlying exception chain.
  2. Measure concurrency at runtime

    • Log how many model calls are active at once.
    • Count node executions per second.
    • If spikes happen during fan-out nodes or parallel edges, you found the source.
  3. Inspect your graph topology

    • Review every conditional edge and loop.
    • Find any node that can trigger multiple downstream LLM calls.
    • Add a hard stop counter in state if recursion is possible.
  4. Check deployment-wide traffic

    • Compare single-user local runs vs production traffic.
    • Look at number of replicas, worker processes, and background jobs.
    • If local works but prod fails under load, it’s usually aggregate throughput.

Prevention

  • Put a semaphore or queue around every external model call that can fan out.
  • Add exponential backoff with jitter for 429 responses.
  • Set explicit limits on graph loops, batch sizes, and concurrent workers.
  • Track per-provider usage metrics so you know when you’re close to RPM/TPM ceilings before production traffic hits them.

If you want a clean rule of thumb: LangGraph should orchestrate work, not flood the model provider. Keep concurrency bounded, retries controlled, and loops finite.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides