How to Fix 'rate limit exceeded when scaling' in LangChain (Python)

By Cyprian AaronsUpdated 2026-04-21
rate-limit-exceeded-when-scalinglangchainpython

When you see rate limit exceeded when scaling in a LangChain Python app, it usually means your app increased concurrency faster than your model provider’s quota can handle. This shows up when you move from a single prompt to batch(), abatch(), parallel chains, or multiple workers hitting the same API key.

In practice, the failure is almost always about request rate, token rate, or both. LangChain is just the layer surfacing the provider error through classes like ChatOpenAI, AzureChatOpenAI, or ChatAnthropic.

The Most Common Cause

The #1 cause is uncontrolled concurrency.

You start with something like a single call, then “scale” by wrapping it in asyncio.gather(), RunnableParallel, or a worker pool. The provider sees a burst of requests and returns errors like:

  • openai.RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for requests...'}}
  • anthropic.RateLimitError: 429 rate_limit_exceeded
  • httpx.HTTPStatusError: Client error '429 Too Many Requests'

Broken vs fixed pattern

Broken patternFixed pattern
Fire off unlimited parallel callsCap concurrency with max_concurrency or a semaphore
Retry only after the whole batch failsRetry per request with backoff
Scale workers without checking quotaMatch worker count to RPM/TPM limits
# BROKEN
import asyncio
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

async def summarize(text: str):
    return await llm.ainvoke(f"Summarize this: {text}")

texts = [f"Document {i}" for i in range(100)]

# This can easily burst past RPM/TPM limits
results = await asyncio.gather(*[summarize(t) for t in texts])
# FIXED
import asyncio
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

sem = asyncio.Semaphore(5)  # tune to your provider limits

async def summarize(text: str):
    async with sem:
        return await llm.ainvoke(f"Summarize this: {text}")

texts = [f"Document {i}" for i in range(100)]
results = await asyncio.gather(*[summarize(t) for t in texts])

If you’re using LangChain runnables, set concurrency explicitly:

results = await chain.abatch(
    inputs,
    config={"max_concurrency": 5}
)

That one setting often fixes the issue because LangChain stops flooding the provider.

Other Possible Causes

1) Token usage exceeds TPM even if request count looks fine

You may be under the requests-per-minute limit but over tokens-per-minute. Large prompts, long chat history, and big retrieved context chunks are common triggers.

# Bad: huge context stuffed into every call
prompt = f"""
Answer using all this context:
{very_large_retrieval_context}
Question: {question}
"""

Fix by trimming context before sending it:

# Better: cap retrieved docs and shorten history
retriever.search_kwargs["k"] = 4
memory_messages = memory_messages[-6:]

2) Multiple processes or pods sharing one API key

A single Python worker may be fine, but three Gunicorn workers plus a Celery queue plus a cron job will all hit the same quota.

# Example deployment issue
gunicorn app:app --workers 4
celery -A tasks worker --concurrency=8

If all of those use one key, your effective throughput is multiplied. Reduce worker count or split traffic across keys/accounts where policy allows it.

3) Missing retries with exponential backoff

LangChain won’t magically absorb every 429 unless you configure retries around the model call.

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="gpt-4o-mini",
    max_retries=0,  # bad if you expect bursts
)

Use retries and backoff:

llm = ChatOpenAI(
    model="gpt-4o-mini",
    max_retries=3,
)

For stricter control, wrap calls with your own retry policy using tenacity.

4) Streaming plus fan-out creates hidden bursts

Streaming feels lighter, but if you start many streams at once you still pay for each request upfront.

# Bad: many concurrent streaming calls
streams = [llm.astream(prompt) for prompt in prompts]

Throttle stream creation the same way you throttle normal invocations.

How to Debug It

  1. Read the exact exception

    • Look for the provider class name and HTTP status.
    • Examples:
      • openai.RateLimitError
      • anthropic.RateLimitError
      • httpx.HTTPStatusError: 429 Too Many Requests
    • If it says “requests per minute” or “tokens per minute,” that tells you which quota is failing.
  2. Check whether scaling changed concurrency

    • Compare single-request behavior vs batch behavior.
    • Test these separately:
      await llm.ainvoke("test")
      await chain.abatch(inputs, config={"max_concurrency": 1})
      
    • If concurrency 1 works and higher values fail, you found the issue.
  3. Measure prompt size

    • Log approximate input tokens.
    • Watch retrieved docs, chat history, and tool outputs.
    • If failures happen on long documents only, it’s probably TPM rather than RPM.
  4. Inspect deployment topology

    • Count workers, pods, threads, and background jobs.
    • One API key across four services can look like “random” rate limiting.
    • Add request logging with timestamps so you can see bursts.

A simple diagnostic log helps:

import time

start = time.time()
try:
    result = await llm.ainvoke(prompt)
except Exception as e:
    print(type(e).__name__, str(e))
    raise
finally:
    print("elapsed_sec=", round(time.time() - start, 2))

Prevention

  • Set explicit concurrency caps everywhere:

    • Semaphore for async code
    • max_concurrency for LangChain runnables
    • worker limits in Celery/Gunicorn/Kubernetes
  • Build retry logic with backoff around model calls.

    • Treat 429 as expected under load, not exceptional noise.
  • Keep prompts small and predictable.

    • Trim chat history.
    • Limit retrieval depth.
    • Avoid sending entire documents unless necessary.

If you’re seeing this error during scaling, don’t start by changing models. Start by reducing burstiness, then verify token load, then add retries. In most LangChain Python apps, that fixes it fast.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides