How to Fix 'rate limit exceeded when scaling' in LangChain (Python)
When you see rate limit exceeded when scaling in a LangChain Python app, it usually means your app increased concurrency faster than your model provider’s quota can handle. This shows up when you move from a single prompt to batch(), abatch(), parallel chains, or multiple workers hitting the same API key.
In practice, the failure is almost always about request rate, token rate, or both. LangChain is just the layer surfacing the provider error through classes like ChatOpenAI, AzureChatOpenAI, or ChatAnthropic.
The Most Common Cause
The #1 cause is uncontrolled concurrency.
You start with something like a single call, then “scale” by wrapping it in asyncio.gather(), RunnableParallel, or a worker pool. The provider sees a burst of requests and returns errors like:
- •
openai.RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for requests...'}} - •
anthropic.RateLimitError: 429 rate_limit_exceeded - •
httpx.HTTPStatusError: Client error '429 Too Many Requests'
Broken vs fixed pattern
| Broken pattern | Fixed pattern |
|---|---|
| Fire off unlimited parallel calls | Cap concurrency with max_concurrency or a semaphore |
| Retry only after the whole batch fails | Retry per request with backoff |
| Scale workers without checking quota | Match worker count to RPM/TPM limits |
# BROKEN
import asyncio
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")
async def summarize(text: str):
return await llm.ainvoke(f"Summarize this: {text}")
texts = [f"Document {i}" for i in range(100)]
# This can easily burst past RPM/TPM limits
results = await asyncio.gather(*[summarize(t) for t in texts])
# FIXED
import asyncio
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")
sem = asyncio.Semaphore(5) # tune to your provider limits
async def summarize(text: str):
async with sem:
return await llm.ainvoke(f"Summarize this: {text}")
texts = [f"Document {i}" for i in range(100)]
results = await asyncio.gather(*[summarize(t) for t in texts])
If you’re using LangChain runnables, set concurrency explicitly:
results = await chain.abatch(
inputs,
config={"max_concurrency": 5}
)
That one setting often fixes the issue because LangChain stops flooding the provider.
Other Possible Causes
1) Token usage exceeds TPM even if request count looks fine
You may be under the requests-per-minute limit but over tokens-per-minute. Large prompts, long chat history, and big retrieved context chunks are common triggers.
# Bad: huge context stuffed into every call
prompt = f"""
Answer using all this context:
{very_large_retrieval_context}
Question: {question}
"""
Fix by trimming context before sending it:
# Better: cap retrieved docs and shorten history
retriever.search_kwargs["k"] = 4
memory_messages = memory_messages[-6:]
2) Multiple processes or pods sharing one API key
A single Python worker may be fine, but three Gunicorn workers plus a Celery queue plus a cron job will all hit the same quota.
# Example deployment issue
gunicorn app:app --workers 4
celery -A tasks worker --concurrency=8
If all of those use one key, your effective throughput is multiplied. Reduce worker count or split traffic across keys/accounts where policy allows it.
3) Missing retries with exponential backoff
LangChain won’t magically absorb every 429 unless you configure retries around the model call.
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="gpt-4o-mini",
max_retries=0, # bad if you expect bursts
)
Use retries and backoff:
llm = ChatOpenAI(
model="gpt-4o-mini",
max_retries=3,
)
For stricter control, wrap calls with your own retry policy using tenacity.
4) Streaming plus fan-out creates hidden bursts
Streaming feels lighter, but if you start many streams at once you still pay for each request upfront.
# Bad: many concurrent streaming calls
streams = [llm.astream(prompt) for prompt in prompts]
Throttle stream creation the same way you throttle normal invocations.
How to Debug It
- •
Read the exact exception
- •Look for the provider class name and HTTP status.
- •Examples:
- •
openai.RateLimitError - •
anthropic.RateLimitError - •
httpx.HTTPStatusError: 429 Too Many Requests
- •
- •If it says “requests per minute” or “tokens per minute,” that tells you which quota is failing.
- •
Check whether scaling changed concurrency
- •Compare single-request behavior vs batch behavior.
- •Test these separately:
await llm.ainvoke("test") await chain.abatch(inputs, config={"max_concurrency": 1}) - •If concurrency 1 works and higher values fail, you found the issue.
- •
Measure prompt size
- •Log approximate input tokens.
- •Watch retrieved docs, chat history, and tool outputs.
- •If failures happen on long documents only, it’s probably TPM rather than RPM.
- •
Inspect deployment topology
- •Count workers, pods, threads, and background jobs.
- •One API key across four services can look like “random” rate limiting.
- •Add request logging with timestamps so you can see bursts.
A simple diagnostic log helps:
import time
start = time.time()
try:
result = await llm.ainvoke(prompt)
except Exception as e:
print(type(e).__name__, str(e))
raise
finally:
print("elapsed_sec=", round(time.time() - start, 2))
Prevention
- •
Set explicit concurrency caps everywhere:
- •
Semaphorefor async code - •
max_concurrencyfor LangChain runnables - •worker limits in Celery/Gunicorn/Kubernetes
- •
- •
Build retry logic with backoff around model calls.
- •Treat
429as expected under load, not exceptional noise.
- •Treat
- •
Keep prompts small and predictable.
- •Trim chat history.
- •Limit retrieval depth.
- •Avoid sending entire documents unless necessary.
If you’re seeing this error during scaling, don’t start by changing models. Start by reducing burstiness, then verify token load, then add retries. In most LangChain Python apps, that fixes it fast.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit