How to Fix 'rate limit exceeded in production' in LangChain (Python)
A rate limit exceeded in production error in LangChain usually means your app is sending more requests to the model provider than your account or deployment allows. In practice, it shows up when you move from local testing to real traffic, or when a chain/agent starts making multiple LLM calls per user request.
The tricky part is that the error often looks like a LangChain problem, but the root cause is usually request volume, retries, concurrency, or token usage against OpenAI, Anthropic, Azure OpenAI, or another provider.
The Most Common Cause
The #1 cause is uncontrolled concurrency: your code fires too many invoke() or ainvoke() calls at once. This happens a lot with asyncio.gather(), background workers, batch jobs, or agent loops that fan out requests.
Here’s the broken pattern:
import asyncio
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
async def summarize(text: str):
return await llm.ainvoke(f"Summarize this:\n{text}")
async def main(docs):
# BAD: unbounded parallelism
results = await asyncio.gather(*(summarize(doc) for doc in docs))
return results
And here’s the fixed pattern with bounded concurrency:
import asyncio
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
semaphore = asyncio.Semaphore(3)
async def summarize(text: str):
async with semaphore:
return await llm.ainvoke(f"Summarize this:\n{text}")
async def main(docs):
results = await asyncio.gather(*(summarize(doc) for doc in docs))
return results
If you are using LangChain batches, the same rule applies. Don’t blast 100 inputs at once unless your provider quota can handle it.
| Broken pattern | Fixed pattern |
|---|---|
asyncio.gather() over hundreds of tasks | Semaphore, queue, or worker pool |
chain.batch(inputs) with no throttling | Smaller batches + delay between batches |
| Agent tool loops without guardrails | Max iterations + rate-limited tools |
Other Possible Causes
1. Retry storms from automatic retries
LangChain and the provider SDK may retry failed requests. If every request gets retried immediately, you multiply traffic during an outage or quota spike.
from langchain_openai import ChatOpenAI
# Risky if your app already has high traffic
llm = ChatOpenAI(
model="gpt-4o-mini",
max_retries=6,
)
Fix it by lowering retries and adding backoff at the application layer:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="gpt-4o-mini",
max_retries=2,
)
2. A chain is making more LLM calls than you expect
A single user request may trigger multiple calls through LLMChain, RetrievalQA, agents, memory summarization, and tool use. What looks like one prompt can become five or ten API calls.
from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
)
# One user request can fan out into multiple internal calls
answer = qa.invoke({"query": "What does our policy say about refunds?"})
If you need fewer calls, simplify the chain or cache retrieval results.
3. Token usage is too high per request
Rate limits are not only about request count. Many providers also enforce tokens-per-minute limits, so long prompts and huge context windows can trigger throttling.
# BAD: dumping large documents into every prompt
prompt = f"""
Use all of this context:
{very_large_policy_document}
Question: {question}
"""
Fix by chunking and retrieving only relevant passages:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
chunks = splitter.split_text(very_large_policy_document)
4. Multiple app instances share one API key
This is common in production behind autoscaling. Two pods in Kubernetes or several Celery workers can each look fine locally but together exceed account limits.
# Example symptom: 6 replicas all using the same OpenAI key
replicas: 6
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: openai-secret
key: api_key
If traffic spikes, reduce replicas temporarily or partition workloads across keys/accounts if your vendor allows it.
How to Debug It
- •
Inspect the exact exception Look for provider-specific errors such as:
- •
openai.RateLimitError - •
anthropic.RateLimitError - •HTTP
429 Too Many Requests
In LangChain logs, you’ll often see something like:
- •
RateLimitError: Error code: 429 - •
openai.RateLimitError: Rate limit reached for gpt-4o-mini...
- •
- •
Count how many LLM calls one user action triggers Add tracing with LangSmith or simple logging around every
.invoke(),.ainvoke(),.batch(), and tool call.If one API request causes five model calls, your “one request” assumption is wrong.
- •
Check concurrency at runtime Print active task counts, worker counts, and batch sizes.
If you see dozens of parallel invocations from a single endpoint, throttle them.
- •
Compare tokens and RPM/TPM limits Check your provider dashboard for:
- •requests per minute
- •tokens per minute
- •daily quota
A low RPM issue needs throttling; a low TPM issue needs shorter prompts and smaller retrieved context.
Prevention
- •Put a hard cap on concurrency for every LLM-backed endpoint.
- •Use retries sparingly and add exponential backoff with jitter.
- •Cache repeated outputs where the same prompt appears often.
- •Keep prompts short and retrieve only relevant context.
- •Load test before production with realistic traffic patterns, not just one-off prompts.
If you want a practical rule: treat LLM calls like database writes. Bound them, batch them carefully, and assume production traffic will be messier than local tests.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit