How to Fix 'rate limit exceeded during development' in LangChain (Python)
When you see rate limit exceeded during development in a LangChain Python app, it usually means your code is sending more requests to the model provider than your account or project quota allows. In practice, this shows up during testing loops, agent retries, chain fan-out, or when you accidentally call the same model multiple times per user request.
The important part: this is rarely a LangChain bug. It’s usually a request pattern problem, a config problem, or both.
The Most Common Cause
The #1 cause is repeated LLM calls inside a loop or agent executor without any throttling. In LangChain, this often happens when you call invoke() repeatedly on the same input, or when an agent retries tool calls and each retry triggers another model request.
Here’s the broken pattern:
| Broken | Fixed |
|---|---|
| Calls the model for every item with no batching/backoff | Batches work or adds retry/backoff control |
# BROKEN: repeated calls can trigger provider rate limits fast
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")
questions = [
"Summarize policy A",
"Summarize policy B",
"Summarize policy C",
]
answers = []
for q in questions:
answers.append(llm.invoke(q).content)
# FIXED: reduce call frequency and add retry behavior
from langchain_openai import ChatOpenAI
from tenacity import retry, wait_exponential_jitter, stop_after_attempt
llm = ChatOpenAI(
model="gpt-4o-mini",
max_retries=2,
)
@retry(wait=wait_exponential_jitter(initial=1, max=10), stop=stop_after_attempt(3))
def ask(question: str) -> str:
return llm.invoke(question).content
questions = [
"Summarize policy A",
"Summarize policy B",
"Summarize policy C",
]
answers = [ask(q) for q in questions]
If you’re using an agent, the same issue can be hidden behind tool loops. A stack trace often looks like this:
openai.RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit exceeded'}}
Or in LangChain wrapper terms:
langchain_core.exceptions.OutputParserException
openai.RateLimitError: Rate limit exceeded during development
The fix is to reduce unnecessary invocations first. Retries help, but they do not solve a bad call pattern.
Other Possible Causes
1. Your concurrency is too high
If you run multiple tasks in parallel with asyncio.gather() or a thread pool, you can spike request volume instantly.
# Too aggressive
results = await asyncio.gather(*[
llm.ainvoke(prompt) for prompt in prompts
])
Use bounded concurrency instead:
sem = asyncio.Semaphore(3)
async def limited_call(prompt):
async with sem:
return await llm.ainvoke(prompt)
results = await asyncio.gather(*[limited_call(p) for p in prompts])
2. You are hitting token-based limits, not just request limits
Some providers throttle by tokens per minute. A single long prompt plus a long output can hit the ceiling even if request count is low.
llm = ChatOpenAI(
model="gpt-4o-mini",
max_tokens=4000,
)
Reduce output size and prompt size:
llm = ChatOpenAI(
model="gpt-4o-mini",
max_tokens=800,
)
Also trim chat history before passing it into MessagesPlaceholder.
3. You are creating a new client/model object on every call
This won’t always cause rate limiting directly, but it often leads to poor retry behavior and noisy development patterns.
# Bad: instantiate inside hot path
def handle_request(text):
llm = ChatOpenAI(model="gpt-4o-mini")
return llm.invoke(text)
Prefer one shared client per process:
llm = ChatOpenAI(model="gpt-4o-mini", max_retries=2)
def handle_request(text):
return llm.invoke(text)
4. Your environment variables point to the wrong project or key
A common dev mistake is using a personal key with a tiny quota while thinking you’re on the team account.
Check these:
echo $OPENAI_API_KEY
echo $LANGCHAIN_TRACING_V2
echo $OPENAI_ORG_ID
If you use multiple environments, keep them explicit:
import os
os.environ["OPENAI_API_KEY"] = os.getenv("DEV_OPENAI_API_KEY", "")
How to Debug It
- •
Inspect the exact exception
- •Look for
openai.RateLimitError, HTTP429, or provider-specific quota messages. - •If you see
insufficient_quota, that’s different from pure throttling.
- •Look for
- •
Count how many model calls one user action triggers
- •Add logging around every
invoke(),ainvoke(), chain run, and tool execution. - •Agents often make 3-10 calls per “one question.”
- •Add logging around every
- •
Disable parallelism temporarily
- •Run everything sequentially.
- •If the error disappears, your issue is concurrency, not total volume.
- •
Print prompt sizes and token usage
- •Large histories and long retrieval context can push you over token-per-minute limits.
- •Check response metadata when available.
Example debug hook:
def traced_invoke(llm, prompt):
print(f"Invoking with {len(str(prompt))} chars")
result = llm.invoke(prompt)
print("Done")
return result
Prevention
- •Add retry with exponential backoff on all external LLM calls.
- •Cap concurrency in async jobs and background workers.
- •Keep prompts short and trim conversation history before each turn.
- •Reuse one configured
ChatOpenAIinstance per service process. - •Set up usage monitoring early so you see spikes before users do.
If this only happens during development, check your test harness too. A tight reload loop, notebook reruns, or a frontend double-submit can make LangChain look guilty when the real problem is duplicated traffic.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit