How to Fix 'rate limit exceeded during development' in LangChain (Python)

By Cyprian AaronsUpdated 2026-04-21
rate-limit-exceeded-during-developmentlangchainpython

When you see rate limit exceeded during development in a LangChain Python app, it usually means your code is sending more requests to the model provider than your account or project quota allows. In practice, this shows up during testing loops, agent retries, chain fan-out, or when you accidentally call the same model multiple times per user request.

The important part: this is rarely a LangChain bug. It’s usually a request pattern problem, a config problem, or both.

The Most Common Cause

The #1 cause is repeated LLM calls inside a loop or agent executor without any throttling. In LangChain, this often happens when you call invoke() repeatedly on the same input, or when an agent retries tool calls and each retry triggers another model request.

Here’s the broken pattern:

BrokenFixed
Calls the model for every item with no batching/backoffBatches work or adds retry/backoff control
# BROKEN: repeated calls can trigger provider rate limits fast
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

questions = [
    "Summarize policy A",
    "Summarize policy B",
    "Summarize policy C",
]

answers = []
for q in questions:
    answers.append(llm.invoke(q).content)
# FIXED: reduce call frequency and add retry behavior
from langchain_openai import ChatOpenAI
from tenacity import retry, wait_exponential_jitter, stop_after_attempt

llm = ChatOpenAI(
    model="gpt-4o-mini",
    max_retries=2,
)

@retry(wait=wait_exponential_jitter(initial=1, max=10), stop=stop_after_attempt(3))
def ask(question: str) -> str:
    return llm.invoke(question).content

questions = [
    "Summarize policy A",
    "Summarize policy B",
    "Summarize policy C",
]

answers = [ask(q) for q in questions]

If you’re using an agent, the same issue can be hidden behind tool loops. A stack trace often looks like this:

openai.RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit exceeded'}}

Or in LangChain wrapper terms:

langchain_core.exceptions.OutputParserException
openai.RateLimitError: Rate limit exceeded during development

The fix is to reduce unnecessary invocations first. Retries help, but they do not solve a bad call pattern.

Other Possible Causes

1. Your concurrency is too high

If you run multiple tasks in parallel with asyncio.gather() or a thread pool, you can spike request volume instantly.

# Too aggressive
results = await asyncio.gather(*[
    llm.ainvoke(prompt) for prompt in prompts
])

Use bounded concurrency instead:

sem = asyncio.Semaphore(3)

async def limited_call(prompt):
    async with sem:
        return await llm.ainvoke(prompt)

results = await asyncio.gather(*[limited_call(p) for p in prompts])

2. You are hitting token-based limits, not just request limits

Some providers throttle by tokens per minute. A single long prompt plus a long output can hit the ceiling even if request count is low.

llm = ChatOpenAI(
    model="gpt-4o-mini",
    max_tokens=4000,
)

Reduce output size and prompt size:

llm = ChatOpenAI(
    model="gpt-4o-mini",
    max_tokens=800,
)

Also trim chat history before passing it into MessagesPlaceholder.

3. You are creating a new client/model object on every call

This won’t always cause rate limiting directly, but it often leads to poor retry behavior and noisy development patterns.

# Bad: instantiate inside hot path
def handle_request(text):
    llm = ChatOpenAI(model="gpt-4o-mini")
    return llm.invoke(text)

Prefer one shared client per process:

llm = ChatOpenAI(model="gpt-4o-mini", max_retries=2)

def handle_request(text):
    return llm.invoke(text)

4. Your environment variables point to the wrong project or key

A common dev mistake is using a personal key with a tiny quota while thinking you’re on the team account.

Check these:

echo $OPENAI_API_KEY
echo $LANGCHAIN_TRACING_V2
echo $OPENAI_ORG_ID

If you use multiple environments, keep them explicit:

import os

os.environ["OPENAI_API_KEY"] = os.getenv("DEV_OPENAI_API_KEY", "")

How to Debug It

  1. Inspect the exact exception

    • Look for openai.RateLimitError, HTTP 429, or provider-specific quota messages.
    • If you see insufficient_quota, that’s different from pure throttling.
  2. Count how many model calls one user action triggers

    • Add logging around every invoke(), ainvoke(), chain run, and tool execution.
    • Agents often make 3-10 calls per “one question.”
  3. Disable parallelism temporarily

    • Run everything sequentially.
    • If the error disappears, your issue is concurrency, not total volume.
  4. Print prompt sizes and token usage

    • Large histories and long retrieval context can push you over token-per-minute limits.
    • Check response metadata when available.

Example debug hook:

def traced_invoke(llm, prompt):
    print(f"Invoking with {len(str(prompt))} chars")
    result = llm.invoke(prompt)
    print("Done")
    return result

Prevention

  • Add retry with exponential backoff on all external LLM calls.
  • Cap concurrency in async jobs and background workers.
  • Keep prompts short and trim conversation history before each turn.
  • Reuse one configured ChatOpenAI instance per service process.
  • Set up usage monitoring early so you see spikes before users do.

If this only happens during development, check your test harness too. A tight reload loop, notebook reruns, or a frontend double-submit can make LangChain look guilty when the real problem is duplicated traffic.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides