How to Fix 'rate limit exceeded' in LangChain (Python)

By Cyprian AaronsUpdated 2026-04-21
rate-limit-exceededlangchainpython

A rate limit exceeded error in LangChain usually means the upstream model provider rejected your request because you sent too many requests, too many tokens, or both. In practice, it shows up when you loop over documents, run concurrent chains, or let retries amplify traffic without any backoff.

The exact exception text varies by provider, but with OpenAI-backed LangChain apps you’ll often see something like:

openai.RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org_...'}}

The Most Common Cause

The #1 cause is uncontrolled request volume from a loop, batch job, or parallel execution. In LangChain, this usually happens when people call an LLM inside a for loop without throttling, or they fan out too many concurrent tasks with RunnableParallel / batch().

Here’s the broken pattern:

BrokenFixed
Sends requests as fast as the loop runsAdds batching, pacing, and retry-aware limits
# Broken: fires one request per item with no pacing
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

questions = [
    "Summarize policy A",
    "Summarize policy B",
    "Summarize policy C",
]

for q in questions:
    response = llm.invoke(q)
    print(response.content)
# Fixed: batch with controlled concurrency and retry-safe pacing
import time
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="gpt-4o-mini",
    max_retries=3,
)

questions = [
    "Summarize policy A",
    "Summarize policy B",
    "Summarize policy C",
]

for q in questions:
    response = llm.invoke(q)
    print(response.content)
    time.sleep(1.2)  # simple throttle; tune to your provider limits

If you’re using LCEL or batch(), keep concurrency low:

results = chain.batch(inputs, config={"max_concurrency": 2})

If the provider’s limit is per minute, raw throughput matters more than retry count. A retry without backoff just makes the spike worse.

Other Possible Causes

1) Your prompt is too large

You can hit token-rate limits even with a small number of requests if each request is huge.

# Risky: stuffing full documents into one prompt
prompt = f"Analyze this document:\n\n{large_text}"
result = llm.invoke(prompt)

Fix it by chunking input and summarizing incrementally:

chunks = text_splitter.split_text(large_text)
summaries = [llm.invoke(f"Summarize:\n{chunk}").content for chunk in chunks]

2) Retries are multiplying traffic

LangChain retries can help transient failures, but if your app is already above quota, retries create more 429s.

llm = ChatOpenAI(model="gpt-4o", max_retries=10)

Use a smaller retry count and exponential backoff at the HTTP layer if needed:

llm = ChatOpenAI(model="gpt-4o", max_retries=2)

3) You’re using parallel execution too aggressively

This is common with RunnableParallel, async gathers, or high max_concurrency.

# Too aggressive for many org-level quotas
results = chain.batch(inputs, config={"max_concurrency": 20})

Dial it down:

results = chain.batch(inputs, config={"max_concurrency": 2})

For async code:

import asyncio

sem = asyncio.Semaphore(2)

async def guarded_call(x):
    async with sem:
        return await chain.ainvoke(x)

4) You’re on the wrong model or tier for your quota

Some models have tighter RPM/TPM limits than others. If you moved from a lower-tier model to something like gpt-4o, your old traffic pattern may now exceed quota.

Check the model name and provider settings:

llm = ChatOpenAI(model="gpt-4o")   # may be rate-limited sooner than expected
# try a lower-cost / higher-throughput option if it fits your use case

Also verify your API key belongs to the right organization and project. Wrong org context can make healthy usage look like a quota issue.

How to Debug It

  1. Read the full exception

    • Don’t stop at RateLimitError.
    • Look for provider details like Error code: 429, requests per minute, tokens per minute, or headers such as x-ratelimit-limit-requests.
  2. Log request volume

    • Count how many times your chain calls the model per user action.
    • If one web request triggers 10 LLM calls, that’s usually the source.
  3. Check concurrency

    • Inspect batch(), async gathers, Celery workers, FastAPI background tasks, and any queue consumers.
    • Reduce concurrency to 1–2 and retest.
  4. Measure prompt size

    • Log input token estimates before invoking the model.
    • If errors happen only on long documents, you’re likely hitting TPM rather than RPM.

A quick diagnostic wrapper helps:

def safe_invoke(llm, prompt):
    print("Prompt chars:", len(prompt))
    return llm.invoke(prompt)

Prevention

  • Set explicit throttling and low concurrency from day one.
    • Use max_concurrency, semaphores, or a queue worker.
  • Keep prompts small and split large documents before calling the model.
    • Chunk first, summarize second.
  • Configure sane retries.
    • Use a small max_retries and backoff-aware behavior instead of blind retry storms.

If you want one rule to remember: treat LLM calls like external API traffic with hard quotas. LangChain won’t save you from bad throughput design; it will just surface the provider’s 429 faster.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides