How to Fix 'rate limit exceeded' in LangChain (TypeScript)
What the error means
rate limit exceeded means the provider rejected your request because you sent too many tokens, too many requests, or both, inside its current quota window. In LangChain TypeScript, this usually shows up when you call OpenAI, Anthropic, Azure OpenAI, or another model provider from a loop, batch job, or parallel worker pool.
The error often looks like this:
Error: 429 Rate limit exceeded
Or, with OpenAI’s SDK under LangChain:
RateLimitError: 429 You exceeded your current quota, please check your plan and billing details.
The Most Common Cause
The #1 cause is uncontrolled parallelism. Developers often map over a list of inputs and fire all requests at once with Promise.all(), which is fine for 5 calls and a problem for 500.
Here’s the broken pattern:
| Broken | Fixed |
|---|---|
| Fires all requests at once | Limits concurrency |
| Easy to write | Safer under provider quotas |
| Causes bursts and 429s | Smooths traffic |
// Broken: unbounded concurrency
import { ChatOpenAI } from "@langchain/openai";
const llm = new ChatOpenAI({
model: "gpt-4o-mini",
temperature: 0,
});
const prompts = [
"Summarize policy A",
"Summarize policy B",
"Summarize policy C",
// ...hundreds more
];
const results = await Promise.all(
prompts.map((prompt) => llm.invoke(prompt))
);
The fix is to cap concurrency. In LangChain, use RunnableSequence, batch, or your own queue. If you’re just invoking a model directly, the simplest production-safe pattern is a small concurrency limiter.
// Fixed: bounded concurrency
import pLimit from "p-limit";
import { ChatOpenAI } from "@langchain/openai";
const llm = new ChatOpenAI({
model: "gpt-4o-mini",
temperature: 0,
});
const limit = pLimit(3);
const prompts = [
"Summarize policy A",
"Summarize policy B",
"Summarize policy C",
];
const results = await Promise.all(
prompts.map((prompt) =>
limit(() => llm.invoke(prompt))
)
);
If you’re using LangChain’s batching APIs, keep the batch size small and test against real provider limits. The important point is this: Promise.all() does not respect API quotas.
Other Possible Causes
1) Your prompt is too large
You can hit rate limits through token throughput, not just request count. Long system prompts, huge context windows, and document stuffing can push you over the provider’s tokens-per-minute cap.
// Too much context in one call
const response = await llm.invoke([
{ role: "system", content: bigPolicyManual },
{ role: "user", content: customerCaseFile },
]);
Fix it by chunking documents and reducing context before generation.
// Better: chunk first, then summarize in stages
const chunks = splitText(customerCaseFile);
for (const chunk of chunks) {
await llm.invoke(`Summarize this chunk:\n${chunk}`);
}
2) Retries are multiplying traffic
LangChain and the underlying SDK may retry failed requests automatically. If your app already has retries at the job runner level, you can accidentally turn one failed call into three or five more calls.
import { ChatOpenAI } from "@langchain/openai";
const llm = new ChatOpenAI({
model: "gpt-4o-mini",
maxRetries: 6,
});
If your queue system also retries jobs, lower one side. Don’t stack aggressive retries everywhere.
3) Multiple workers share one API key
This happens in background jobs all the time. One service instance looks fine in dev; then four replicas go live and all hammer the same key.
OPENAI_API_KEY=sk-...
If every pod uses that same key without coordination, you can hit org-wide limits fast. Add a distributed rate limiter or reduce worker count per key.
4) You’re hitting the wrong provider tier
Sometimes the code is fine and the account isn’t. The error message may be explicit:
RateLimitError: You exceeded your current quota, please check your plan and billing details.
That means no amount of code changes will fix it until you raise limits or move to a higher tier.
How to Debug It
- •
Check whether it fails on one request or many
- •If a single
llm.invoke()fails consistently, suspect quota or prompt size. - •If failures appear only under load, suspect concurrency.
- •If a single
- •
Log request size and timing
- •Record prompt length, token estimates, and how many requests are in flight.
- •If failures cluster during bursts, your limiter is missing or too loose.
- •
Disable app-level retries temporarily
- •Set
maxRetrieslow and remove queue retries for one test run. - •This tells you whether retries are amplifying the problem.
- •Set
- •
Test with a single worker
- •Run one process with one request at a time.
- •If it stops failing, your issue is almost certainly parallelism or shared-key contention.
Prevention
- •Use bounded concurrency by default.
- •Start with
2-5concurrent requests per key unless the provider docs say otherwise.
- •Start with
- •Put token budgeting into your pipeline.
- •Chunk large documents before they reach
ChatOpenAI,AzureChatOpenAI, or any other LLM wrapper.
- •Chunk large documents before they reach
- •Centralize retry policy.
- •Keep retries in one layer only: either the application queue or the LangChain client config, not both.
If you’re building agents for production systems like claims triage or KYC workflows, treat rate limiting as an architecture problem, not just an exception handler issue. The fix is usually less about catching 429s and more about controlling how traffic enters the model layer.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit