How to Fix 'rate limit exceeded in production' in LangChain (TypeScript)
When LangChain throws rate limit exceeded in production, it usually means your app is sending more requests than the model provider allows in a given time window. In TypeScript apps, this often shows up after you move from local testing to real traffic, where parallel requests, retries, and long-running chains all hit the same API key.
The error is rarely “just OpenAI being down.” It’s usually your code pattern, your concurrency, or your retry settings.
The Most Common Cause
The #1 cause is uncontrolled parallelism.
A common pattern is calling Promise.all() over a list of inputs and letting LangChain fire off dozens of LLM calls at once. That works locally with 3 items, then falls apart in production when the queue spikes.
| Broken pattern | Fixed pattern |
|---|---|
| Fires all requests at once | Limits concurrency |
| Easy to write | Safe under load |
| Causes burst rate limits | Smooths request volume |
import { ChatOpenAI } from "@langchain/openai";
const llm = new ChatOpenAI({
model: "gpt-4o-mini",
apiKey: process.env.OPENAI_API_KEY,
});
const prompts = [
"Summarize policy A",
"Summarize policy B",
"Summarize policy C",
"Summarize policy D",
];
// BROKEN: all requests go out at once
const results = await Promise.all(
prompts.map((prompt) => llm.invoke(prompt))
);
import pLimit from "p-limit";
import { ChatOpenAI } from "@langchain/openai";
const llm = new ChatOpenAI({
model: "gpt-4o-mini",
apiKey: process.env.OPENAI_API_KEY,
});
const prompts = [
"Summarize policy A",
"Summarize policy B",
"Summarize policy C",
"Summarize policy D",
];
const limit = pLimit(2); // keep only 2 in flight
// FIXED: controlled concurrency
const results = await Promise.all(
prompts.map((prompt) => limit(() => llm.invoke(prompt)))
);
If you’re using RunnableSequence, map(), or a custom queue worker, the same rule applies. The provider sees request bursts, not your intent.
Other Possible Causes
1) Retries are multiplying traffic
LangChain retries can help with transient failures, but they also amplify load if your app is already near the limit.
import { ChatOpenAI } from "@langchain/openai";
const llm = new ChatOpenAI({
model: "gpt-4o-mini",
maxRetries: 6, // can make burst traffic worse
});
If you’re already hitting provider limits, reduce retries and add backoff outside the model call.
const llm = new ChatOpenAI({
model: "gpt-4o-mini",
maxRetries: 2,
});
2) Multiple workers share one API key
This is common in production when you scale horizontally. One pod looks fine; five pods all using the same key push you over the account limit.
OPENAI_API_KEY=sk-prod-shared-key
If each worker can independently spike traffic, rate limits will look random. The fix is usually shared throttling or a centralized job queue.
3) Your chain makes more LLM calls than you think
A single user request may trigger multiple model calls through tools, agents, retrieval steps, or output parsing retries.
// One request can become many internal calls:
const chain = prompt.pipe(llm).pipe(parser);
Agents are especially noisy because tool loops can re-enter the model several times per user action. Check whether your “one endpoint” is actually making five or ten upstream requests.
4) Token-heavy prompts increase provider throttling
Some providers enforce rate limits by tokens per minute, not just request count. Large context windows can trip limits even with low concurrency.
const longContext = docs.join("\n\n"); // huge payload
await llm.invoke(`Answer using this context:\n${longContext}`);
Trim retrieved documents, summarize first, or cap context size before sending it to the model.
How to Debug It
- •
Log every upstream call
- •Add request IDs around each
invoke(). - •Log model name, prompt size, and timestamps.
- •If one user action triggers many logs in a burst, you found the problem.
- •Add request IDs around each
- •
Check whether the error is a provider
429- •In LangChain/OpenAI setups, this often surfaces as an HTTP
429 Too Many Requests. - •You may also see messages like
RateLimitErrororopenai.RateLimitErrordepending on SDK version. - •If it’s a true rate limit issue, retries alone won’t fix it.
- •In LangChain/OpenAI setups, this often surfaces as an HTTP
- •
Measure concurrency under load
- •Count in-flight requests per process.
- •Compare local dev vs production worker count.
- •If production has more pods or threads, assume concurrency is higher than expected.
- •
Inspect chain behavior
- •Turn on verbose logging for chains and agents.
- •Watch for repeated tool calls or retry loops.
- •If a single endpoint fans out into multiple LLM calls, reduce fan-out before touching limits.
Prevention
- •
Put a concurrency cap around every batch job
- •Use
p-limit, BullMQ workers, or an internal queue. - •Never let unbounded
Promise.all()hit an LLM provider in production.
- •Use
- •
Treat retries as part of capacity planning
- •Keep
maxRetrieslow. - •Add jittered backoff and stop retrying on hard rate limits unless you have headroom.
- •Keep
- •
Instrument token usage and request counts
- •Track requests per minute and tokens per minute by service.
- •Alert before you hit provider ceilings so production doesn’t discover them first.
If you’re seeing RateLimitError or HTTP 429 Too Many Requests in LangChain TypeScript, start with concurrency first. In real systems, that’s usually where the problem lives.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit