How to Fix 'rate limit exceeded when scaling' in LangChain (TypeScript)
When you see rate limit exceeded when scaling in a LangChain TypeScript app, it usually means your concurrency jumped faster than the upstream model provider can accept requests. In practice, this shows up when you move from one-off calls to batch processing, parallel Promise.all, or agent loops that fan out too aggressively.
The error is not a LangChain bug by itself. It’s usually your app sending more requests per second than the model provider allows, or retry behavior multiplying traffic under load.
The Most Common Cause
The #1 cause is uncontrolled parallelism.
A lot of TypeScript code starts with a clean-looking Promise.all(...), then gets wrapped inside another loop or batch job. That works at small scale, then falls over with errors like:
- •
429 Too Many Requests - •
RateLimitError: Rate limit exceeded - •
openai.RateLimitError: 429 You exceeded your current quota - •provider-specific messages surfaced through
ChatOpenAIorOpenAIEmbeddings
Broken pattern vs fixed pattern
| Broken | Fixed |
|---|---|
Fires everything at once with Promise.all | Limits concurrency with a queue or batch size |
| No backpressure | Explicit throttling |
| Retries amplify the spike | Retries happen inside a controlled rate |
// ❌ Broken: unbounded parallelism
import { ChatOpenAI } from "@langchain/openai";
const llm = new ChatOpenAI({
model: "gpt-4o-mini",
temperature: 0,
});
async function summarizeDocs(docs: string[]) {
return Promise.all(
docs.map((doc) =>
llm.invoke([
{ role: "system", content: "Summarize briefly." },
{ role: "user", content: doc },
])
)
);
}
// ✅ Fixed: controlled concurrency
import pLimit from "p-limit";
import { ChatOpenAI } from "@langchain/openai";
const llm = new ChatOpenAI({
model: "gpt-4o-mini",
temperature: 0,
});
const limit = pLimit(3); // tune to your provider limits
async function summarizeDocs(docs: string[]) {
return Promise.all(
docs.map((doc) =>
limit(() =>
llm.invoke([
{ role: "system", content: "Summarize briefly." },
{ role: "user", content: doc },
])
)
)
);
}
If you’re using LangChain’s batching APIs, the same rule applies. Keep batch sizes small enough that request bursts stay below the provider’s RPM/TPM ceiling.
Other Possible Causes
1. Retry settings are too aggressive
LangChain wrappers and underlying SDKs can retry failed calls. If you already hit the limit, retries can make it worse by immediately re-sending traffic.
// Too aggressive for bursty workloads
const llm = new ChatOpenAI({
model: "gpt-4o-mini",
maxRetries: 6,
});
Use fewer retries and add jitter/backoff if your workload is spiky.
const llm = new ChatOpenAI({
model: "gpt-4o-mini",
maxRetries: 2,
});
2. Your agent loop is calling tools too often
Agents can create hidden request multiplication. One user query can trigger multiple LLM calls across planning, tool selection, and final response generation.
const agent = await createReactAgent({
llm,
tools,
});
// One input may produce several model calls
await agent.invoke({ input: "Analyze these 50 records and draft a report" });
If the agent is doing bulk work, consider replacing it with a deterministic pipeline:
- •split documents first
- •run bounded map steps
- •aggregate results once
3. Embeddings are being generated in one giant burst
This happens a lot during ingestion. OpenAIEmbeddings or similar embedding classes can hit rate limits fast if you send thousands of chunks at once.
import { OpenAIEmbeddings } from "@langchain/openai";
const embeddings = new OpenAIEmbeddings({
modelName: "text-embedding-3-small",
});
// Bad if chunks is huge and called all at once
await embeddings.embedDocuments(chunks);
Fix it by chunking and pacing the ingestion job:
- •process in slices of 50–200 chunks
- •sleep between slices if needed
- •run ingestion as a background job, not on request path
4. Model/provider limits are lower than you think
Sometimes the code is fine, but your account tier or deployment has low RPM/TPM limits. Azure OpenAI, OpenAI, Anthropic, and Bedrock all expose different ceilings.
Check your config for things like:
- •deployment-level throughput caps
- •per-minute token limits
- •regional throttles
- •org-level quotas
If your logs show consistent failure at the same volume, this is likely the issue.
How to Debug It
- •
Log every call count per second
- •Count how many times
llm.invoke,chain.invoke, oragent.invokeruns. - •If spikes line up with failures, you’ve found your pressure point.
- •Count how many times
- •
Inspect the exact error class
- •Look for
RateLimitError, HTTP429, or provider-specific exceptions. - •In LangChain wrappers, check whether the error comes from
ChatOpenAI,AzureChatOpenAI, or another integration.
- •Look for
- •
Disable parallelism temporarily
- •Replace
Promise.allwith a simplefor...of. - •If errors disappear, concurrency was the trigger.
- •Replace
for (const doc of docs) {
await llm.invoke([{ role: "user", content: doc }]);
}
- •Reduce retries to isolate amplification
- •Set
maxRetrieslow. - •If failures become more visible but less explosive, retries were hiding the real problem.
- •Set
Prevention
- •Use bounded concurrency everywhere you fan out requests.
- •Separate ingestion jobs from interactive request flows.
- •Treat embeddings, agents, and chains as rate-limited systems by default.
- •Add metrics for request rate, retry count, and provider-side throttling before shipping to production.
If you build LangChain TypeScript apps for real workloads, assume rate limits will be hit eventually. The fix is not “retry harder”; it’s controlling how much traffic you generate in the first place.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit