How to Fix 'rate limit exceeded when scaling' in LangChain (TypeScript)

By Cyprian AaronsUpdated 2026-04-21

rate-limit-exceeded-when-scalinglangchaintypescript

When you see rate limit exceeded when scaling in a LangChain TypeScript app, it usually means your concurrency jumped faster than the upstream model provider can accept requests. In practice, this shows up when you move from one-off calls to batch processing, parallel Promise.all, or agent loops that fan out too aggressively.

The error is not a LangChain bug by itself. It’s usually your app sending more requests per second than the model provider allows, or retry behavior multiplying traffic under load.

The Most Common Cause

The #1 cause is uncontrolled parallelism.

A lot of TypeScript code starts with a clean-looking Promise.all(...), then gets wrapped inside another loop or batch job. That works at small scale, then falls over with errors like:

•429 Too Many Requests
•RateLimitError: Rate limit exceeded
•openai.RateLimitError: 429 You exceeded your current quota
•provider-specific messages surfaced through ChatOpenAI or OpenAIEmbeddings

Broken pattern vs fixed pattern

Broken	Fixed
Fires everything at once with `Promise.all`	Limits concurrency with a queue or batch size
No backpressure	Explicit throttling
Retries amplify the spike	Retries happen inside a controlled rate

// ❌ Broken: unbounded parallelism
import { ChatOpenAI } from "@langchain/openai";

const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
  temperature: 0,
});

async function summarizeDocs(docs: string[]) {
  return Promise.all(
    docs.map((doc) =>
      llm.invoke([
        { role: "system", content: "Summarize briefly." },
        { role: "user", content: doc },
      ])
    )
  );
}

// ✅ Fixed: controlled concurrency
import pLimit from "p-limit";
import { ChatOpenAI } from "@langchain/openai";

const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
  temperature: 0,
});

const limit = pLimit(3); // tune to your provider limits

async function summarizeDocs(docs: string[]) {
  return Promise.all(
    docs.map((doc) =>
      limit(() =>
        llm.invoke([
          { role: "system", content: "Summarize briefly." },
          { role: "user", content: doc },
        ])
      )
    )
  );
}

If you’re using LangChain’s batching APIs, the same rule applies. Keep batch sizes small enough that request bursts stay below the provider’s RPM/TPM ceiling.

Other Possible Causes

1. Retry settings are too aggressive

LangChain wrappers and underlying SDKs can retry failed calls. If you already hit the limit, retries can make it worse by immediately re-sending traffic.

// Too aggressive for bursty workloads
const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
  maxRetries: 6,
});

Use fewer retries and add jitter/backoff if your workload is spiky.

const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
  maxRetries: 2,
});

2. Your agent loop is calling tools too often

Agents can create hidden request multiplication. One user query can trigger multiple LLM calls across planning, tool selection, and final response generation.

const agent = await createReactAgent({
  llm,
  tools,
});

// One input may produce several model calls
await agent.invoke({ input: "Analyze these 50 records and draft a report" });

If the agent is doing bulk work, consider replacing it with a deterministic pipeline:

•split documents first
•run bounded map steps
•aggregate results once

3. Embeddings are being generated in one giant burst

This happens a lot during ingestion. OpenAIEmbeddings or similar embedding classes can hit rate limits fast if you send thousands of chunks at once.

import { OpenAIEmbeddings } from "@langchain/openai";

const embeddings = new OpenAIEmbeddings({
  modelName: "text-embedding-3-small",
});

// Bad if chunks is huge and called all at once
await embeddings.embedDocuments(chunks);

Fix it by chunking and pacing the ingestion job:

•process in slices of 50–200 chunks
•sleep between slices if needed
•run ingestion as a background job, not on request path

4. Model/provider limits are lower than you think

Sometimes the code is fine, but your account tier or deployment has low RPM/TPM limits. Azure OpenAI, OpenAI, Anthropic, and Bedrock all expose different ceilings.

Check your config for things like:

•deployment-level throughput caps
•per-minute token limits
•regional throttles
•org-level quotas

If your logs show consistent failure at the same volume, this is likely the issue.

How to Debug It

•
Log every call count per second
- •Count how many times llm.invoke, chain.invoke, or agent.invoke runs.
- •If spikes line up with failures, you’ve found your pressure point.
•
Inspect the exact error class
- •Look for RateLimitError, HTTP 429, or provider-specific exceptions.
- •In LangChain wrappers, check whether the error comes from ChatOpenAI, AzureChatOpenAI, or another integration.
•
Disable parallelism temporarily
- •Replace Promise.all with a simple for...of.
- •If errors disappear, concurrency was the trigger.

for (const doc of docs) {
  await llm.invoke([{ role: "user", content: doc }]);
}

•
Reduce retries to isolate amplification
- •Set maxRetries low.
- •If failures become more visible but less explosive, retries were hiding the real problem.

Prevention

•Use bounded concurrency everywhere you fan out requests.
•Separate ingestion jobs from interactive request flows.
•Treat embeddings, agents, and chains as rate-limited systems by default.
•Add metrics for request rate, retry count, and provider-side throttling before shipping to production.

If you build LangChain TypeScript apps for real workloads, assume rate limits will be hit eventually. The fix is not “retry harder”; it’s controlling how much traffic you generate in the first place.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit