How to Fix 'rate limit exceeded' in LangChain (TypeScript)

By Cyprian AaronsUpdated 2026-04-21
rate-limit-exceededlangchaintypescript

What the error means

rate limit exceeded means the provider rejected your request because you sent too many tokens, too many requests, or both, inside its current quota window. In LangChain TypeScript, this usually shows up when you call OpenAI, Anthropic, Azure OpenAI, or another model provider from a loop, batch job, or parallel worker pool.

The error often looks like this:

Error: 429 Rate limit exceeded

Or, with OpenAI’s SDK under LangChain:

RateLimitError: 429 You exceeded your current quota, please check your plan and billing details.

The Most Common Cause

The #1 cause is uncontrolled parallelism. Developers often map over a list of inputs and fire all requests at once with Promise.all(), which is fine for 5 calls and a problem for 500.

Here’s the broken pattern:

BrokenFixed
Fires all requests at onceLimits concurrency
Easy to writeSafer under provider quotas
Causes bursts and 429sSmooths traffic
// Broken: unbounded concurrency
import { ChatOpenAI } from "@langchain/openai";

const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
  temperature: 0,
});

const prompts = [
  "Summarize policy A",
  "Summarize policy B",
  "Summarize policy C",
  // ...hundreds more
];

const results = await Promise.all(
  prompts.map((prompt) => llm.invoke(prompt))
);

The fix is to cap concurrency. In LangChain, use RunnableSequence, batch, or your own queue. If you’re just invoking a model directly, the simplest production-safe pattern is a small concurrency limiter.

// Fixed: bounded concurrency
import pLimit from "p-limit";
import { ChatOpenAI } from "@langchain/openai";

const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
  temperature: 0,
});

const limit = pLimit(3);

const prompts = [
  "Summarize policy A",
  "Summarize policy B",
  "Summarize policy C",
];

const results = await Promise.all(
  prompts.map((prompt) =>
    limit(() => llm.invoke(prompt))
  )
);

If you’re using LangChain’s batching APIs, keep the batch size small and test against real provider limits. The important point is this: Promise.all() does not respect API quotas.

Other Possible Causes

1) Your prompt is too large

You can hit rate limits through token throughput, not just request count. Long system prompts, huge context windows, and document stuffing can push you over the provider’s tokens-per-minute cap.

// Too much context in one call
const response = await llm.invoke([
  { role: "system", content: bigPolicyManual },
  { role: "user", content: customerCaseFile },
]);

Fix it by chunking documents and reducing context before generation.

// Better: chunk first, then summarize in stages
const chunks = splitText(customerCaseFile);
for (const chunk of chunks) {
  await llm.invoke(`Summarize this chunk:\n${chunk}`);
}

2) Retries are multiplying traffic

LangChain and the underlying SDK may retry failed requests automatically. If your app already has retries at the job runner level, you can accidentally turn one failed call into three or five more calls.

import { ChatOpenAI } from "@langchain/openai";

const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
  maxRetries: 6,
});

If your queue system also retries jobs, lower one side. Don’t stack aggressive retries everywhere.

3) Multiple workers share one API key

This happens in background jobs all the time. One service instance looks fine in dev; then four replicas go live and all hammer the same key.

OPENAI_API_KEY=sk-...

If every pod uses that same key without coordination, you can hit org-wide limits fast. Add a distributed rate limiter or reduce worker count per key.

4) You’re hitting the wrong provider tier

Sometimes the code is fine and the account isn’t. The error message may be explicit:

RateLimitError: You exceeded your current quota, please check your plan and billing details.

That means no amount of code changes will fix it until you raise limits or move to a higher tier.

How to Debug It

  1. Check whether it fails on one request or many

    • If a single llm.invoke() fails consistently, suspect quota or prompt size.
    • If failures appear only under load, suspect concurrency.
  2. Log request size and timing

    • Record prompt length, token estimates, and how many requests are in flight.
    • If failures cluster during bursts, your limiter is missing or too loose.
  3. Disable app-level retries temporarily

    • Set maxRetries low and remove queue retries for one test run.
    • This tells you whether retries are amplifying the problem.
  4. Test with a single worker

    • Run one process with one request at a time.
    • If it stops failing, your issue is almost certainly parallelism or shared-key contention.

Prevention

  • Use bounded concurrency by default.
    • Start with 2-5 concurrent requests per key unless the provider docs say otherwise.
  • Put token budgeting into your pipeline.
    • Chunk large documents before they reach ChatOpenAI, AzureChatOpenAI, or any other LLM wrapper.
  • Centralize retry policy.
    • Keep retries in one layer only: either the application queue or the LangChain client config, not both.

If you’re building agents for production systems like claims triage or KYC workflows, treat rate limiting as an architecture problem, not just an exception handler issue. The fix is usually less about catching 429s and more about controlling how traffic enters the model layer.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides