How to Fix 'rate limit exceeded in production' in LangChain (TypeScript)

By Cyprian AaronsUpdated 2026-04-21
rate-limit-exceeded-in-productionlangchaintypescript

When LangChain throws rate limit exceeded in production, it usually means your app is sending more requests than the model provider allows in a given time window. In TypeScript apps, this often shows up after you move from local testing to real traffic, where parallel requests, retries, and long-running chains all hit the same API key.

The error is rarely “just OpenAI being down.” It’s usually your code pattern, your concurrency, or your retry settings.

The Most Common Cause

The #1 cause is uncontrolled parallelism.

A common pattern is calling Promise.all() over a list of inputs and letting LangChain fire off dozens of LLM calls at once. That works locally with 3 items, then falls apart in production when the queue spikes.

Broken patternFixed pattern
Fires all requests at onceLimits concurrency
Easy to writeSafe under load
Causes burst rate limitsSmooths request volume
import { ChatOpenAI } from "@langchain/openai";

const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
  apiKey: process.env.OPENAI_API_KEY,
});

const prompts = [
  "Summarize policy A",
  "Summarize policy B",
  "Summarize policy C",
  "Summarize policy D",
];

// BROKEN: all requests go out at once
const results = await Promise.all(
  prompts.map((prompt) => llm.invoke(prompt))
);
import pLimit from "p-limit";
import { ChatOpenAI } from "@langchain/openai";

const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
  apiKey: process.env.OPENAI_API_KEY,
});

const prompts = [
  "Summarize policy A",
  "Summarize policy B",
  "Summarize policy C",
  "Summarize policy D",
];

const limit = pLimit(2); // keep only 2 in flight

// FIXED: controlled concurrency
const results = await Promise.all(
  prompts.map((prompt) => limit(() => llm.invoke(prompt)))
);

If you’re using RunnableSequence, map(), or a custom queue worker, the same rule applies. The provider sees request bursts, not your intent.

Other Possible Causes

1) Retries are multiplying traffic

LangChain retries can help with transient failures, but they also amplify load if your app is already near the limit.

import { ChatOpenAI } from "@langchain/openai";

const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
  maxRetries: 6, // can make burst traffic worse
});

If you’re already hitting provider limits, reduce retries and add backoff outside the model call.

const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
  maxRetries: 2,
});

2) Multiple workers share one API key

This is common in production when you scale horizontally. One pod looks fine; five pods all using the same key push you over the account limit.

OPENAI_API_KEY=sk-prod-shared-key

If each worker can independently spike traffic, rate limits will look random. The fix is usually shared throttling or a centralized job queue.

3) Your chain makes more LLM calls than you think

A single user request may trigger multiple model calls through tools, agents, retrieval steps, or output parsing retries.

// One request can become many internal calls:
const chain = prompt.pipe(llm).pipe(parser);

Agents are especially noisy because tool loops can re-enter the model several times per user action. Check whether your “one endpoint” is actually making five or ten upstream requests.

4) Token-heavy prompts increase provider throttling

Some providers enforce rate limits by tokens per minute, not just request count. Large context windows can trip limits even with low concurrency.

const longContext = docs.join("\n\n"); // huge payload
await llm.invoke(`Answer using this context:\n${longContext}`);

Trim retrieved documents, summarize first, or cap context size before sending it to the model.

How to Debug It

  1. Log every upstream call

    • Add request IDs around each invoke().
    • Log model name, prompt size, and timestamps.
    • If one user action triggers many logs in a burst, you found the problem.
  2. Check whether the error is a provider 429

    • In LangChain/OpenAI setups, this often surfaces as an HTTP 429 Too Many Requests.
    • You may also see messages like RateLimitError or openai.RateLimitError depending on SDK version.
    • If it’s a true rate limit issue, retries alone won’t fix it.
  3. Measure concurrency under load

    • Count in-flight requests per process.
    • Compare local dev vs production worker count.
    • If production has more pods or threads, assume concurrency is higher than expected.
  4. Inspect chain behavior

    • Turn on verbose logging for chains and agents.
    • Watch for repeated tool calls or retry loops.
    • If a single endpoint fans out into multiple LLM calls, reduce fan-out before touching limits.

Prevention

  • Put a concurrency cap around every batch job

    • Use p-limit, BullMQ workers, or an internal queue.
    • Never let unbounded Promise.all() hit an LLM provider in production.
  • Treat retries as part of capacity planning

    • Keep maxRetries low.
    • Add jittered backoff and stop retrying on hard rate limits unless you have headroom.
  • Instrument token usage and request counts

    • Track requests per minute and tokens per minute by service.
    • Alert before you hit provider ceilings so production doesn’t discover them first.

If you’re seeing RateLimitError or HTTP 429 Too Many Requests in LangChain TypeScript, start with concurrency first. In real systems, that’s usually where the problem lives.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides