How to Fix 'token limit exceeded in production' in LangChain (TypeScript)

By Cyprian AaronsUpdated 2026-04-21
token-limit-exceeded-in-productionlangchaintypescript

When LangChain throws token limit exceeded in production, it usually means your prompt, retrieved context, chat history, or model output pushed the request past the model’s context window. In TypeScript apps, this shows up most often after you add memory, RAG, or long-running conversations and then deploy with real user traffic.

The fix is almost never “just use a bigger model.” You need to find which part of the chain is growing uncontrollably and cap it before the request hits the LLM.

The Most Common Cause

The #1 cause is unbounded conversation history being stuffed into the prompt on every request. In LangChain TypeScript, this often happens with BufferMemory, ConversationChain, or manual message accumulation.

Here’s the broken pattern:

BrokenFixed
Appends every message foreverTrims or summarizes history
No token budgetEnforces a max token window
// Broken: memory grows without bounds
import { ChatOpenAI } from "@langchain/openai";
import { BufferMemory } from "langchain/memory";
import { ConversationChain } from "langchain/chains";

const llm = new ChatOpenAI({ model: "gpt-4o-mini", temperature: 0 });
const memory = new BufferMemory({
  returnMessages: true,
});

const chain = new ConversationChain({
  llm,
  memory,
});

await chain.call({ input: "Summarize our policy discussion." });
// After enough turns, you'll hit errors like:
// Error: This model's maximum context length is 128000 tokens.
// However, your messages resulted in 131245 tokens.

The fixed pattern uses a bounded memory strategy:

import { ChatOpenAI } from "@langchain/openai";
import { BufferWindowMemory } from "langchain/memory";
import { ConversationChain } from "langchain/chains";

const llm = new ChatOpenAI({ model: "gpt-4o-mini", temperature: 0 });
const memory = new BufferWindowMemory({
  k: 6, // keep only the last 6 messages
  returnMessages: true,
});

const chain = new ConversationChain({
  llm,
  memory,
});

await chain.call({ input: "Summarize our policy discussion." });

If you need long conversations, use summary-based memory instead of raw accumulation:

import { ConversationSummaryMemory } from "langchain/memory";

That keeps the prompt size stable while preserving useful context.

Other Possible Causes

1. Retrieval returns too many documents

With RAG, similaritySearch can easily dump too much text into the prompt.

// Too many docs
const docs = await vectorStore.similaritySearch(query, 10);

Fix it by reducing k and truncating content before prompt assembly:

const docs = await vectorStore.similaritySearch(query, 3);
const context = docs.map(d => d.pageContent.slice(0, 1200)).join("\n\n");

If you use RetrievalQAChain, make sure your retriever is capped:

retriever.searchKwargs = { k: 3 };

2. Your prompt template is too verbose

A bloated system prompt plus user input plus retrieved docs can exceed limits fast.

const systemPrompt = `
You are an extremely detailed assistant...
[200 lines of instructions]
`;

Trim it to only what changes behavior:

const systemPrompt = `
Answer using only provided context.
If context is insufficient, say so.
Return concise answers.
`;

3. The model output limit is too high

Sometimes input is fine, but your maxTokens lets the model generate too much.

const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
  maxTokens: 4000,
});

Lower it if you don’t need long responses:

const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
  maxTokens: 500,
});

This matters when chains call tools or produce structured outputs that expand unexpectedly.

4. You are concatenating raw documents or logs

This one shows up in production pipelines where someone passes full PDFs, tickets, or audit logs directly into a chain.

const prompt = `
User question:
${question}

Docs:
${docs.map(d => d.pageContent).join("\n")}
`;

Instead, pre-process and extract only relevant chunks:

const trimmedDocs = docs
  .map(d => d.pageContent.slice(0, 800))
  .join("\n---\n");

If you’re feeding logs into an agent, summarize first before retrieval or generation.

How to Debug It

  1. Print token counts before calling the LLM

    Use the tokenizer for your target model and inspect each component separately:

    • system prompt
    • chat history
    • retrieved docs
    • user input

    If one piece spikes after several requests, that’s your culprit.

  2. Log the exact LangChain error

    Look for messages like:

    • Error: This model's maximum context length is X tokens
    • BadRequestError: Prompt too long
    • 400 InvalidRequestError

    The wording tells you whether you exceeded input tokens or output constraints.

  3. Disable parts of the chain one by one

    Run the same request with:

    • no memory
    • no retriever
    • shorter system prompt
    • smaller maxTokens

    If the error disappears when memory is removed, you found the leak.

  4. Inspect runtime growth in production

    Add metrics for:

    • number of messages in history
    • total retrieved characters/tokens
    • average completion length

    Production failures usually come from gradual growth, not a single bad request.

Prevention

  • Use bounded memory by default.
    • Prefer BufferWindowMemory or summary-based memory over raw buffers.
  • Put hard caps on retrieval.
    • Limit k, truncate document text, and rank before stuffing context.
  • Add token budgeting in code.
    • Fail early if prompt construction crosses a threshold instead of sending a doomed request.

If you’re building agents for production systems like banking workflows or claims support, treat token budget like any other resource limit. Enforce it at composition time, not after OpenAI rejects the request.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides