How to Fix 'token limit exceeded during development' in LangChain (TypeScript)

By Cyprian AaronsUpdated 2026-04-21
token-limit-exceeded-during-developmentlangchaintypescript

When you see token limit exceeded during development in a LangChain TypeScript app, it usually means your prompt, retrieved context, chat history, or tool output is too large for the model’s context window. In practice, this shows up during local testing when you keep appending messages or stuffing full documents into a single call.

The key point: this is not a LangChain bug. It’s almost always a prompt assembly problem, and the fix is to control what gets sent to the model.

The Most Common Cause

The #1 cause is unbounded chat history or document stuffing.

A common broken pattern is passing every prior message into ChatOpenAI through a MessagesPlaceholder, or dumping full retrieval results into the prompt without trimming them first. That works for small tests and then blows up once the conversation or documents grow.

Broken patternFixed pattern
Send all messages / full docs every timeTrim history and cap retrieved context
No token budgetExplicit token budgeting
Assumes model will “handle it”Enforces input size before invoke
// BROKEN: unbounded history + full docs
import { ChatOpenAI } from "@langchain/openai";
import { ChatPromptTemplate, MessagesPlaceholder } from "@langchain/core/prompts";

const model = new ChatOpenAI({ model: "gpt-4o-mini", temperature: 0 });

const prompt = ChatPromptTemplate.fromMessages([
  ["system", "You are a banking assistant."],
  new MessagesPlaceholder("history"),
  ["human", "{input}"],
]);

const chain = prompt.pipe(model);

// history keeps growing forever
await chain.invoke({
  input: "Summarize my policy options.",
  history: allPreviousMessages,
});
// FIXED: trim history before invoking the model
import { ChatOpenAI } from "@langchain/openai";
import { ChatPromptTemplate, MessagesPlaceholder } from "@langchain/core/prompts";
import { trimMessages } from "@langchain/core/messages";

const model = new ChatOpenAI({ model: "gpt-4o-mini", temperature: 0 });

const prompt = ChatPromptTemplate.fromMessages([
  ["system", "You are a banking assistant."],
  new MessagesPlaceholder("history"),
  ["human", "{input}"],
]);

const trimmedHistory = await trimMessages(allPreviousMessages, {
  maxTokens: 2000,
  tokenCounter: async (msgs) => msgs.length * 100, // replace with real counter if needed
});

const chain = prompt.pipe(model);

await chain.invoke({
  input: "Summarize my policy options.",
  history: trimmedHistory,
});

If you’re using retrieval, the same issue applies. A RetrievalQAChain or custom RAG flow that injects entire documents will hit the model limit fast. Keep only top-k chunks and cap chunk size.

Other Possible Causes

1) Your retriever returns too many chunks

If you use vectorStore.asRetriever(20) and each chunk is large, your context can explode.

const retriever = vectorStore.asRetriever(20); // too many for most prompts

Fix it by lowering k and tightening chunk size at ingestion:

const retriever = vectorStore.asRetriever(4);

2) Tool output is being injected raw

Agents can fail when a tool returns massive JSON, logs, HTML, or entire records.

// BROKEN
return JSON.stringify(bigApiResponse);

Fix by summarizing or selecting only fields needed by the model:

// FIXED
return JSON.stringify({
  id: bigApiResponse.id,
  status: bigApiResponse.status,
  summary: bigApiResponse.summary,
});

3) You picked a smaller-context model than you think

Some TypeScript apps silently switch models via env vars like OPENAI_MODEL. If your dev environment points to a smaller window, you’ll get errors sooner.

const model = new ChatOpenAI({
  model: process.env.OPENAI_MODEL ?? "gpt-4o-mini",
});

Check the actual deployed value and compare it to your expected context size.

4) Your system prompt is bloated

Teams often paste policy docs, legal text, and product manuals directly into the system message. That eats tokens before user input even starts.

// BROKEN
const system = `
You are an assistant.
[...12 pages of compliance text...]
`;

Move long policy content into retrieval or external rules evaluation instead of hardcoding it into every call.

How to Debug It

  1. Print the final payload before invoke

    • Log the assembled messages, retrieved docs, and tool outputs.
    • In LangChain JS/TS, inspect what you pass into prompt.pipe(model).invoke(...).
  2. Count tokens on each component

    • Check system prompt length.
    • Check chat history length.
    • Check retrieved context length.
    • Check tool output length.
    • The problem is usually one oversized piece plus accumulated history.
  3. Reduce one input at a time

    • Set history = [].
    • Set retriever k = 2.
    • Disable tools.
    • If the error disappears, you’ve found the source.
  4. Read the actual model error

    • Common runtime messages include:
      • 400 Bad Request
      • context_length_exceeded
      • This model's maximum context length is ... tokens
      • OpenAI-style errors surfaced through LangChain wrappers like BadRequestError
    • Don’t guess from LangChain alone; inspect the provider response.

Prevention

  • Put hard caps on everything

    • Cap chat history with trimMessages.
    • Cap retrieval with low k.
    • Cap tool output size before returning it to the agent.
  • Design prompts around budgets

    • Reserve tokens for user input and completion.
    • Don’t spend your entire window on instructions and context.
  • Add token checks in CI or local tests

    • Run a preflight check on prompt assembly.
    • Fail fast when combined inputs cross your threshold.

If you’re building agents in production, treat token budgeting like memory management. The fix is not “use a bigger model” every time; it’s controlling what reaches the model in the first place.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides