How to Fix 'context length exceeded when scaling' in LangChain (TypeScript)

By Cyprian AaronsUpdated 2026-04-21
context-length-exceeded-when-scalinglangchaintypescript

When LangChain throws context length exceeded when scaling, it usually means your prompt, retrieved documents, chat history, or tool output grew past the model’s token limit. In TypeScript apps, this tends to show up after you add memory, retrieval, or looped agent calls and the chain starts accumulating more text on every run.

The error is not about LangChain being “broken”. It’s about your request payload getting too large for the model you picked.

The Most Common Cause

The #1 cause is unbounded chat history or document stuffing. You keep appending messages or retrieved chunks into the prompt, and eventually ChatOpenAI gets a payload that exceeds the model context window.

Here’s the broken pattern:

import { ChatOpenAI } from "@langchain/openai";
import { BufferMemory } from "langchain/memory";
import { ConversationChain } from "langchain/chains";

const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
  temperature: 0,
});

const memory = new BufferMemory({
  returnMessages: true,
  memoryKey: "history",
  inputKey: "input",
});

const chain = new ConversationChain({
  llm,
  memory,
});

for (const msg of userMessages) {
  const res = await chain.invoke({ input: msg });
  console.log(res.response);
}

This looks fine until the conversation gets long. BufferMemory keeps everything, so each call sends a bigger prompt than the last.

Here’s the fixed pattern:

import { ChatOpenAI } from "@langchain/openai";
import { BufferWindowMemory } from "langchain/memory";
import { ConversationChain } from "langchain/chains";

const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
  temperature: 0,
});

const memory = new BufferWindowMemory({
  k: 6, // keep only the last 6 turns
  returnMessages: true,
  memoryKey: "history",
  inputKey: "input",
});

const chain = new ConversationChain({
  llm,
  memory,
});

for (const msg of userMessages) {
  const res = await chain.invoke({ input: msg });
  console.log(res.response);
}

If you need long-term context, don’t dump it all into the prompt. Summarize old turns and store them elsewhere.

BrokenFixed
BufferMemory with unlimited growthBufferWindowMemory or summary-based memory
Full chat transcript sent every turnLast N turns only
Raw documents concatenated into promptRetrieved top-k chunks with size limits

Other Possible Causes

1. Retrieval returns too many chunks

If you use a vector store retriever with a high k, your prompt can blow up fast.

const retriever = vectorStore.asRetriever(12); // too many for large chunks

Fix it by lowering k and chunk size:

const retriever = vectorStore.asRetriever(4);

Also make sure your splitter is sane:

chunkSize: 800,
chunkOverlap: 100

2. Tool output is being injected verbatim

Agents can fail when a tool returns huge JSON, HTML, logs, or API responses.

const tools = [
  {
    name: "fetchCustomerData",
    description: "Fetch customer profile",
    func: async () => JSON.stringify(bigResponse),
  },
];

Trim before returning:

func: async () =>
  JSON.stringify({
    id: bigResponse.id,
    status: bigResponse.status,
    riskScore: bigResponse.riskScore,
  }),

3. Recursive agent loops are amplifying tokens

If an agent keeps calling itself or re-running steps, each iteration adds more context.

Watch for patterns like this:

while (true) {
  const result = await agentExecutor.invoke({ input });
}

Add hard limits:

const agentExecutor = new AgentExecutor({
  agent,
  tools,
  maxIterations: 3,
});

4. You picked a smaller-context model than you think

A lot of people assume every OpenAI model has enough room. It doesn’t.

Check your config:

const llm = new ChatOpenAI({
  model: "gpt-4o-mini", // smaller context than larger GPT-4 variants
});

If your workload is heavy, move to a larger context window or reduce prompt size before switching models blindly.

How to Debug It

  1. Log token estimates before every LLM call
    Count prompt size using a tokenizer or rough character-to-token estimate. If the number climbs per request, you’ve found an accumulation problem.

  2. Print the final messages passed to the model
    In LangChain TypeScript, inspect the actual array going into ChatOpenAI. Look for duplicated history, repeated docs, or giant tool outputs.

  3. Disable components one at a time
    Turn off memory first, then retrieval, then tools. If the error disappears when one component is removed, that component is your culprit.

  4. Check stack traces and model limits
    Errors often surface as provider errors wrapped by LangChain. Look for messages like:

    • Error from OpenAI
    • context length exceeded
    • This model's maximum context length is ...
    • BadRequestError

Prevention

  • Use bounded memory by default.
  • Cap retrieval at low k and keep chunks small.
  • Summarize tool output and old conversation turns instead of passing raw data back into the prompt.
  • Add token budget checks in CI for prompts that are generated dynamically.
  • Treat every extra message as paid context; if it doesn’t help the next answer, don’t send it.

If you’re building agents for banking or insurance workflows, this matters even more because customer history and policy data grow quickly. The fix is usually not “use a bigger model”; it’s controlling what enters the prompt in the first place.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides