How to Fix 'context length exceeded in production' in LangChain (TypeScript)

By Cyprian AaronsUpdated 2026-04-21

context-length-exceeded-in-productionlangchaintypescript

context length exceeded in production means your request payload is larger than the model can accept. In LangChain TypeScript, this usually shows up after you keep appending chat history, retrieved documents, or tool outputs until the prompt crosses the model’s token limit.

The failure is common in production because traffic patterns are different from local tests. A single user thread can grow for hours, and your chain keeps stuffing more text into every call.

The Most Common Cause

The #1 cause is unbounded conversation memory. People wire BufferMemory or manually append messages, then reuse the same history forever.

Here’s the broken pattern:

Broken	Fixed
Keeps all messages forever	Trims or summarizes history
No token budgeting	Enforces a max token window
Fails once conversations get long	Stays within model limits

// ❌ Broken: unbounded chat history
import { ChatOpenAI } from "@langchain/openai";
import { BufferMemory } from "langchain/memory";
import { ConversationChain } from "langchain/chains";

const llm = new ChatOpenAI({ model: "gpt-4o-mini" });
const memory = new BufferMemory({
  returnMessages: true,
  memoryKey: "history",
});

const chain = new ConversationChain({
  llm,
  memory,
});

await chain.invoke({ input: "Summarize my policy claim status" });

// Later, after many turns...
// Error you'll eventually see:
// BadRequestError: 400 This model's maximum context length is 128000 tokens.
// However, your messages resulted in 131245 tokens.

// ✅ Fixed: bounded memory with trimming
import { ChatOpenAI } from "@langchain/openai";
import {
  ConversationSummaryBufferMemory,
} from "langchain/memory";
import { ConversationChain } from "langchain/chains";

const llm = new ChatOpenAI({ model: "gpt-4o-mini" });

const memory = new ConversationSummaryBufferMemory({
  llm,
  maxTokenLimit: 6000,
  returnMessages: true,
  memoryKey: "history",
});

const chain = new ConversationChain({
  llm,
  memory,
});

await chain.invoke({ input: "Summarize my policy claim status" });

If you’re on newer LangChain code, the same idea applies with RunnableWithMessageHistory: do not store unlimited messages per session. Trim, summarize, or persist only the last N turns.

Other Possible Causes

1) Retriever returns too many documents

A common production bug is using k=10 or k=20 on a vector store and stuffing full document text into the prompt.

// Too much context
const retriever = vectorStore.asRetriever(12);

Fix it by reducing k and chunk size:

const retriever = vectorStore.asRetriever(4);

Also shorten chunks when indexing:

chunkSize: 800,
chunkOverlap: 100

2) Tool output is being passed through raw

If an agent tool returns large JSON, HTML, logs, or database rows, LangChain will happily inject it into the next prompt.

// Bad: raw tool output is huge
return JSON.stringify(resultRows);

Trim it before returning:

return JSON.stringify(resultRows.slice(0, 5));

Or return a compact summary instead of full payloads:

return {
  count: resultRows.length,
  topMatches: resultRows.slice(0, 3),
};

3) Prompt template is duplicating context

Sometimes the prompt includes both {history} and {input} plus {documents}, and each field already contains overlapping text.

const prompt = ChatPromptTemplate.fromMessages([
  ["system", "Use this context:\n{context}\nHistory:\n{history}"],
  ["human", "{input}\n\nContext again:\n{context}"],
]);

That doubles your token usage for no reason. Keep one source of truth for each piece of context.

4) Model limit mismatch

You may be using a smaller-context model in production than in local testing.

const llm = new ChatOpenAI({
  model: "gpt-3.5-turbo",
});

If your prompt was tuned for a larger window, switch to a larger-context model or reduce input size. Check the exact error message:

•This model's maximum context length is ...
•messages resulted in ... tokens
•BadRequestError: 400

How to Debug It

•
Log token usage before every call
- •Count prompt tokens and completion tokens.
- •If you’re near the limit before generation starts, the input is too large.
•
Inspect what changed between working and failing requests
- •Compare message count, retrieved docs, and tool outputs.
- •Production issues usually come from one user thread growing over time.
•
Binary search the payload
- •Remove memory first.
- •Then remove retrieval.
- •Then remove tools.
- •The component that makes the error disappear is your culprit.
•
Print serialized inputs
- •Log the final prompt/messages sent to the model.
- •In LangChain TypeScript, inspect what your chain actually passes into invoke() rather than what you think it passes.

Example diagnostic snippet:

console.log("messages:", messages.length);
console.log("last message chars:", messages[messages.length - 1]?.content?.length);
console.log("retrieved docs:", docs.map(d => d.pageContent.length));

Prevention

•
Use bounded memory by default:
- •ConversationSummaryBufferMemory
- •message trimming
- •session-level TTLs
•
Budget tokens at every layer:
- •user input
- •chat history
- •retrieved docs
- •tool results
•
Add guardrails in production:
- •reject oversized uploads early
- •cap retriever k
- •truncate verbose tool output before it reaches the LLM

If you treat context like a finite resource instead of an infinite log file, this error stops being random and becomes predictable. That’s the difference between a demo chain and something you can run in production without paging yourself at midnight.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit