How to Fix 'token limit exceeded when scaling' in LangChain (TypeScript)
When LangChain says token limit exceeded when scaling, it usually means your prompt grew past the model’s context window as your app processed more messages, documents, or tool outputs. In TypeScript, this often shows up after a few turns of chat, during retrieval-augmented generation, or when you “scale” by feeding more data into the same chain.
The error is not about compute scaling. It’s about input size: too many tokens in the prompt, too much retrieved text, or memory that keeps expanding without trimming.
The Most Common Cause
The #1 cause is unbounded chat history or document stuffing. You keep appending messages into a ChatPromptTemplate, BufferMemory, or a manual message array until the next invoke() call exceeds the model context window.
Here’s the broken pattern:
import { ChatOpenAI } from "@langchain/openai";
import { HumanMessage, AIMessage } from "@langchain/core/messages";
const llm = new ChatOpenAI({ model: "gpt-4o-mini" });
const messages = [
new HumanMessage("Summarize this policy."),
// ...many prior turns
new HumanMessage(veryLargeTranscript),
new AIMessage("Sure, here is the summary..."),
];
const res = await llm.invoke(messages);
And here’s the fixed pattern using trimming before invocation:
import { ChatOpenAI } from "@langchain/openai";
import { trimMessages } from "langchain/messages";
const llm = new ChatOpenAI({ model: "gpt-4o-mini" });
const trimmedMessages = await trimMessages({
messages,
maxTokens: 6000,
tokenCounter: async (msgs) => llm.getNumTokensFromMessages(msgs),
strategy: "last",
});
const res = await llm.invoke(trimmedMessages);
If you are using memory, the same problem appears with BufferMemory and long-running conversations. The fix is to switch to bounded memory patterns like summary memory or windowed history.
| Broken | Fixed |
|---|---|
| Append every turn forever | Keep only last N tokens/messages |
| Stuff full transcripts into the prompt | Summarize or chunk before prompting |
Use BufferMemory with no cap | Use windowed/summarized memory |
Other Possible Causes
1) Retrieval returns too many chunks
Your retriever may be returning 10–20 chunks, each with several hundred tokens. That looks fine in isolation, then explodes once concatenated into a single prompt.
const docs = await retriever.getRelevantDocuments(query); // too many / too large
Fix by limiting results and chunk size:
const retriever = vectorStore.asRetriever(4); // not 12+
Also make sure your splitter produces sane chunks:
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 800,
chunkOverlap: 100,
});
2) Tool output is being passed back raw
Tool calls can dump large JSON payloads into the next LLM turn. This is common with search tools, database queries, or insurance policy lookup tools.
// Bad: pass full raw tool response back to the model
const toolResult = await fetchPolicyData(policyId);
messages.push({ role: "tool", content: JSON.stringify(toolResult) });
Trim or summarize tool output before re-injecting it:
const compactResult = {
policyId: toolResult.policyId,
status: toolResult.status,
premium: toolResult.premium,
};
messages.push({
role: "tool",
content: JSON.stringify(compactResult),
});
3) You are using a model with a smaller context window than you think
A chain that works on one model can fail on another. Switching from a larger-context model to a smaller one often surfaces this immediately.
const llm = new ChatOpenAI({
model: "gpt-4o-mini", // smaller context than some larger models
});
If your prompt is already near the limit, reduce input size or move to a model with a larger context window.
4) Your chain is duplicating context
This happens when you manually add conversation history and also let LangChain inject memory automatically. The result is repeated messages and token bloat.
// Bad: history added twice
await chain.invoke({
input,
chat_history,
});
Check whether your prompt template already includes {chat_history} and whether your memory object also injects it. Use one source of truth.
How to Debug It
- •
Print token counts before every LLM call
- •Use
llm.getNumTokensFromMessages(messages)for chat prompts. - •Log the count right before
.invoke()so you know which step crosses the line.
- •Use
- •
Inspect what was added last
- •Compare message length after each turn.
- •If it spikes after retrieval, tools, or memory injection, that’s your source.
- •
Disable components one by one
- •Remove retriever context first.
- •Then disable memory.
- •Then remove tool outputs.
- •The component that makes the error disappear is your culprit.
- •
Log raw prompt size
- •Print message roles and approximate character counts.
- •Large JSON blobs and repeated transcripts are easy to spot once you look at them directly.
Prevention
- •Cap everything:
- •Limit retrieved docs.
- •Limit chat history.
- •Limit tool output size.
- •Prefer summarization over accumulation:
- •Summarize old turns.
- •Summarize long documents before injecting them into prompts.
- •Build token checks into your chain:
- •Reject oversized prompts early.
- •Trim before calling the model instead of waiting for runtime failures.
If you want one rule to keep in mind: never let prompt growth be unbounded. In LangChain TypeScript, “scaling” usually means your data got bigger than your context window long before your infrastructure did.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit