How to Fix 'token limit exceeded' in LangChain (TypeScript)
What the error means
token limit exceeded usually means the prompt you sent to the model is larger than the model’s context window. In LangChain TypeScript, this shows up when you stuff too much chat history, retrieved context, or raw documents into a single invoke() call.
The failure often appears inside ChatOpenAI, RunnableSequence, or an agent loop after one more message pushes the request over the limit. The exact message varies by provider, but the root problem is the same: your input tokens + output tokens > model max tokens.
The Most Common Cause
The #1 cause is unbounded prompt growth, usually from chat history or document stuffing.
A common broken pattern is appending every prior message and every retrieved chunk into one prompt without trimming.
| Broken | Fixed |
|---|---|
| Keeps growing forever | Trims history and limits context |
| Sends full documents | Sends only top-k relevant chunks |
| Fails once conversation gets long | Stays under model context window |
import { ChatOpenAI } from "@langchain/openai";
import { RunnableSequence } from "@langchain/core/runnables";
import { StringOutputParser } from "@langchain/core/output_parsers";
const llm = new ChatOpenAI({
model: "gpt-4o-mini",
temperature: 0,
});
const chain = RunnableSequence.from([
async (input: { history: string[]; question: string; docs: string[] }) => {
// WRONG: unbounded growth
return `
Chat history:
${input.history.join("\n")}
Documents:
${input.docs.join("\n\n")}
Question:
${input.question}
`;
},
llm,
new StringOutputParser(),
]);
// This eventually triggers something like:
// BadRequestError: 400 This model's maximum context length is ...
await chain.invoke({
history: hugeHistory,
question: "What is our refund policy?",
docs: allRetrievedDocs,
});
import { ChatOpenAI } from "@langchain/openai";
import { RunnableSequence } from "@langchain/core/runnables";
import { StringOutputParser } from "@langchain/core/output_parsers";
const llm = new ChatOpenAI({
model: "gpt-4o-mini",
temperature: 0,
});
function trimHistory(history: string[], maxMessages = 8) {
return history.slice(-maxMessages);
}
function trimDocs(docs: string[], maxDocs = 3) {
return docs.slice(0, maxDocs);
}
const chain = RunnableSequence.from([
async (input: { history: string[]; question: string; docs: string[] }) => {
const safeHistory = trimHistory(input.history);
const safeDocs = trimDocs(input.docs);
return `
Chat history:
${safeHistory.join("\n")}
Documents:
${safeDocs.join("\n\n")}
Question:
${input.question}
`;
},
llm,
new StringOutputParser(),
]);
If you are using agents, memory, or retrieval QA, this is usually where the bug lives.
Other Possible Causes
1. Your retriever returns too many chunks
If k is too high, your prompt gets bloated fast.
const retriever = vectorStore.asRetriever(10); // risky
Use a smaller value first:
const retriever = vectorStore.asRetriever(3);
If you need more recall, rerank results instead of dumping all chunks into the prompt.
2. Your chunk size is too large
Large chunks mean each retrieved document consumes more tokens.
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 4000, // too big for many RAG setups
chunkOverlap: 200,
});
A safer starting point:
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 800,
chunkOverlap: 100,
});
3. You are using the wrong model for the context window
Some models have small context windows. If you send long inputs to a smaller model, it will fail even if your code is correct.
const llm = new ChatOpenAI({
model: "gpt-3.5-turbo", // may be too small for your payload
});
Switch to a larger-context model if your use case needs it:
const llm = new ChatOpenAI({
model: "gpt-4o-mini",
});
Check the provider’s token limits before assuming LangChain is at fault.
4. You are mixing system prompts, tools, and memory without budgeting tokens
Agents can explode in size because tool schemas, scratchpads, and prior turns all get injected automatically.
// Example of hidden prompt growth in agent loops
const agent = createOpenAIFunctionsAgent({
llm,
tools,
prompt,
});
If your tool descriptions are long or your agent keeps intermediate steps, trim them aggressively and cap iterations.
How to Debug It
- •
Print the final prompt before calling the model
Log the exact text or messages passed intoinvoke(). If you cannot see what LangChain sends, you are debugging blind. - •
Count tokens on the assembled input
Use a tokenizer utility or provider-side token estimation before execution. Compare input tokens against the target model’s max context window. - •
Remove components one at a time
Test with:- •no chat history
- •no retrieved docs
- •no tools
- •no memory
Add them back until the error returns.
- •
Check LangChain stack traces for where expansion happens
Look forRunnableSequence,AgentExecutor,BufferMemory,ConversationalRetrievalQAChain, or your custom formatter. The bug is usually in prompt assembly, not in the LLM call itself.
Prevention
- •
Budget tokens explicitly
Set hard limits for history length, retrieved documents, and output size before building prompts.
- •
Prefer trimmed memory over full transcript replay
Use recent-message windows or summary memory instead of appending every turn forever.
- •
Keep retrieval narrow
Start with low
k, smaller chunks, and reranking before increasing context size.
If you are still seeing BadRequestError: This model's maximum context length has been exceeded, inspect the final rendered prompt first. In LangChain TypeScript, that usually tells you exactly which part of your pipeline is growing out of control.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit