How to Fix 'OOM error during inference in production' in LangChain (TypeScript)
When you see OOM error during inference in production, it usually means your Node process got killed because memory usage spiked during a model call. In LangChain TypeScript, this typically shows up under load, with long prompts, large retrieved context, or when you accidentally keep too much state in memory.
The key thing: this is usually not a LangChain bug. It’s almost always an application pattern that causes the process to hold onto too much data at once.
The Most Common Cause
The #1 cause is building huge prompts or chat histories in memory before calling the model. In LangChain TS, this often happens when developers concatenate documents, conversation state, and tool outputs into one giant string.
Here’s the broken pattern next to the fixed one:
| Broken | Fixed |
|---|---|
| Builds one massive prompt string | Truncates, batches, or retrieves only what’s needed |
| Keeps full history in memory | Uses bounded memory or summary memory |
| Sends all docs to the model | Selects top-k chunks only |
// Broken: unbounded context growth
import { ChatOpenAI } from "@langchain/openai";
import { HumanMessage, SystemMessage } from "@langchain/core/messages";
const llm = new ChatOpenAI({
model: "gpt-4o-mini",
temperature: 0,
});
export async function answer(question: string, docs: string[], history: string[]) {
const prompt = [
new SystemMessage("You are a banking assistant."),
...history.map((msg) => new HumanMessage(msg)),
new HumanMessage(
`Question: ${question}\n\nContext:\n${docs.join("\n\n")}`
),
];
const res = await llm.invoke(prompt);
return res.content;
}
// Fixed: bounded input size
import { ChatOpenAI } from "@langchain/openai";
import { HumanMessage, SystemMessage } from "@langchain/core/messages";
const llm = new ChatOpenAI({
model: "gpt-4o-mini",
temperature: 0,
});
function truncate(text: string, maxChars = 4000) {
return text.length > maxChars ? text.slice(0, maxChars) : text;
}
export async function answer(question: string, docs: string[], history: string[]) {
const recentHistory = history.slice(-6); // keep last N turns only
const topDocs = docs.slice(0, 3).map((d) => truncate(d));
const prompt = [
new SystemMessage("You are a banking assistant."),
...recentHistory.map((msg) => new HumanMessage(msg)),
new HumanMessage(`Question: ${question}\n\nContext:\n${topDocs.join("\n\n")}`),
];
const res = await llm.invoke(prompt);
return res.content;
}
If you’re using BufferMemory, ConversationSummaryMemory, or custom message arrays, check whether they grow without bounds across requests. In production, that becomes a slow memory leak even if each request looks fine in isolation.
Other Possible Causes
1. Loading too many documents into retrieval
If you use VectorStoreRetriever with a high k, you may be sending far more text than needed.
const retriever = vectorStore.asRetriever({ k: 20 }); // risky
const retriever = vectorStore.asRetriever({ k: 4 }); // safer
Also watch for chunk size. Huge chunks mean fewer retrieval hits but much larger prompt payloads.
2. Returning full tool outputs into the chain
Some tools return large JSON blobs or HTML pages. If you pass that raw output back into the LLM, memory jumps fast.
// Bad
const toolResult = await myTool.invoke(input);
messages.push(new HumanMessage(JSON.stringify(toolResult)));
// Better
const compactResult = {
id: toolResult.id,
status: toolResult.status,
summary: toolResult.summary,
};
messages.push(new HumanMessage(JSON.stringify(compactResult)));
3. Streaming buffers not being released
If you buffer every token chunk before sending it to the client, you can spike heap usage on long completions.
// Risky
let fullText = "";
for await (const chunk of stream) {
fullText += chunk.content;
}
// Better
for await (const chunk of stream) {
res.write(chunk.content);
}
4. Large parallel inference batches
Running too many Promise.all() calls against the LLM at once can blow up memory and socket usage.
// Risky
await Promise.all(questions.map((q) => chain.invoke(q)));
// Better
for (const q of questions) {
await chain.invoke(q);
}
If you need concurrency, cap it with a queue like p-limit.
How to Debug It
- •
Check whether the process is actually being killed by memory
- •Look for logs like:
- •
FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory - •Kubernetes pod restarts with exit code
137 - •AWS ECS task stopped due to OOMKilled
- •
- •If you see these, it’s runtime memory pressure, not an application exception from LangChain.
- •Look for logs like:
- •
Measure prompt size before calling
invoke()- •Log the number of messages and approximate character count.
- •If your prompt grows linearly across requests, your state management is broken.
console.log({
messageCount: messages.length,
chars: messages.reduce((sum, m) => sum + String(m.content).length, 0),
});
- •
Inspect retriever and tool payloads
- •Print the length of retrieved documents.
- •Print the size of tool outputs before they enter the chain.
- •The usual culprit is one giant document or JSON blob.
- •
Run with heap profiling
- •Start Node with:
node --inspect --max-old-space-size=2048 dist/server.js - •Use Chrome DevTools or
clinic heapprofileto see what objects stay alive. - •If arrays of messages or documents keep growing after each request, that’s your leak.
- •Start Node with:
Prevention
- •Keep chat history bounded.
- •Store only recent turns or summarize older turns before reuse.
- •Cap retrieval and output sizes.
- •Use small
k, smaller chunks, and trim tool responses before passing them to the model.
- •Use small
- •Set explicit Node memory limits in production.
- •Example:
NODE_OPTIONS="--max-old-space-size=2048" - •This won’t fix bad code, but it makes failures predictable and easier to observe.
- •Example:
If you’re seeing OOM error during inference in production in LangChain TypeScript, start with prompt growth first. In real systems, that’s the source of most incidents I’ve debugged.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit