How to Fix 'OOM error during inference in production' in LangChain (TypeScript)

By Cyprian AaronsUpdated 2026-04-21
oom-error-during-inference-in-productionlangchaintypescript

When you see OOM error during inference in production, it usually means your Node process got killed because memory usage spiked during a model call. In LangChain TypeScript, this typically shows up under load, with long prompts, large retrieved context, or when you accidentally keep too much state in memory.

The key thing: this is usually not a LangChain bug. It’s almost always an application pattern that causes the process to hold onto too much data at once.

The Most Common Cause

The #1 cause is building huge prompts or chat histories in memory before calling the model. In LangChain TS, this often happens when developers concatenate documents, conversation state, and tool outputs into one giant string.

Here’s the broken pattern next to the fixed one:

BrokenFixed
Builds one massive prompt stringTruncates, batches, or retrieves only what’s needed
Keeps full history in memoryUses bounded memory or summary memory
Sends all docs to the modelSelects top-k chunks only
// Broken: unbounded context growth
import { ChatOpenAI } from "@langchain/openai";
import { HumanMessage, SystemMessage } from "@langchain/core/messages";

const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
  temperature: 0,
});

export async function answer(question: string, docs: string[], history: string[]) {
  const prompt = [
    new SystemMessage("You are a banking assistant."),
    ...history.map((msg) => new HumanMessage(msg)),
    new HumanMessage(
      `Question: ${question}\n\nContext:\n${docs.join("\n\n")}`
    ),
  ];

  const res = await llm.invoke(prompt);
  return res.content;
}
// Fixed: bounded input size
import { ChatOpenAI } from "@langchain/openai";
import { HumanMessage, SystemMessage } from "@langchain/core/messages";

const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
  temperature: 0,
});

function truncate(text: string, maxChars = 4000) {
  return text.length > maxChars ? text.slice(0, maxChars) : text;
}

export async function answer(question: string, docs: string[], history: string[]) {
  const recentHistory = history.slice(-6); // keep last N turns only
  const topDocs = docs.slice(0, 3).map((d) => truncate(d));

  const prompt = [
    new SystemMessage("You are a banking assistant."),
    ...recentHistory.map((msg) => new HumanMessage(msg)),
    new HumanMessage(`Question: ${question}\n\nContext:\n${topDocs.join("\n\n")}`),
  ];

  const res = await llm.invoke(prompt);
  return res.content;
}

If you’re using BufferMemory, ConversationSummaryMemory, or custom message arrays, check whether they grow without bounds across requests. In production, that becomes a slow memory leak even if each request looks fine in isolation.

Other Possible Causes

1. Loading too many documents into retrieval

If you use VectorStoreRetriever with a high k, you may be sending far more text than needed.

const retriever = vectorStore.asRetriever({ k: 20 }); // risky
const retriever = vectorStore.asRetriever({ k: 4 });  // safer

Also watch for chunk size. Huge chunks mean fewer retrieval hits but much larger prompt payloads.

2. Returning full tool outputs into the chain

Some tools return large JSON blobs or HTML pages. If you pass that raw output back into the LLM, memory jumps fast.

// Bad
const toolResult = await myTool.invoke(input);
messages.push(new HumanMessage(JSON.stringify(toolResult)));

// Better
const compactResult = {
  id: toolResult.id,
  status: toolResult.status,
  summary: toolResult.summary,
};
messages.push(new HumanMessage(JSON.stringify(compactResult)));

3. Streaming buffers not being released

If you buffer every token chunk before sending it to the client, you can spike heap usage on long completions.

// Risky
let fullText = "";
for await (const chunk of stream) {
  fullText += chunk.content;
}

// Better
for await (const chunk of stream) {
  res.write(chunk.content);
}

4. Large parallel inference batches

Running too many Promise.all() calls against the LLM at once can blow up memory and socket usage.

// Risky
await Promise.all(questions.map((q) => chain.invoke(q)));

// Better
for (const q of questions) {
  await chain.invoke(q);
}

If you need concurrency, cap it with a queue like p-limit.

How to Debug It

  1. Check whether the process is actually being killed by memory

    • Look for logs like:
      • FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
      • Kubernetes pod restarts with exit code 137
      • AWS ECS task stopped due to OOMKilled
    • If you see these, it’s runtime memory pressure, not an application exception from LangChain.
  2. Measure prompt size before calling invoke()

    • Log the number of messages and approximate character count.
    • If your prompt grows linearly across requests, your state management is broken.
console.log({
  messageCount: messages.length,
  chars: messages.reduce((sum, m) => sum + String(m.content).length, 0),
});
  1. Inspect retriever and tool payloads

    • Print the length of retrieved documents.
    • Print the size of tool outputs before they enter the chain.
    • The usual culprit is one giant document or JSON blob.
  2. Run with heap profiling

    • Start Node with:
      node --inspect --max-old-space-size=2048 dist/server.js
      
    • Use Chrome DevTools or clinic heapprofile to see what objects stay alive.
    • If arrays of messages or documents keep growing after each request, that’s your leak.

Prevention

  • Keep chat history bounded.
    • Store only recent turns or summarize older turns before reuse.
  • Cap retrieval and output sizes.
    • Use small k, smaller chunks, and trim tool responses before passing them to the model.
  • Set explicit Node memory limits in production.
    • Example:
      NODE_OPTIONS="--max-old-space-size=2048"
      
    • This won’t fix bad code, but it makes failures predictable and easier to observe.

If you’re seeing OOM error during inference in production in LangChain TypeScript, start with prompt growth first. In real systems, that’s the source of most incidents I’ve debugged.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides