How to Fix 'OOM error during inference when scaling' in LangChain (TypeScript)

By Cyprian AaronsUpdated 2026-04-21
oom-error-during-inference-when-scalinglangchaintypescript

OOM errors during inference mean your process is running out of memory while LangChain is building prompts, holding intermediate results, or running model calls in parallel. In TypeScript projects, this usually shows up when you scale from one request to many, or when a chain that worked locally starts failing under load.

The error often looks like this:

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory

Or, in containerized deployments:

Killed

The Most Common Cause

The #1 cause is unbounded concurrency. In LangChain TypeScript, this usually happens when you call Promise.all() over a large array of inputs, or when a chain internally fans out too many requests at once.

This is especially common with RunnableSequence, RunnableMap, map(), and bulk .invoke() patterns.

Broken vs fixed

Broken patternFixed pattern
Fires too many model calls at onceLimits concurrency
Holds all outputs in memoryStreams or batches results
Easy to write, hard to scaleSlightly more code, stable under load
// ❌ Broken: unbounded parallelism
import { ChatOpenAI } from "@langchain/openai";
import { PromptTemplate } from "@langchain/core/prompts";

const llm = new ChatOpenAI({ model: "gpt-4o-mini", temperature: 0 });

const prompt = PromptTemplate.fromTemplate(
  "Summarize this claim note:\n\n{note}"
);

const notes = hugeArrayOfNotes; // thousands of items

const summaries = await Promise.all(
  notes.map(async (note) => {
    const msg = await prompt.pipe(llm).invoke({ note });
    return msg.content;
  })
);
// ✅ Fixed: bounded concurrency with batching
import pLimit from "p-limit";
import { ChatOpenAI } from "@langchain/openai";
import { PromptTemplate } from "@langchain/core/prompts";

const llm = new ChatOpenAI({ model: "gpt-4o-mini", temperature: 0 });

const prompt = PromptTemplate.fromTemplate(
  "Summarize this claim note:\n\n{note}"
);

const limit = pLimit(5); // keep memory and API pressure under control

const summaries = await Promise.all(
  hugeArrayOfNotes.map((note) =>
    limit(async () => {
      const msg = await prompt.pipe(llm).invoke({ note });
      return msg.content;
    })
  )
);

If you’re using LangChain’s batch APIs, keep the batch size small:

const chain = prompt.pipe(llm);
const results = await chain.batch(inputs, { maxConcurrency: 5 });

Other Possible Causes

1. Huge prompts or documents loaded into memory

If you pass entire PDFs, transcripts, or chat histories into a chain, LangChain will keep those strings in memory while formatting prompts and waiting on the model.

// ❌ Bad: entire document stuffed into one prompt
await chain.invoke({
  context: fullPolicyDocumentText,
  question: "What is the deductible?"
});
// ✅ Better: chunk first, then retrieve only relevant pieces
const docs = await splitter.splitText(fullPolicyDocumentText);
const topChunks = docs.slice(0, 4);

await chain.invoke({
  context: topChunks.join("\n\n"),
  question: "What is the deductible?"
});

If you’re using retrieval chains, make sure your retriever isn’t returning too many documents.


2. Chat history growing without bound

A common LangChain class involved here is BufferMemory or any custom conversation state that appends every turn forever.

// ❌ Bad: unbounded chat history
import { BufferMemory } from "langchain/memory";

const memory = new BufferMemory({
  returnMessages: true,
});
// ✅ Better: cap history size
import { BufferWindowMemory } from "langchain/memory";

const memory = new BufferWindowMemory({
  k: 6, // last 6 messages only
  returnMessages: true,
});

If you’re on newer LangChain patterns, use explicit message trimming before each invoke.


3. Large tool outputs returned to the model

Agents can blow up memory when tools return massive payloads like raw search results, full database records, or HTML pages.

// ❌ Bad: tool returns everything
async function getCustomerRecords() {
  return await db.customer.findMany(); // huge array
}
// ✅ Better: return only what the agent needs
async function getCustomerRecords() {
  const rows = await db.customer.findMany({
    select: { id: true, status: true, updatedAt: true },
    take: 20,
  });

  return rows;
}

For agent tools, keep outputs short and structured. If a tool returns more than a few KB repeatedly, expect memory pressure.


4. Node heap too small for your workload

Sometimes the code is fine and the runtime simply has no headroom. This shows up as:

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

If you’re running large batches locally or in CI:

node --max-old-space-size=4096 dist/index.js

For Docker:

ENV NODE_OPTIONS="--max-old-space-size=4096"

This does not fix bad concurrency or giant prompts. It just gives Node more room before crashing.

How to Debug It

  1. Check whether the crash happens during fan-out

    • Look for Promise.all, .batch(), .map() over large arrays, or agent loops.
    • If memory spikes linearly with input count, concurrency is your problem.
  2. Log prompt sizes before invocation

    console.log("prompt chars:", JSON.stringify(input).length);
    

    If one request is enormous, inspect document loading, chat history, and tool output.

  3. Disable parallelism temporarily

    • Set maxConcurrency to 1.
    • Process one item at a time.
    • If OOM disappears, the issue is load amplification rather than a single bad object.
  4. Inspect what your tools and retrievers return

    • Print result counts and payload sizes.
    • For retrievers:
      const docs = await retriever.getRelevantDocuments(query);
      console.log(docs.length, docs.map(d => d.pageContent.length));
      
    • For tools:
      console.log(JSON.stringify(toolResult).length);
      

Prevention

  • Use bounded concurrency everywhere you call models in bulk.
  • Trim chat history and retrieved documents before they hit the prompt.
  • Keep tool outputs small; return IDs and summaries instead of raw datasets.
  • Set sane Node heap limits in dev and production so failures are visible early.

If you’re seeing OOM error during inference when scaling in LangChain TypeScript, start with concurrency first. In real systems, that’s the cause most of the time.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides