How to Fix 'OOM error during inference during development' in LangChain (Python)
When you see OOM error during inference during development in a LangChain Python app, it usually means your process ran out of memory while loading the model, building prompts, or keeping too much intermediate state in RAM. In practice, this shows up during local development when you test with a large model, long chat history, or an agent that keeps chaining calls without clearing state.
The fix is usually not “buy more RAM.” It’s almost always one of a few patterns: loading too much into memory, using the wrong model size, or letting LangChain retain more context than you intended.
The Most Common Cause
The #1 cause is loading a model or embedding pipeline into memory repeatedly inside a request path, loop, or chain execution. In LangChain apps, this often happens when people instantiate ChatOpenAI, HuggingFacePipeline, or an embeddings object inside the function that gets called many times.
Here’s the broken pattern:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
def answer_question(question: str) -> str:
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) # recreated every call
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant."),
("human", "{question}")
])
chain = prompt | llm
return chain.invoke({"question": question}).content
And here’s the fixed pattern:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant."),
("human", "{question}")
])
chain = prompt | llm
def answer_question(question: str) -> str:
return chain.invoke({"question": question}).content
The difference looks small, but it matters. If your app is creating new clients, new tokenizers, new vector stores, or new local models on every request, memory usage climbs fast and you’ll eventually hit an OOM.
This is especially common with local models through HuggingFacePipeline or transformers. If you see errors like:
- •
RuntimeError: CUDA out of memory - •
torch.OutOfMemoryError: CUDA out of memory - •
MemoryError - •
Killedon Linux after the process gets terminated
then your inference path is holding too much in memory.
Other Possible Causes
| Cause | What it looks like | Fix |
|---|---|---|
| Huge prompt / chat history | Context length exceeded followed by memory pressure | Trim messages and summarize history |
| Large local model | CUDA out of memory from PyTorch / Transformers | Use a smaller model or quantization |
| Too many concurrent requests | Memory spikes under load | Limit concurrency and batch carefully |
| Vector store loaded in RAM | App slows then OOMs on startup | Move to persistent storage or lazy load |
1. Unbounded chat history in ConversationBufferMemory
If you use old-style memory classes, this can grow forever:
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(return_messages=True)
Use a bounded alternative:
from langchain.memory import ConversationSummaryBufferMemory
memory = ConversationSummaryBufferMemory(
llm=llm,
max_token_limit=2000,
return_messages=True,
)
2. Loading the full vector store into process memory
This pattern hurts when your local dev dataset is large:
# bad: loads everything eagerly
vectorstore = FAISS.load_local("index_dir", embeddings)
Prefer lazy loading where possible and keep the index on disk. If you must load locally, make sure you’re not rebuilding embeddings on every run.
3. Using a model that is too large for your machine
A 7B+ model can easily blow up RAM or VRAM in development.
# bad for small dev machines
llm = HuggingFacePipeline.from_model_id(
model_id="meta-llama/Llama-3.1-8B-Instruct",
task="text-generation"
)
Use a smaller model first:
llm = HuggingFacePipeline.from_model_id(
model_id="google/flan-t5-base",
task="text2text-generation"
)
If you need the larger model locally, use quantization or offload settings from transformers.
4. Too much concurrency in async chains or agents
A common mistake is firing off many requests at once without limits:
results = await asyncio.gather(*tasks)
Add backpressure:
sem = asyncio.Semaphore(2)
async def limited_task(task):
async with sem:
return await task
results = await asyncio.gather(*(limited_task(t) for t in tasks))
How to Debug It
- •
Check whether the OOM happens on startup or on first request.
If it fails immediately after importing or initializing LangChain objects, the issue is likely model loading or vector store initialization. - •
Print memory usage around each step.
Usepsutilto find the spike point.import os import psutil p = psutil.Process(os.getpid()) print(f"RSS MB: {p.memory_info().rss / 1024 / 1024:.2f}") - •
Remove components one by one.
Test with only:- •prompt + LLM
- •then add memory
- •then add retrieval
- •then add tools/agent logic
The component that pushes memory over the edge is usually obvious.
- •
Inspect your actual stack trace.
Look for:- •
torch.OutOfMemoryError - •
RuntimeError: CUDA out of memory - •
MemoryError - •process killed by OS (
exit code 137)
That tells you whether this is GPU VRAM exhaustion, system RAM exhaustion, or runaway process growth.
- •
Prevention
- •Instantiate long-lived objects once at module scope: LLM clients, embeddings, retrievers, and chains should not be recreated per request.
- •Put hard limits on context growth: use summarizing memory, truncate message history, and cap retrieved documents.
- •Match model size to hardware early: if it OOMs on your laptop during development, it will be worse under concurrent production traffic.
If you’re building LangChain agents for internal tools, treat memory as part of the design surface. Most OOM bugs come from architecture choices, not from LangChain itself.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit