How to Fix 'OOM error during inference in production' in CrewAI (Python)
When CrewAI throws an OOM error during inference, it means the model process ran out of memory while generating a response. In practice, this usually shows up when an agent is asked to process too much context, too many tools are loaded at once, or the model backend is running on a machine with too little RAM/VRAM.
This is not a CrewAI bug in most cases. It’s usually a workload shape problem: oversized prompts, long conversation history, large documents, or an inference backend that cannot fit the model and its KV cache in memory.
The Most Common Cause
The #1 cause is passing too much context into the agent task. In CrewAI, people often keep appending raw documents, chat history, and tool outputs into every task until the prompt becomes huge.
Here’s the broken pattern:
from crewai import Agent, Task, Crew
researcher = Agent(
role="Researcher",
goal="Summarize customer policy documents",
backstory="Insurance analyst"
)
task = Task(
description=f"""
Summarize these documents:
{open("policy_1.txt").read()}
{open("policy_2.txt").read()}
{open("policy_3.txt").read()}
{open("policy_4.txt").read()}
{open("policy_5.txt").read()}
""",
agent=researcher
)
crew = Crew(agents=[researcher], tasks=[task])
result = crew.kickoff()
And here’s the fixed pattern:
from crewai import Agent, Task, Crew
def chunk_text(text: str, size: int = 4000):
return [text[i:i+size] for i in range(0, len(text), size)]
researcher = Agent(
role="Researcher",
goal="Summarize customer policy documents",
backstory="Insurance analyst"
)
docs = []
for name in ["policy_1.txt", "policy_2.txt", "policy_3.txt", "policy_4.txt", "policy_5.txt"]:
with open(name) as f:
docs.extend(chunk_text(f.read(), size=3000))
tasks = [
Task(
description=f"Summarize this chunk:\n\n{chunk}",
agent=researcher
)
for chunk in docs[:3]
]
crew = Crew(agents=[researcher], tasks=tasks)
result = crew.kickoff()
The fix is simple:
- •split large inputs before sending them to the model
- •summarize incrementally instead of all at once
- •keep task descriptions short and focused
If you are using Process.sequential with long-running tasks, this matters even more because context can accumulate across steps.
Other Possible Causes
| Cause | What it looks like | Fix |
|---|---|---|
| Too many tools attached to one agent | Agent loads every tool schema and metadata into prompt context | Reduce tool count per agent |
Large memory=True conversation history | Context grows across turns until inference fails | Trim memory or reset between runs |
| Running a large local model on small hardware | Backend logs show CUDA or RAM exhaustion | Use a smaller model or quantized build |
| Huge tool outputs returned verbatim | Tool returns multi-MB JSON or HTML into the next prompt | Summarize or truncate tool output |
1) Too many tools on one agent
Every tool adds overhead. If you attach 10+ tools to a single Agent, you increase prompt size and sometimes trigger backend memory pressure.
# Bad
agent = Agent(
role="Claims Assistant",
goal="Handle claims tasks",
tools=[search_tool, db_tool, email_tool, pdf_tool, web_tool, jira_tool]
)
# Better
claims_reader = Agent(
role="Claims Reader",
goal="Read claim data",
tools=[db_tool, pdf_tool]
)
claims_writer = Agent(
role="Claims Writer",
goal="Draft claim responses",
tools=[email_tool]
)
2) Memory enabled without trimming
If you keep chat history forever, inference will eventually blow up.
# Bad: unbounded memory growth
agent = Agent(
role="Support Agent",
goal="Answer policy questions",
memory=True
)
# Better: reset or scope memory per case/session
agent = Agent(
role="Support Agent",
goal="Answer policy questions",
)
If your app stores conversation state externally, only pass the last relevant turns back into CrewAI.
3) Tool output is too large
A common failure mode is returning full HTML pages or giant JSON blobs from a tool.
# Bad
def fetch_policy_doc(policy_id):
return requests.get(f"https://api.example.com/policies/{policy_id}").text
# Better
def fetch_policy_doc(policy_id):
text = requests.get(f"https://api.example.com/policies/{policy_id}").text
return text[:8000] # or extract only relevant fields
4) Local model/backend does not have enough memory
If you are using Ollama, vLLM, LM Studio, or another local runtime behind CrewAI, the issue may be outside Python.
# Example: use a smaller quantized model
ollama run llama3.1:8b-instruct-q4_K_M
Or reduce generation load:
llm_config = {
"temperature": 0.2,
"max_tokens": 300,
}
Large max_tokens values can increase KV cache usage during inference.
How to Debug It
- •
Check whether the crash happens before or during generation.
- •If it fails immediately after
crew.kickoff(), inspect prompt size. - •If it fails after tool execution, inspect tool output size.
- •If it fails immediately after
- •
Log the exact input sent to the task.
- •Print task descriptions.
- •Measure character count and approximate token count.
- •Look for repeated document dumps or copied chat history.
- •
Remove variables until it stops failing.
- •Run with one task.
- •Remove all but one tool.
- •Disable memory.
- •Replace your local model with a smaller one if needed.
- •
Inspect backend logs.
- •For local models: look for CUDA OOM, CPU RAM exhaustion, or context length errors.
- •For hosted APIs: check whether you’re hitting request size limits instead of true memory exhaustion.
A useful sanity check:
prompt = task.description
print("chars:", len(prompt))
print("approx tokens:", len(prompt) // 4)
If your prompt is already tens of thousands of tokens before inference starts, that’s your problem.
Prevention
- •Keep each
Tasknarrow. One task should do one thing with one bounded input. - •Summarize intermediate results instead of passing raw documents through multiple agents.
- •Put hard limits on tool output size and conversation history length.
- •Prefer smaller models for production workflows that don’t need deep reasoning.
- •Test with production-like payload sizes before shipping to avoid surprises under load.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit