How to Fix 'OOM error during inference when scaling' in CrewAI (Python)
An OOM error during inference means your process ran out of memory while the model was generating a response. In CrewAI, this usually shows up when you scale from one agent/task to many, increase context size, or run multiple workers in parallel.
The pattern is usually the same: things work in local testing, then fail once you add more tasks, longer prompts, bigger tools, or concurrent execution.
The Most Common Cause
The #1 cause is unbounded context growth across agents and tasks.
In CrewAI, people often pass large objects into task descriptions, keep full chat history forever, or let each agent accumulate too much state. That pushes token usage and memory up until Python or the model backend throws something like:
- •
RuntimeError: CUDA out of memory - •
MemoryError - •
litellm.exceptions.APIError: OOM error during inference - •
ValueError: Context length exceeded
Broken pattern vs fixed pattern
| Broken | Fixed |
|---|---|
| Passes full documents and history into every task | Passes only the minimum relevant context |
| Reuses one giant shared string | Trims and summarizes before each task |
| Lets agents carry unlimited memory | Caps memory and resets state between runs |
# BROKEN
from crewai import Agent, Task, Crew
researcher = Agent(
role="Researcher",
goal="Analyze claims",
backstory="You are a careful analyst."
)
task = Task(
description=f"""
Analyze this entire customer file and all prior messages:
{full_customer_record}
{conversation_history}
{all_support_notes}
""",
agent=researcher,
)
crew = Crew(
agents=[researcher],
tasks=[task],
)
result = crew.kickoff()
# FIXED
from crewai import Agent, Task, Crew
def build_summary(record: str) -> str:
return record[:4000] # replace with real summarization / extraction
researcher = Agent(
role="Researcher",
goal="Analyze claims",
backstory="You are a careful analyst.",
)
compact_context = build_summary(full_customer_record)
task = Task(
description=f"""
Analyze the claim using only this extracted context:
{compact_context}
Focus on policy coverage, dates, and stated exceptions.
""",
agent=researcher,
)
crew = Crew(
agents=[researcher],
tasks=[task],
)
result = crew.kickoff()
If you’re seeing OOM during scaling, this is usually where the problem starts. The fix is not “buy more RAM”; it’s reducing what each inference call carries.
Other Possible Causes
1) Too much parallelism
If you scale with multiple processes or threads, each worker loads its own model/client state.
# risky
crew.kickoff(inputs=inputs) # called from many workers at once
Fix by limiting concurrency:
# safer
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=2) as pool:
results = list(pool.map(run_crew_job, jobs))
2) Large tool outputs fed back into the agent
A tool that returns a huge JSON blob can blow up context fast.
# risky tool result
{
"transactions": [... thousands of rows ...]
}
Trim at the tool boundary:
def fetch_transactions(account_id):
rows = query_db(account_id)
return {
"count": len(rows),
"top_20": rows[:20],
"summary": summarize_rows(rows),
}
3) Model too large for your hardware
If you’re running local inference through Ollama, vLLM, llama.cpp, or a GPU-backed provider wrapper, the model may simply not fit.
Check your model config:
llm = LLM(
model="ollama/llama3.1:70b",
)
Try a smaller model:
llm = LLM(
model="ollama/llama3.1:8b",
)
4) Memory leak from repeated crew creation
Creating new agents/LLMs inside a hot loop can keep allocating until the process dies.
# risky
for job in jobs:
crew = build_crew() # new objects every iteration
crew.kickoff(inputs=job)
Reuse objects where possible:
crew = build_crew()
for job in jobs:
crew.kickoff(inputs=job)
How to Debug It
- •
Check where the failure happens
- •If it fails on first request: likely model size or prompt size.
- •If it fails after several runs: likely memory leak or accumulated context.
- •If it fails only under load: likely concurrency.
- •
Log prompt size before kickoff
print("task chars:", len(task.description)) print("tool payload chars:", len(str(tool_output)))If these numbers are huge, reduce inputs before calling
Crew.kickoff(). - •
Disable parallel execution Run one crew at a time. If the OOM disappears, your issue is worker fan-out or shared resource pressure.
- •
Inspect backend logs Look for exact messages like:
- •
CUDA out of memory - •
OOM error during inference - •
Context length exceeded - •
Killedfrom the OS OOM killer
- •
If the backend says context is too long, trim prompts. If it says CUDA OOM, lower model size or batch/concurrency.
Prevention
- •Keep task descriptions small and explicit. Push raw data into retrieval or preprocessing instead of dumping it into prompts.
- •Put hard limits on tool outputs. Return summaries, counts, IDs, and top-N records instead of entire datasets.
- •Test scaling with production-like concurrency early. A single-agent happy path does not tell you anything about memory behavior under load.
If you want a simple rule: every agent should see only what it needs for one decision. Once you stop feeding CrewAI giant prompts and oversized tool payloads, most OOM errors disappear fast.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit