How to Fix 'OOM error during inference when scaling' in CrewAI (Python)

By Cyprian AaronsUpdated 2026-04-21
oom-error-during-inference-when-scalingcrewaipython

An OOM error during inference means your process ran out of memory while the model was generating a response. In CrewAI, this usually shows up when you scale from one agent/task to many, increase context size, or run multiple workers in parallel.

The pattern is usually the same: things work in local testing, then fail once you add more tasks, longer prompts, bigger tools, or concurrent execution.

The Most Common Cause

The #1 cause is unbounded context growth across agents and tasks.

In CrewAI, people often pass large objects into task descriptions, keep full chat history forever, or let each agent accumulate too much state. That pushes token usage and memory up until Python or the model backend throws something like:

  • RuntimeError: CUDA out of memory
  • MemoryError
  • litellm.exceptions.APIError: OOM error during inference
  • ValueError: Context length exceeded

Broken pattern vs fixed pattern

BrokenFixed
Passes full documents and history into every taskPasses only the minimum relevant context
Reuses one giant shared stringTrims and summarizes before each task
Lets agents carry unlimited memoryCaps memory and resets state between runs
# BROKEN
from crewai import Agent, Task, Crew

researcher = Agent(
    role="Researcher",
    goal="Analyze claims",
    backstory="You are a careful analyst."
)

task = Task(
    description=f"""
    Analyze this entire customer file and all prior messages:
    {full_customer_record}
    {conversation_history}
    {all_support_notes}
    """,
    agent=researcher,
)

crew = Crew(
    agents=[researcher],
    tasks=[task],
)

result = crew.kickoff()
# FIXED
from crewai import Agent, Task, Crew

def build_summary(record: str) -> str:
    return record[:4000]  # replace with real summarization / extraction

researcher = Agent(
    role="Researcher",
    goal="Analyze claims",
    backstory="You are a careful analyst.",
)

compact_context = build_summary(full_customer_record)

task = Task(
    description=f"""
    Analyze the claim using only this extracted context:
    {compact_context}
    Focus on policy coverage, dates, and stated exceptions.
    """,
    agent=researcher,
)

crew = Crew(
    agents=[researcher],
    tasks=[task],
)
result = crew.kickoff()

If you’re seeing OOM during scaling, this is usually where the problem starts. The fix is not “buy more RAM”; it’s reducing what each inference call carries.

Other Possible Causes

1) Too much parallelism

If you scale with multiple processes or threads, each worker loads its own model/client state.

# risky
crew.kickoff(inputs=inputs)  # called from many workers at once

Fix by limiting concurrency:

# safer
from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=2) as pool:
    results = list(pool.map(run_crew_job, jobs))

2) Large tool outputs fed back into the agent

A tool that returns a huge JSON blob can blow up context fast.

# risky tool result
{
  "transactions": [... thousands of rows ...]
}

Trim at the tool boundary:

def fetch_transactions(account_id):
    rows = query_db(account_id)
    return {
        "count": len(rows),
        "top_20": rows[:20],
        "summary": summarize_rows(rows),
    }

3) Model too large for your hardware

If you’re running local inference through Ollama, vLLM, llama.cpp, or a GPU-backed provider wrapper, the model may simply not fit.

Check your model config:

llm = LLM(
    model="ollama/llama3.1:70b",
)

Try a smaller model:

llm = LLM(
    model="ollama/llama3.1:8b",
)

4) Memory leak from repeated crew creation

Creating new agents/LLMs inside a hot loop can keep allocating until the process dies.

# risky
for job in jobs:
    crew = build_crew()   # new objects every iteration
    crew.kickoff(inputs=job)

Reuse objects where possible:

crew = build_crew()

for job in jobs:
    crew.kickoff(inputs=job)

How to Debug It

  1. Check where the failure happens

    • If it fails on first request: likely model size or prompt size.
    • If it fails after several runs: likely memory leak or accumulated context.
    • If it fails only under load: likely concurrency.
  2. Log prompt size before kickoff

    print("task chars:", len(task.description))
    print("tool payload chars:", len(str(tool_output)))
    

    If these numbers are huge, reduce inputs before calling Crew.kickoff().

  3. Disable parallel execution Run one crew at a time. If the OOM disappears, your issue is worker fan-out or shared resource pressure.

  4. Inspect backend logs Look for exact messages like:

    • CUDA out of memory
    • OOM error during inference
    • Context length exceeded
    • Killed from the OS OOM killer

If the backend says context is too long, trim prompts. If it says CUDA OOM, lower model size or batch/concurrency.

Prevention

  • Keep task descriptions small and explicit. Push raw data into retrieval or preprocessing instead of dumping it into prompts.
  • Put hard limits on tool outputs. Return summaries, counts, IDs, and top-N records instead of entire datasets.
  • Test scaling with production-like concurrency early. A single-agent happy path does not tell you anything about memory behavior under load.

If you want a simple rule: every agent should see only what it needs for one decision. Once you stop feeding CrewAI giant prompts and oversized tool payloads, most OOM errors disappear fast.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides