How to Fix 'OOM error during inference' in CrewAI (Python)

By Cyprian AaronsUpdated 2026-04-21
oom-error-during-inferencecrewaipython

What the error means

OOM error during inference means your process ran out of memory while an LLM call was being executed. In CrewAI, this usually shows up when an agent, tool, or task pushes too much context into the model, or when you try to run too many heavy inference jobs at once.

You’ll typically see it during long task chains, large document processing, or when multiple agents share a bloated prompt history.

The Most Common Cause

The #1 cause is oversized context being sent into the model. In CrewAI, this often happens when you keep appending full tool outputs, full documents, or long chat history into Task.description, Agent.goal, or memory-backed conversations.

Here’s the broken pattern and the fixed pattern side by side:

BrokenFixed
Sends raw document text into every taskPasses only the relevant chunk
Keeps accumulating history in memoryTruncates or summarizes state
Reuses huge outputs as promptsStores outputs externally and references them
# BROKEN
from crewai import Agent, Task, Crew
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o", temperature=0)

researcher = Agent(
    role="Researcher",
    goal="Analyze the following report in full detail: " + open("claims_report.txt").read(),
    backstory="You are a senior claims analyst.",
    llm=llm,
)

task = Task(
    description="Read this entire file and extract all risks: " + open("claims_report.txt").read(),
    agent=researcher,
)

crew = Crew(agents=[researcher], tasks=[task])
crew.kickoff()
# FIXED
from crewai import Agent, Task, Crew
from langchain_openai import ChatOpenAI

def load_chunk(path: str, start: int = 0, size: int = 4000) -> str:
    with open(path, "r", encoding="utf-8") as f:
        f.seek(start)
        return f.read(size)

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

chunk = load_chunk("claims_report.txt")

researcher = Agent(
    role="Researcher",
    goal="Extract risks from the provided excerpt only.",
    backstory="You are a senior claims analyst.",
    llm=llm,
)

task = Task(
    description=f"Analyze this excerpt and return only high-signal risks:\n\n{chunk}",
    agent=researcher,
)

crew = Crew(agents=[researcher], tasks=[task])
crew.kickoff()

The mistake is not “using CrewAI wrong.” It’s feeding the model more tokens than your runtime can handle. On smaller machines, even gpt-4o class models can trigger OOM error during inference if the prompt is massive enough.

Other Possible Causes

1) Running too many agents in parallel

If you fan out multiple agents or tasks at once, each inference call needs its own memory footprint.

# Problematic parallelism
crew = Crew(
    agents=[agent1, agent2, agent3],
    tasks=[task1, task2, task3],
    process="parallel"
)

Fix it by running sequentially when memory is tight:

crew = Crew(
    agents=[agent1, agent2, agent3],
    tasks=[task1, task2, task3],
    process="sequential"
)

2) Using a large local model on limited RAM/VRAM

If you’re using a local backend through Ollama, vLLM, llama.cpp, or Transformers under the hood, the model itself may be too large for your machine.

from langchain_community.llms import Ollama

llm = Ollama(model="llama3:70b")  # likely to OOM on modest hardware

Use a smaller model:

llm = Ollama(model="llama3:8b")

3) Long-running memory accumulation across tasks

CrewAI memory can help with continuity, but it can also grow without bound if you keep storing every intermediate result.

crew = Crew(
    agents=[agent],
    tasks=[task1, task2, task3],
    memory=True
)

If you don’t need persistent memory across steps, disable it:

crew = Crew(
    agents=[agent],
    tasks=[task1, task2, task3],
    memory=False
)

Or summarize state before passing it forward:

summary_task = Task(
    description="Summarize prior findings in under 200 words.",
    agent=agent,
)

4) Oversized tool outputs being injected back into prompts

A common trap is returning huge JSON blobs from tools and then feeding them straight into another agent.

@tool("fetch_claims")
def fetch_claims():
    return open("all_claims.json").read()  # huge payload

Return a compact result instead:

@tool("fetch_claims")
def fetch_claims():
    data = open("all_claims.json").read()
    return data[:5000]  # or better: filter before returning

Better still: store large results in S3/database and pass a pointer:

return {"artifact_id": "claims_2024_q1", "location": "s3://bucket/claims_2024_q1.json"}

How to Debug It

  1. Check whether the prompt is exploding

    • Log Task.description, tool outputs, and any memory payload being passed between tasks.
    • If you see multi-page text dumps or giant JSON objects, that’s your first suspect.
  2. Reduce to one agent and one small task

    • Run a minimal crew with a tiny input.
    • If the error disappears, add complexity back one piece at a time until it returns.
  3. Switch to a smaller model

    • Move from gpt-4o to gpt-4o-mini, or from a local 70B model to an 8B model.
    • If OOM disappears immediately, your issue is capacity-related rather than logic-related.
  4. Disable memory and parallel execution

    • Set memory=False.
    • Use sequential processing.
    • If that fixes it, your problem is accumulation or concurrency pressure.

A good diagnostic workflow looks like this:

crew = Crew(
    agents=[agent],
    tasks=[small_task],
    memory=False,
    process="sequential"
)
result = crew.kickoff()
print(result)

If that works but your real pipeline fails, reintroduce one variable at a time:

  • bigger input
  • memory enabled
  • multiple tasks
  • parallel execution
  • larger model

Prevention

  • Keep task inputs small and specific. Pass excerpts, IDs, summaries, or retrieved chunks instead of full documents.
  • Prefer sequential execution unless you’ve measured that your runtime can handle parallel inference safely.
  • Put hard limits on tool output size and summarize before handing results to another agent.
  • Match model size to hardware. A local 70B model on a laptop is not an optimization; it’s an OOM ticket.

If you’re seeing OOM error during inference in CrewAI Python code right now, start by shrinking context. In practice that fixes most cases faster than changing frameworks or rewriting agents.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides