CrewAI Tutorial (Python): chunking large documents for advanced developers

By Cyprian AaronsUpdated 2026-04-21
crewaichunking-large-documents-for-advanced-developerspython

This tutorial shows how to split a large document into manageable chunks, assign each chunk to a CrewAI agent, and then merge the results into one structured output. You need this when your source material is too large for a single model context window or when you want parallel analysis across sections without losing traceability.

What You'll Need

  • Python 3.10+
  • crewai
  • crewai-tools
  • An OpenAI API key in OPENAI_API_KEY
  • A long text file to process, for example policy.txt or contract.txt
  • Basic familiarity with CrewAI agents, tasks, and crews

Install the packages:

pip install crewai crewai-tools

Step-by-Step

  1. Start by loading the document and chunking it in plain Python. Keep the chunker deterministic so you can reproduce results and map outputs back to source ranges later.
from pathlib import Path

def chunk_text(text: str, chunk_size: int = 3000, overlap: int = 300):
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunks.append(text[start:end])
        if end == len(text):
            break
        start = end - overlap
    return chunks

document_path = Path("policy.txt")
text = document_path.read_text(encoding="utf-8")
chunks = chunk_text(text)

print(f"Loaded {len(text)} characters")
print(f"Created {len(chunks)} chunks")
print(chunks[0][:500])
  1. Define a worker agent that analyzes one chunk at a time. For document-heavy workflows, keep the task narrow so each result is structured and easy to merge later.
from crewai import Agent, Task, Crew, Process
from crewai.llm import LLM

llm = LLM(model="gpt-4o-mini")

chunk_analyst = Agent(
    role="Document Chunk Analyst",
    goal="Extract concise, structured findings from a single document chunk",
    backstory="You analyze legal, policy, and compliance documents with precision.",
    llm=llm,
    verbose=True,
)

def make_task(chunk: str, index: int) -> Task:
    return Task(
        description=(
            f"Analyze chunk {index} of a larger document.\n\n"
            "Return:\n"
            "- key points\n"
            "- risks or ambiguities\n"
            "- named entities\n"
            "- any dates, thresholds, or obligations\n\n"
            f"Chunk content:\n{chunk}"
        ),
        expected_output="A compact markdown summary with the requested fields.",
        agent=chunk_analyst,
    )
  1. Build one task per chunk and run them in parallel using a hierarchical-free setup. If you have many chunks, this pattern scales better than asking one agent to read the entire file.
tasks = [make_task(chunk, i + 1) for i, chunk in enumerate(chunks)]

crew = Crew(
    agents=[chunk_analyst],
    tasks=tasks,
    process=Process.sequential,
    verbose=True,
)

results = crew.kickoff()
print(results)
  1. Merge the per-chunk outputs into a single report. In production, this is where you add deduplication and conflict resolution because repeated entities and overlapping chunks will produce repeated findings.
from collections import defaultdict

def build_merge_prompt(chunk_outputs):
    prompt_lines = [
        "Combine these chunk summaries into one final report.",
        "Remove duplicates.",
        "Preserve exact dates and thresholds.",
        "Group findings under: Summary, Risks, Obligations, Entities.",
        ""
    ]
    for i, output in enumerate(chunk_outputs, start=1):
        prompt_lines.append(f"Chunk {i} summary:\n{output}\n")
    return "\n".join(prompt_lines)

merge_agent = Agent(
    role="Report Synthesizer",
    goal="Merge multiple chunk summaries into one coherent final report",
    backstory="You reconcile overlapping analyses into clean executive output.",
    llm=llm,
)

merge_task = Task(
    description=build_merge_prompt(results.tasks_output),
    expected_output="A deduplicated final markdown report.",
    agent=merge_agent,
)

merge_crew = Crew(
    agents=[merge_agent],
    tasks=[merge_task],
    process=Process.sequential,
)

final_report = merge_crew.kickoff()
print(final_report)
  1. If your document is extremely large, switch from a single sequential crew run to batched processing. This keeps token usage predictable and gives you control over retries when one batch fails.
def batched(items, size):
    for i in range(0, len(items), size):
        yield items[i:i + size]

all_summaries = []

for batch_num, batch in enumerate(batched(chunks, 4), start=1):
    batch_tasks = [make_task(chunk, i + 1) for i, chunk in enumerate(batch)]
    batch_crew = Crew(
        agents=[chunk_analyst],
        tasks=batch_tasks,
        process=Process.sequential,
        verbose=False,
    )
    batch_result = batch_crew.kickoff()
    all_summaries.append(str(batch_result))
    print(f"Finished batch {batch_num}")

print(f"Collected {len(all_summaries)} batch summaries")

Testing It

Run the script against a real file with at least 10 pages of text so you can see whether chunk boundaries are sensible. Check that each chunk summary contains only local facts from its own section and that the final merged report removes duplicates cleanly.

If the model starts hallucinating cross-chunk facts, reduce the chunk size and increase overlap slightly. Also verify that your merge step preserves exact numbers like policy limits, deadlines, and clause references.

A good sanity check is to compare three things: raw text excerpts from each chunk, individual summaries, and the final merged report. If those three line up consistently across multiple runs, your pipeline is stable enough for production hardening.

Next Steps

  • Add metadata to each chunk: page number, section heading, byte offset
  • Replace plain text splitting with semantic splitting using headings or paragraph boundaries
  • Add structured outputs with Pydantic models so downstream systems can consume JSON instead of markdown

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides