CrewAI Tutorial (Python): chunking large documents for beginners

By Cyprian AaronsUpdated 2026-04-21
crewaichunking-large-documents-for-beginnerspython

This tutorial shows you how to split a large document into smaller chunks, hand those chunks to CrewAI agents, and get structured summaries back. You need this when your source material is too large for one prompt, or when you want parallel processing over contracts, policies, reports, or claims files.

What You'll Need

  • Python 3.10+
  • crewai
  • crewai-tools
  • langchain-openai
  • An OpenAI API key set as OPENAI_API_KEY
  • A text file to test with, for example document.txt

Install the packages:

pip install crewai crewai-tools langchain-openai

Step-by-Step

  1. Start by loading a document and splitting it into manageable chunks. For beginners, character-based chunking is enough and easier to debug than token-based splitting.
from pathlib import Path
from langchain_text_splitters import RecursiveCharacterTextSplitter

def load_and_chunk_document(file_path: str, chunk_size: int = 2000, chunk_overlap: int = 200):
    text = Path(file_path).read_text(encoding="utf-8")
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", " ", ""],
    )
    return splitter.split_text(text)

chunks = load_and_chunk_document("document.txt")
print(f"Created {len(chunks)} chunks")
print(chunks[0][:500])
  1. Define one agent that summarizes each chunk and another agent that merges the summaries into a final result. This keeps the workflow simple and gives you a clean map-reduce style pipeline.
from crewai import Agent

chunk_summarizer = Agent(
    role="Chunk Summarizer",
    goal="Summarize one document chunk accurately and concisely",
    backstory="You extract the key facts from legal and business documents.",
    verbose=True,
)

final_summarizer = Agent(
    role="Final Summarizer",
    goal="Combine multiple chunk summaries into one coherent summary",
    backstory="You produce clear executive summaries from partial inputs.",
    verbose=True,
)
  1. Create tasks dynamically, one per chunk. Each task asks the agent to summarize only its assigned chunk, which avoids context overflow on large documents.
from crewai import Task

def build_chunk_tasks(chunks):
    tasks = []
    for i, chunk in enumerate(chunks, start=1):
        tasks.append(
            Task(
                description=(
                    f"Summarize chunk {i} of {len(chunks)}.\n\n"
                    f"Focus on facts, dates, entities, obligations, and risks.\n\n"
                    f"Chunk text:\n{chunk}"
                ),
                expected_output=f"A concise summary of chunk {i}.",
                agent=chunk_summarizer,
            )
        )
    return tasks

chunk_tasks = build_chunk_tasks(chunks[:3])
  1. Run the chunk tasks in a crew, then pass their outputs into a second crew for consolidation. This two-stage pattern is what you want in production when documents are too large for a single pass.
from crewai import Crew, Process

chunk_crew = Crew(
    agents=[chunk_summarizer],
    tasks=chunk_tasks,
    process=Process.sequential,
    verbose=True,
)

chunk_results = chunk_crew.kickoff()

summary_text = "\n".join([str(result) for result in chunk_results])
final_task = Task(
    description=(
        "Merge these chunk summaries into one final summary.\n"
        "Remove duplicates and keep the structure clear.\n\n"
        f"Chunk summaries:\n{summary_text}"
    ),
    expected_output="A single consolidated summary.",
    agent=final_summarizer,
)

final_crew = Crew(
    agents=[final_summarizer],
    tasks=[final_task],
    process=Process.sequential,
    verbose=True,
)

final_result = final_crew.kickoff()
print(final_result)
  1. If you want better control over output quality, add a lightweight validation step after summarization. In real systems, this is where you enforce length limits, required fields, or JSON output.
def validate_summary(text: str) -> bool:
    required_terms = ["summary"]
    return all(term.lower() in text.lower() for term in required_terms)

result_text = str(final_result)
if validate_summary(result_text):
    print("Summary passed validation")
else:
    print("Summary failed validation")

Testing It

Run the script against a real text file with several pages of content. Start with only the first 2-3 chunks so you can inspect whether each summary stays focused on its own section.

If the outputs are too vague, reduce chunk_size or tighten the task instructions around what to extract. If summaries repeat themselves too much in the final output, increase chunk_overlap slightly or make the final agent explicitly remove duplicates.

For a proper smoke test, use a document with headings like scope, risks, exclusions, and obligations. You should see those concepts preserved across individual chunk summaries and then merged cleanly in the final result.

Next Steps

  • Add metadata to each chunk so you can trace summaries back to page numbers or sections.
  • Switch from plain-text files to PDF extraction using pymupdf or pdfplumber.
  • Change the final task to output structured JSON for downstream systems like case management or claims workflows.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides