CrewAI Tutorial (Python): chunking large documents for intermediate developers
This tutorial shows you how to split large documents into smaller, token-safe chunks and process them with CrewAI in Python. You need this when a single PDF, policy manual, or contract is too large for one LLM call and you want reliable summarization, extraction, or review across the full document.
What You'll Need
- •Python 3.10+
- •
crewai - •
crewai-tools - •
openai - •An OpenAI API key set as
OPENAI_API_KEY - •A large text file to test with, such as
docs/policy.txt - •Basic familiarity with CrewAI agents, tasks, and crews
Install the packages:
pip install crewai crewai-tools openai
Step-by-Step
- •Start by loading the document and splitting it into chunks with overlap.
Overlap matters because legal and insurance documents often break concepts across paragraph boundaries.
from pathlib import Path
def chunk_text(text: str, chunk_size: int = 4000, overlap: int = 400) -> list[str]:
chunks = []
start = 0
text_length = len(text)
while start < text_length:
end = min(start + chunk_size, text_length)
chunks.append(text[start:end])
start = end - overlap if end - overlap > start else end
return chunks
document_path = Path("docs/policy.txt")
text = document_path.read_text(encoding="utf-8")
chunks = chunk_text(text)
print(f"Loaded {len(text)} characters")
print(f"Created {len(chunks)} chunks")
print(chunks[0][:500])
- •Define a focused agent that processes one chunk at a time.
Keep the agent narrow: one job, one output format, no extra responsibilities.
from crewai import Agent
chunk_summarizer = Agent(
role="Document Analyst",
goal="Extract the key points from a single document chunk",
backstory=(
"You review enterprise documents for risk, obligations, deadlines, "
"and exceptions. You write concise summaries."
),
verbose=True,
)
- •Create a task template that tells the model exactly what to produce from each chunk.
For production work, ask for structured output so downstream aggregation is predictable.
from crewai import Task
def build_chunk_task(chunk: str) -> Task:
return Task(
description=(
"Analyze the following document chunk and extract:\n"
"- main topic\n"
"- obligations or requirements\n"
"- deadlines or dates\n"
"- risks or exceptions\n"
"- any numbers or thresholds\n\n"
f"CHUNK:\n{chunk}"
),
expected_output=(
"A compact bullet list with the five requested sections. "
"Do not invent missing details."
),
agent=chunk_summarizer,
)
- •Run each chunk through its own crew execution and collect the results.
This pattern scales better than stuffing the whole document into one prompt.
from crewai import Crew, Process
def process_chunk(chunk: str) -> str:
crew = Crew(
agents=[chunk_summarizer],
tasks=[build_chunk_task(chunk)],
process=Process.sequential,
verbose=True,
)
result = crew.kickoff()
return str(result)
chunk_results = []
for i, chunk in enumerate(chunks[:5], start=1):
print(f"Processing chunk {i}/{min(len(chunks), 5)}")
chunk_results.append(process_chunk(chunk))
print(chunk_results[0])
- •Add a second pass that merges all chunk outputs into a final report.
This is where you turn many local summaries into one global answer.
final_analyst = Agent(
role="Senior Document Reviewer",
goal="Combine chunk-level findings into one coherent report",
backstory="You reconcile repeated points, remove duplication, and produce an executive summary.",
verbose=True,
)
merge_task = Task(
description=(
"Combine these chunk-level notes into one final report.\n"
"Return:\n"
"- executive summary\n"
"- key obligations\n"
"- key risks\n"
"- important dates and thresholds\n"
"- open questions or ambiguities\n\n"
f"NOTES:\n{chr(10).join(chunk_results)}"
),
expected_output="A clean final report with clear headings and no duplication.",
agent=final_analyst,
)
merge_crew = Crew(
agents=[final_analyst],
tasks=[merge_task],
process=Process.sequential,
verbose=True,
)
final_report = merge_crew.kickoff()
print(final_report)
Testing It
Run the script against a real document that is larger than your model’s comfortable context window. If you get a complete final report without truncation errors, the chunking logic is doing its job.
Check two failure modes first: repeated content at chunk boundaries and missing facts in the merged report. If those show up, reduce chunk_size or increase overlap.
For documents with tables or dense formatting, inspect the raw text extraction before blaming CrewAI. Bad PDF-to-text conversion is usually the real problem.
If you want stronger verification, compare the final report against known sections in the source document. In regulated domains, I also log each chunk output so reviewers can trace every claim back to source text.
Next Steps
- •Replace plain-text loading with PDF extraction using
pypdforpdfplumber - •Add structured outputs with Pydantic models for stricter downstream parsing
- •Introduce parallel processing for chunks when you need higher throughput
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit