CrewAI Tutorial (Python): handling long documents for intermediate developers

By Cyprian AaronsUpdated 2026-04-21
crewaihandling-long-documents-for-intermediate-developerspython

This tutorial shows you how to build a CrewAI pipeline that can ingest long documents, split them into manageable chunks, summarize each chunk, and produce a final answer without blowing past context limits. You need this when your source material is too large for a single prompt, but you still want structured analysis instead of random truncation.

What You'll Need

  • Python 3.10+
  • crewai
  • crewai-tools
  • langchain-text-splitters
  • An OpenAI API key
  • A text file or PDF converted to plain text
  • Basic familiarity with CrewAI agents, tasks, and crews

Install the packages:

pip install crewai crewai-tools langchain-text-splitters

Set your API key:

export OPENAI_API_KEY="your-api-key"

Step-by-Step

  1. Start by loading the document and splitting it into chunks. For long-document workflows, chunking is the first real control point because it determines how much context each agent sees.
from pathlib import Path
from langchain_text_splitters import RecursiveCharacterTextSplitter

doc_path = Path("long_document.txt")
text = doc_path.read_text(encoding="utf-8")

splitter = RecursiveCharacterTextSplitter(
    chunk_size=3000,
    chunk_overlap=300,
)
chunks = splitter.split_text(text)

print(f"Loaded {len(text)} characters")
print(f"Created {len(chunks)} chunks")
print(chunks[0][:500])
  1. Create one agent that extracts structured notes from each chunk. Keep this agent narrow: its job is not to answer the final question, only to compress each chunk into useful evidence.
from crewai import Agent

chunk_summarizer = Agent(
    role="Document Analyst",
    goal="Extract concise, structured notes from a document chunk",
    backstory="You are precise and only keep facts relevant to downstream analysis.",
    verbose=True,
)
  1. Define a task template that will run once per chunk. Each task should ask for consistent output so the final aggregation step can merge results cleanly.
from crewai import Task

def build_chunk_task(chunk: str) -> Task:
    return Task(
        description=(
            "Read the following document chunk and produce:\n"
            "1) A 5-bullet summary\n"
            "2) Key entities or dates\n"
            "3) Any risks, obligations, or decisions mentioned\n\n"
            f"CHUNK:\n{chunk}"
        ),
        expected_output="Structured notes with bullets and short paragraphs.",
        agent=chunk_summarizer,
    )
  1. Run the chunk tasks through a crew and collect the outputs. This gives you one compact artifact per chunk instead of one giant prompt that fails on length.
from crewai import Crew, Process

chunk_tasks = [build_chunk_task(chunk) for chunk in chunks[:3]]

crew = Crew(
    agents=[chunk_summarizer],
    tasks=chunk_tasks,
    process=Process.sequential,
    verbose=True,
)

results = crew.kickoff()
print(results)
  1. Add a second agent to synthesize all chunk summaries into one final response. This is where you turn many local summaries into one global answer.
from crewai import Agent, Task, Crew, Process

synthesizer = Agent(
    role="Senior Reviewer",
    goal="Combine chunk summaries into one accurate final report",
    backstory="You identify repeated themes, contradictions, and missing details.",
    verbose=True,
)

final_task = Task(
    description=(
        "Using the provided chunk summaries, write a consolidated report with:\n"
        "- Main themes\n"
        "- Important facts\n"
        "- Open questions or inconsistencies\n"
        "- Final recommendation\n\n"
        f"SUMMARIES:\n{results}"
    ),
    expected_output="A concise executive-style report.",
    agent=synthesizer,
)

final_crew = Crew(
    agents=[synthesizer],
    tasks=[final_task],
    process=Process.sequential,
    verbose=True,
)

final_report = final_crew.kickoff()
print(final_report)
  1. Wrap it in a reusable script so you can point it at any long document. In production, this is where you would add file validation, retry logic, and persistence for intermediate outputs.
from pathlib import Path
from langchain_text_splitters import RecursiveCharacterTextSplitter
from crewai import Agent, Task, Crew, Process

def analyze_document(file_path: str):
    text = Path(file_path).read_text(encoding="utf-8")
    splitter = RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=300)
    chunks = splitter.split_text(text)

    analyst = Agent(
        role="Document Analyst",
        goal="Extract structured notes from each document chunk",
        backstory="You are careful with legal and business documents.",
        verbose=False,
    )

    tasks = [
        Task(description=f"Summarize this chunk:\n{chunk}", expected_output="Notes", agent=analyst)
        for chunk in chunks
    ]

    crew = Crew(agents=[analyst], tasks=tasks, process=Process.sequential)
    return crew.kickoff()

print(analyze_document("long_document.txt"))

Testing It

Run the script against a document that is clearly longer than a single prompt window, such as a policy manual or contract bundle. Check that you get multiple intermediate outputs rather than one truncated response.

Then inspect whether each chunk summary preserves concrete details like dates, obligations, exceptions, and named entities. If those are missing, reduce chunk_size or tighten the summarization prompt.

Finally, compare the synthesized report against the original source for obvious omissions or contradictions. For banking and insurance use cases, this is where you catch missed clauses before they become workflow bugs.

Next Steps

  • Add metadata to each chunk so you can trace every summary back to its source section.
  • Store intermediate summaries in a database or object store for auditability.
  • Replace plain-text files with PDF ingestion using pypdf or OCR for scanned documents.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides