CrewAI Tutorial (Python): handling long documents for advanced developers

By Cyprian AaronsUpdated 2026-04-21
crewaihandling-long-documents-for-advanced-developerspython

This tutorial shows how to build a CrewAI workflow that can ingest long documents, split them into manageable chunks, analyze each chunk, and merge the results into one usable output. You need this when a single prompt hits context limits, when you want better recall over large PDFs or reports, or when you need repeatable document processing instead of ad hoc prompting.

What You'll Need

  • Python 3.10+
  • crewai
  • crewai-tools
  • python-dotenv
  • An OpenAI API key set as OPENAI_API_KEY
  • A long text document in .txt format for the example
  • Basic familiarity with CrewAI agents, tasks, and crews

Install the packages:

pip install crewai crewai-tools python-dotenv

Step-by-Step

  1. Start by loading your environment and defining a chunking strategy. For long documents, the important part is not “one giant prompt” but controlled segmentation with overlap so you do not lose context at boundaries.
import os
from dotenv import load_dotenv

load_dotenv()

def chunk_text(text: str, chunk_size: int = 4000, overlap: int = 400):
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunks.append(text[start:end])
        start = end - overlap
        if start < 0:
            start = 0
        if end == len(text):
            break
    return chunks

with open("long_document.txt", "r", encoding="utf-8") as f:
    document_text = f.read()

chunks = chunk_text(document_text)
print(f"Loaded {len(chunks)} chunks")
  1. Create a small set of specialized agents. One agent extracts structured findings from each chunk, and another agent consolidates the outputs into a single final answer.
from crewai import Agent

chunk_analyzer = Agent(
    role="Document Chunk Analyst",
    goal="Extract the most important facts, risks, entities, and obligations from a document chunk",
    backstory="You are precise and conservative. You only report what is supported by the text.",
    verbose=True,
)

synthesizer = Agent(
    role="Document Synthesizer",
    goal="Merge chunk-level findings into one coherent summary without duplicating points",
    backstory="You reconcile overlapping notes and produce a clean final report.",
    verbose=True,
)
  1. Turn each chunk into a task. The pattern here is simple: ask for structured output per chunk so the downstream merge step has consistent input.
from crewai import Task

tasks = []
for i, chunk in enumerate(chunks[:5]):  # keep the example bounded for execution
    tasks.append(
        Task(
            description=(
                f"Analyze chunk {i+1} of the document.\n\n"
                f"Return:\n"
                f"- key facts\n"
                f"- named entities\n"
                f"- risks or obligations\n"
                f"- any dates or numbers\n\n"
                f"Chunk text:\n{chunk}"
            ),
            expected_output="A concise bullet list of extracted findings.",
            agent=chunk_analyzer,
        )
    )

merge_task = Task(
    description=(
        "Combine all prior chunk findings into one final report.\n"
        "Remove duplicates and group related items under clear headings."
    ),
    expected_output="A consolidated executive summary with grouped insights.",
    agent=synthesizer,
)
  1. Build and run the crew. Use sequential execution here because long-document workflows usually depend on deterministic aggregation more than agent-to-agent improvisation.
from crewai import Crew, Process

crew = Crew(
    agents=[chunk_analyzer, synthesizer],
    tasks=tasks + [merge_task],
    process=Process.sequential,
    verbose=True,
)

result = crew.kickoff()
print(result)
  1. If you want a more production-friendly version, persist intermediate results before synthesis. This makes retries cheaper and lets you inspect which chunk caused bad extraction.
import json

chunk_outputs = []
for idx, task in enumerate(tasks):
    output = task.execute_sync()
    chunk_outputs.append({"chunk": idx + 1, "output": str(output)})

with open("chunk_outputs.json", "w", encoding="utf-8") as f:
    json.dump(chunk_outputs, f, indent=2)

synthesis_input = "\n\n".join(
    [f"Chunk {item['chunk']}:\n{item['output']}" for item in chunk_outputs]
)

final_task = Task(
    description=f"Synthesize these findings into one report:\n\n{synthesis_input}",
    expected_output="A consolidated report with no duplicated findings.",
    agent=synthesizer,
)

Testing It

Run the script against a real long document that has multiple sections, repeated concepts, or dense policy language. Check that each chunk output stays focused on its own section and that the final synthesis does not repeat the same point three times.

If the result looks thin, reduce chunk_size or increase overlap so important sentences are not split across boundaries. If it looks noisy, tighten the extraction prompt so each chunk returns only facts that matter to your use case.

For bank or insurance workflows, validate against known sections like obligations, exclusions, limits, dates, counterparties, and exceptions. Those are usually where long-document failures show up first.

Next Steps

  • Add a retrieval layer with embeddings so you only send relevant chunks to agents instead of every chunk.
  • Replace plain-text files with PDF extraction using pymupdf or unstructured.
  • Add schema validation with Pydantic so your extracted fields are machine-checkable before synthesis.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides