LlamaIndex Tutorial (Python): testing agents locally for advanced developers

By Cyprian AaronsUpdated 2026-04-21

llamaindextesting-agents-locally-for-advanced-developerspython

This tutorial shows you how to run and test a LlamaIndex agent locally in Python, with enough structure to catch tool, memory, and routing issues before you ever point it at production. You need this when your agent works in notebooks but starts failing once you add tools, async calls, or multi-step reasoning.

What You'll Need

•Python 3.10+
•llama-index
•llama-index-llms-openai
•llama-index-tools-mcp if you want external tools later, but not needed for this tutorial
•An OpenAI API key set as OPENAI_API_KEY
•Optional: python-dotenv for loading local env vars
•A terminal and a plain .py file, not just a notebook

Step-by-Step

•Install the packages and set up a clean project. Keep the environment minimal so failures are easier to trace.

python -m venv .venv
source .venv/bin/activate
pip install llama-index llama-index-llms-openai python-dotenv
export OPENAI_API_KEY="your-key-here"

•Create a small local dataset and index it. This gives the agent something deterministic to query while you test tool behavior.

from llama_index.core import Document, VectorStoreIndex

docs = [
    Document(text="Policy A covers accidental damage with a $500 deductible."),
    Document(text="Policy B covers theft but excludes items left in vehicles overnight."),
]

index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()

•Wrap the query engine as a tool and build an agent around it. This is the part most teams get wrong locally: they test the model, but not the tool contract.

from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.llms.openai import OpenAI
from llama_index.core.agent import ReActAgent

policy_tool = QueryEngineTool(
    query_engine=query_engine,
    metadata=ToolMetadata(
        name="policy_search",
        description="Search internal policy documents for coverage details."
    ),
)

llm = OpenAI(model="gpt-4o-mini", temperature=0)
agent = ReActAgent.from_tools([policy_tool], llm=llm, verbose=True)

•Add a local test harness that runs fixed prompts and prints results. Use this to validate that the agent chooses the right tool and answers consistently across runs.

test_cases = [
    "What is the deductible for accidental damage?",
    "Does Policy B cover theft from a car overnight?",
]

for prompt in test_cases:
    response = agent.chat(prompt)
    print("\nPROMPT:", prompt)
    print("ANSWER:", response.response)

•If you want repeatable local testing, put your prompts into assertions. This is where advanced teams move from “it seems fine” to actual regression testing.

def ask(text: str) -> str:
    return agent.chat(text).response.lower()

answer_1 = ask("What is the deductible for accidental damage?")
answer_2 = ask("Does Policy B cover theft from a car overnight?")

assert "$500" in answer_1
assert "exclude" in answer_2 or "not cover" in answer_2

print("Local checks passed.")

•Test failure modes too, not just happy paths. If your agent will be used by analysts or ops teams, you need to know how it behaves on ambiguous questions.

edge_cases = [
    "Summarize all policies in one sentence.",
    "What does policy C say about fire damage?",
]

for prompt in edge_cases:
    result = agent.chat(prompt).response
    print("\nPROMPT:", prompt)
    print("RESPONSE:", result)

Testing It

Run the script from your terminal and confirm that each prompt produces an answer grounded in the two sample documents. The first assertion should return $500, and the second should mention exclusion or non-coverage for items left in vehicles overnight.

If the agent hallucinates or ignores the tool, check three things first: your tool description, your model choice, and whether temperature=0 is set. If you see inconsistent outputs across runs, your issue is usually prompt ambiguity or too much freedom in tool selection.

For deeper validation, log both the raw response and any intermediate reasoning traces from verbose=True. That makes it easier to see whether failures come from retrieval, planning, or final answer generation.

Next Steps

•Add more tools and write one assertion per tool path so you can catch routing regressions early.
•Swap the in-memory index for a persistent vector store once your local tests are stable.
•Add pytest around these prompts so every change to prompts, tools, or models gets checked automatically.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit