Haystack Tutorial (Python): testing agents locally for intermediate developers

By Cyprian AaronsUpdated 2026-04-21
haystacktesting-agents-locally-for-intermediate-developerspython

This tutorial shows you how to build and test a Haystack agent locally in Python without wiring it into your production stack. You need this when you want fast feedback on tool use, prompt changes, and failure handling before the agent ever touches real users or internal systems.

What You'll Need

  • Python 3.10 or newer
  • pip and a virtual environment
  • Haystack installed locally
  • An OpenAI API key for the generator model
  • Basic familiarity with Haystack pipelines, components, and documents

Install the packages first:

pip install haystack-ai openai python-dotenv

Set your API key in the shell:

export OPENAI_API_KEY="your-key-here"

Step-by-Step

  1. Create a small local workspace and load your environment variables. Keep this isolated so you can run the same test repeatedly without touching shared services.
from dotenv import load_dotenv
import os

load_dotenv()

api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
    raise RuntimeError("OPENAI_API_KEY is not set")

print("API key loaded")
  1. Build a tiny tool the agent can call. For local testing, use deterministic tools first so you can tell whether failures come from the model or from your business logic.
from haystack.tools import Tool

def lookup_policy_status(policy_id: str) -> str:
    mock_db = {
        "POL-1001": "Active",
        "POL-1002": "Pending payment",
        "POL-1003": "Cancelled",
    }
    return mock_db.get(policy_id, "Policy not found")

policy_tool = Tool(
    name="lookup_policy_status",
    description="Look up the current status of an insurance policy by policy ID.",
    parameters={"type": "object", "properties": {"policy_id": {"type": "string"}}, "required": ["policy_id"]},
    function=lookup_policy_status,
)
  1. Create a simple agent pipeline with a chat generator and attach the tool. This keeps the setup close to how you would run it in production, but still easy to inspect locally.
from haystack import Pipeline
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage

generator = OpenAIChatGenerator(model="gpt-4o-mini", generation_kwargs={"temperature": 0})

pipeline = Pipeline()
pipeline.add_component("llm", generator)

messages = [
    ChatMessage.from_system(
        "You are a support agent for an insurance company. "
        "Use tools when needed and answer briefly."
    ),
    ChatMessage.from_user("Check policy POL-1002 and tell me its status."),
]
  1. Run the model once without tools to see baseline behavior, then wire in the tool call loop. This is useful because you want to know whether the model can answer directly or whether it needs structured assistance.
result = pipeline.run({"llm": {"messages": messages}})
print(result["llm"]["replies"][0].content)
  1. Test tool use explicitly by giving the model a task that should trigger the tool. In local testing, inspect both the final answer and whether the tool returned what you expected.
tool_messages = [
    ChatMessage.from_system(
        "You are a support agent for an insurance company. "
        "If asked about policy status, call lookup_policy_status."
    ),
    ChatMessage.from_user("What is the status of policy POL-1001?"),
]

result = generator.run(messages=tool_messages, tools=[policy_tool])
reply = result["replies"][0]
print(reply.content)
print(reply.tool_calls)
  1. Add assertions so your local test behaves like a unit test instead of an interactive notebook. That gives you repeatable checks when prompts or tools change.
def test_policy_lookup():
    assert lookup_policy_status("POL-1001") == "Active"
    assert lookup_policy_status("POL-9999") == "Policy not found"

test_policy_lookup()
print("Local tests passed")

Testing It

Run the script end to end and confirm three things: your environment loads, the tool returns deterministic output, and the model produces a sensible response for a policy-status question. If you get an empty or off-target response, check whether the prompt actually instructs the model to use the tool and whether your model supports tool calls in that configuration.

For better signal, keep temperature=0 while testing. That removes randomness so changes in output usually mean something changed in your code, prompt, or tool schema.

If you want stronger validation, wrap these checks in pytest and add cases for missing policy IDs, malformed inputs, and unexpected user questions.

Next Steps

  • Add more tools and test routing between them with separate assertions.
  • Replace the mock lookup function with a real internal service client behind a thin interface.
  • Move from single-turn tests to multi-turn conversation tests so you can validate memory and follow-up behavior.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides