Haystack Tutorial (Python): testing agents locally for advanced developers

By Cyprian AaronsUpdated 2026-04-21
haystacktesting-agents-locally-for-advanced-developerspython

This tutorial shows you how to build and test a Haystack agent locally in Python, with a tight loop for debugging tool calls, prompts, and outputs. You need this when you want deterministic agent tests before wiring anything to production APIs, especially for banking and insurance workflows where bad tool selection is expensive.

What You'll Need

  • Python 3.10+
  • pip or uv
  • haystack-ai
  • openai or another chat model provider supported by Haystack
  • An API key for your chosen model provider
  • pytest if you want to turn the local checks into automated tests

Install the basics:

pip install haystack-ai openai pytest

Set your model key before running anything:

export OPENAI_API_KEY="your-key-here"

Step-by-Step

  1. Start with a small agent surface area.
    For local testing, keep the tool set tiny so you can see exactly why the agent picked a tool and what it returned.
from typing import Annotated

from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.tools import Tool

def lookup_policy(policy_id: Annotated[str, "Policy ID"]) -> str:
    if policy_id == "P-1001":
        return "Policy P-1001: active, premium paid, renewal due in 32 days."
    return f"Policy {policy_id}: not found."

policy_tool = Tool(
    name="lookup_policy",
    description="Look up a policy by ID.",
    parameters={
        "type": "object",
        "properties": {
            "policy_id": {"type": "string"}
        },
        "required": ["policy_id"]
    },
    function=lookup_policy,
)

generator = OpenAIChatGenerator(model="gpt-4o-mini")
  1. Build a local test harness around one user request.
    This is the part most teams skip. Wrap the agent call in a function so you can run it from a REPL, a script, or a test file without changing code.
def run_agent(user_text: str) -> str:
    messages = [
        ChatMessage.from_system(
            "You are a support agent. Use tools when needed and answer concisely."
        ),
        ChatMessage.from_user(user_text),
    ]

    result = generator.run(
        messages=messages,
        tools=[policy_tool],
    )

    return result["replies"][0].text

if __name__ == "__main__":
    print(run_agent("Check policy P-1001 and tell me its status."))
  1. Add deterministic assertions for local verification.
    You do not want to inspect outputs manually every time. Check that the response contains expected business facts and that the tool path is being used on known inputs.
def test_policy_lookup_happy_path():
    reply = run_agent("Check policy P-1001 and tell me its status.")
    assert "P-1001" in reply
    assert "active" in reply.lower()

def test_policy_lookup_unknown_id():
    reply = run_agent("Check policy P-9999.")
    assert "not found" in reply.lower()
  1. Test tool behavior directly before testing the full agent loop.
    This isolates failures faster than debugging everything through the LLM. If the tool is wrong, no prompt engineering will save you.
def test_tool_function_directly():
    assert lookup_policy("P-1001") == (
        "Policy P-1001: active, premium paid, renewal due in 32 days."
    )
    assert lookup_policy("P-9999") == "Policy P-9999: not found."

if __name__ == "__main__":
    print(test_tool_function_directly())
  1. Run the script and then run pytest locally.
    The first pass confirms your model credentials and Haystack setup are correct. The second pass turns your checks into repeatable regression tests.
python your_script.py
pytest -q

Testing It

Start by running the script with a few known prompts and confirm the answer matches your expected business logic. Then run pytest and make sure both direct tool tests and agent-level tests pass consistently.

If responses vary too much, tighten the system message and reduce model randomness by choosing a lower-temperature model configuration where supported by your provider wrapper. For production-grade testing, keep one set of fixed prompts as regression cases for every important workflow branch.

Next Steps

  • Add more tools and test routing decisions with one prompt per tool.
  • Replace ad hoc assertions with structured output validation using Pydantic models.
  • Move from direct generator calls to an explicit Haystack pipeline when you need multi-step orchestration and traceable execution paths.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides