AutoGen Tutorial (Python): mocking LLM calls in tests for advanced developers

By Cyprian AaronsUpdated 2026-04-21
autogenmocking-llm-calls-in-tests-for-advanced-developerspython

This tutorial shows how to replace real LLM calls in AutoGen tests with deterministic fakes so your test suite stays fast, offline, and stable. You need this when you want to verify agent logic, message routing, tool execution, and retries without paying for tokens or depending on model nondeterminism.

What You'll Need

  • Python 3.10+
  • pyautogen installed
  • pytest installed
  • No OpenAI API key required for the mocked tests
  • Optional: an OpenAI API key if you want to compare mocked behavior with a real run later

Install the dependencies:

pip install pyautogen pytest

Step-by-Step

  1. Create a small AutoGen setup that uses a custom model client interface.
    The trick is to keep your production agent code intact and swap only the model client in tests.
from autogen import AssistantAgent, UserProxyAgent

assistant = AssistantAgent(
    name="assistant",
    llm_config={
        "config_list": [],
        "timeout": 30,
        "temperature": 0,
    },
)

user = UserProxyAgent(
    name="user",
    human_input_mode="NEVER",
    code_execution_config=False,
)
  1. Build a fake LLM client that returns fixed responses.
    This lets you test the orchestration layer without calling any external API.
from types import SimpleNamespace

class FakeLLMClient:
    def __init__(self, replies):
        self.replies = list(replies)
        self.calls = []

    def create(self, messages, **kwargs):
        self.calls.append({"messages": messages, "kwargs": kwargs})
        content = self.replies.pop(0) if self.replies else "default mocked reply"
        return SimpleNamespace(
            choices=[SimpleNamespace(message=SimpleNamespace(content=content))]
        )
  1. Wire the fake client into an AutoGen agent for tests.
    In AutoGen, the model client is what ultimately produces completions, so replacing it gives you deterministic behavior.
from autogen import AssistantAgent

fake_client = FakeLLMClient(["Mocked answer from the assistant"])

assistant = AssistantAgent(
    name="assistant",
    llm_config={"config_list": [], "temperature": 0},
)

assistant.client = fake_client

result = assistant.generate_reply(messages=[{"role": "user", "content": "Say hello"}])
print(result)
print(fake_client.calls[0]["messages"][-1]["content"])
  1. Use pytest to assert both the returned text and the request payload.
    Good tests check not just the output but also that your agent sent the right prompt structure.
# test_agent_mock.py
from types import SimpleNamespace
from autogen import AssistantAgent

class FakeLLMClient:
    def __init__(self, replies):
        self.replies = list(replies)
        self.calls = []

    def create(self, messages, **kwargs):
        self.calls.append({"messages": messages, "kwargs": kwargs})
        content = self.replies.pop(0)
        return SimpleNamespace(
            choices=[SimpleNamespace(message=SimpleNamespace(content=content))]
        )

def test_assistant_uses_mocked_llm():
    fake_client = FakeLLMClient(["Approved"])
    assistant = AssistantAgent(name="assistant", llm_config={"config_list": []})
    assistant.client = fake_client

    reply = assistant.generate_reply(messages=[{"role": "user", "content": "Approve claim"}])

    assert reply == "Approved"
    assert fake_client.calls[0]["messages"][-1]["content"] == "Approve claim"
  1. Mock multi-turn behavior when your agent logic depends on sequence.
    This is useful for workflows like validation followed by correction or escalation.
from types import SimpleNamespace
from autogen import AssistantAgent

class FakeLLMClient:
    def __init__(self, replies):
        self.replies = list(replies)

    def create(self, messages, **kwargs):
        content = self.replies.pop(0)
        return SimpleNamespace(
            choices=[SimpleNamespace(message=SimpleNamespace(content=content))]
        )

fake_client = FakeLLMClient([
    "I need more information.",
    "Final answer: proceed.",
])

assistant = AssistantAgent(name="assistant", llm_config={"config_list": []})
assistant.client = fake_client

first = assistant.generate_reply(messages=[{"role": "user", "content": "Process request"}])
second = assistant.generate_reply(messages=[{"role": "user", "content": "Here are the details"}])

print(first)
print(second)

Testing It

Run pytest -q and confirm the test passes without any network access. If your setup is correct, the assertions should verify both the exact response string and the prompt sent into create(). For a stronger check, add one test that exhausts the fake reply list and confirm your fallback behavior is explicit rather than accidental.

If you see real API traffic during tests, your agent is still using a live config somewhere in your stack. That usually means you patched the wrong object or instantiated a second agent outside the test boundary.

Next Steps

  • Add mocks for tool calls so you can test function-calling flows end to end.
  • Wrap this pattern in pytest fixtures so every test gets a clean fake client.
  • Compare this approach with AutoGen’s built-in caching if you also want replayable integration tests.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides