LlamaIndex Tutorial (Python): streaming agent responses for intermediate developers

By Cyprian AaronsUpdated 2026-04-21
llamaindexstreaming-agent-responses-for-intermediate-developerspython

This tutorial shows you how to build a LlamaIndex agent in Python that streams partial responses back to the caller instead of waiting for the full answer. You need this when your app has to feel responsive, show token-by-token progress, or forward agent output into a UI, webhook, or SSE endpoint.

What You'll Need

  • Python 3.10+
  • llama-index installed
  • An OpenAI API key set as OPENAI_API_KEY
  • A terminal and a virtual environment
  • Basic familiarity with LlamaIndex agents and tools

Install the package:

pip install llama-index

Set your API key:

export OPENAI_API_KEY="your-key-here"

Step-by-Step

  1. Start with a simple tool the agent can call.
    Streaming only matters if the agent has something useful to do, so we’ll give it a small calculator tool first.
from llama_index.core.tools import FunctionTool

def multiply(a: float, b: float) -> float:
    return a * b

multiply_tool = FunctionTool.from_defaults(fn=multiply)
  1. Create an agent with streaming enabled.
    The important part is streaming=True; that tells LlamaIndex to return an iterator-like response instead of buffering everything until completion.
import os
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])

agent = FunctionAgent(
    tools=[multiply_tool],
    llm=llm,
    system_prompt="You are a helpful assistant. Use tools when needed.",
    streaming=True,
)
  1. Run the agent and print tokens as they arrive.
    In a real service, this loop is where you would push chunks into your frontend, SSE stream, or async queue.
import asyncio

async def main():
    handler = agent.run("What is 17 times 23? Explain briefly.")
    async for event in handler.stream_events():
        if hasattr(event, "delta") and event.delta:
            print(event.delta, end="", flush=True)

    result = await handler
    print("\n\nFinal response:")
    print(result)

if __name__ == "__main__":
    asyncio.run(main())
  1. Handle streamed output in a way that works in production.
    The pattern below separates raw streaming from final result handling, which makes it easier to log, trace, and retry without mixing concerns.
import asyncio

async def stream_agent_answer(prompt: str) -> str:
    handler = agent.run(prompt)
    chunks = []

    async for event in handler.stream_events():
        delta = getattr(event, "delta", None)
        if delta:
            chunks.append(delta)
            print(delta, end="", flush=True)

    result = await handler
    return "".join(chunks), str(result)

if __name__ == "__main__":
    text, final = asyncio.run(stream_agent_answer("Multiply 12 by 8 and keep it short."))
    print("\n---")
    print("Streamed text:", text)
    print("Final text:", final)
  1. If you want to stream from an API endpoint later, keep the same core pattern.
    Your web layer should just forward chunks as they arrive; don’t rebuild streaming logic inside the route handler.
async def get_streamed_response(prompt: str):
    handler = agent.run(prompt)

    async for event in handler.stream_events():
        delta = getattr(event, "delta", None)
        if delta:
            yield delta

# Example usage in an async context:
# async for chunk in get_streamed_response("What is 9 times 11?"):
#     send_to_client(chunk)

Testing It

Run the script from your terminal and watch the response appear incrementally instead of all at once. If streaming is working, you should see partial text printed before the final response object is returned.

Test both a direct-answer prompt and a tool-using prompt like “What is 17 times 23?” so you can confirm the agent can still call tools while streaming output. If nothing streams until the end, check that streaming=True is set on the agent and that your OpenAI key is valid.

For debugging, log each event type before filtering on delta. That makes it easier to see whether you’re receiving model tokens, tool events, or other internal workflow events.

Next Steps

  • Wire the same streaming loop into FastAPI or Starlette using Server-Sent Events.
  • Add more tools with FunctionTool and test how tool calls behave during streamed runs.
  • Learn how to persist conversation state so streamed agents can continue multi-turn workflows.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides