LlamaIndex Tutorial (Python): streaming agent responses for advanced developers

By Cyprian AaronsUpdated 2026-04-21

llamaindexstreaming-agent-responses-for-advanced-developerspython

This tutorial shows how to build a LlamaIndex agent in Python that streams partial responses token-by-token instead of waiting for the full answer. You need this when your app has a chat UI, long-running tool calls, or any workflow where users should see progress immediately.

What You'll Need

•Python 3.10+
•llama-index
•An OpenAI API key
•A terminal and a virtual environment
•Basic familiarity with LlamaIndex AgentRunner / ReActAgent concepts

Install the package set like this:

pip install llama-index llama-index-llms-openai

Set your API key in the environment:

export OPENAI_API_KEY="your-key-here"

Step-by-Step

•Start with a minimal agent that can stream output. The key idea is to create the agent with streaming=True, then call .stream_chat() or .stream() depending on the interface you want.

import os
from llama_index.core.agent import ReActAgent
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o-mini", temperature=0, streaming=True)

agent = ReActAgent.from_tools(
    tools=[],
    llm=llm,
    verbose=False,
)

response = agent.stream_chat("Explain streaming in one sentence.")
for chunk in response.response_gen:
    print(chunk, end="", flush=True)
print()

•Add a real tool so the stream includes both reasoning and tool execution. This is where streaming becomes useful in production, because users can watch the model think while your backend calls external systems.

from llama_index.core.tools import FunctionTool

def get_policy_status(policy_id: str) -> str:
    return f"Policy {policy_id} is active and paid through 2026-01-31."

policy_tool = FunctionTool.from_defaults(fn=get_policy_status)

agent = ReActAgent.from_tools(
    tools=[policy_tool],
    llm=llm,
    verbose=True,
)

response = agent.stream_chat("Check policy 12345 and summarize the status.")
for chunk in response.response_gen:
    print(chunk, end="", flush=True)
print()

•If you need structured access to the final answer while still streaming tokens to the user, keep the generator for display and read the final response afterward. This pattern works well when your frontend consumes chunks and your backend stores the completed result.

response = agent.stream_chat("What is policy 12345 status?")
partial_text = []

for chunk in response.response_gen:
    partial_text.append(chunk)
    print(chunk, end="", flush=True)

final_answer = "".join(partial_text)
print("\n\nFinal answer captured:")
print(final_answer)

•For chat applications, preserve conversation state with a memory object so streaming responses stay contextual across turns. Without memory, each streamed turn is isolated and you lose multi-step behavior.

from llama_index.core.memory import ChatMemoryBuffer

memory = ChatMemoryBuffer.from_defaults(token_limit=2000)

agent = ReActAgent.from_tools(
    tools=[policy_tool],
    llm=llm,
    memory=memory,
    verbose=False,
)

first = agent.stream_chat("Remember that policy 12345 belongs to Alice.")
for chunk in first.response_gen:
    print(chunk, end="", flush=True)
print()

second = agent.stream_chat("What policy did I mention?")
for chunk in second.response_gen:
    print(chunk, end="", flush=True)
print()

•Wrap streaming in a helper function before wiring it into FastAPI, Streamlit, or a websocket server. That keeps your transport layer thin and makes it easier to test the agent logic independently.

def stream_agent_reply(agent, prompt: str) -> str:
    response = agent.stream_chat(prompt)
    output = []

    for chunk in response.response_gen:
        output.append(chunk)
        print(chunk, end="", flush=True)

    print()
    return "".join(output)

result = stream_agent_reply(agent, "Summarize policy 12345 in one paragraph.")
print("Stored result length:", len(result))

Testing It

Run the script from your terminal and confirm you see text appearing before the full answer completes. If nothing streams until the end, check that streaming=True is set on the LLM instance and that you're iterating over response.response_gen.

Then test a prompt that triggers your tool so you can verify both streamed text and tool execution logs appear. If you enabled verbose=True, you should see intermediate agent steps while tokens still flow to stdout.

Finally, test conversation continuity by asking two related questions in sequence. If memory is wired correctly, the second response should reference context from the first turn.

Next Steps

•Wire response.response_gen into a FastAPI websocket endpoint for browser-based live updates.
•Add retries and timeouts around tool functions so streaming does not hide backend failures.
•Explore ContextChatEngine if you want streaming over retrieval-backed conversations instead of direct agent tools.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit