LlamaIndex Tutorial (Python): streaming agent responses for beginners
This tutorial shows you how to build a Python LlamaIndex agent that streams its response token-by-token instead of waiting for the full answer. You need this when you want a chat UI, terminal app, or API endpoint to feel responsive while the model is still generating output.
What You'll Need
- •Python 3.10+
- •An OpenAI API key set as
OPENAI_API_KEY - •
llama-indexinstalled - •
python-dotenvif you want to load environment variables from a.envfile - •Basic familiarity with LlamaIndex agents and
QueryEngineTool
Install the packages:
pip install llama-index python-dotenv
Step-by-Step
- •Start by setting your OpenAI key in the environment. LlamaIndex will read it automatically through the OpenAI integration.
import os
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
- •Create a simple tool the agent can call. For beginners, a local
FunctionToolis easier than wiring up a full vector index.
from llama_index.core.tools import FunctionTool
def get_policy_status(policy_id: str) -> str:
return f"Policy {policy_id} is active and next renewal is 2026-01-15."
policy_tool = FunctionTool.from_defaults(
fn=get_policy_status,
name="get_policy_status",
description="Look up the status of an insurance policy by policy ID.",
)
- •Build a streaming agent with that tool. The important part is
streaming=True, which tells LlamaIndex to return chunks as they are generated.
from llama_index.core.agent.workflow import FunctionAgent
agent = FunctionAgent(
tools=[policy_tool],
llm=None,
system_prompt="You are an insurance support assistant. Use tools when needed.",
streaming=True,
)
- •Run a streaming chat request and print each chunk as it arrives. This is the core pattern you will use in a CLI, web socket, or server-sent events handler.
import asyncio
async def main():
handler = agent.run("Check policy 12345 and tell me the status.")
async for chunk in handler.stream_events():
if hasattr(chunk, "delta") and chunk.delta:
print(chunk.delta, end="", flush=True)
asyncio.run(main())
- •If you want cleaner terminal output, collect the streamed text into a buffer while still printing live updates. This makes it easier to log the final response after streaming finishes.
import asyncio
async def main():
handler = agent.run("Check policy 12345 and tell me the status.")
full_text = []
async for chunk in handler.stream_events():
if hasattr(chunk, "delta") and chunk.delta:
print(chunk.delta, end="", flush=True)
full_text.append(chunk.delta)
print("\n\nFinal response:")
print("".join(full_text))
asyncio.run(main())
Testing It
Run the script from your terminal and watch for text to appear immediately instead of all at once at the end. If streaming is working, you should see partial output printed before the full answer completes.
Try changing the prompt so the agent clearly needs the tool, such as asking for a specific policy ID. If the tool call happens correctly, you should see the final answer include the mocked policy status.
If nothing streams, check three things: your OpenAI key is set, your installed LlamaIndex version matches the imports above, and your event loop is running correctly. In practice, most failures come from version mismatches or missing API credentials.
Next Steps
- •Replace the mock function with a real internal API call for policy lookup or claims status.
- •Add
QueryEngineToolbacked by a vector index so the agent can stream answers from documents. - •Wire this pattern into FastAPI with Server-Sent Events for browser-based streaming responses.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit