AutoGen Tutorial (Python): building a RAG pipeline for intermediate developers

By Cyprian AaronsUpdated 2026-04-21
autogenbuilding-a-rag-pipeline-for-intermediate-developerspython

This tutorial shows you how to build a working Retrieval-Augmented Generation (RAG) pipeline in Python using AutoGen, FAISS, and OpenAI-compatible embeddings. You need this when your assistant must answer from your own documents instead of guessing from model memory.

What You'll Need

  • Python 3.10+
  • autogen-agentchat
  • autogen-ext
  • openai
  • faiss-cpu
  • numpy
  • An OpenAI API key in OPENAI_API_KEY
  • A small document set to index, such as policy docs, runbooks, or product specs

Install the packages:

pip install autogen-agentchat autogen-ext openai faiss-cpu numpy

Step-by-Step

  1. Start by creating a tiny document store and embedding helper. For production, you would load files from disk or object storage, but this example keeps the data inline so the full pipeline is runnable end-to-end.
import os
import numpy as np
import faiss
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

documents = [
    "AutoGen lets you build multi-agent workflows in Python.",
    "RAG combines retrieval with generation so answers come from source documents.",
    "FAISS is a fast vector index for similarity search.",
    "Chunking documents improves retrieval quality for long texts."
]

def embed_texts(texts: list[str]) -> np.ndarray:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
    )
    vectors = [item.embedding for item in response.data]
    return np.array(vectors, dtype="float32")
  1. Build a FAISS index over the document embeddings. The important part is to keep the mapping between vector rows and source text so you can return grounded context later.
doc_vectors = embed_texts(documents)
dimension = doc_vectors.shape[1]

index = faiss.IndexFlatIP(dimension)
faiss.normalize_L2(doc_vectors)
index.add(doc_vectors)

def retrieve(query: str, k: int = 2) -> list[str]:
    query_vector = embed_texts([query])
    faiss.normalize_L2(query_vector)
    scores, ids = index.search(query_vector, k)
    return [documents[i] for i in ids[0] if i != -1]
  1. Wire retrieval into an AutoGen agent through a tool function. This pattern keeps retrieval deterministic and lets the LLM focus on synthesis instead of doing search itself.
from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.messages import TextMessage

def get_context(query: str) -> str:
    chunks = retrieve(query, k=2)
    return "\n".join(f"- {chunk}" for chunk in chunks)

agent = AssistantAgent(
    name="rag_assistant",
    model_client=None,
)

def build_prompt(question: str) -> str:
    context = get_context(question)
    return (
        "Answer only using the context below.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {question}\n"
        "Return a concise answer with no speculation."
    )
  1. Add the actual generation call using the OpenAI chat API. AutoGen handles agent orchestration well, but for a compact RAG tutorial it is cleaner to keep retrieval separate and call the model with grounded context directly.
def answer_question(question: str) -> str:
    prompt = build_prompt(question)
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a precise assistant that answers from provided context."},
            {"role": "user", "content": prompt},
        ],
        temperature=0,
    )
    return response.choices[0].message.content

if __name__ == "__main__":
    question = "What does RAG combine?"
    print(answer_question(question))
  1. If you want AutoGen to own the conversation loop, wrap the same retrieval logic as a callable tool. This is the version you extend when multiple agents need access to shared knowledge.
from autogen_agentchat.tools import FunctionTool

rag_tool = FunctionTool(
    get_context,
    description="Retrieve relevant context chunks for a user question.",
)

tool_enabled_agent = AssistantAgent(
    name="tool_rag_assistant",
    model_client=None,
)

print("Tool registered:", rag_tool.name)

Testing It

Run the script and ask questions that clearly map to your sample documents, like “What is FAISS used for?” or “Why do we chunk documents?”. You should see answers that directly reflect retrieved text rather than generic model output.

If you get empty or irrelevant results, check three things first: your embedding model name, whether FAISS normalization is applied consistently, and whether your query text actually overlaps with indexed content. For real documents, test with known facts from each file before wiring this into an agent workflow.

A good sanity check is to print the retrieved chunks before generation. If retrieval looks correct but answers are weak, tighten the system prompt so the model does not invent missing details.

Next Steps

  • Replace inline documents with a loader for PDFs, Markdown files, or database records.
  • Add chunking with overlap before embedding long documents.
  • Move from single-turn Q&A to an AutoGen multi-agent setup where one agent retrieves and another validates citations.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides