Haystack Tutorial (Python): connecting to PostgreSQL for beginners
This tutorial shows you how to wire Haystack to PostgreSQL so you can store and query document embeddings from Python. You’d do this when you want a persistent vector store for retrieval-augmented generation, instead of keeping everything in memory or relying on a local demo setup.
What You'll Need
- •Python 3.10+
- •A running PostgreSQL instance with the
pgvectorextension enabled - •A database user with permission to create tables
- •
haystack-ai - •
psycopg[binary] - •An embedding model, either:
- •OpenAI API key for
OpenAITextEmbedder, or - •another Haystack-compatible embedder you already use
- •OpenAI API key for
- •Basic familiarity with Haystack pipelines and documents
Install the packages:
pip install haystack-ai psycopg[binary]
Step-by-Step
- •First, make sure PostgreSQL is ready for vector search. Haystack’s PostgreSQL integration expects
pgvector, so create the extension in your target database before you try to write documents.
CREATE EXTENSION IF NOT EXISTS vector;
- •Create a connection string and initialize the PostgreSQL document store. This store will hold your documents, embeddings, and metadata.
from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore
DATABASE_URL = "postgresql://postgres:postgres@localhost:5432/haystack"
document_store = PgvectorDocumentStore(
connection_string=DATABASE_URL,
table_name="haystack_docs",
vector_size=1536,
)
print("Document store ready")
- •Add a small set of documents. In production, these would come from files, tickets, policies, or internal knowledge bases. For beginners, hardcoding a few documents is the fastest way to confirm the storage layer works.
from haystack import Document
docs = [
Document(content="PostgreSQL supports vector similarity search with pgvector.", meta={"source": "db"}),
Document(content="Haystack can index documents into external vector stores.", meta={"source": "haystack"}),
Document(content="Retrieval works best when chunks are clean and focused.", meta={"source": "search"}),
]
document_store.write_documents(docs)
print("Documents written:", len(docs))
- •Embed the documents and store their vectors. The example below uses OpenAI because it’s common in Haystack setups, but any embedder that returns compatible embeddings will work.
import os
from haystack.components.embedders import OpenAITextEmbedder, OpenAIDocumentEmbedder
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
doc_embedder = OpenAIDocumentEmbedder(model="text-embedding-3-small")
embedded_docs = doc_embedder.run(documents=docs)["documents"]
document_store.write_documents(embedded_docs)
print("Embedded documents stored")
- •Run a similarity search against PostgreSQL. This is the part you care about in an RAG system: given a query, retrieve the most relevant chunks from your database.
from haystack.components.embedders import OpenAITextEmbedder
query_embedder = OpenAITextEmbedder(model="text-embedding-3-small")
query_result = query_embedder.run(text="How does Haystack store vectors?")
retrieved = document_store.embedding_retrieval(
query_embedding=query_result["embedding"],
top_k=2,
)
for doc in retrieved:
print(doc.content)
print(doc.meta)
- •If you want this inside a pipeline, connect the embedder and retriever explicitly. This is the pattern you’ll use once you move beyond manual calls and into an application flow.
from haystack import Pipeline
from haystack.components.retrievers import InMemoryEmbeddingRetriever
pipeline = Pipeline()
pipeline.add_component("query_embedder", OpenAITextEmbedder(model="text-embedding-3-small"))
pipeline.add_component(
"retriever",
InMemoryEmbeddingRetriever(document_store=document_store),
)
pipeline.connect("query_embedder.embedding", "retriever.query_embedding")
result = pipeline.run({
"query_embedder": {"text": "What is pgvector used for?"}
})
for doc in result["retriever"]["documents"]:
print(doc.content)
Testing It
The quickest test is to run the script end to end and confirm that PostgreSQL accepts the writes without errors. If your setup is correct, you should see documents stored and then retrieved in relevance order for the sample query.
If retrieval returns nothing, check three things first: the vector extension exists, vector_size matches your embedding model output, and your database URL points to the right schema and table. Also verify that your OpenAI key is set if you’re using OpenAI embedders.
For a more realistic test, swap the hardcoded documents with 10–20 short chunks from one internal policy or product guide. Then ask a question that should clearly match one of those chunks and inspect whether the top result is actually relevant.
Next Steps
- •Add metadata filters so retrieval can scope results by source, tenant, or document type.
- •Replace manual embedding calls with a full RAG pipeline using a generator component.
- •Learn how to chunk long documents properly before indexing them into PostgreSQL.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit