Haystack Tutorial (Python): connecting to PostgreSQL for intermediate developers
This tutorial shows you how to wire Haystack up to PostgreSQL in Python so you can store, retrieve, and query documents from a real database instead of keeping everything in memory. You need this when your RAG app has to persist data across restarts, support larger corpora, and behave like something you can actually deploy.
What You'll Need
- •Python 3.10+
- •A running PostgreSQL instance
- •A PostgreSQL database and user with write access
- •
pipfor installing packages - •Haystack 2.x
- •
psycopgorpsycopg2depending on your environment - •Optional: an embedding model if you want semantic retrieval instead of plain keyword lookup
Install the packages first:
pip install haystack-ai psycopg[binary]
If you want embeddings later, install a generator component too:
pip install sentence-transformers
Step-by-Step
- •Start by connecting to PostgreSQL with a normal Python connection string. Haystack’s PostgreSQL integration uses your database as the backing store, so the first thing is making sure the connection details are correct and the database is reachable.
import os
from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore
DATABASE_URL = os.getenv(
"DATABASE_URL",
"postgresql://postgres:postgres@localhost:5432/haystack_demo"
)
document_store = PgvectorDocumentStore(
connection_string=DATABASE_URL,
table_name="haystack_documents",
embedding_dimension=384,
)
print("Connected to PostgreSQL-backed document store")
- •Create a few documents and write them into PostgreSQL. In production, this is where you’d ingest PDFs, tickets, policies, or knowledge base articles after preprocessing.
from haystack import Document
documents = [
Document(content="PostgreSQL stores structured data and supports transactions."),
Document(content="Haystack can use PgvectorDocumentStore for persistent document retrieval."),
Document(content="Vector search works best when documents have embeddings."),
]
document_store.write_documents(documents)
print(f"Wrote {len(documents)} documents")
- •If you want semantic retrieval, generate embeddings and index them. The document store needs vectors before similarity search becomes useful, so this step turns raw text into searchable representations.
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
embedder = SentenceTransformersDocumentEmbedder(
model="sentence-transformers/all-MiniLM-L6-v2"
)
embedder.warm_up()
embedded_documents = embedder.run(documents=documents)["documents"]
document_store.write_documents(embedded_documents)
print("Embedded and stored documents")
- •Build a retrieval pipeline that queries PostgreSQL through Haystack. This is the part you’ll call from your app when a user asks a question and you want relevant context back.
from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack_integrations.components.retrievers.pgvector import PgvectorEmbeddingRetriever
query_embedder = SentenceTransformersTextEmbedder(
model="sentence-transformers/all-MiniLM-L6-v2"
)
retriever = PgvectorEmbeddingRetriever(document_store=document_store)
pipeline = Pipeline()
pipeline.add_component("query_embedder", query_embedder)
pipeline.add_component("retriever", retriever)
pipeline.connect("query_embedder.embedding", "retriever.query_embedding")
result = pipeline.run({
"query_embedder": {"text": "How does Haystack use PostgreSQL?"}
})
for doc in result["retriever"]["documents"]:
print(doc.content)
- •Clean up or inspect what’s in the table when you need operational control. This matters when you are reindexing data, rotating environments, or debugging why retrieval looks wrong.
all_docs = document_store.filter_documents()
print(f"Total documents in store: {len(all_docs)}")
for doc in all_docs[:5]:
print(doc.id, doc.content)
# Optional cleanup during development
# document_store.delete_documents()
# print("Deleted all documents")
Testing It
Run the script end to end and confirm that it connects to PostgreSQL without raising an exception. Then verify that write_documents() succeeds and that filter_documents() returns the same records you inserted.
For retrieval testing, ask a query that should clearly match one of your sample documents, like “What stores structured data?” If the retriever returns relevant content from PostgreSQL, your setup is working.
If results look empty, check these first:
- •The database URL is correct
- •The table exists or can be created by your user
- •Your embedding dimension matches the model output size
- •You embedded documents before trying vector search
Next Steps
- •Add a generator component and build a full RAG pipeline on top of PostgreSQL.
- •Learn how to chunk long documents before embedding them.
- •Add metadata filters so you can scope retrieval by tenant, product line, or policy type.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit