Haystack Tutorial (Python): connecting to PostgreSQL for advanced developers
This tutorial shows how to wire Haystack up to PostgreSQL so you can persist documents, metadata, and retrieval state in a real database instead of keeping everything in memory. You need this when your prototype turns into a service that must survive restarts, support multiple workers, and keep a consistent index for RAG workloads.
What You'll Need
- •Python 3.10+
- •A running PostgreSQL instance
- •
pipaccess to install Haystack and the PostgreSQL integration - •A PostgreSQL user with permission to create tables in your target schema
- •Optional:
psycopgv3 if you want to inspect the database directly from Python - •Haystack 2.x packages:
- •
haystack-ai - •
haystack-integrationswith the PostgreSQL document store package
- •
Step-by-Step
- •Install the packages you need and verify the interpreter can import them.
For production work, pin versions inrequirements.txtor your lockfile so your indexing behavior does not drift between deployments.
pip install haystack-ai haystack-integrations psycopg[binary]
python -c "from haystack import Document; print('haystack ok')"
- •Create a PostgreSQL connection string and initialize the document store.
The store owns the table setup, so point it at a dedicated database or schema rather than sharing an application schema with unrelated tables.
from haystack_integrations.document_stores.postgres import PgvectorDocumentStore
DATABASE_URL = "postgresql://postgres:postgres@localhost:5432/haystack_demo"
document_store = PgvectorDocumentStore(
connection_string=DATABASE_URL,
table_name="documents",
vector_dimension=384,
)
print(document_store)
- •Write documents into PostgreSQL using Haystack
Documentobjects.
This is where persistence starts paying off: once written, your corpus survives process restarts and can be shared across multiple retrieval pipelines.
from haystack import Document
docs = [
Document(content="PostgreSQL is a relational database.", meta={"source": "wiki"}),
Document(content="Haystack can use external stores for retrieval.", meta={"source": "docs"}),
Document(content="Vector search works well for semantic retrieval.", meta={"source": "notes"}),
]
document_store.write_documents(docs)
print("Documents written:", len(docs))
- •Retrieve by embedding with an embedding model and a retriever.
In Haystack, the document store handles persistence while the retriever handles similarity search; that separation keeps your pipeline easier to test and swap later.
from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers import InMemoryEmbeddingRetriever
text_embedder = SentenceTransformersTextEmbedder(
model="sentence-transformers/all-MiniLM-L6-v2"
)
retriever = InMemoryEmbeddingRetriever(document_store=document_store)
pipeline = Pipeline()
pipeline.add_component("embedder", text_embedder)
pipeline.add_component("retriever", retriever)
pipeline.connect("embedder.embedding", "retriever.query_embedding")
result = pipeline.run({
"embedder": {"text": "How do I persist documents for retrieval?"},
})
for doc in result["retriever"]["documents"]:
print(doc.content, doc.meta)
- •Add a clean startup check so your app fails fast when PostgreSQL is unavailable.
In services, you want connection issues to surface during boot rather than halfway through a user request.
def ensure_store_ready(store):
count = store.count_documents()
print(f"Connected. Documents currently stored: {count}")
ensure_store_ready(document_store)
Testing It
Run the script end-to-end and confirm that the insert step reports the expected number of documents. Then restart your Python process and call count_documents() again; if the count persists, PostgreSQL is doing its job.
Next, run a query that should clearly match one of your sample documents, like "semantic retrieval" or "relational database". If retrieval returns relevant results with stable metadata, your indexing path is correct.
If you get empty results, check three things first: the embedding model dimension matches vector_dimension, PostgreSQL is reachable from your machine, and the table name points to the same database across runs.
Next Steps
- •Add a
DocumentWritercomponent to build a full ingestion pipeline instead of callingwrite_documents()directly. - •Introduce filters on metadata fields like
source,tenant_id, ordoc_typefor multi-tenant RAG. - •Move from local development to managed Postgres with connection pooling and migration scripts for table lifecycle management.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit