Haystack Tutorial (Python): connecting to PostgreSQL for advanced developers

By Cyprian AaronsUpdated 2026-04-21
haystackconnecting-to-postgresql-for-advanced-developerspython

This tutorial shows how to wire Haystack up to PostgreSQL so you can persist documents, metadata, and retrieval state in a real database instead of keeping everything in memory. You need this when your prototype turns into a service that must survive restarts, support multiple workers, and keep a consistent index for RAG workloads.

What You'll Need

  • Python 3.10+
  • A running PostgreSQL instance
  • pip access to install Haystack and the PostgreSQL integration
  • A PostgreSQL user with permission to create tables in your target schema
  • Optional: psycopg v3 if you want to inspect the database directly from Python
  • Haystack 2.x packages:
    • haystack-ai
    • haystack-integrations with the PostgreSQL document store package

Step-by-Step

  1. Install the packages you need and verify the interpreter can import them.
    For production work, pin versions in requirements.txt or your lockfile so your indexing behavior does not drift between deployments.
pip install haystack-ai haystack-integrations psycopg[binary]
python -c "from haystack import Document; print('haystack ok')"
  1. Create a PostgreSQL connection string and initialize the document store.
    The store owns the table setup, so point it at a dedicated database or schema rather than sharing an application schema with unrelated tables.
from haystack_integrations.document_stores.postgres import PgvectorDocumentStore

DATABASE_URL = "postgresql://postgres:postgres@localhost:5432/haystack_demo"

document_store = PgvectorDocumentStore(
    connection_string=DATABASE_URL,
    table_name="documents",
    vector_dimension=384,
)

print(document_store)
  1. Write documents into PostgreSQL using Haystack Document objects.
    This is where persistence starts paying off: once written, your corpus survives process restarts and can be shared across multiple retrieval pipelines.
from haystack import Document

docs = [
    Document(content="PostgreSQL is a relational database.", meta={"source": "wiki"}),
    Document(content="Haystack can use external stores for retrieval.", meta={"source": "docs"}),
    Document(content="Vector search works well for semantic retrieval.", meta={"source": "notes"}),
]

document_store.write_documents(docs)
print("Documents written:", len(docs))
  1. Retrieve by embedding with an embedding model and a retriever.
    In Haystack, the document store handles persistence while the retriever handles similarity search; that separation keeps your pipeline easier to test and swap later.
from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers import InMemoryEmbeddingRetriever

text_embedder = SentenceTransformersTextEmbedder(
    model="sentence-transformers/all-MiniLM-L6-v2"
)

retriever = InMemoryEmbeddingRetriever(document_store=document_store)

pipeline = Pipeline()
pipeline.add_component("embedder", text_embedder)
pipeline.add_component("retriever", retriever)
pipeline.connect("embedder.embedding", "retriever.query_embedding")

result = pipeline.run({
    "embedder": {"text": "How do I persist documents for retrieval?"},
})
for doc in result["retriever"]["documents"]:
    print(doc.content, doc.meta)
  1. Add a clean startup check so your app fails fast when PostgreSQL is unavailable.
    In services, you want connection issues to surface during boot rather than halfway through a user request.
def ensure_store_ready(store):
    count = store.count_documents()
    print(f"Connected. Documents currently stored: {count}")

ensure_store_ready(document_store)

Testing It

Run the script end-to-end and confirm that the insert step reports the expected number of documents. Then restart your Python process and call count_documents() again; if the count persists, PostgreSQL is doing its job.

Next, run a query that should clearly match one of your sample documents, like "semantic retrieval" or "relational database". If retrieval returns relevant results with stable metadata, your indexing path is correct.

If you get empty results, check three things first: the embedding model dimension matches vector_dimension, PostgreSQL is reachable from your machine, and the table name points to the same database across runs.

Next Steps

  • Add a DocumentWriter component to build a full ingestion pipeline instead of calling write_documents() directly.
  • Introduce filters on metadata fields like source, tenant_id, or doc_type for multi-tenant RAG.
  • Move from local development to managed Postgres with connection pooling and migration scripts for table lifecycle management.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides