pgvector vs Elasticsearch for batch processing: Which Should You Use?

By Cyprian AaronsUpdated 2026-04-21
pgvectorelasticsearchbatch-processing

pgvector is a PostgreSQL extension for vector similarity search. Elasticsearch is a distributed search engine that also does vector retrieval, filtering, and full-text search at scale.

For batch processing, pick pgvector if your workload already lives in Postgres and your batches are moderate. Pick Elasticsearch only when you need high-volume indexing, distributed throughput, or hybrid search across text and vectors.

Quick Comparison

AreapgvectorElasticsearch
Learning curveLow if you already know SQL and Postgres. You use CREATE EXTENSION vector, CREATE INDEX ... USING hnsw, and normal SQL queries.Higher. You need to understand index mappings, shards, analyzers, refresh cycles, and vector fields.
PerformanceStrong for single-node or modest-scale batch jobs. HNSW and IVFFlat work well when the data fits cleanly in Postgres.Better for large distributed ingestion and retrieval. Built for parallel indexing across shards and nodes.
EcosystemBest when paired with PostgreSQL tooling, transactions, joins, and ETL pipelines already using SQL.Strong for search-heavy systems with logs, documents, observability, and hybrid retrieval.
PricingUsually cheaper because you run one database stack instead of two. Fewer moving parts means lower ops cost.More expensive operationally. Cluster management, shard tuning, storage overhead, and memory pressure add up fast.
Best use casesBatch embedding lookup, deduplication, semantic joins, RAG over relational data, metadata-heavy filtering.Large-scale document ingestion, hybrid lexical + vector search, multi-tenant search platforms, near-real-time indexing.
DocumentationSmall but direct. The core API surface is simple: vector, halfvec, hnsw, ivfflat, <->, <=>, <#>.Broad but fragmented across search concepts: mappings, dense_vector, kNN queries, ingest pipelines, bulk API, refresh settings.

When pgvector Wins

Use pgvector when your batch job is really a SQL job with embeddings attached.

  • You already store the source records in PostgreSQL

    • If your batch pipeline reads invoices, claims, customer profiles, or policy records from Postgres, adding embeddings to the same system is the cleanest path.
    • You can run joins directly:
      SELECT c.id,
             c.text,
             e.embedding <-> q.embedding AS distance
      FROM customer_notes c
      JOIN query_vectors q ON q.id = $1
      ORDER BY distance
      LIMIT 20;
      
  • You need transactional consistency

    • Batch jobs often update metadata alongside vectors.
    • With pgvector you get normal Postgres transactions, foreign keys, constraints, and rollbacks without building a second consistency model.
  • Your filtering logic is relational

    • If every batch query includes tenant_id, region, product_line, or date windows, SQL wins.
    • pgvector handles this naturally with WHERE clauses before or after similarity search.
  • Your scale is sane

    • For hundreds of thousands to a few million vectors per tenant or dataset slice, pgvector is enough.
    • Use:
      CREATE INDEX ON embeddings USING hnsw (embedding vector_cosine_ops);
      
      or:
      CREATE INDEX ON embeddings USING ivfflat (embedding vector_l2_ops) WITH (lists = 100);
      
    • That covers most batch enrichment jobs without introducing a cluster.

When Elasticsearch Wins

Use Elasticsearch when batch processing means large-scale indexing and retrieval across many documents.

  • You ingest huge volumes of documents

    • If your pipeline loads millions of records per run and needs parallel indexing throughput, Elasticsearch is built for that.
    • The _bulk API is the standard move here:
      POST /my-index/_bulk
      { "index": { "_id": "1" } }
      { "text": "claim summary", "embedding": [0.1, 0.2] }
      
  • You need hybrid search

    • Elasticsearch gives you lexical relevance plus vectors in one query model.
    • That matters when batch jobs must support both keyword matching and semantic ranking on the same corpus.
  • Your data model is document-first

    • If each record is a denormalized JSON document with nested fields and sparse attributes, Elasticsearch fits better than forcing it into relational tables.
    • Mapping controls like dense_vector, analyzers, and nested objects are native concepts there.
  • You need horizontal scaling as the default

    • Batch workloads that outgrow one database node benefit from Elasticsearch’s shard-based architecture.
    • It handles distributed ingestion and retrieval better than trying to stretch Postgres into a search cluster.

For batch processing Specifically

My recommendation: use pgvector by default.

Batch processing usually means scheduled enrichment jobs, nightly scoring runs, deduplication passes, or embedding backfills over data that already has a home in Postgres. In that world, pgvector gives you fewer systems to operate, simpler failure handling, and faster development because you stay inside SQL.

Choose Elasticsearch only if your batch job is really part of a search platform: massive document volume, distributed indexing pressure, or hybrid retrieval requirements that Postgres cannot handle cleanly.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides