Hero image for Building a Production-Ready Event Store with PostgreSQL: Lessons from Running It at Scale

Building a Production-Ready Event Store with PostgreSQL: Lessons from Running It at Scale


Your event-sourced system worked perfectly in development. Then you hit 10 million events, projections started lagging by minutes, and your ops team is asking why the database is consuming 80% CPU during peak hours. Here’s how to build an event store that actually survives production.

Event sourcing promises an immutable audit log, time-travel debugging, and the ability to rebuild state from scratch. What the tutorials don’t mention: most event stores fail not because of architectural flaws, but because of operational neglect. The schema that seemed elegant at 100,000 events becomes a maintenance nightmare at 100 million. The projection that rebuilt in seconds now takes hours.

This guide covers the hard-won lessons from running PostgreSQL as an event store at scale—the schema decisions that prevent late-night emergencies, the concurrency patterns that don’t deadlock under load, and the operational practices that keep your system healthy.


Why PostgreSQL Makes Sense (And When It Doesn’t)

PostgreSQL solves the hardest problem in event sourcing: guaranteed ordering and durability. When you append an event with a specific version number, PostgreSQL’s ACID guarantees ensure that either the write succeeds atomically or fails entirely. No partial writes. No lost events. No ordering ambiguity. This is the foundation that event sourcing requires—without it, you’re constantly second-guessing whether your event log actually represents what happened.

The transactional model also means you can update projections in the same transaction as the event append—something that requires complex coordination with dedicated event stores. For many systems, this simplifies the architecture dramatically. Instead of wrestling with eventual consistency between your event store and your read models, you get the option of immediate consistency when you need it. This proves especially valuable for user-facing operations where stale reads cause confusion.

PostgreSQL’s NOTIFY/LISTEN provides native pub/sub without additional infrastructure. Your projections can subscribe to new events in real-time without polling, without Kafka, without Redis. One less system to operate, one less failure mode to handle. When you’re a small team running a handful of services, this operational simplicity matters more than theoretical throughput limits you’ll never reach.

The ecosystem advantage shouldn’t be underestimated either. Your team likely already knows PostgreSQL. Your monitoring, backup, and disaster recovery procedures exist. Connection pooling with PgBouncer, replication for read scaling, and point-in-time recovery for safety—all battle-tested patterns that apply directly.

When PostgreSQL falls short:

Specialized event stores like EventStoreDB offer features PostgreSQL can’t match: built-in catch-up subscriptions with guaranteed ordering, native projection support, and optimized storage for append-only workloads. If you’re building a system where event sourcing is the core architectural pattern—not just one component—the operational overhead of a dedicated store may be worth it.

EventStoreDB handles projections as first-class citizens. You write projection logic in JavaScript, and the database manages checkpoint tracking, exactly-once semantics, and parallel processing automatically. With PostgreSQL, you build all of this yourself. For teams going all-in on event sourcing across dozens of aggregates, that infrastructure investment adds up quickly.

The hidden costs of PostgreSQL as an event store emerge at scale:

  • Write amplification: PostgreSQL’s MVCC creates a new tuple for every write. Your append-only event table still generates dead tuples from index updates. Each insert updates the heap, the primary key index, and any secondary indexes—multiplying the actual I/O.
  • Vacuum pressure: Those dead tuples require aggressive vacuuming. An event store with 10,000 writes per minute generates significant maintenance overhead. Vacuum workers compete for resources with your application queries.
  • Index bloat: B-tree indexes on monotonically increasing values (like event IDs or timestamps) can become unbalanced. As pages fill and split, you develop “right-hand side” bloat that fragments performance.
  • Connection overhead: Each projection worker needs its own connection for long-running polling or LISTEN subscriptions. Connection limits become a constraint before CPU or disk do.

The break-even point varies, but teams running PostgreSQL event stores beyond 500 million events often report spending more time on operational tuning than on feature development. Know your scale trajectory before committing. If you’re building the next high-frequency trading platform, PostgreSQL isn’t your answer. If you’re building a SaaS application serving thousands of customers with audit requirements, it’s probably the right choice.


Schema Design That Won’t Bite You Later

The events table is the foundation. Get this wrong, and you’re looking at a painful migration under production load. Every decision you make here compounds over millions of events, so it’s worth understanding the tradeoffs deeply.

001_create_events_table.sql
CREATE TABLE events (
-- Global ordering across all streams
global_position BIGSERIAL PRIMARY KEY,
-- Stream identification
stream_id UUID NOT NULL,
stream_type VARCHAR(255) NOT NULL,
-- Version within the stream (for optimistic concurrency)
version INTEGER NOT NULL,
-- Event metadata
event_type VARCHAR(255) NOT NULL,
event_id UUID NOT NULL DEFAULT gen_random_uuid(),
-- Payload - JSONB for flexibility, indexed fields for queries
data JSONB NOT NULL,
metadata JSONB NOT NULL DEFAULT '{}',
-- Timestamps
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
-- Compound unique constraint prevents duplicate versions
CONSTRAINT events_stream_version_unique
UNIQUE (stream_id, version)
) PARTITION BY RANGE (created_at);
-- Create partitions for the next 12 months
CREATE TABLE events_2026_01 PARTITION OF events
FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');
CREATE TABLE events_2026_02 PARTITION OF events
FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');
-- Continue for remaining months...
-- Index for reading a single stream (most common operation)
CREATE INDEX idx_events_stream_id_version
ON events (stream_id, version);
-- Index for projection catch-up (reading from a position)
CREATE INDEX idx_events_global_position
ON events (global_position);
-- Index for filtering by event type
CREATE INDEX idx_events_event_type
ON events (event_type);

The stream_id + version compound unique constraint is the key to safe concurrent writes. When two requests try to append version 5 to the same stream simultaneously, one succeeds and one fails with a unique constraint violation. No locks required. No coordination needed. PostgreSQL handles the race condition for you at the storage layer, which is far more efficient than application-level locking.

The global_position column deserves special attention. It’s a monotonically increasing sequence that orders events across all streams. Projections use this to process events in the exact order they were committed, regardless of which stream they belong to. Without global ordering, you can’t guarantee that your read models see a consistent view of the world. Two events in different streams that happen in the same millisecond need a tiebreaker—global position provides that.

Separating stream_type from stream_id pays dividends for operational queries. When you need to find all events for a specific aggregate type (say, all Order events), you don’t need to maintain a separate mapping. It also enables type-specific retention policies and monitoring.

JSONB vs. typed columns for payloads:

JSONB offers schema flexibility—you can add fields to events without migrations. But you pay for this at query time. Every query that filters on payload fields must parse JSON, which adds CPU overhead and prevents the query planner from using statistics effectively.

If you frequently filter or join on specific payload fields, extract them into indexed columns:

002_add_indexed_fields.sql
-- For events that always have these fields
ALTER TABLE events ADD COLUMN user_id UUID
GENERATED ALWAYS AS ((data->>'user_id')::UUID) STORED;
CREATE INDEX idx_events_user_id ON events (user_id)
WHERE user_id IS NOT NULL;

Generated columns compute the value at insert time and store it physically. You get the query performance of typed columns with the flexibility of keeping the full event in JSONB. The storage cost is minimal compared to the query benefits for frequently-accessed fields.

Consider creating covering indexes for your most common projection queries:

003_covering_indexes.sql
-- Covering index for projection catch-up queries
-- Includes event_type so the query can be satisfied from the index alone
CREATE INDEX idx_events_position_covering
ON events (global_position)
INCLUDE (stream_id, event_type, data);

Covering indexes eliminate heap fetches for queries that only need the included columns. For projection workers processing thousands of events per second, this reduces I/O significantly.

Partitioning strategy:

Time-based partitioning (shown above) works best when:

  • You have time-based retention policies
  • Most queries filter by time range
  • You need to archive old data efficiently

Partition by aggregate type when:

  • Different aggregates have vastly different event volumes
  • Queries almost always filter by stream type
  • You want to scale specific aggregate types independently

The choice affects query planning and maintenance operations. Time-based partitions can be dropped in constant time when data expires—no scanning, no copying, just metadata updates. This makes archival a non-event rather than a migration project.

💡 Pro Tip: Start with time-based partitioning. It’s the most common access pattern and makes archival straightforward. You can always add stream-type partitioning as a second dimension later. Monthly partitions work well for most workloads; switch to weekly if you’re generating millions of events per day.


Implementing Optimistic Concurrency Without Deadlocks

Optimistic concurrency control is the standard pattern for event sourcing: read the current version, perform business logic, attempt to write with an expected version. If someone else wrote in between, retry. This approach assumes conflicts are rare and optimizes for the happy path.

The naive approach uses SELECT FOR UPDATE:

naive_append.py
# DON'T DO THIS - creates unnecessary lock contention
async def append_event_naive(conn, stream_id: str, event: dict, expected_version: int):
async with conn.transaction():
# This lock blocks all other writers to this stream
row = await conn.fetchrow(
"SELECT MAX(version) FROM events WHERE stream_id = $1 FOR UPDATE",
stream_id
)
current_version = row[0] or 0
if current_version != expected_version:
raise ConcurrencyError(f"Expected {expected_version}, got {current_version}")
await conn.execute(
"INSERT INTO events (stream_id, version, ...) VALUES ($1, $2, ...)",
stream_id, expected_version + 1, ...
)

This works but serializes all writes to a stream. Under high concurrency, you’re essentially single-threading your write path. The lock is held for the entire transaction, including network round trips and any business logic. If your transaction takes 50ms, you’ve capped throughput at 20 writes per second per stream—regardless of how fast your database actually is.

Worse, SELECT FOR UPDATE across multiple streams in different orders leads to deadlocks. Transaction A locks stream 1, transaction B locks stream 2, then A wants stream 2 while B wants stream 1. PostgreSQL detects this and aborts one transaction, but the retry logic adds latency and complexity.

The better approach: let PostgreSQL’s unique constraint do the work.

event_store.py
import asyncpg
from dataclasses import dataclass
from typing import Optional
import json
@dataclass
class Event:
stream_id: str
stream_type: str
event_type: str
data: dict
metadata: Optional[dict] = None
class ConcurrencyError(Exception):
"""Raised when expected version doesn't match current version."""
pass
class EventStore:
def __init__(self, pool: asyncpg.Pool):
self.pool = pool
async def append(
self,
stream_id: str,
stream_type: str,
events: list[Event],
expected_version: int
) -> int:
"""
Append events to a stream with optimistic concurrency.
Returns the new version number after appending.
Raises ConcurrencyError if expected_version doesn't match.
"""
if not events:
return expected_version
async with self.pool.acquire() as conn:
try:
# Build a single INSERT with multiple VALUES
values = []
params = []
param_idx = 1
for i, event in enumerate(events):
version = expected_version + i + 1
values.append(
f"(${param_idx}, ${param_idx+1}, ${param_idx+2}, "
f"${param_idx+3}, ${param_idx+4}, ${param_idx+5})"
)
params.extend([
stream_id,
stream_type,
version,
event.event_type,
json.dumps(event.data),
json.dumps(event.metadata or {})
])
param_idx += 6
query = f"""
INSERT INTO events
(stream_id, stream_type, version, event_type, data, metadata)
VALUES {', '.join(values)}
RETURNING version
"""
rows = await conn.fetch(query, *params)
return rows[-1]['version']
except asyncpg.UniqueViolationError:
# Another writer got there first
current = await conn.fetchval(
"SELECT MAX(version) FROM events WHERE stream_id = $1",
stream_id
)
raise ConcurrencyError(
f"Stream {stream_id}: expected version {expected_version}, "
f"current version is {current}"
)
async def read_stream(
self,
stream_id: str,
from_version: int = 0
) -> list[dict]:
"""Read all events from a stream starting at a version."""
async with self.pool.acquire() as conn:
rows = await conn.fetch(
"""
SELECT version, event_type, data, metadata, created_at
FROM events
WHERE stream_id = $1 AND version > $2
ORDER BY version
""",
stream_id, from_version
)
return [dict(row) for row in rows]

The key insight: the INSERT either succeeds entirely or fails with a unique constraint violation. No locks held during business logic processing. No deadlock risk. The retry logic happens at the application layer where you have full control:

retry_handler.py
import asyncio
from typing import Callable, TypeVar
T = TypeVar('T')
async def with_retry(
operation: Callable[[], T],
max_attempts: int = 3,
base_delay: float = 0.1
) -> T:
"""Execute operation with exponential backoff on concurrency errors."""
for attempt in range(max_attempts):
try:
return await operation()
except ConcurrencyError:
if attempt == max_attempts - 1:
raise
delay = base_delay * (2 ** attempt)
await asyncio.sleep(delay)

The exponential backoff prevents thundering herd problems. If multiple writers conflict simultaneously, they’ll retry at staggered intervals rather than all retrying at once and conflicting again.

For high-contention streams where conflicts are common, consider adding jitter to the delay:

import random
delay = base_delay * (2 ** attempt) * (0.5 + random.random())

This spreads retries more evenly and reduces the chance of repeated conflicts.

⚠️ Warning: Always reload the aggregate state before retrying. The business rules may produce different events given the new state. Never just replay the same events with a higher version number. If you accepted an order at quantity 5 but someone else reduced inventory to 3, your retry must re-evaluate whether the order can still be fulfilled.


Projections That Keep Up With Your Write Load

Projections transform your event stream into queryable read models. The challenge: keeping them consistent while handling high throughput. Your choices here affect both performance and correctness guarantees.

Synchronous projections update in the same transaction as the event append:

sync_projection.py
async def place_order(conn, order_id: str, items: list[dict]):
async with conn.transaction():
# Append the event
await conn.execute(
"INSERT INTO events (...) VALUES (...)",
...
)
# Update projection in same transaction
await conn.execute(
"""
INSERT INTO order_summaries (order_id, item_count, total, status)
VALUES ($1, $2, $3, 'placed')
ON CONFLICT (order_id) DO UPDATE SET
item_count = EXCLUDED.item_count,
total = EXCLUDED.total,
status = EXCLUDED.status
""",
order_id, len(items), sum(i['price'] for i in items)
)

The upside: guaranteed consistency. The read model reflects the write immediately. Users placing an order can see it in their order list without any delay. For user-facing operations where stale reads cause confusion or support tickets, this matters.

The downside: write latency increases with every projection, and a slow projection blocks all writes. If you have ten projections, each adding 5ms, your write latency just increased by 50ms. A single projection hitting a slow query path—perhaps due to missing indexes or a lock conflict—can cascade into system-wide degradation.

Synchronous projections also create coupling. Adding a new projection requires changing the write path. Removing one requires a code change. Your event store becomes a transaction coordinator, not just a log.

Asynchronous projections decouple write latency from projection updates:

projection_worker.py
import asyncpg
from dataclasses import dataclass
from typing import Callable, Dict
import json
@dataclass
class Checkpoint:
projection_name: str
last_position: int
class ProjectionWorker:
def __init__(
self,
pool: asyncpg.Pool,
projection_name: str,
handlers: Dict[str, Callable]
):
self.pool = pool
self.projection_name = projection_name
self.handlers = handlers # event_type -> handler function
async def run(self, batch_size: int = 100):
"""Process events in batches, updating checkpoint after each batch."""
while True:
processed = await self._process_batch(batch_size)
if processed < batch_size:
# Caught up - wait for notifications or poll
await asyncio.sleep(0.1)
async def _process_batch(self, batch_size: int) -> int:
async with self.pool.acquire() as conn:
# Get current checkpoint
checkpoint = await conn.fetchval(
"""
SELECT last_position FROM projection_checkpoints
WHERE projection_name = $1
""",
self.projection_name
) or 0
# Fetch next batch of events
events = await conn.fetch(
"""
SELECT global_position, stream_id, event_type, data, metadata
FROM events
WHERE global_position > $1
ORDER BY global_position
LIMIT $2
""",
checkpoint, batch_size
)
if not events:
return 0
# Process events in a transaction
async with conn.transaction():
for event in events:
handler = self.handlers.get(event['event_type'])
if handler:
await handler(
conn,
event['stream_id'],
json.loads(event['data']),
json.loads(event['metadata'])
)
# Update checkpoint
new_position = events[-1]['global_position']
await conn.execute(
"""
INSERT INTO projection_checkpoints (projection_name, last_position)
VALUES ($1, $2)
ON CONFLICT (projection_name) DO UPDATE SET
last_position = EXCLUDED.last_position,
updated_at = now()
""",
self.projection_name, new_position
)
return len(events)

The checkpoint table is critical:

003_projection_checkpoints.sql
CREATE TABLE projection_checkpoints (
projection_name VARCHAR(255) PRIMARY KEY,
last_position BIGINT NOT NULL,
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

This gives you exactly-once processing semantics without distributed transactions. If the worker crashes mid-batch, the transaction rolls back, and the next run restarts from the last committed checkpoint. No events are skipped. No events are processed twice. The checkpoint and projection updates are atomic.

Batching matters. Processing events one at a time incurs massive overhead—network round trips, transaction commits, WAL flushes. Batching 100-500 events per transaction can improve throughput by 10-50x.

However, batch size is a tradeoff. Larger batches mean longer transactions, which hold locks longer and delay checkpoint updates. If your worker crashes mid-batch of 1000 events, you’ll replay all 1000 on restart. Find the sweet spot for your workload through measurement.

For projections that aggregate across many events (like daily totals), consider a two-phase approach: accumulate changes in memory across a batch, then write a single update at the end. This reduces database round trips dramatically.

📝 Note: For projections that can tolerate staleness, consider using materialized views with REFRESH MATERIALIZED VIEW CONCURRENTLY. This shifts the CPU cost to scheduled refreshes rather than per-event updates. It works well for analytics dashboards that don’t need real-time accuracy.


Real-Time Subscriptions with LISTEN/NOTIFY

PostgreSQL’s NOTIFY/LISTEN provides real-time event streaming without external infrastructure. A trigger fires on every insert, broadcasting the event to all connected subscribers. This eliminates the need for polling and reduces latency to milliseconds.

004_event_notifications.sql
-- Create a function to broadcast new events
CREATE OR REPLACE FUNCTION notify_new_event()
RETURNS TRIGGER AS $$
DECLARE
payload JSON;
BEGIN
-- Build a minimal payload (NOTIFY has 8KB limit)
payload := json_build_object(
'position', NEW.global_position,
'stream_id', NEW.stream_id,
'event_type', NEW.event_type
);
-- Broadcast to the 'new_events' channel
PERFORM pg_notify('new_events', payload::TEXT);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
-- Attach trigger to events table
CREATE TRIGGER events_notify_trigger
AFTER INSERT ON events
FOR EACH ROW
EXECUTE FUNCTION notify_new_event();

The AFTER INSERT timing ensures notifications only fire for committed events. If the transaction rolls back, no notification is sent. This prevents subscribers from acting on events that never actually persisted.

The subscriber listens for notifications and processes them:

event_subscriber.py
import asyncpg
import json
from typing import Callable
class EventSubscriber:
def __init__(self, pool: asyncpg.Pool, event_store):
self.pool = pool
self.event_store = event_store
self.last_position = 0
async def subscribe(self, handler: Callable, from_position: int = 0):
"""
Subscribe to new events with automatic catch-up.
Handles reconnection and missed events by tracking position.
"""
self.last_position = from_position
# First, catch up on any missed events
await self._catch_up(handler)
# Then listen for new events
async with self.pool.acquire() as conn:
await conn.add_listener('new_events', self._on_notify)
try:
while True:
# Process any pending notifications
await asyncio.sleep(0.01)
finally:
await conn.remove_listener('new_events', self._on_notify)
async def _catch_up(self, handler: Callable):
"""Process events from last_position to current."""
async with self.pool.acquire() as conn:
while True:
events = await conn.fetch(
"""
SELECT global_position, stream_id, event_type, data
FROM events
WHERE global_position > $1
ORDER BY global_position
LIMIT 1000
""",
self.last_position
)
if not events:
break
for event in events:
await handler(event)
self.last_position = event['global_position']
def _on_notify(self, conn, pid, channel, payload):
"""Handle incoming notification."""
data = json.loads(payload)
# Queue for processing (don't do heavy work in callback)
asyncio.create_task(self._process_notification(data))

The 8KB limit is the main constraint. The notification payload cannot exceed 8KB after JSON encoding. That’s why we only send the position and stream ID—the subscriber fetches the full event data separately. Attempting to send the full event payload risks silent truncation or failures for larger events.

This pattern also improves efficiency. The notification is a hint that new data exists, not the data itself. The subscriber can batch multiple notifications and fetch all pending events in a single query.

Handling missed events:

Notifications are fire-and-forget. If your subscriber disconnects, notifications during that time are lost. They’re stored in shared memory queues, not persisted. The solution: always track the last processed position and catch up on reconnection. The subscription pattern above demonstrates this—_catch_up runs before entering the listen loop.

This catch-up-then-listen pattern is essential. Never assume LISTEN alone is sufficient. Treat notifications as an optimization to reduce latency, not as a reliability mechanism.

Scaling beyond single-node:

NOTIFY/LISTEN only works within a single PostgreSQL instance. For read replicas or multiple databases:

  1. Use logical replication to stream the events table to subscribers
  2. Each subscriber maintains its own checkpoint
  3. The trigger fires on the primary, but subscribers read from their local replica

Logical replication provides ordering guarantees that physical replication doesn’t—you’re replicating changes to specific tables, not WAL segments. Subscribers can apply changes at their own pace and maintain independent positions.

For geographically distributed systems, consider adding a message queue (like Kafka or SQS) as a fan-out layer. The PostgreSQL trigger publishes to the queue, and subscribers in different regions consume from their nearest queue endpoint.


Operational Realities: Maintenance and Monitoring

Running an event store in production requires different operational practices than typical OLTP workloads. The append-only pattern creates specific challenges that default PostgreSQL configurations don’t handle well.

Vacuum tuning for append-only workloads:

Standard autovacuum settings assume tables have mixed insert/update/delete patterns. An event store is almost entirely inserts, but index maintenance still creates dead tuples. Every time a row is added, the indexes update, and previous index entries for nearby keys become dead. Tune these parameters:

-- More aggressive autovacuum for the events table
ALTER TABLE events SET (
autovacuum_vacuum_scale_factor = 0.01, -- Trigger at 1% dead tuples
autovacuum_analyze_scale_factor = 0.005,
autovacuum_vacuum_cost_delay = 5 -- Less throttling
);

The default autovacuum_vacuum_scale_factor of 0.2 means vacuum doesn’t trigger until 20% of the table is dead tuples. For a 100 million row table, that’s 20 million dead rows before cleanup starts. By then, you’ve accumulated massive bloat.

Monitor vacuum progress with:

SELECT relname,
n_dead_tup,
n_live_tup,
last_vacuum,
last_autovacuum
FROM pg_stat_user_tables
WHERE relname LIKE 'events%';

If n_dead_tup consistently grows between vacuum runs, your autovacuum isn’t keeping up. Consider adding more autovacuum workers or reducing the cost delay further.

Key metrics to monitor:

MetricWarning ThresholdCritical Threshold
Replication lag> 10 seconds> 60 seconds
Projection checkpoint lag> 10,000 events> 100,000 events
Event throughput drop> 20% from baseline> 50% from baseline
Dead tuple ratio> 10%> 25%
Table bloat> 20%> 50%

Projection checkpoint lag is the most important operational metric. If projections fall behind, your read models become stale, and users see inconsistent data. Set up alerts:

-- Query for projection health dashboard
SELECT
pc.projection_name,
pc.last_position,
(SELECT MAX(global_position) FROM events) - pc.last_position AS lag,
pc.updated_at,
now() - pc.updated_at AS time_since_update
FROM projection_checkpoints pc
ORDER BY lag DESC;

A projection that hasn’t updated in several minutes while events continue flowing indicates a stuck worker—likely a bug, a deadlock, or a crashed process.

Track write throughput as a leading indicator. A sudden drop often precedes problems: disk filling up, connection pool exhaustion, or a slow query monopolizing resources. Baseline your normal throughput and alert on significant deviations.

Archiving old events:

Events older than your retention period can move to cold storage. With time-based partitioning, this is straightforward:

  1. Detach the partition: ALTER TABLE events DETACH PARTITION events_2024_01;
  2. Export to S3/blob storage: pg_dump or COPY TO
  3. Drop the partition: DROP TABLE events_2024_01;

The detach operation is instant. Your main table remains available throughout. No table locks, no downtime.

Before archiving, verify no projections need those events. Query your checkpoint table—if any projection’s last_position falls within the partition being archived, that projection will break. Either rebuild the projection first or keep the partition longer.

Backup considerations:

Point-in-time recovery is essential for event stores. A corrupted or missing event breaks the entire history. Configure WAL archiving and test restores regularly:

Terminal window
# Test restore to a specific point
pg_restore --target-time="2026-01-15 14:30:00" ...

Unlike CRUD systems where losing recent data means losing recent state, losing events means losing the ability to reconstruct any state. Your backup strategy must match this criticality.

⚠️ Warning: Never delete events without archiving them first. Unlike CRUD systems where you can often reconstruct state, deleted events are gone forever. The audit trail is the system. A compliance auditor asking about transactions from 18 months ago won’t accept “we deleted those” as an answer.


Migration Path: From Monolith to Event-Sourced

Adopting event sourcing doesn’t require a big-bang rewrite. The strangler pattern lets you migrate incrementally, aggregate by aggregate. You run the old and new systems in parallel, gradually shifting traffic and trust.

Phase 1: Identify the candidate aggregate

Pick an aggregate with:

  • Clear boundaries (doesn’t touch half the database)
  • High business value for historical tracking
  • Moderate complexity (not your simplest or most complex domain)
  • Active development (you’ll touch this code anyway)

Order management is often a good starting point—clear lifecycle, important audit trail, bounded interactions. Avoid core entities that everything else depends on; their fan-out makes parallel operation complex.

Phase 2: Dual-write during transition

The existing system continues to operate while you build the event-sourced version alongside it:

┌─────────────────┐ ┌─────────────────┐
│ API Request │────▶│ Command Handler │
└─────────────────┘ └────────┬────────┘
┌────────────┴────────────┐
▼ ▼
┌───────────────┐ ┌───────────────┐
│ Legacy DB │ │ Event Store │
│ (source │ │ (target │
│ of truth) │ │ validation) │
└───────────────┘ └───────────────┘

Write to both systems, but read from the legacy system. Compare the event-sourced projections against the legacy data to validate correctness:

# Nightly validation job
async def validate_order_projections():
legacy_orders = await legacy_db.fetch("SELECT * FROM orders")
for order in legacy_orders:
projected = await projection_db.fetchrow(
"SELECT * FROM order_projections WHERE id = $1",
order['id']
)
if not projections_match(order, projected):
log.error(f"Mismatch for order {order['id']}")
# Alert, but don't fail - you're validating, not blocking

Start with writes that can fail silently. If the event store write fails, log the error but let the request succeed. You’re building confidence in the new system, not depending on it yet.

As validation passes over days and weeks, increase the strictness. Fail requests where the event store write fails. Then start reading from projections for non-critical paths. Finally, flip the source of truth.

Phase 3: Flip the switch

Once validation passes consistently:

  1. Make the event store the source of truth
  2. Keep dual-writes to legacy for rollback safety
  3. Migrate dependent systems to read from projections
  4. Eventually deprecate the legacy tables

Keep the legacy tables around longer than feels necessary. Six months of overlap costs little but provides insurance. When a bug emerges in the event-sourced system, you can cross-reference against the legacy data to understand what went wrong.

This approach takes longer than a rewrite but carries far less risk. You can stop at any phase if priorities change, and the legacy system remains fully functional throughout. Teams that attempt big-bang event sourcing migrations often discover issues only after committing—dual-write lets you discover them while you still have options.


Key Takeaways

  • Start with time-based partitioning on your events table from day one—retrofitting partitioning on a large table requires downtime and careful planning that’s better avoided entirely.

  • Implement projection checkpointing before you need it; rebuilding projections from scratch becomes impractical past a few million events, and you’ll want the exactly-once processing guarantees from the start.

  • Use LISTEN/NOTIFY for real-time updates but always design projections to handle replay from checkpoints—notifications are unreliable during disconnections, and your recovery path must work without them.

  • Monitor vacuum activity and autovacuum settings aggressively—append-only workloads have different tuning needs than typical OLTP, and default settings will cause problems at scale.

  • Adopt event sourcing incrementally using the strangler pattern—validate your event store against existing systems before trusting it as the source of truth.