Hero image for NATS JetStream in Practice: Persistent Messaging and Event Replay

NATS JetStream in Practice: Persistent Messaging and Event Replay


Your microservice restarts. Takes thirty seconds. In that window, 10,000 events fired across the system—order confirmations, inventory updates, payment callbacks. With core NATS pub/sub, every single one of them is gone. No buffer, no replay, no recovery. The producers didn’t wait, the broker didn’t store, and your service woke up blind to everything that happened while it was down.

This isn’t a theoretical edge case. It’s the default behavior of a fire-and-forget messaging model, and it surfaces exactly when you can least afford it: deployments, crashes, scaling events, network partitions. Teams working with core NATS often paper over this with application-level workarounds—polling databases for missed state, building ad-hoc retry queues, or simply accepting data loss as an operational reality. None of these are solutions. They’re symptoms of reaching for the wrong tool.

JetStream is NATS’s answer to this class of problems, built directly into the server rather than bolted on as a separate system. It brings persistence, consumer tracking, and replay guarantees to the same infrastructure you’re already running—without the operational complexity of Kafka or the protocol overhead of RabbitMQ. The distinction matters: JetStream isn’t a plugin or a sidecar. It’s a first-class feature of the NATS server, which means zero additional infrastructure and tight integration with the existing security and clustering model.

Before diving into streams and consumers, it’s worth understanding exactly why core NATS falls short—and what JetStream actually adds at the architectural level.

Why Core NATS Isn’t Enough: The Case for JetStream

NATS is fast. Benchmarks routinely show it delivering millions of messages per second with sub-millisecond latency, and its operational simplicity is genuinely refreshing compared to heavier brokers. But that speed comes with a fundamental constraint: Core NATS is a pure pub/sub system with no memory. A message published to a subject is delivered to whatever subscribers are connected at that instant, then discarded. If your consumer is offline, restarting, or simply slow, that message is gone.

Visual: core NATS fire-and-forget vs JetStream persistent delivery model

This fire-and-forget model is perfect for telemetry, live dashboards, and service-to-service RPC. It is a liability the moment you need guaranteed delivery, ordered processing, or the ability to replay history.

The Gap JetStream Fills

JetStream is NATS’s persistence layer, built directly into the NATS server binary. It adds three capabilities that Core NATS deliberately omits:

Persistence. Messages are written to disk (or held in memory with configurable limits) and retained according to stream policies you define—by message count, total byte size, or age. A consumer that reconnects after a two-hour outage picks up exactly where it left off.

Consumer tracking. JetStream maintains per-consumer acknowledgment state on the server side. It knows which messages each named consumer has processed, retries unacknowledged messages after a configurable timeout, and surfaces this state through its API. You do not need to implement any of this in your application.

Replay. Any consumer—whether catching up from a crash or a brand-new service being onboarded—can start reading from the beginning of a stream, from a specific sequence number, or from a point in time. This is a first-class feature, not a workaround.

Why Not NATS Streaming (Stan)?

NATS Streaming, known as Stan, was the earlier attempt to add persistence to NATS. It was implemented as a separate process that sat in front of the NATS server, introduced its own protocol, and carried significant operational complexity. The NATS team deprecated Stan in 2023. JetStream replaces it entirely with an architecture that lives inside the server itself, uses the standard NATS protocol, and eliminates the split-brain failure modes Stan was prone to under network partitions.

JetStream vs. Kafka, RabbitMQ, and Pulsar

Kafka wins when you need petabyte-scale log retention, a mature ecosystem of connectors, or stream processing via Kafka Streams. Pulsar is worth considering if you need multi-tenancy at the namespace level baked into the broker. RabbitMQ remains a solid choice for complex routing topologies built around AMQP.

JetStream earns its place when your requirements include reliable at-least-once delivery, event replay, and durable consumers—but you want to run a single binary with a fraction of the operational surface area. A three-node JetStream cluster on Kubernetes requires no ZooKeeper, no separate schema registry, and no JVM tuning. For teams that already use NATS for its speed and simplicity, adding JetStream is an upgrade, not an architectural replacement.

With the motivation clear, the next section maps out the core building blocks—streams, subjects, and consumers—before writing any configuration or code.

JetStream Core Concepts: Streams, Subjects, and Consumers

JetStream extends NATS with a persistence layer built directly into the server. Before writing a single line of configuration, understanding the three fundamental primitives—streams, subjects, and consumers—eliminates most of the confusion that comes from treating JetStream like a drop-in Kafka replacement. It isn’t. The mental model is different, and once that model clicks, configuration choices become obvious.

Visual: JetStream stream, subject binding, and consumer relationship diagram

Streams Are Append-Only Logs Bound to Subject Patterns

A stream is a persistent, ordered sequence of messages captured from one or more NATS subjects. When a publisher sends a message to orders.created, JetStream intercepts it (if a stream is configured to capture that subject) and appends it to the stream’s log before delivering it to any subscribers. The original message still flows through core NATS, but JetStream now owns a durable copy.

Streams bind to subject patterns, not individual subjects. A single stream can capture orders.* and thereby absorb orders.created, orders.updated, and orders.cancelled into one ordered log. This is a meaningful architectural decision: it determines how you replay history, how you enforce limits, and how consumers traverse the message sequence.

Retention Policies Define What Gets Kept

Every stream has a retention policy that governs when messages are discarded:

Limits-based retention (the default) keeps messages until a threshold is crossed—maximum message count, total byte size, or message age. Once the limit is hit, the oldest messages are dropped. This behaves like a sliding window and is appropriate for telemetry or log-style workloads where historical depth matters but unbounded growth doesn’t.

Interest-based retention discards a message only after all active consumers have acknowledged it. If no consumers exist, messages are dropped immediately. This mode is suited for fan-out scenarios where every downstream service must process each event exactly once, but you don’t want the stream to accumulate data indefinitely when consumers are caught up.

Work-queue retention combines persistence with exclusive consumption: a message is removed from the stream the moment any consumer acknowledges it. This turns JetStream into a distributed task queue with durability guarantees.

Push vs. Pull Consumers

Consumers define how a client reads from a stream. JetStream supports two delivery modes.

Push consumers have the server deliver messages to a NATS subject, which the client subscribes to. The server controls the flow. This works well for low-latency pipelines where the consumer is always running and can keep up with message volume.

Pull consumers require the client to explicitly request messages in batches. The client controls the pace. This is the correct choice for worker pools, batch processing jobs, and any scenario where consumers scale horizontally or have variable processing capacity. Pull consumers also handle back-pressure naturally—if workers are busy, they simply stop requesting.

Durable Consumers and Sequence Tracking

A consumer becomes durable when it has a name. That name anchors the consumer’s state on the server: its current position in the stream, any unacknowledged messages, and delivery metadata. If the client disconnects and reconnects—even days later—it resumes from where it left off. Anonymous (ephemeral) consumers exist only for the lifetime of the subscription.

Sequence tracking is per-stream. JetStream assigns a monotonically increasing sequence number to every message, and the consumer tracks which sequence it last delivered. This makes delivery policies precise:

  • all — replay from the beginning of the stream
  • new — receive only messages published after the consumer is created
  • last — start with the single most recent message
  • by-start-sequence — begin at an explicit sequence number
  • by-start-time — begin with the first message at or after a given timestamp

💡 Pro Tip: Delivery policies are set at consumer creation time and cannot be changed afterward. If you need to reposition a consumer, you create a new one with the desired policy—which is exactly how event replay is triggered in practice.

With streams, retention, and consumer mechanics established, the next step is getting JetStream running in a production-grade Kubernetes cluster where these abstractions translate into concrete Helm values and resource configurations.

Deploying a JetStream Cluster on Kubernetes with Helm

A single-node NATS deployment is fine for local development, but production workloads require replication, persistence across pod restarts, and leader election. This section walks through deploying a three-node JetStream cluster using the official NATS Helm chart, with persistent volumes and Raft-based clustering configured correctly from the start.

Add the Helm Repository and Configure Values

First, add the NATS Helm repository and pull down the default values file to customize:

Terminal window
helm repo add nats https://nats-io.github.io/k8s/helm/charts/
helm repo update

Create a values.yaml that enables JetStream, configures clustering, and attaches persistent storage:

nats-values.yaml
config:
cluster:
enabled: true
replicas: 3
name: nats-cluster
jetstream:
enabled: true
fileStore:
enabled: true
dir: /data/jetstream
pvc:
enabled: true
size: 10Gi
storageClassName: standard
container:
image:
tag: 2.10.14-alpine
natsBox:
enabled: true
podDisruptionBudget:
enabled: true
minAvailable: 2

The cluster.replicas: 3 field instructs the chart to deploy a StatefulSet with three pods. The Raft protocol handles meta-leader election automatically — no external coordination is required. Each pod gets its own PVC, so stream data survives pod restarts and rescheduling without manual intervention. Setting minAvailable: 2 in the PodDisruptionBudget ensures the cluster maintains quorum during rolling upgrades or node drains. With a three-node cluster, losing more than one member simultaneously would break quorum and halt stream operations, so this budget acts as a hard safety rail.

Note: Always set storageClassName explicitly. Relying on the default storage class in production clusters often results in provisioning volumes on the wrong tier — unintentionally landing on slower NFS-backed storage instead of SSDs. For JetStream workloads, I/O latency directly impacts consumer delivery throughput, so SSD-backed volumes are strongly recommended.

Deploy and Verify the Cluster

Install the chart into a dedicated namespace:

Terminal window
kubectl create namespace messaging
helm install nats nats/nats \
--namespace messaging \
--values nats-values.yaml \
--version 1.2.4

Pinning --version prevents unintentional chart upgrades during CI re-runs. Wait for all three pods to reach Running state before proceeding:

Terminal window
kubectl rollout status statefulset/nats --namespace messaging

Once the pods are up, use the NATS Box sidecar to run a health check against the cluster:

Terminal window
kubectl exec -n messaging -it deployment/nats-box -- \
nats server list --server nats://nats:4222

Expected output shows all three servers with their cluster roles:

╭──────────────────────────────────────────────────────────────────────────────────╮
│ Server Overview │
├─────────────┬────────────┬──────────┬───────────┬──────┬─────────┬──────────────┤
│ Name │ Cluster │ IP │ Version │ JS │ Streams │ Consumers │
├─────────────┼────────────┼──────────┼───────────┼──────┼─────────┼──────────────┤
│ nats-0 │ nats-cluster│ 10.0.1.4│ 2.10.14 │ true │ 0 │ 0 │
│ nats-1 │ nats-cluster│ 10.0.1.5│ 2.10.14 │ true │ 0 │ 0 │
│ nats-2 │ nats-cluster│ 10.0.1.6│ 2.10.14 │ true │ 0 │ 0 │
╰─────────────┴────────────┴──────────┴───────────┴──────┴─────────┴──────────────╯

The JS: true column confirms JetStream is active on each node. If any pod is missing from this list, check its logs for PVC binding errors before investigating network or configuration issues — volume provisioning failures are the most frequent cause of incomplete cluster formation.

Verify JetStream Is Functional

Run a JetStream health check from within the cluster:

Terminal window
kubectl exec -n messaging -it deployment/nats-box -- \
nats server check jetstream --server nats://nats:4222

A healthy response reports JetStream OK with current memory and storage usage figures. If any node reports disabled, confirm that its PVC reached Bound status:

Terminal window
kubectl get pvc --namespace messaging

A PVC stuck in Pending typically indicates that no storage provisioner matched the requested storageClassName, or that the cluster has exhausted available persistent volume capacity. Resolve the storage issue first — JetStream will initialize automatically once the volume becomes available, without requiring a pod restart.

With three nodes online, Raft-elected leadership established, and persistent volumes attached to each pod, the cluster is ready to handle durable streams. The next section covers creating those streams, defining retention policies, and publishing your first persisted messages.

Creating Streams and Publishing Messages

With your JetStream cluster running, the next step is defining streams and publishing messages that persist beyond the initial publish call. Unlike Core NATS subjects that vanish if no subscriber is listening, a stream captures every message matching its subject filter and holds it according to your retention policy.

Defining a Stream with the NATS CLI

The fastest way to prototype a stream is through the CLI. This command creates a stream named ORDERS that captures all subjects under the orders.> hierarchy:

create-stream.sh
nats stream add ORDERS \
--subjects "orders.>" \
--storage file \
--retention limits \
--max-age 72h \
--max-bytes 10GiB \
--max-msgs 5000000 \
--replicas 3

--replicas 3 mirrors the stream across all three JetStream nodes, so a single node failure does not interrupt delivery or replay. The limits retention policy discards the oldest messages when any threshold—age, bytes, or count—is exceeded.

Creating Streams Programmatically in Go

CLI commands are useful for exploration, but production services manage their own stream topology at startup. The nats.go client exposes the full JetStream management API:

stream/manager.go
package stream
import (
"context"
"time"
"github.com/nats-io/nats.go/jetstream"
)
func EnsureOrdersStream(ctx context.Context, js jetstream.JetStream) (jetstream.Stream, error) {
cfg := jetstream.StreamConfig{
Name: "ORDERS",
Description: "Persistent order lifecycle events",
Subjects: []string{"orders.>"},
Storage: jetstream.FileStorage,
Retention: jetstream.LimitsPolicy,
Replicas: 3,
MaxAge: 72 * time.Hour,
MaxBytes: 10 * 1024 * 1024 * 1024, // 10 GiB
MaxMsgs: 5_000_000,
Compression: jetstream.S2Compression,
}
stream, err := js.CreateOrUpdateStream(ctx, cfg)
if err != nil {
return nil, err
}
return stream, nil
}

CreateOrUpdateStream is idempotent: it creates the stream on first run and updates the configuration on subsequent calls if the diff is compatible. This makes it safe to call at service startup without guarding against duplicate errors.

Subject Hierarchies and Wildcard Bindings

The orders.> subject filter matches any subject with orders. as its prefix—orders.created, orders.shipped, orders.payment.failed, and any depth beyond that. The single-token wildcard * matches exactly one token, so orders.* captures orders.created but not orders.payment.failed.

Design your subject hierarchy before defining streams. A well-structured hierarchy lets you bind multiple independent streams to non-overlapping slices of the same subject space as your domain evolves.

Publishing with Headers and Inspecting Metadata

JetStream enriches every message with server-assigned metadata: a monotonically increasing sequence number, the server timestamp, and the stream name. You can also attach your own headers for tracing and idempotency:

publisher/orders.go
func PublishOrderCreated(ctx context.Context, js jetstream.JetStream, orderID string, payload []byte) error {
msg := &nats.Msg{
Subject: "orders.created",
Data: payload,
Header: nats.Header{},
}
msg.Header.Set("X-Order-ID", orderID)
msg.Header.Set("X-Idempotency-Key", "ord-20240315-"+orderID)
ack, err := js.PublishMsg(ctx, msg)
if err != nil {
return err
}
// ack.Sequence is the stream-assigned sequence number
// use it for ordered replay or deduplication auditing
_ = ack.Sequence
return nil
}

The PublishMsg call blocks until the stream leader acknowledges persistence to the required number of replicas. This is the publish acknowledgement that Core NATS never provided—your message is durable the moment this call returns without error.

💡 Pro Tip: Set Nats-Msg-Id on the message header to enable JetStream’s built-in deduplication window. The server rejects duplicate publishes with the same ID within the DuplicateWindow period you configure on the stream, eliminating double-publish races without application-level coordination.

Retention Policies in Practice

Limits-based retention keeps storage predictable but silently drops older messages when thresholds are crossed. For audit streams where no message can be lost, switch to InterestPolicy—messages are retained until every bound consumer has acknowledged them—or WorkQueuePolicy for single-consumer fan-out where a message is deleted immediately after one acknowledgement.

With streams capturing and persisting your events, the next step is attaching durable consumers that track per-subscriber progress and guarantee at-least-once delivery even through restarts and network partitions.

Durable Consumers and At-Least-Once Delivery

A consumer in JetStream is more than a subscription—it’s a named, server-side cursor that tracks delivery position and acknowledgment state. When a consumer is durable, that state persists across client restarts, network partitions, and pod evictions. Your processing pipeline picks up exactly where it left off.

Creating a Durable Pull Consumer

Pull consumers give your application explicit control over when messages are fetched, making them the right default for workloads where processing time is variable or resource-bound.

consumer.go
package main
import (
"context"
"log"
"time"
"github.com/nats-io/nats.go"
"github.com/nats-io/nats.go/jetstream"
)
func main() {
nc, _ := nats.Connect("nats://nats.messaging.svc.cluster.local:4222")
defer nc.Drain()
js, _ := jetstream.New(nc)
cons, err := js.CreateOrUpdateConsumer(context.Background(), "ORDERS", jetstream.ConsumerConfig{
Durable: "orders-processor",
FilterSubject: "orders.>",
AckPolicy: jetstream.AckExplicitPolicy,
AckWait: 30 * time.Second,
MaxDeliver: 5,
BackOff: []time.Duration{
5 * time.Second,
15 * time.Second,
60 * time.Second,
},
})
if err != nil {
log.Fatal(err)
}
for {
msgs, _ := cons.Fetch(10, jetstream.FetchMaxWait(5*time.Second))
for msg := range msgs.Messages() {
if err := processOrder(msg); err != nil {
msg.Nak()
continue
}
msg.Ack()
}
}
}

The Durable field is what makes this consumer survive restarts. JetStream stores the consumer configuration and its delivery sequence on the server—reconnecting clients that reference orders-processor resume from the last acknowledged message.

The Acknowledgment Flow

JetStream exposes four acknowledgment signals, each serving a distinct purpose in failure handling:

SignalMethodEffect
Ackmsg.Ack()Message processed successfully; advance the cursor
Nakmsg.Nak()Processing failed; redeliver after AckWait
InProgressmsg.InProgress()Reset the AckWait timer; processing continues
Termmsg.Term()Poison message; discard without redelivery

InProgress is critical for long-running jobs. If your handler calls an external API that takes 45 seconds but AckWait is 30 seconds, the server redelivers the message to another consumer instance before the first finishes. Call msg.InProgress() periodically to extend the deadline:

long_running_handler.go
func processOrder(msg jetstream.Msg) error {
ctx, cancel := context.WithTimeout(context.Background(), 2*time.Minute)
defer cancel()
ticker := time.NewTicker(20 * time.Second)
defer ticker.Stop()
done := make(chan error, 1)
go func() {
done <- callExternalFulfillmentAPI(ctx, msg.Data())
}()
for {
select {
case err := <-done:
return err
case <-ticker.C:
msg.InProgress()
}
}
}

Term handles the case where a message is structurally invalid and retrying it is pointless—a malformed payload that will never deserialize correctly. Terminating it immediately prevents MaxDeliver redelivery cycles from wasting resources and lets you route the message to a dead-letter stream via an advisory subject.

MaxDeliver and BackOff

MaxDeliver: 5 caps redelivery attempts. After five failed deliveries, JetStream marks the message as “num delivery exceeded” and stops attempting. Pair this with BackOff to implement exponential retry spacing—the slice maps attempt number to delay. In the example above, the second attempt waits 5 seconds, the third waits 15, and subsequent attempts wait 60 seconds each.

💡 Pro Tip: Enable a dedicated dead-letter stream by subscribing to $JS.EVENT.ADVISORY.CONSUMER.MAX_DELIVERIES.ORDERS.orders-processor. This advisory subject fires whenever a message hits its delivery ceiling, letting you capture, inspect, and alert on poison messages without polling.

Consumer Groups for Horizontal Scaling

Multiple instances sharing the same durable consumer name form a competing-consumer group automatically. JetStream dispatches each fetched message to exactly one instance—no additional coordination required.

scaled_consumer.go
// All three replicas use identical configuration.
// JetStream distributes pulled messages across whichever replica fetches first.
cons, _ := js.CreateOrUpdateConsumer(context.Background(), "ORDERS", jetstream.ConsumerConfig{
Durable: "orders-processor", // shared name across all replicas
FilterSubject: "orders.>",
AckPolicy: jetstream.AckExplicitPolicy,
AckWait: 30 * time.Second,
MaxDeliver: 5,
})

Scale your Deployment to three replicas and all three instances share the same server-side consumer state. There is no leader election, no partition assignment, and no rebalancing delay—each Fetch call races to claim the next available batch.

With durable consumers handling failures and horizontal scaling covered, the next natural question is what happens when you need to reprocess historical messages—whether you’re onboarding a new downstream service or debugging a production incident by replaying the exact event sequence that caused it.

Event Replay: Rewinding History for New Consumers and Debugging

One of JetStream’s most powerful capabilities is the ability to replay stored messages at will. Unlike traditional message queues that discard messages after delivery, JetStream streams function as a durable log—new consumers can position themselves anywhere in that log and reprocess events from that point forward. This unlocks bootstrapping new services, recovering from bugs, and implementing event sourcing patterns without bolting on additional infrastructure like Apache Kafka or EventStore.

Replaying from the Beginning

When a new service comes online and needs the full event history, configure its consumer with DeliverAllPolicy. This positions the consumer at sequence zero and works through every message in the stream before switching to live delivery.

replay_all.go
func subscribeFromStart(js nats.JetStreamContext) error {
sub, err := js.Subscribe(
"orders.>",
func(msg *nats.Msg) {
var order Order
if err := json.Unmarshal(msg.Data, &order); err != nil {
msg.Nak()
return
}
if err := processOrder(order); err != nil {
msg.Nak()
return
}
msg.Ack()
},
nats.Durable("orders-read-model-v1"),
nats.DeliverAll(),
nats.AckExplicit(),
nats.MaxAckPending(50),
)
if err != nil {
return fmt.Errorf("subscribe failed: %w", err)
}
defer sub.Unsubscribe()
return nil
}

The MaxAckPending(50) limit acts as backpressure during the replay burst—without it, JetStream floods the consumer faster than it processes messages, causing timeouts and redeliveries.

Time-Based Replay After a Bug Fix

When a bug corrupts downstream state, you need to reprocess only the affected window rather than the entire stream history. DeliverByStartTimePolicy lets you specify an exact timestamp as the starting position.

replay_since.go
func replayFromTime(js nats.JetStreamContext, since time.Time) error {
sub, err := js.Subscribe(
"orders.>",
func(msg *nats.Msg) {
var order Order
json.Unmarshal(msg.Data, &order)
rebuildReadModel(order)
msg.Ack()
},
nats.Durable("orders-repair-20260215"),
nats.DeliverByStartTime(&since),
nats.AckExplicit(),
)
if err != nil {
return fmt.Errorf("time-based replay failed: %w", err)
}
defer sub.Unsubscribe()
return nil
}

💡 Pro Tip: Use a uniquely named durable consumer for each repair run (e.g., orders-repair-20260215). This guarantees an independent position pointer with no interference from your production consumer’s ack state.

State Reconstruction with DeliverLastPerSubject

For key-value style projections where only the latest event per entity matters, DeliverLastPerSubjectPolicy delivers exactly one message per subject before switching to live updates. This is ideal for rebuilding a read model without processing every historical state transition.

replay_last_per_subject.go
sub, err := js.Subscribe(
"orders.*",
func(msg *nats.Msg) {
// msg.Subject is "orders.ord-7842", "orders.ord-9103", etc.
var order Order
json.Unmarshal(msg.Data, &order)
upsertOrderProjection(order)
msg.Ack()
},
nats.Durable("orders-projection-v3"),
nats.DeliverLastPerSubject(),
nats.AckExplicit(),
)

A wildcard subject like orders.* paired with DeliverLastPerSubject gives you a snapshot of current state for every order ID in the stream—eliminating the need to fold hundreds of intermediate events to arrive at the same result.

Practical Rebuild: Order Read Model

Combining these policies in a real deployment looks like this: on first deployment, use DeliverAll to build the initial projection. After a schema migration or bug fix targeting a known time window, spin up a repair consumer with DeliverByStartTime. When you need a lightweight cache warm-up, DeliverLastPerSubject reconstructs current state in a single pass.

The stream itself requires no changes across any of these scenarios—replay policy is entirely a consumer-level concern. This separation between the durable log and how you read from it is what makes JetStream’s replay model genuinely flexible, rather than a brittle workaround that collapses under operational pressure.

With event replay handling recovery and bootstrapping, the remaining challenge is operating JetStream reliably in production. The next section covers monitoring stream health, configuring retention and storage limits, and the operational gotchas that surface under real traffic.

Production Considerations: Monitoring, Limits, and Gotchas

Running JetStream in production surfaces a class of problems that never appear in local testing: streams that quietly fill your disks, consumers that fall behind without alerting anyone, and retention policies that discard messages you expected to keep. Addressing these before you ship saves significant operational pain.

Metrics That Actually Matter

Three signals determine JetStream cluster health:

Consumer lag (jetstream_consumer_num_pending) is the most critical. It measures how many messages a consumer has not yet processed. A lag that grows monotonically means your consumer is slower than your publisher — flow control will eventually kick in and back-pressure publishers, causing latency spikes upstream. Alert on sustained lag, not instantaneous spikes.

Stream storage usage (jetstream_stream_bytes and jetstream_stream_messages) tells you how close you are to your configured max_bytes or max_msgs limits. When a stream hits its limit, the configured discard policy takes effect — either new messages are dropped (DiscardNew) or the oldest messages are evicted (DiscardOld). Whichever policy you chose, silent data movement is happening. Track utilization and alert well before 80%.

Ack pending count (jetstream_consumer_num_ack_pending) tracks messages delivered but not yet acknowledged. A high and growing value usually indicates consumers are crashing mid-processing or your AckWait timeout is too short for your workload, causing repeated redelivery.

Exposing Metrics to Prometheus

The NATS server exposes a monitoring HTTP endpoint (/jsz, /accountz, /healthz) on port 8222 by default. Use the nats-exporter sidecar or the NATS Surveyor to scrape these endpoints and publish Prometheus-compatible metrics. In your Helm values, enable monitoring explicitly:

monitor:
enabled: true
port: 8222

Then point your ServiceMonitor at port 8222 and import the official NATS JetStream Grafana dashboard (ID 14929) for an immediate baseline view of stream and consumer state.

The Gotchas That Catch Teams Off Guard

Unbounded stream growth happens when you configure no max_bytes, no max_msgs, and RetentionPolicy: Limits — which is the default. Without explicit limits, JetStream stores everything until the disk is full.

Wrong retention mode is subtle. WorkQueuePolicy deletes a message once any consumer acknowledges it. If you later add a second consumer expecting to see the full message history, it sees nothing. Use LimitsPolicy for multi-consumer fan-out scenarios.

AckWait shorter than processing time causes messages to be redelivered mid-processing, leading to duplicate work. Set AckWait to at least 2x your p99 processing latency and implement idempotent handlers regardless.

💡 Pro Tip: Run nats stream report from the NATS CLI before and after every deployment. It surfaces consumer lag, storage utilization, and redelivery counts in a single view — faster than digging through Grafana during an incident.

Pre-Production Checklist

  • max_bytes or max_msgs configured on every stream
  • discard policy explicitly chosen and tested
  • AckWait set to 2x p99 processing latency
  • Consumer lag alert at sustained > 10,000 messages
  • Storage utilization alert at > 80%
  • nats-exporter deployed and Grafana dashboard verified
  • Retention policy validated against your consumer topology

With monitoring and limits locked down, you have the full operational picture of a production JetStream deployment: a Raft-clustered server with persistent volumes, streams with explicit retention bounds, durable pull consumers with calibrated ack timeouts, replay policies for bootstrapping and recovery, and the observability layer to detect problems before they become incidents.

Key Takeaways

  • Use pull consumers with explicit acks for all critical workloads—push consumers are convenient but harder to control under backpressure
  • Set stream retention policies and storage limits before going to production; unbounded streams will silently fill your PVCs
  • Implement event replay from the start by choosing sequence-based delivery policies, so new service instances and post-incident recovery are a configuration change, not a data migration
  • Monitor consumer lag and ack-pending counts as your primary JetStream health signals—not just server uptime