Feb 18, 2026

NATS JetStream in Production: From Fire-and-Forget to Durable Event Streaming

Your microservices are dropping messages, and you might not know it yet. Core NATS is genuinely fast—benchmarks routinely show millions of messages per second on commodity hardware—but that throughput comes with an explicit trade-off baked into its design: at-most-once delivery. No subscriber listening when you publish? That message is gone. Consumer crashes mid-flight? Same outcome. The broker never knew it needed to care.

For a class of workloads, this is exactly the right trade-off. Telemetry streams, heartbeats, live sensor feeds—losing the occasional data point is operationally irrelevant, and the simplicity of fire-and-forget keeps your infrastructure lean. The moment you move financial transactions, order events, or any payload where loss has a business cost into that same pattern, you have a silent reliability problem waiting to surface under load.

This is the architectural fork that catches teams mid-scale. The system works fine in staging, where timing is controlled and consumers are always healthy. Production introduces the edge cases: a pod restart during a burst, a downstream service degraded at exactly the wrong moment, a deploy that takes a consumer offline for thirty seconds. Core NATS does not buffer for you. JetStream does—and the distance between those two sentences is the entire operational decision space.

JetStream is not a different system bolted onto NATS. It is NATS with a persistence layer, replay semantics, consumer groups, and configurable acknowledgment modes. The question is never whether JetStream is more capable than Core NATS. It is whether your specific workload actually needs that capability, and if it does, where exactly to draw the boundary.

That starts with understanding why delivery guarantees and raw throughput are structurally at odds—and what that tension costs you at different points in your architecture.

The Delivery Guarantee Problem: Why Speed and Reliability Are at Odds

Every distributed messaging system makes a fundamental promise to its users, and that promise lives on a spectrum. At one end: raw throughput, sub-millisecond latency, and zero persistence overhead. At the other: guaranteed delivery, ordered processing, and durable storage that survives restarts and network partitions. You cannot fully optimize for both simultaneously. Understanding where your use case falls on that spectrum is the first architectural decision you need to make before touching any NATS configuration.

Visual: delivery guarantee spectrum from fire-and-forget to durable streaming

Core NATS: A Deliberate Design Choice

Core NATS operates on at-most-once delivery. When a publisher sends a message, NATS routes it to any currently connected subscribers and immediately discards it. No queue, no journal, no retry. If no subscriber is present at the moment of publication—because the consumer process restarted, is mid-deployment, or simply hasn’t connected yet—the message is gone. This is not a limitation or a bug. It is a deliberate design decision that makes NATS exceptionally fast and operationally lightweight.

For a large class of problems, at-most-once is exactly the right guarantee. Infrastructure telemetry, where you are sampling metrics every few seconds and a dropped data point is irrelevant, benefits from this model. Heartbeat signals, presence notifications, live game state updates, and sensor readings that are only meaningful if they arrive within a narrow time window are all cases where persistence adds cost without adding value. Storing every heartbeat is wasteful; storing every GPS coordinate from a fleet of vehicles that publishes ten times per second is actively counterproductive.

When Fire-and-Forget Becomes a Liability

The calculus changes the moment a message carries business-critical state. A payment event, an order confirmation, a fraud detection signal, an inventory reservation—these are messages where “the subscriber was restarting” is not an acceptable reason for data loss. A financial system that drops events under load does not have a performance problem; it has a correctness problem. The distinction matters because correctness failures are categorically different from latency failures, and they require categorically different tooling.

This is the architectural fork point. Identify the flows in your system where message loss produces incorrect state, broken audit trails, or failed transactions. Those flows need durable streaming. The rest—the high-frequency, ephemeral, loss-tolerant flows—are exactly what Core NATS was designed for, and reaching for heavier tooling there introduces unnecessary complexity.

NATS vs. Kafka: A Different Set of First-Class Values

Kafka treats persistence as the default. Every message is written to a distributed log, and the system is designed around the assumption that durability is always desirable. NATS inverts this: simplicity and operational lightness are the defaults, and JetStream adds persistence as an explicit, opt-in layer for the flows that require it.

This positioning matters for teams managing infrastructure at scale. A single NATS server binary with no external dependencies can handle both ephemeral pub/sub and durable streaming within the same deployment—a sharp contrast to operating a Kafka cluster with its ZooKeeper or KRaft coordination layer, separate consumer group management, and substantial tuning surface area.

💡 Pro Tip: Map your message flows into two columns before writing any code: “loss-tolerant” and “loss-critical.” Subject hierarchy design in the final section will use this exact division to structure your NATS subject namespace.

With that mental model in place, the next step is understanding how JetStream’s persistence layer actually works—what streams and consumers are, how messages are stored, and what guarantees the architecture provides before you write a single producer or consumer.

JetStream Architecture: Streams, Consumers, and the Persistence Layer

JetStream is not a separate service you bolt onto NATS—it ships inside the NATS server binary. There is no ZooKeeper ensemble to manage, no separate persistence daemon, no coordination layer to operate. You enable it with a single server configuration flag and the persistence layer activates within the same process. This architectural decision shapes everything about JetStream’s operational profile: lower latency between the broker and its storage, fewer moving parts to monitor, and a dramatically simpler deployment story than Kafka or Pulsar.

Visual: JetStream architecture showing streams, consumers, and the persistence layer

Streams: Capturing and Retaining Messages

A stream is the foundational storage primitive. It defines a subject filter (or set of subjects) whose messages JetStream captures and persists. When a publisher sends to orders.created, a stream configured to capture orders.* intercepts that message and writes it to durable storage before acknowledging the publish.

Streams carry three distinct retention policies that govern when messages are discarded:

Limits-based retention holds messages until they exceed a configured threshold—maximum message count, total byte size, or per-message age. This is the default and the right choice when you want a bounded replay window.
Interest-based retention discards a message only after all bound consumers have acknowledged it. Use this when every consumer group must process every message and you want automatic cleanup.
Work-queue retention discards a message after any single consumer acknowledges it, turning the stream into a distributed task queue.

Choosing the wrong retention policy is one of the most common misconfigurations in JetStream deployments. Limits-based retention on a work-queue pattern means acknowledged messages accumulate until your size limit triggers eviction, silently dropping messages other consumers haven’t seen.

Consumers: The Subscription Layer

Consumers are the mechanism through which subscribers read from streams. The push/pull distinction matters significantly for horizontal scaling.

Push consumers have the server deliver messages to a subject the client subscribes to. They suit low-latency pipelines where backpressure is managed externally, but they require careful flow control configuration to avoid overwhelming slow consumers.

Pull consumers invert the control: the client explicitly requests a batch of messages. This model is superior for horizontally scaled worker pools because each instance fetches only what it can process, and the server never needs to track per-client delivery rates. Pull consumers are the default recommendation for any production workload with variable processing throughput.

The durable versus ephemeral distinction controls whether the consumer’s position is remembered across reconnections. Durable consumers have a named identity the server persists; ephemeral consumers evaporate when their last subscriber disconnects. Ephemeral consumers are useful for one-off replay operations, but every service that must survive restarts needs a durable consumer with a stable name.

Ack/Nak Flow and Delivery State

JetStream tracks per-consumer delivery state without a separate offset store. Each consumer maintains a sequence cursor and a delivery count for in-flight messages. When a client acknowledges (Ack) a message, the cursor advances. When it negatively acknowledges (Nak) or the ack window times out, JetStream redelivers—up to the stream’s configured MaxDeliver threshold.

This means retry logic is built into the broker. A processing failure does not require the application to re-enqueue—it requires the application to either Nak immediately or let the ack timeout expire.

Storage Backends

JetStream supports file-based and in-memory storage per stream. File storage survives server restarts and is the correct default for any event that must not be lost. Memory storage offers lower write latency but evaporates on restart—appropriate for ephemeral caching streams or high-throughput telemetry where loss is acceptable.

💡 Pro Tip: In clustered deployments, JetStream uses Raft-based consensus to replicate stream data across a configurable number of replicas (R1, R3). A single-replica stream offers no fault tolerance. Set replicas: 3 for any stream where data loss is unacceptable.

With this mental model in place—streams capturing subjects, consumers reading with explicit durability and pull semantics, and ack flow handled at the broker—the next step is translating these concepts into working Go code using the nats.go client library.

Building a Durable Event Pipeline in Go with nats.go

Core NATS gives you speed. JetStream gives you guarantees. This section walks through a production-ready Go implementation that connects those two realities — from stream creation through confirmed publishing, pull-based consumption, and redelivery handling. Each pattern here reflects decisions you’ll face on day one of a real deployment, not edge cases you’ll discover after an incident.

Connecting and Configuring a Stream

Start by establishing a connection and creating a JetStream context. The stream definition is where you encode your retention and replay requirements — get this wrong and you’ll be recreating streams under load.

package main

import (
    "log"
    "time"

    "github.com/nats-io/nats.go"
)

func setupStream(nc *nats.Conn) (nats.JetStreamContext, error) {
    js, err := nc.JetStream(nats.PublishAsyncMaxPending(256))
    if err != nil {
        return nil, err
    }

    streamConfig := &nats.StreamConfig{
        Name:       "ORDERS",
        Subjects:   []string{"orders.>"},
        Storage:    nats.FileStorage,
        Retention:  nats.LimitsPolicy,
        MaxAge:     72 * time.Hour,
        MaxBytes:   1 << 30, // 1 GiB
        MaxMsgSize: 1 << 20, // 1 MiB
        Replicas:   3,
    }

    _, err = js.AddStream(streamConfig)
    if err != nil {
        // Stream may already exist — update it instead
        _, err = js.UpdateStream(streamConfig)
        if err != nil {
            return nil, err
        }
    }

    return js, nil
}

Replicas: 3 is non-negotiable in production. A single-replica stream loses data on node failure. FileStorage provides durability across restarts; MemoryStorage trades persistence for lower latency in ephemeral use cases like session tokens or short-lived caches. The AddStream/UpdateStream fallback pattern is intentional — idempotent setup lets your service restart cleanly without manual stream management.

LimitsPolicy is the right default for most pipelines: messages are retained until they hit an age, byte, or count limit. Use WorkQueuePolicy if you want the stream to auto-delete messages after they’ve been consumed, eliminating the need to size long-term retention.

Publishing with Confirmation

Fire-and-forget publishing has no place in a durable pipeline. JetStream provides two confirmation modes: synchronous (blocks until the server acknowledges persistence) and asynchronous (returns a future you reconcile later).

func publishOrder(js nats.JetStreamContext, orderID string, payload []byte) error {
    // Synchronous: safe for critical paths, adds latency
    ack, err := js.Publish("orders.created", payload,
        nats.MsgId(orderID), // Enable deduplication
    )
    if err != nil {
        return fmt.Errorf("publish failed for order %s: %w", orderID, err)
    }
    log.Printf("order %s persisted on stream %s seq %d", orderID, ack.Stream, ack.Sequence)
    return nil
}

func publishOrderAsync(js nats.JetStreamContext, payload []byte) (nats.PubAckFuture, error) {
    // Asynchronous: higher throughput, requires explicit future handling
    future, err := js.PublishAsync("orders.created", payload)
    if err != nil {
        return nil, err
    }
    return future, nil
}

nats.MsgId enables server-side deduplication within the stream’s Duplicates window (default 2 minutes). Use a stable, business-meaningful ID — order UUID, event fingerprint — not a random value generated at publish time. Deduplication is a server-side bloom filter: collisions are possible at extreme scale, but for most workloads it reliably prevents double-processing on producer retries.

The PublishAck returned by the synchronous path contains the sequence number assigned by the server. Logging this gives you an audit trail you can reconcile against consumer delivery records during incident investigation.

Pro Tip: For async publishing, drain the pending futures in a separate goroutine watching js.PublishAsyncComplete(). Unread futures are not error-handled automatically — they silently drop on connection close.

Pull Consumer with Explicit Acknowledgment

Push consumers hand delivery timing to the server. Pull consumers let your application control the rate — critical when downstream processing is expensive or when you need predictable memory pressure.

func createPullConsumer(js nats.JetStreamContext) error {
    _, err := js.AddConsumer("ORDERS", &nats.ConsumerConfig{
        Durable:       "orders-processor",
        FilterSubject: "orders.created",
        AckPolicy:     nats.AckExplicitPolicy,
        AckWait:       30 * time.Second,
        MaxDeliver:    5,
        MaxAckPending: 500,
    })
    return err
}

func processMessages(js nats.JetStreamContext) {
    sub, err := js.PullSubscribe("orders.created", "orders-processor")
    if err != nil {
        log.Fatalf("subscribe failed: %v", err)
    }
    defer sub.Drain()

    for {
        msgs, err := sub.Fetch(25, nats.MaxWait(5*time.Second))
        if err != nil && err != nats.ErrTimeout {
            log.Printf("fetch error: %v", err)
            continue
        }

        for _, msg := range msgs {
            if err := handleOrder(msg.Data); err != nil {
                // Nak with a delay — back off before redelivery
                msg.NakWithDelay(10 * time.Second)
                continue
            }
            msg.Ack()
        }
    }
}

MaxDeliver: 5 caps redelivery attempts. On the fifth failure, the message moves to the dead-letter stream if you’ve configured one — otherwise it expires silently. Wire up a dead-letter consumer before you go to production; silent expiry is how data loss hides until an audit.

AckWait: 30s defines how long the server waits before assuming the consumer died and redelivering. Size this to your p99 processing time with headroom, not your average. A consumer that processes in 2 seconds on average but occasionally takes 45 seconds will generate phantom redeliveries and duplicate processing if AckWait is set to 30 seconds.

MaxAckPending: 500 prevents a slow consumer from accumulating an unbounded in-flight window. When this ceiling is hit, the server stops delivering new messages to that consumer until pending acks are resolved — a natural backpressure mechanism.

Queue Groups for Competing Consumers

Horizontal scaling with pull consumers requires binding multiple subscribers to the same durable consumer. JetStream distributes fetch requests across all active subscribers automatically.

func startWorkerPool(js nats.JetStreamContext, workerCount int) {
    for i := range workerCount {
        go func(id int) {
            sub, err := js.PullSubscribe(
                "orders.created",
                "orders-processor", // Same durable = competing consumers
                nats.BindStream("ORDERS"),
            )
            if err != nil {
                log.Fatalf("worker %d subscribe failed: %v", id, err)
            }
            defer sub.Drain()

            for {
                msgs, _ := sub.Fetch(10, nats.MaxWait(2*time.Second))
                for _, msg := range msgs {
                    if err := handleOrder(msg.Data); err != nil {
                        msg.NakWithDelay(15 * time.Second)
                        continue
                    }
                    msg.Ack()
                }
            }
        }(i)
    }
}

nats.BindStream is explicit about which stream backs the consumer, preventing ambiguous resolution when subject patterns overlap across multiple streams. Each Fetch call is a lease — one worker gets those messages exclusively. No external coordination required, no ZooKeeper, no Redis, no additional infrastructure.

Tune the Fetch batch size to your processing characteristics. Small batches reduce head-of-line blocking when individual messages are slow; larger batches amortize round-trip latency when throughput matters more than tail latency. Start at 10–25 and adjust based on observed consumer lag metrics.

The Go implementation gives you the primitives. The next section covers the same durable pipeline patterns in Node.js using nats.js and TypeScript, where async iterators and the JetStreamManager API reshape the consumer model around promises and event loops.

Node.js Integration: Event Streaming with nats.js and TypeScript

The nats.js client brings first-class JetStream support to Node.js with a TypeScript-native API that maps cleanly onto the same streaming concepts covered in the Go section. The main operational difference is the Node.js event loop: a single-threaded runtime that makes back-pressure management an explicit concern rather than something the OS scheduler handles for you. Unlike Go’s goroutine-per-connection model, every message callback you register competes for the same thread, which means a slow handler doesn’t just fall behind—it can starve unrelated I/O and degrade your entire process.

Connecting and Initializing the JetStream Context

Install the client and codec packages, then establish a connection with a JetStream manager handle:

import { connect, JSONCodec, AckPolicy, DeliverPolicy } from "nats";

const jc = JSONCodec();

export async function createJetStreamClient() {
  const nc = await connect({
    servers: ["nats://nats-cluster.internal:4222"],
    reconnect: true,
    maxReconnectAttempts: 10,
    reconnectTimeWait: 2000,
  });

  const js = nc.jetstream();
  const jsm = await nc.jetstreamManager();

  return { nc, js, jsm };
}

nc.jetstream() returns a publish/subscribe handle backed by the persistence layer. The jetstreamManager() handle is used for administrative operations—stream creation, consumer configuration, and stream introspection—and is separate by design. Keep the nc reference in scope for the lifetime of your process; if it’s garbage-collected, both handles become invalid and you’ll receive connection-closed errors on the next operation.

Publishing Structured Events

Define a typed event envelope and publish to a subject hierarchy that reflects your domain:

import { JetStreamClient, JSONCodec } from "nats";

interface OrderEvent {
  orderId: string;
  customerId: string;
  status: "placed" | "confirmed" | "shipped" | "cancelled";
  totalCents: number;
  timestamp: string;
}

const jc = JSONCodec<OrderEvent>();

export async function publishOrderEvent(
  js: JetStreamClient,
  event: OrderEvent
): Promise<void> {
  const subject = `orders.${event.status}`;
  const ack = await js.publish(subject, jc.encode(event));

  console.log(`Published to ${subject}, seq=${ack.seq}, stream=${ack.stream}`);
}

The subject pattern orders.<status> lets consumers subscribe to specific lifecycle stages or the entire orders.> wildcard. The returned PubAck confirms the server wrote the message to the stream—at this point durability is guaranteed regardless of what happens to downstream consumers. If js.publish throws, the message was not persisted; wrap publish calls in retry logic with exponential backoff for production use rather than letting transient network errors silently drop events.

Creating a Durable Push Consumer

A push consumer with manual acknowledgment is the right default for order-processing workflows where silent message loss is unacceptable:

import {
  JetStreamClient,
  JetStreamManager,
  AckPolicy,
  DeliverPolicy,
  JSONCodec,
} from "nats";

const jc = JSONCodec<{ orderId: string; status: string }>();

export async function startOrderConsumer(
  js: JetStreamClient,
  jsm: JetStreamManager
): Promise<void> {
  await jsm.consumers.add("ORDERS", {
    durable_name: "order-processor-v1",
    ack_policy: AckPolicy.Explicit,
    deliver_policy: DeliverPolicy.All,
    deliver_subject: "push.orders.processor",
    max_ack_pending: 20,
  });

  const sub = await js.subscribe("orders.>", {
    config: { durable_name: "order-processor-v1" },
  });

  for await (const msg of sub) {
    try {
      const event = jc.decode(msg.data);
      await processOrder(event.orderId, event.status);
      msg.ack();
    } catch (err) {
      msg.nak();
      console.error(`Processing failed for msg seq=${msg.seq}:`, err);
    }
  }
}

async function processOrder(orderId: string, status: string): Promise<void> {
  // domain logic here
}

Back-Pressure and the Event Loop

max_ack_pending: 20 is the key back-pressure lever. NATS stops delivering new messages once 20 are outstanding, which prevents the event loop from accumulating a queue of unprocessed work that grows faster than your handler can drain it.

Choosing the right value depends on your handler’s execution profile:

I/O-bound handlers (database writes, downstream HTTP calls): 20 is a reasonable starting point. Awaiting network I/O yields the event loop between messages, so moderate concurrency is safe.
CPU-bound handlers (heavy JSON transformation, synchronous computation): lower to 5 or fewer. CPU work blocks the event loop entirely while it runs, and stacking 20 pending callbacks will cause measurable latency spikes across the rest of the process.
Mixed workloads: profile under realistic load before committing to a value. Use process.hrtime() around the handler body to measure wall-clock time per message, then tune max_ack_pending until p99 latency stays within your SLA.

The nak() call on error is equally important. Without it, an unacknowledged message sits in-flight until the ack_wait timeout expires—typically 30 seconds—before NATS redelivers it. Calling nak() immediately triggers redelivery, reducing the window during which a message is stuck in a failed handler.

Pro Tip: Use DeliverPolicy.LastPerSubject for consumers that only need the most recent state per entity—inventory snapshots, configuration updates—rather than a full ordered history. This dramatically reduces replay time on restart and avoids processing thousands of stale events just to arrive at current state.

Integration Testing Without External Dependencies

The nats.js ecosystem includes a server binary wrapper that starts an embedded NATS server in-process for tests:

import { NatsServer } from "@nats-io/nats-server";
import { connect, JSONCodec } from "nats";
import { startOrderConsumer } from "../consumer";

describe("OrderConsumer", () => {
  let server: NatsServer;

  beforeAll(async () => {
    server = await NatsServer.start({ port: 14222, jetstream: true });
  });

  afterAll(async () => {
    await server.stop();
  });

  it("acks a valid order event", async () => {
    const nc = await connect({ servers: "nats://localhost:14222" });
    const js = nc.jetstream();
    const jc = JSONCodec();

    await js.publish(
      "orders.placed",
      jc.encode({ orderId: "ord-8821", status: "placed" })
    );

    // assert handler processed without throwing
    await nc.drain();
  });
});

This pattern eliminates the test-environment dependency on a running NATS cluster and keeps CI pipelines self-contained. Each test suite spins up a fresh server on a separate port, so parallel test runs never share state. For tests that exercise redelivery behavior—verifying that a nak() causes the message to reappear—set a short ack_wait in the consumer configuration (two seconds is sufficient in tests) so you’re not waiting on the production 30-second default.

With the Node.js integration patterns established, the next section shifts focus to the operational layer: how to run a JetStream cluster in production, what observability signals matter, and how the system behaves under the failure modes that eventually hit every distributed messaging deployment.

Production Deployment Patterns: Clustering, Observability, and Failure Modes

Deploying JetStream beyond a single node introduces topology decisions that directly affect your latency profile, data durability, and operational complexity. Getting these decisions wrong early means painful migrations later.

Topology: Clusters vs. Superclusters

A standard NATS cluster (3–7 nodes) handles most production workloads. All nodes participate in Raft-based consensus for JetStream metadata and stream replication. Latency stays low because clients connect to any node and the cluster routes internally.

Superclusters connect multiple clusters via gateway connections, with leaf nodes acting as lightweight edge extensions that proxy traffic to a hub cluster. Use this topology when you have strict data residency requirements, genuinely separate failure domains across regions, or edge deployments where you want local pub/sub with selective replication to a central cluster. Don’t reach for superclusters because your architecture diagram looks impressive—the operational overhead is real, and cross-cluster stream mirroring adds latency and complexity that a well-sized single cluster avoids entirely.

Replication Factor and What R3 Actually Guarantees

Setting replicas: 3 on a stream means JetStream writes each message to three nodes before acknowledging to the producer. The cluster requires a quorum of two nodes to confirm writes, so you can tolerate one node failure without data loss or write unavailability. R3 does not protect you against a three-node cluster losing network connectivity to a quorum simultaneously—that results in write stalls until quorum is restored.

💡 Pro Tip: R1 streams are appropriate for ephemeral work queues where losing unprocessed messages on a node failure is acceptable. R3 is the minimum for anything you’d consider durable.

Metrics That Actually Matter

Four metrics determine whether your JetStream deployment is healthy:

Pending consumer count measures messages delivered but not yet acknowledged. A sustained rise indicates slow consumer processing or a hung consumer. Set alerts at 2× your normal processing batch size.

Redelivery rate exposes how often AckWait timeouts trigger redelivery. Elevated rates mean your consumers are taking longer than AckWait to process, your MaxDeliver limit is too aggressive, or you have poison-pill messages that consistently fail.

Stream lag (the difference between last sequence and consumer delivered sequence) tells you whether consumers are keeping pace with producers. Unbounded lag leads to backpressure, which stalls producers once the stream hits its MaxBytes or MaxMsgs limit.

Ack wait timeouts per consumer group surface the specific slow consumers within a group without requiring log analysis.

Failure Modes to Anticipate

Network partitions in a three-node cluster that isolate a single node are handled gracefully. Partitions that split the cluster into 1+2 leave the minority unable to confirm writes. Slow consumers cause stream backpressure when retention limits are reached—producers receive MaxBytesExceeded errors rather than silent drops, which is intentional but requires producer-side retry logic. Consumer group rebalancing during rolling deployments briefly increases redelivery rates as in-flight messages time out and reassign.

When JetStream Is Not the Right Answer

JetStream is not Kafka. If you need long-term event log retention measured in weeks with consumers replaying months of history, Kafka’s log compaction and tiered storage are better suited. If your use case is stateful stream processing—joins, windowed aggregations, complex event patterns—Flink’s processing model handles that natively where JetStream would require external state management. JetStream earns its place in low-latency, operational event pipelines where operational simplicity and Go/Node.js client ergonomics matter more than ecosystem breadth.

With the operational layer understood, the next design decision is how you structure the subjects and consumer groups that your streams and services communicate through—and that structure has more architectural impact than most teams anticipate.

Subject Hierarchy Design and Queue Group Patterns for Scalable Systems

Subject naming is infrastructure. Get it wrong early, and you spend the next two years working around a naming convention that made sense for three services but breaks down at thirty. NATS’s dot-delimited subject hierarchy gives you a structured namespace that routes messages, enforces boundaries, and adapts to scale—if you design it deliberately from the start.

Dot-Delimited Hierarchies and Wildcard Routing

The canonical pattern is <domain>.<entity>.<action>.<qualifier>. A fleet telemetry system might use subjects like fleet.vehicle.telemetry.gps, fleet.vehicle.telemetry.engine, and fleet.vehicle.alert.geofence. This structure lets consumers subscribe with wildcards: fleet.vehicle.telemetry.* captures all telemetry types, while fleet.> captures everything in the domain.

The * wildcard matches a single token; > matches everything from that position forward. Use * for type-safe subscriptions where you know the depth, and > for catch-all consumers like audit loggers or debug subscribers that need the full event stream.

Queue Groups for Stateless Horizontal Scaling

When you need load-balanced consumers without persistence, Core NATS queue groups deliver exactly that. Multiple subscribers registering the same queue group name receive messages round-robin—no JetStream required.

nc, _ := nats.Connect("nats://prod-cluster.internal:4222")

// All instances of the telemetry processor share this group
nc.QueueSubscribe("fleet.vehicle.telemetry.*", "telemetry-processors", func(msg *nats.Msg) {
    processTelemetry(msg.Subject, msg.Data)
})

This pattern suits stateless processors where losing an in-flight message during a crash is acceptable—think real-time dashboards or non-critical metric aggregators.

Combining JetStream Consumers with Queue Groups

For durable processing with horizontal scale, bind a JetStream push consumer to a queue group. Each consumer instance in the group receives an exclusive slice of messages, with acknowledgment tracked per-instance. If a worker crashes, unacknowledged messages redeliver to surviving group members.

js, _ := nc.JetStream()

js.AddStream(&nats.StreamConfig{
    Name:     "FLEET",
    Subjects: []string{"fleet.>"},
})

js.AddConsumer("FLEET", &nats.ConsumerConfig{
    Durable:        "telemetry-processor",
    DeliverGroup:   "telemetry-workers",
    DeliverSubject: "fleet.vehicle.telemetry.deliver",
    AckPolicy:      nats.AckExplicitPolicy,
    FilterSubject:  "fleet.vehicle.telemetry.>",
})

nc.QueueSubscribe("fleet.vehicle.telemetry.deliver", "telemetry-workers", func(msg *nats.Msg) {
    if err := processTelemetry(msg.Data); err != nil {
        msg.Nak()
        return
    }
    msg.Ack()
})

💡 Pro Tip: Set DeliverGroup and DeliverSubject together on the JetStream consumer. Without DeliverGroup, each subscriber in your queue group receives every message independently—defeating load balancing entirely.

Subject Mapping for Multi-Tenant Systems

NATS server-side subject mapping lets you rewrite subjects without changing publisher or subscriber code. In a multi-tenant fleet platform, you map tenant.acme.vehicle.> to the canonical fleet.vehicle.> stream at the account level, keeping tenant isolation in the routing layer rather than baking it into application logic.

For fleet telemetry specifically, the mixed-durability architecture looks like this: GPS and engine telemetry land in a persistent JetStream stream for replay and analytics. Alert subjects like fleet.vehicle.alert.> use Core NATS to queue groups—latency matters more than durability for real-time geofence violations, and a dropped alert triggers a re-evaluation on the next telemetry pulse anyway.

This separation—durable streams for data you analyze later, fire-and-forget queue groups for time-sensitive reactions—maps directly to the operational trade-offs covered in this post’s opening sections, and informs how you configure the clustering and replication strategies discussed in the production deployment section above.

Key Takeaways

Audit your message types before choosing a delivery model: use Core NATS for ephemeral/high-frequency data and JetStream only where you need replay, persistence, or guaranteed delivery—mixing both in one system is valid and often optimal
Configure pull consumers with explicit ack/nak and a MaxDeliver limit for any business-critical event processing; this gives you at-least-once delivery without runaway redelivery loops when a consumer is broken
Design your subject hierarchy before writing consumer code—a well-structured dot-delimited namespace with wildcards costs nothing upfront and saves a painful refactor when you add your fifth service
Monitor stream consumer lag and redelivery rates as primary health signals; a rising redelivery rate is an early warning of consumer instability before it cascades into a full backlog