Hero image for Building Production-Grade MongoDB Observability with Prometheus and Grafana

Building Production-Grade MongoDB Observability with Prometheus and Grafana


You’re getting paged at 3 AM because MongoDB is slow, but your monitoring only shows CPU and memory. By the time you SSH in and run db.currentOp(), the spike is gone. Your infrastructure dashboards tell you the server is healthy—70% CPU, plenty of RAM, disk I/O looks normal. But your application is timing out, users are complaining, and you have no idea what actually happened inside MongoDB.

This is the gap that breaks production MongoDB deployments. Traditional infrastructure monitoring sees the host, not the database. It can’t tell you that a collection scan just locked your primary for 200ms, that replication lag spiked to 15 seconds, or that connection pool exhaustion is queuing operations. You need visibility into MongoDB’s internal state: operation counts, lock percentages, query executor metrics, and replication health. The kind of data that lives in db.serverStatus() and rs.status(), not /proc/meminfo.

The MongoDB Prometheus exporter solves this by continuously scraping these internal metrics and exposing them in a format that Prometheus can ingest. Combined with Grafana, you get real-time dashboards showing exactly what MongoDB is doing—not just how the server it runs on is performing. You can track slow queries as they happen, watch lock contention build up before it causes timeouts, and alert on replication lag before it compromises your RPO.

Setting this up properly requires more than installing an exporter and pointing Prometheus at it. You need to understand which metrics matter, how to structure your recording rules for performance, and how to build dashboards that surface problems before they page you.

Why Infrastructure Metrics Aren’t Enough for MongoDB

When your MongoDB cluster starts degrading at 2 AM, your infrastructure monitoring will tell you the server is fine. CPU usage is normal. Memory isn’t maxed out. Disk I/O looks healthy. Yet users are experiencing 5-second query latencies and your application is throwing timeout errors.

Visual: Infrastructure metrics showing healthy servers while MongoDB performance degrades

This scenario plays out in production environments because infrastructure metrics measure the wrong layer. System-level tools like CloudWatch, Datadog’s host agent, or Prometheus node_exporter track server health, but MongoDB’s performance problems rarely manifest as straightforward resource exhaustion.

The Database Performance Blind Spot

Infrastructure monitoring answers questions like “Is the server running?” and “Are we out of disk space?” Database performance monitoring answers fundamentally different questions: “Why is this query taking 800ms when it should take 15ms?” and “Which index is missing that’s causing a collection scan on 40 million documents?”

Consider a common production issue: a developer deploys code with an unindexed query on a growing collection. For weeks, the query performs fine because the dataset is small. Your infrastructure metrics show nothing unusual—CPU ticks up slightly, memory usage is stable. Then the collection crosses 10 million documents and suddenly that query triggers a full collection scan. Response times spike to multiple seconds. By the time your infrastructure alerts fire for elevated CPU, users have been experiencing degraded performance for 20 minutes.

Traditional monitoring misses the critical leading indicators: rising operation execution times, increasing number of collection scans, growing lock wait times, and replication lag that hasn’t yet triggered a failover.

What You’re Not Seeing

MongoDB exposes over 300 internal metrics through its serverStatus command that infrastructure tools never capture. These include operation-specific latency distributions (not just averages), lock acquisition patterns, working set size versus cache size, individual operation queue depths, and granular replication metrics including oplog window and apply batch times.

When a secondary replica starts falling behind, you need to know whether it’s due to network issues, slow disk writes, or application load patterns. Infrastructure metrics show network throughput and disk IOPS. MongoDB metrics show oplog application batch sizes, replication heartbeat latencies, and the specific operations causing bottlenecks.

💡 Pro Tip: The gap between your working set size and available cache is often the first indicator of an approaching performance cliff. Infrastructure metrics report total memory usage; MongoDB metrics show you’re about to exceed WiredTiger cache capacity and trigger excessive disk I/O.

Understanding these database-level metrics requires purpose-built monitoring. The MongoDB Prometheus exporter bridges this gap by extracting the metrics that matter for database performance and exposing them in a format your existing observability stack can consume.

Setting Up the MongoDB Prometheus Exporter

The mongodb_exporter serves as the bridge between MongoDB’s internal metrics and Prometheus’s time-series database. Getting it configured correctly from the start prevents security gaps and monitoring blind spots that become expensive to fix later.

Installation and Basic Configuration

The percona/mongodb_exporter is the most actively maintained option for production deployments. Deploy it as a systemd service on each MongoDB host or as a sidecar container in Kubernetes environments.

For a replica set deployment, create a dedicated monitoring user with minimal privileges:

create_exporter_user.js
db.getSiblingDB("admin").createUser({
user: "mongodb_exporter",
pwd: "secure_password_from_vault",
roles: [
{ role: "clusterMonitor", db: "admin" },
{ role: "read", db: "local" }
]
})

The clusterMonitor role provides read access to server statistics and replication state without exposing application data. The read role on the local database allows monitoring oplog metrics critical for replication lag tracking.

Configure the exporter with connection details and metric collection flags:

mongodb-exporter-config.yaml
global:
scrape-interval: 30s
mongodb:
uri: "mongodb://mongodb_exporter:secure_password_from_vault@localhost:27017/admin?tls=true&tlsCAFile=/etc/ssl/certs/mongodb-ca.pem"
exporter:
bind-addr: "127.0.0.1:9216"
web-telemetry-path: "/metrics"
collectors:
- diagnostic_data
- replicaset_status
- top_metrics
- index_stats
- collection_stats

The collector flags determine which metric groups the exporter exposes. The diagnostic_data collector captures server status metrics including operation counters, connection counts, and memory usage. Enable replicaset_status for replica set health metrics, top_metrics for per-collection operation statistics, and index_stats to track index utilization patterns. The collection_stats collector provides storage metrics like document counts and data sizes, but it adds non-trivial overhead on clusters with hundreds of collections—enable it selectively based on your monitoring requirements.

💡 Pro Tip: Keep the scrape interval at 30 seconds or higher for production clusters. MongoDB’s serverStatus command adds measurable load, and sub-30s intervals rarely provide actionable signal for database-level metrics.

Securing the Exporter

Never expose the exporter directly to the network. Bind to 127.0.0.1 and configure Prometheus to scrape through a reverse proxy or service mesh with mTLS. In Kubernetes, use NetworkPolicies to restrict access to the Prometheus namespace:

exporter-networkpolicy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: mongodb-exporter-policy
namespace: database
spec:
podSelector:
matchLabels:
app: mongodb-exporter
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: monitoring
ports:
- protocol: TCP
port: 9216

For TLS-enabled MongoDB clusters, the exporter must authenticate using certificates. Mount the client certificate and CA bundle, then reference them in the connection URI parameters. The certificate’s Common Name or Subject Alternative Names must match the hostname Prometheus uses to connect, and the certificate must be valid for client authentication. Configure MongoDB to require client certificates by setting net.tls.mode to requireTLS and net.tls.allowConnectionsWithoutCertificates to false in the server configuration.

Rotate the exporter’s credentials regularly through your secrets management system. Use Kubernetes secrets with rotation policies or HashiCorp Vault’s dynamic secrets to avoid hardcoded credentials in configuration files. When rotating credentials, ensure the exporter pod restarts automatically to pick up new values—configure liveness probes that fail when authentication errors exceed a threshold.

Configuring Prometheus Scrape Targets

Replica set monitoring requires scraping each member independently to capture node-specific metrics like replication lag and member state. Use Prometheus’s static_configs with target labels for small deployments:

prometheus-mongodb-scrape.yaml
scrape_configs:
- job_name: 'mongodb-replicaset'
static_configs:
- targets:
- 'mongo-rs0-0.internal:9216'
- 'mongo-rs0-1.internal:9216'
- 'mongo-rs0-2.internal:9216'
labels:
cluster: 'production-primary'
environment: 'prod'
replica_set: 'rs0'

For sharded clusters, configure separate jobs for each shard replica set and the config server replica set. This separation allows targeted alerting rules and prevents metric cardinality explosion when aggregating across the entire cluster. Label each job with the shard identifier to enable cross-shard comparisons in your dashboards.

In dynamic environments, use Prometheus service discovery mechanisms like kubernetes_sd_configs or consul_sd_configs to automatically detect new MongoDB instances as they join the cluster. This prevents monitoring gaps during scaling events and reduces manual configuration overhead. Configure relabeling rules to extract replica set names and shard identifiers from pod labels or service tags, ensuring consistent metric labeling across your infrastructure.

Set appropriate scrape timeouts to handle slow responses during maintenance windows or heavy load. A timeout of 10-15 seconds prevents Prometheus from marking targets as down during temporary slowdowns while still catching genuine outages. Configure scrape_timeout slightly below your scrape_interval to avoid overlapping scrapes that amplify load on already-stressed database nodes.

With the exporter running and Prometheus scraping successfully, you have raw access to hundreds of MongoDB metrics. The challenge shifts to identifying which metrics actually matter for your workload’s performance and reliability.

Essential MongoDB Metrics to Track

Production MongoDB monitoring requires tracking metrics across four critical dimensions: query performance, replication health, resource utilization, and connection management. These metrics serve as leading indicators for issues that degrade application performance before users notice.

Visual: Dashboard showing essential MongoDB metrics organized by operational priority

Operation Counters and Latency Percentiles

Operation counters (opcounters) track the rate of commands, queries, updates, deletes, and getmore operations per second. While raw throughput matters, latency percentiles reveal how your database actually performs under load. Monitor P95 and P99 read and write latencies separately—a P99 write latency above 100ms typically indicates storage layer contention or insufficient WiredTiger tickets.

Track the opcountersRepl metric separately to understand replication overhead. High replication operation rates without corresponding primary operations suggest oplog replay lag, which compounds during recovery scenarios.

Query execution statistics expose inefficient patterns before they cause outages. The totalKeysExamined to totalDocsReturned ratio identifies missing indexes—ratios above 10:1 indicate table scans that will eventually overwhelm your cluster. The cursorTimeoutCount metric reveals application connection leaks or unbounded result sets.

Replication Lag and Oplog Window

Replication lag measures the time delay between primary writes and secondary application. Track replicationLag per secondary node, not as a cluster-wide average. A single lagging secondary during maintenance is acceptable; widespread lag above 10 seconds signals write pressure exceeding secondary replay capacity.

The oplog window determines how long secondaries can remain offline before requiring full resyncs. Monitor oplogWindowHours continuously—this metric shrinks during high write volumes. A window below 24 hours leaves insufficient recovery time for hardware failures or maintenance. The oplogSizeMB and oplogUsedMB metrics help predict when oplog expansion becomes necessary.

Connection Pool Exhaustion and Cursor Timeouts

MongoDB’s connection model makes pool exhaustion a leading cause of application timeouts. Track currentConnections against availableConnections to identify saturation. When active connections consistently exceed 80% of the configured limit, applications begin queuing requests, adding hundreds of milliseconds to P99 latencies.

Monitor activeClients broken down by read and write operations. Spikes in active writers during read-heavy workloads indicate lock contention from background operations or index builds. The currentQueue metric shows operations waiting for execution—sustained queuing above zero indicates capacity problems that connection pool tuning cannot solve.

WiredTiger Cache and Ticket Utilization

WiredTiger’s cache hit ratio determines whether MongoDB serves data from memory or disk. Track wiredTiger.cache.bytesCurrentlyInCache against wiredTiger.cache.maximumBytesConfigured. Cache utilization consistently above 95% triggers eviction storms that spike read latencies. The wiredTiger.cache.pagesEvicted rate quantifies this thrashing.

WiredTiger tickets control concurrent operations—available read tickets (wiredTiger.concurrentTransactions.read.available) and write tickets (wiredTiger.concurrentTransactions.write.available) act as admission control. When available tickets drop to zero, operations queue regardless of CPU or I/O availability. Default ticket counts (128 per type) rarely match production workload characteristics.

💡 Pro Tip: Combine low available tickets with high P99 latency in CloudWatch or Datadog to create alerts that catch capacity issues before connection pool exhaustion causes outages.

With these metrics instrumented, building effective Grafana dashboards requires organizing them by operational workflow rather than technical category.

Building Grafana Dashboards for Operations Teams

The difference between useful monitoring and metric overload is dashboard design. A well-structured Grafana dashboard surfaces actionable insights in seconds, while a poorly designed one buries critical signals under dozens of irrelevant panels.

Designing for Different Personas

Your development team and SRE team need fundamentally different views of the same MongoDB cluster. Developers care about query performance, index usage, and collection-level metrics. SREs focus on replica set health, replication lag, and resource saturation.

Create two separate dashboards rather than cramming everything into one. A developer dashboard should highlight slow queries, connection pool utilization, and document scan ratios. An SRE dashboard prioritizes cluster topology, oplog window size, and member state changes.

Use Grafana’s dashboard variables to make these dashboards flexible. Define variables for replica set, database, and collection names so users can filter metrics without creating duplicate dashboards:

dashboard-variables.promql
## Replica set variable
label_values(mongodb_up, replica_set)
## Database variable
label_values(mongodb_db_stats_collections{replica_set="$replica_set"}, database)
## Collection variable
label_values(mongodb_collection_stats_size{database="$database"}, collection)

This variable-driven approach scales particularly well in multi-tenant environments. A single dashboard template can serve dozens of replica sets by simply changing the $replica_set variable. Chain variables together so selecting a replica set automatically populates available databases, and selecting a database filters collection options.

Consider creating a third dashboard type for executives and product managers focused on business-level metrics: request rates, error percentages, and SLA compliance rather than technical infrastructure details. These stakeholders need visualization clarity over technical depth—use gauge panels for uptime percentages and stat panels for current query latency rather than time series graphs.

Aggregating Metrics Across Replica Sets

Raw per-instance metrics rarely tell the complete story. Aggregating data across replica set members reveals cluster-wide patterns that individual node metrics obscure.

Track total operations per second across all secondaries to understand read distribution:

replica-set-ops.promql
## Total read operations across all secondaries
sum(rate(mongodb_op_counters_total{type="query", state="secondary", replica_set="production-cluster"}[5m]))
## Write operations on primary (should be zero on secondaries)
sum(rate(mongodb_op_counters_total{type="insert", state="primary", replica_set="production-cluster"}[5m])) +
sum(rate(mongodb_op_counters_total{type="update", state="primary", replica_set="production-cluster"}[5m])) +
sum(rate(mongodb_op_counters_total{type="delete", state="primary", replica_set="production-cluster"}[5m]))

Monitor replication lag across all secondaries with percentile aggregations. Maximum lag identifies the worst-case scenario, but p95 lag reveals whether most secondaries stay healthy:

replication-lag.promql
## Maximum replication lag across secondaries
max(mongodb_replset_member_replication_lag{state="secondary", replica_set="production-cluster"})
## p95 replication lag
quantile(0.95, mongodb_replset_member_replication_lag{state="secondary", replica_set="production-cluster"})

Track connection pool saturation across your application fleet:

connection-saturation.promql
## Connection usage ratio (current / available)
sum(mongodb_connections{state="current"}) / sum(mongodb_connections{state="available"}) * 100

When this ratio consistently exceeds 80%, your applications are approaching connection limits. Set threshold alerts at 70% to give yourself runway before saturation causes application timeouts.

Aggregate cache hit ratios to assess WiredTiger performance cluster-wide. A sudden drop in cache efficiency often precedes performance degradation:

cache-efficiency.promql
## Cluster-wide cache hit ratio
sum(rate(mongodb_wiredtiger_cache_pages_read_into_cache_total[5m])) /
(sum(rate(mongodb_wiredtiger_cache_pages_read_into_cache_total[5m])) +
sum(rate(mongodb_wiredtiger_cache_pages_requested_from_cache_total[5m]))) * 100

When aggregating across instances, always verify that your queries account for replica set topology changes. A newly added secondary should not skew your baseline metrics, and a failed node should not create gaps in your aggregate calculations.

Building Drill-Down Panels for Query Investigation

The most valuable dashboards enable rapid investigation without leaving Grafana. Structure your panels to support progressive drill-down from cluster-wide views to individual query patterns.

Start with a high-level panel showing operations by type:

operations-by-type.promql
## Operations per second by command type
sum by (type) (rate(mongodb_op_counters_total{replica_set="production-cluster"}[5m]))

Create a second panel showing slow query count trends:

slow-queries.promql
## Slow queries per minute (>100ms threshold)
rate(mongodb_slow_queries_total{threshold="100"}[1m]) * 60

Add a table panel that breaks down operations by database and collection:

collection-operations.promql
## Top collections by operation count
topk(10,
sum by (database, collection) (
rate(mongodb_collection_stats_operations{replica_set="production-cluster"}[5m])
)
)

Link these panels together using Grafana’s data links feature. When users click a spike in slow queries, they should land on a detailed view showing which collections generated those queries. Configure data links to preserve time ranges and variable selections across navigation, maintaining context as engineers move between dashboard layers.

💡 Pro Tip: Set up text panels with links to your MongoDB profiler logs or APM tools. When dashboard metrics show degradation, engineers need immediate access to detailed query traces without hunting through documentation.

Use row collapsing to organize related panels. Group all replication metrics under a “Replication Health” row, all cache metrics under “WiredTiger Cache”, and query performance under “Query Analysis”. This structure prevents dashboard sprawl while keeping related information accessible.

Implement query inspection panels that correlate multiple data sources. Place a slow query count panel directly above index usage statistics and collection scan ratios. When slow queries spike, adjacent panels immediately show whether missing indexes or inefficient query patterns caused the degradation. This spatial correlation reduces investigation time from minutes to seconds.

With dashboards tailored to specific workflows and drill-down paths that accelerate investigation, your team can move from “something is wrong” to “here’s the problematic query” in under a minute. The next step is converting those observations into automated alerts that catch issues before they impact users.

Alerting Strategies That Reduce Noise

Alert fatigue kills on-call culture. When your monitoring system sends 50 alerts per day, teams learn to ignore them—and eventually miss the critical ones. For MongoDB in production, the challenge is designing alerts that catch genuine degradation without triggering on normal operational variance.

Beyond Simple Thresholds

Traditional threshold alerts fire when a metric crosses a static value. This approach fails for databases with variable workloads:

prometheus-alerts-bad.yml
groups:
- name: mongodb-naive
rules:
- alert: HighConnectionCount
expr: mongodb_connections{state="current"} > 500
annotations:
summary: "Connection count above 500"

This alert triggers every time your application scales up during business hours. A better approach uses rate of change and historical context:

prometheus-alerts-improved.yml
groups:
- name: mongodb-intelligent
rules:
- alert: ConnectionSpikeAnomaly
expr: |
(
mongodb_connections{state="current"}
-
avg_over_time(mongodb_connections{state="current"}[1h] offset 1d)
) >
2 * stddev_over_time(mongodb_connections{state="current"}[1h] offset 1d)
for: 10m
labels:
severity: warning
annotations:
summary: "Connections deviate 2σ from daily pattern"
description: "Current: {{ $value }} vs yesterday's avg at this hour"

This rule compares current connections against the same hour yesterday, accounting for normal daily patterns. It fires only when behavior deviates significantly from historical norms—more than two standard deviations from the baseline. The for: 10m clause prevents transient spikes from triggering alerts, requiring the anomaly to persist before firing.

Trend-based detection extends beyond simple comparison. For metrics like memory usage that grow predictably, forecast-based alerts prevent surprise outages:

prometheus-alerts-predictive.yml
groups:
- name: mongodb-predictive
rules:
- alert: MemoryExhaustionPredicted
expr: |
predict_linear(
mongodb_mem_resident[4h],
3600 * 24
) > mongodb_mem_virtual * 0.9
for: 30m
labels:
severity: warning
annotations:
summary: "Memory usage trending toward exhaustion within 24h"
description: "Linear projection suggests memory limit in {{ $value }} hours"

This alert fires when linear regression of the past 4 hours predicts memory will reach 90% capacity within 24 hours, giving teams time to investigate before an actual outage occurs.

Correlating Multiple Signals

Real incidents rarely manifest as a single metric spike. Effective alerts combine multiple signals to confirm problems:

prometheus-alerts-correlated.yml
groups:
- name: mongodb-correlation
rules:
- alert: ReplicationLagWithHighOplog
expr: |
mongodb_replset_member_replication_lag > 10
and
rate(mongodb_oplog_size_bytes[5m]) > 1048576
and
rate(mongodb_network_bytes_total{state="out"}[5m]) < 524288
for: 5m
labels:
severity: critical
annotations:
summary: "Secondary falling behind with network issues"
description: "Replication lag {{ $value }}s with active oplog growth but low network throughput on {{ $labels.instance }}"

This alert triggers only when three conditions align: replication lag exceeds 10 seconds, the oplog is actively growing (indicating primary writes), and network output is low (suggesting network or secondary performance issues). This correlation eliminates false positives from planned maintenance or normal catch-up scenarios.

Multi-signal correlation proves especially valuable for cache efficiency alerts. A low cache hit ratio alone doesn’t warrant paging someone, but combined with increased disk I/O and query latency, it indicates a genuine working set expansion:

prometheus-alerts-cache.yml
groups:
- name: mongodb-cache-correlation
rules:
- alert: WorkingSetExpansion
expr: |
rate(mongodb_metrics_query_executor_scanned_objects[5m])
/
rate(mongodb_metrics_query_executor_scanned[5m]) > 10
and
rate(mongodb_extra_info_page_faults[5m]) > 100
and
rate(mongodb_op_latencies_histogram_sum{type="read"}[5m])
/
rate(mongodb_op_latencies_histogram_count{type="read"}[5m]) > 50
for: 10m
labels:
severity: warning
annotations:
summary: "Cache pressure detected across multiple metrics"
description: "High scan ratio, page faults, and read latency suggest working set exceeds available memory"

Tiered Severity and Escalation

Not all issues require waking someone at 3 AM. Implement severity tiers that match business impact:

prometheus-alerts-severity.yml
groups:
- name: mongodb-tiered
rules:
- alert: QueryExecutionSlow
expr: rate(mongodb_op_latencies_histogram_sum{type="command"}[5m])
/
rate(mongodb_op_latencies_histogram_count{type="command"}[5m])
> 100
for: 15m
labels:
severity: warning
team: database
annotations:
summary: "Average query latency above 100ms"
runbook: "https://wiki.company.io/mongodb-slow-queries"
- alert: QueryExecutionCritical
expr: rate(mongodb_op_latencies_histogram_sum{type="command"}[5m])
/
rate(mongodb_op_latencies_histogram_count{type="command"}[5m])
> 1000
for: 5m
labels:
severity: critical
team: database
oncall: page
annotations:
summary: "Average query latency above 1s - user impact likely"
runbook: "https://wiki.company.io/mongodb-critical-latency"

The warning-level alert gives teams visibility into degradation with a 15-minute evaluation window. The critical alert has a tighter threshold and shorter window, triggering pages only when user experience is actively impacted. Route warning alerts to Slack; reserve PagerDuty for critical severity.

Escalation policies should account for alert duration. An issue that persists for an hour despite initial response warrants escalation to senior engineers or management:

prometheus-alerts-escalation.yml
groups:
- name: mongodb-escalation
rules:
- alert: PersistentPerformanceDegradation
expr: |
ALERTS{alertname=~"QueryExecutionSlow|ReplicationLagWithHighOplog",alertstate="firing"}
for: 1h
labels:
severity: critical
escalation: senior
annotations:
summary: "Performance alert unresolved for 1 hour"
description: "Alert {{ $labels.alertname }} has been firing for over an hour - escalating to senior team"

This meta-alert triggers when any performance-related alert remains unresolved for an hour, automatically escalating the incident. This pattern ensures persistent issues get appropriate attention without immediately over-escalating transient problems.

With intelligent alerting rules in place, the next challenge is scaling this observability infrastructure across multiple MongoDB clusters and geographic regions.

Scaling Monitoring for Large Deployments

When you’re managing dozens of MongoDB clusters across multiple regions, your monitoring infrastructure becomes a bottleneck. A single Prometheus instance starts dropping metrics, queries time out, and your dashboards become unusable. The solution requires architectural changes to how you collect, store, and query MongoDB metrics at scale.

Managing High Cardinality in Sharded Clusters

Sharded MongoDB deployments generate metric explosions. Each shard, replica set member, and database creates unique label combinations. A 10-shard cluster with 3 replicas per shard monitoring 50 databases produces over 150,000 time series for basic operation counters alone. Without intervention, Prometheus memory usage spikes and ingestion falls behind.

Reduce cardinality by filtering metrics at collection time. Configure the MongoDB exporter to exclude databases you don’t need to monitor individually:

mongodb-exporter-config.yaml
## Only collect metrics for production databases
mongodb:
uri: "mongodb://monitor:password@mongodb-cluster:27017"
collstats-colls: "prod_orders,prod_users,prod_inventory"
indexstats-colls: "prod_orders,prod_users"
# Exclude system and test databases
collstats-limit: 20
## Disable per-collection metrics for low-priority clusters
compatible-mode: false
discovering-mode: false

For sharded clusters, deploy one exporter per mongos router instead of per shard. This aggregates metrics naturally and reduces the total number of scrape targets. A 12-shard cluster drops from 36 scrape targets (3 replicas × 12 shards) to just 2-3 mongos instances.

Additionally, relabel metrics to drop unnecessary labels. Remove the database label from metrics where you only need cluster-level visibility:

prometheus-relabel.yaml
scrape_configs:
- job_name: 'mongodb'
relabel_configs:
# Drop per-database metrics for background operations
- source_labels: [__name__, database]
regex: 'mongodb_ss_backgroundFlushing.*;.*'
action: drop

This relabeling alone can reduce cardinality by 40-60% in multi-tenant environments.

Implementing Prometheus Federation

Federation splits metric collection across multiple Prometheus instances. Configure regional Prometheus servers to scrape local MongoDB clusters, then aggregate critical metrics into a global Prometheus instance.

prometheus-federation.yaml
## Global Prometheus configuration
scrape_configs:
- job_name: 'federate-us-east'
scrape_interval: 30s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="mongodb",region="us-east-1"}'
- '{__name__=~"mongodb_up|mongodb_connections|mongodb_opcounters_total"}'
static_configs:
- targets:
- 'prometheus-us-east-1.internal:9090'
- job_name: 'federate-eu-west'
scrape_interval: 30s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="mongodb",region="eu-west-1"}'
- '{__name__=~"mongodb_up|mongodb_connections|mongodb_opcounters_total"}'
static_configs:
- targets:
- 'prometheus-eu-west-1.internal:9090'

This architecture keeps high-resolution data regional while centralizing essential metrics for global dashboards. Regional Prometheus instances retain 30 days of data with 15-second scrape intervals; the global instance stores only federated metrics for 90 days at 30-second resolution. This tiered retention strategy reduces global storage requirements by 85% while maintaining detailed regional metrics for troubleshooting.

Configure different retention policies based on metric importance. Keep replication lag and connection metrics for 90 days, but reduce retention for verbose operation counters to 14 days.

Optimizing with Recording Rules

Complex queries that aggregate metrics across shards kill dashboard performance. Pre-compute these aggregations using recording rules that run every 30 seconds:

mongodb-recording-rules.yaml
groups:
- name: mongodb_aggregations
interval: 30s
rules:
- record: cluster:mongodb_connections:sum
expr: sum(mongodb_connections{state="current"}) by (cluster)
- record: cluster:mongodb_opcounters:rate5m
expr: sum(rate(mongodb_opcounters_total[5m])) by (cluster, type)
- record: cluster:mongodb_replication_lag:max
expr: max(mongodb_mongod_replset_member_replication_lag) by (cluster, replset)
- record: cluster:mongodb_query_executor:p95
expr: histogram_quantile(0.95, sum(rate(mongodb_ss_metrics_query_executor_bucket[5m])) by (cluster, le))

Recording rules reduce dashboard query load by 70% and enable sub-second dashboard rendering even with 50+ MongoDB clusters. They also enable longer retention of aggregated metrics—store pre-computed cluster totals for a year while keeping raw per-shard metrics for only 30 days.

💡 Pro Tip: Use recording rules for any query that appears in more than three dashboard panels. The storage overhead is minimal compared to the query performance gains.

With these scaling patterns in place, you need to translate metrics into actionable operational procedures. The next section covers building runbooks that help on-call engineers respond to alerts effectively.

From Metrics to Action: Building Runbooks

Dashboards and alerts are only as valuable as the actions they trigger. The most effective MongoDB monitoring implementations bridge the gap between metrics and resolution by embedding operational knowledge directly into the observability stack.

Linking Panels to Response Procedures

Grafana supports panel-level documentation through the description field and external links. Configure each critical panel with direct links to your incident response runbooks stored in Confluence, Notion, or your internal wiki. When a high replication lag panel shows degradation, engineers should see a one-click link to the “Replication Lag Remediation” procedure. This eliminates the context-switching overhead during incidents when every second counts.

For teams using PagerDuty or Opsgenie, include runbook URLs in alert payloads. The on-call engineer receives not just the symptom—“Connection pool exhausted on production cluster”—but also the standard operating procedure for resolution, including commands to check connection pool statistics, identify long-running queries, and safely increase pool limits.

Pattern Recognition and Standard Responses

Document the recurring performance patterns your team encounters. Slow query spikes correlating with batch jobs warrant scheduled index creation during maintenance windows. Connection pool exhaustion following deployments indicates inadequate connection reuse in application code. Lock contention during backups suggests moving to filesystem snapshots instead of mongodump.

Create a decision tree that maps metric patterns to root causes. When operations per second drops while queue depth increases, check for lock contention. When working set size approaches RAM, evaluate index usage and consider horizontal sharding. These heuristics belong in runbooks referenced from dashboard annotations.

💡 Pro Tip: Use Grafana’s dashboard variables to generate runbook-specific queries. A runbook for investigating slow queries can include a pre-populated link to MongoDB Compass filtered to the specific database and collection showing degradation.

Metrics in Deployment Pipelines

Integrate MongoDB metrics into deployment verification. After canary deployments, automated pipelines should query Prometheus for the last 15 minutes of query execution times, replication lag, and error rates on the canary subset. If metrics exceed baseline thresholds, trigger automatic rollback before promoting to production.

This metrics-driven deployment validation catches performance regressions that pass functional tests. A new index strategy might execute correctly but triple query latency under production load—measurable immediately through continuous monitoring integration rather than discovered through user reports.

With runbooks in place, your monitoring system evolves from a diagnostic tool into an operational knowledge base that guides teams from detection through resolution.

Key Takeaways

  • Deploy mongodb_exporter with authentication and monitor replica set health, replication lag, and WiredTiger cache metrics
  • Build role-specific Grafana dashboards using PromQL aggregations that help teams diagnose issues in under 2 minutes
  • Implement multi-signal alerting rules that correlate operation latency, connection pool usage, and replication lag to reduce false positives by 80%