Debugging MongoDB in a real production cluster: a strategy, not a checklist

[DRAFT NOTE] Strategic framing is correct based on the scaling work; specific incident examples and exact commands tagged [TODO] for verification before promoting.

The first time I had to debug a production MongoDB cluster under real load, I made every classic mistake.

I looked at slow query counts and assumed slow queries were the problem. I looked at CPU and assumed busy CPU was the problem. I looked at IOPS and assumed IOPS were the problem. None of these were wrong exactly — they were all signals — but none of them were the actual root cause, and I spent hours chasing each one before the lesson sank in.

A sharded MongoDB cluster has roughly six places things can go wrong. The art of debugging it is figuring out which one — and that’s a strategy question, not a checklist question.

The six places things go wrong

MongoS routing tier. Connection saturation, slow query routing, the wrong MongoS getting overloaded while others are idle.
MongoD shard tier. Compute saturation on a specific shard. Disk IOPS hitting a ceiling. Working set exceeding RAM. Hot-shard incidents.
Config server. Rare but catastrophic. Chunk metadata stale, balancer thrashing, sharded operations stalled.
Client connection pool. The application is the problem — pool exhausted, connections not being returned, MongoDB itself is fine.
Network. Inter-shard traffic during balancing, cross-AZ latency, MongoS-to-MongoD link saturation.
Application query patterns. The queries themselves are wrong — missing shard key, scatter-gather queries, queries that grew over time as data did.

Each one looks different in the metrics. Each one has a different fix. Confusing them is how you spend a week tuning the wrong tier.

The loop

The strategy I’ve converged on is a loop, not a checklist. Each iteration narrows the search.

Step 1: identify the symptom precisely. Not “Mongo is slow.” That’s where everyone starts and it tells you nothing. “p95 of findOne({tenantId, sessionId}) on userContexts collection has gone from 8ms to 90ms over the last hour, affecting tenant X but not tenant Y.” That’s a starting point.

Step 2: classify the symptom by tier. Ask:

Is the slowness on all queries, or specific collections? → all = infrastructure tier; specific = shard or query.
Is it on all shards, or one? → one = hot shard or shard-specific issue.
Is it on all clients, or one client type? → one = client pool or specific query pattern.
Did it correlate with a deploy, a data event, or a time-based trigger? → tells you whether to bisect on changes or look at data growth.

Step 3: confirm with the right metric. Don’t trust the most-visible metric. The most-visible metric in a MongoDB dashboard is “slow query count” — and slow queries can be a symptom of any of the six failure tiers above. Confirm with:

MongoS: connection count per pod, query routing time, MongoS CPU
MongoD: per-shard CPU, IOPS, queue depth, cache hit rate
Config server: chunk distribution age, balancer state
Client: connection pool wait time on the application side
Network: inter-shard bandwidth, latency
Query: explain plans for the specific slow query

Step 4: take one corrective action and watch. Don’t change three things at once. You won’t know which one helped (or hurt).

Step 5: iterate. Either you confirmed the hypothesis (great, document the fix) or you didn’t (back to step 2 with new information).

This sounds bureaucratic. In practice it takes minutes when you’ve internalized it, and it’s much faster than the “try things and hope” approach that everyone defaults to under pressure.

The specific debugging moves that paid off most

These are the ones I reach for first now, in rough order of how often they’ve been useful.

db.currentOp() filtered to slow ops.

db.currentOp({"active": true, "secs_running": {"$gt": 1}})

Tells you what’s running right now. Often illuminating in a way that aggregated metrics aren’t. You see the actual query, the namespace, the shard it’s hitting, whether it’s a getMore (long cursor), whether it’s blocked on a lock.

Per-shard serverStatus.

db.adminCommand({"serverStatus": 1})

Run it on each mongod (not via mongos). Look at:

connections.current — is one shard fielding way more connections than the others?
wiredTiger.cache.bytes currently in the cache vs available — working set vs RAM ratio.
globalLock.activeClients — is the shard busy or idle?

The asymmetry across shards is usually the most informative thing.

sh.status() with verbose output.

Tells you chunk distribution per shard, balancer state, ongoing chunk migrations. The “is one shard much bigger than the others” check.

Per-query explain plans.

db.collection.find({...}).explain("executionStats")

For a query you suspect is slow, the explain plan tells you whether it’s using the shard key (SHARD_MERGE good; SHARD_FILTER ok-ish; nothing about sharding = scatter-gather, the query is hitting every shard).

This is the fastest way to spot a query that’s been quietly scattering across the whole cluster because someone forgot to include the shard key in the filter.

MongoS connection counts per app pod.

Not a MongoDB query — an infrastructure question. How many MongoS instances does each app pod connect to? If the answer is “all of them,” you have the MongoS fan-out problem covered in the scaling pillar. Pin connections per pod and see if MongoS load drops.

The mental model shifts I had to make

Some things took me longer than they should have to internalize.

Slow query count is a lagging indicator, not a leading one. By the time slow queries are piling up, something else has been wrong for a while. Use it as confirmation, not as the starting signal.

Low MongoD CPU + high MongoS slow queries = routing problem, not compute problem. This combination took me embarrassingly long to recognize the first time. The shards are fine; the router is the bottleneck. Adding more shards won’t help; the fix is on the MongoS or client tier.

A “balanced” cluster can still have a hot shard. Chunk count is balanced; chunk traffic might not be. One shard might be holding the chunks for one big tenant whose write rate is 10× the others. Look at per-shard ops/sec, not just chunk count.

Working set vs RAM is the silent killer. If your working set just exceeded RAM, cache hit rate falls, IOPS climbs, p95 latency goes non-linear. The transition isn’t gradual; it’s a cliff. Watch wiredTiger.cache.bytes currently in the cache against the configured cache size. The day you cross the threshold is the day everything gets weird.

Connection pools have a personality. The application’s MongoDB client pool config (size, wait timeout, retry behaviour) affects what you see during a degradation. A pool with high wait timeouts hides Mongo slowness as application latency. A pool that retries aggressively turns brief Mongo blips into amplified traffic. Read the pool config when debugging.

When the diagnostic tools themselves are the problem

A failure mode I’ve seen on a production cluster under heavy load: the diagnostic tools (db.currentOp, db.serverStatus) themselves become expensive enough to perturb what they’re measuring. You ask Mongo what it’s doing; the question itself takes 30 seconds because Mongo is too busy to answer.

When this happens:

Use db.adminCommand({"currentOp": 1, "$ownOps": false, "secs_running": {"$gt": 5}}) to limit to long-running ops only.
Throttle diagnostic queries during incidents — don’t run them in a tight loop.
Lean on metrics that are already being scraped (Prometheus, the mongodb_exporter) rather than ad-hoc queries.

This is rare but instructive — every observation has a cost.

The boring debugging that wins most often

Glamorous debugging stories involve clever insights and dramatic fixes. The boring truth is that most production MongoDB issues turn out to be:

A query missing the shard key, scattering across the whole cluster.
A connection pool sized wrong.
A schema change that increased document size.
A new index that’s helping reads but slowing writes more than expected.
A retention job that didn’t run, letting a collection grow past working-set limits.

When debugging a new symptom, check these mundane causes before reaching for the exotic ones. It’s a humbling habit but it saves time.

What I’d do differently

[TODO: specific examples from your debugging history. Pick 2-3 real incidents where the diagnostic process either worked well or failed, and what you took from each. This is the most valuable part of this dive and currently the most vague.]

MongoDB sharding — the shard key choices that prevent some of the failure modes this dive is about debugging
Scaling pillar — the broader context; the MongoS fan-out problem is one of the bottlenecks discussed there