This site is a work in progress — some sections are incomplete.
Deep Dive draft

MongoDB sharding: picking the right key, surviving the wrong one

A sharded MongoDB cluster lives and dies by its shard key. Picking one for a collection that has both write-heavy ingest and tenant-scoped queries is a balancing act. This is how I think about it, why MongoDB v8's online resharding changed the stakes, and the specific keys we landed on for our largest collections.

mongodbshardingdatav8

For about a decade, picking a MongoDB shard key was the single most career-defining decision you could make on a sharded cluster. Get it wrong, and you’d either live with a hot shard forever or commit to a multi-week dump-and-reload migration with downtime windows that nobody wanted to schedule.

MongoDB v8 changed that with online resharding. The cost is still non-trivial — you pay in extra storage and IOPS during the rebalance — but it’s no longer the kind of decision that defines the next two years of your operational life. That’s a big deal, and it’s the reason I’m willing to ship a shard key change in production now.

This dive covers the shard key reasoning, the v8 features that earned their keep, and the gotchas I hit along the way.

The fundamental tension

In a multi-tenant platform with sharded MongoDB, the same collection usually has to serve two access patterns at once:

  • High-volume writes spread across tenants — you want write distribution, otherwise one shard becomes a hot shard.
  • Tenant-scoped reads — you want locality, otherwise every read scatters to all shards and you pay a fan-out tax.

These pull in opposite directions. The default temptation is to shard on _id because that gives perfect write distribution. It also kills tenant-scoped reads — every query has to scatter, every query is slow. Shard on tenantId (hashed) and you get the opposite problem: a big tenant becomes a hot shard all by themselves.

The shape that resolves this for our high-volume operational collections is compound: { tenantId: 1, <high-cardinality-field>: 1 }.

  • tenantId first gives tenant query locality. A query with tenantId in the filter routes to a small number of chunks rather than scattering across the whole cluster.
  • The second field gives cardinality within a tenant. Without it, every document for a big tenant lands on one shard. With it, a single tenant’s data spreads across many chunks while still preserving the locality benefit.

The trap I want to flag: don’t make the second field something monotonically increasing like createdAt. The newest chunk is always hot. Append-only collections — audit logs, event streams — can sometimes get away with this if they’re strictly append-then-archive. For anything else, use a hashed value or a high-cardinality natural identifier.

What the four shapes actually look like

ShapeWrite distributionTenant query localityWhen to use
_id (default)ExcellentTerrible (always scatter)Never, if you have tenant queries
Hashed tenantIdGood across small/medium tenants; hot-shard risk on big tenantsSingle shard (routed)Small per-tenant volumes
{ tenantId: 1, createdAt: 1 }Hot newest chunkGoodAppend-then-archive only
{ tenantId: 1, <hash/random>: 1 }EvenRoutes to a small chunk setDefault for big mixed-pattern collections

The fourth shape is the one we use for our largest operational collections. Compound keys with a hashed or high-cardinality second field cooperate with prefix-only queries — a query on just { tenantId } can still route, even though it doesn’t know the second field. A query on the second field alone, however, scatters. That’s a design constraint your application has to respect.

What v8 actually changed

Online resharding is the headline feature, and it deserves it. Pre-v8, the cost of getting a shard key wrong was a downtime window measured in hours-to-days. Post-v8, the cost is a temporary IOPS and storage overhead during the copy phase. You can ship a key change as a normal operational task instead of a project plan.

We used it to migrate a couple of collections from hashed tenantId to compound { tenantId, X } after the first hot-shard incident on a big tenant. The migration ran for [TODO: hours] during off-peak; the application kept reading and writing through it. We monitored IOPS and storage carefully — both spike during the copy — but didn’t have to coordinate downtime.

Compound shard keys with hashed prefixes went GA in a way that made the second-field-is-hash pattern much cleaner to implement. Before this, you’d compose the hash yourself; now Mongo handles it.

The other v8 features — cluster-wide configuration propagation, time-series collection improvements — exist but didn’t dominate our work. We use cluster-wide settings sparingly; they’re convenient for propagating config across shards but expand blast radius. Treat them with the same care as any global config.

What I read about and skipped: queryable encryption (no regulatory driver in our workload), some of the new query optimizer hints (the optimizer was making the right choices for us by default). Interviewers respect “evaluated and chose not to” more than “we used everything new.”

How I actually run the shard key decision

For each new sharded collection, I write a one-pager covering:

  1. The shard key shape and why. Cardinality estimates on the second field — validate them on real data, not assumed distributions. Cardinality estimates lie surprisingly often.
  2. Expected chunk size distribution. If one tenant produces 50% of writes, you’ll have skew even with a “good” key. Decide whether that’s acceptable.
  3. The query patterns the key supports cheaply. A query on { tenantId, x: ?, y: ? } routes well. A query on { y: ? } alone scatters. List both columns explicitly.
  4. The pre-split strategy. Empty chunks at launch will move around expensively if you don’t seed them. For a known-skewed tenant distribution, pre-split to avoid the rebalance storm on day one.
  5. Migration notes if we change later. With v8 online resharding the cost is lower, but it’s not free. Treat the initial key as the contract you’re shipping.

Where this paid off

Hot-shard incidents on the two biggest collections went from a recurring monthly problem to zero after the key changes. Tenant-scoped query p95 dropped meaningfully (the routing reduction is the dominant factor — fewer shards touched means less network and less coordinator overhead).

Write distribution across shards is now within [TODO: ~5-10%] of even on the worst-skewed collection. Perfectly even is unachievable when tenant write rates vary; “within an order of magnitude” is the goal, and we beat it comfortably.

Migration confidence is the less-quantifiable win. The team will now propose a shard key change in a design review without it being controversial. Before v8 that conversation would have been a multi-meeting argument.

Gotchas

  • A shard key you can’t change is a contract with your future self. Online resharding helps, but it’s still operationally expensive — don’t design as if it’s free.
  • Pre-splitting matters. New collections with known tenant distributions should start pre-split. Otherwise the balancer thrashes for days.
  • Compound keys cooperate with prefix queries, not suffix queries. Document this for the application team or someone will write a query on the second field and wonder why it’s slow.
  • Cardinality estimates lie. Validate on real data. The estimate is a starting point, not a commitment.
  • MongoS connection fan-out is a separate problem. Even with a great shard key, if every app pod opens connections to every MongoS instance, the routing tier saturates. Pin connections per pod — covered in the scaling pillar.
  • MongoDB debugging strategy — how to figure out which of the many ways a sharded cluster can degrade is the one biting you today
  • Scaling pillar — the broader context: MongoDB was the third bottleneck on the way to 13k CCU