The 3-year-old restart bug nobody understood

[SKELETON] Structural draft based on the latent-bug pattern. Specific details — which pod class, which code path, the actual root cause — need filling in.

What happened

[TODO: which pod class started restarting consistently — likely related to scale-up. When the behaviour became impossible to ignore. The bug was simple once we found it; the path to finding it required understanding three years of codebase evolution.]

For about three years, this class of pod had been restarting roughly once every [TODO: time period] in production. Rare. Random. Always restarted clean. Never affected customers. The team treated it as background noise — one of those things that “just happens sometimes.”

Then we scaled the deployment up by [TODO: factor], and the restarts went from rare to consistent. Every pod restarted every [TODO: shorter interval]. Now it was customer-visible. Now it was a priority.

Why it was latent for three years

The bug was triggered by a specific combination of conditions that happened rarely at the original scale:

[TODO: the actual triggering combination — likely related to memory growth past a threshold, a specific cleanup pattern that only fired under certain conditions, an interaction between two subsystems that only collided under load]

Each individual condition happened often. The intersection of all of them in the same pod’s lifetime was rare. At the original deployment scale, with [TODO: original replica count], the math worked out to maybe one pod hitting all the conditions per [TODO: time period]. At the new scale with [TODO: new replica count], the math gave you all pods hitting the conditions within their lifetime.

This is the classic latency-bug shape: a real bug present from day one, hidden by being too rare to be statistically obvious until the deployment grew.

The detection signal

[TODO: what specifically clued us in]

The shift from “occasionally restarts” to “consistently restarts” was the signal. The base rate change was the data point. Without it we’d have continued treating the restarts as noise.

This is worth flagging: “a thing that used to happen rarely now happens consistently” is a meaningful signal in its own right. Not because the thing is suddenly worse — the underlying behaviour didn’t change — but because the conditions for it changed and that change is information.

The investigation

[TODO: how you traced from symptom to root cause]

The investigation took [TODO: how long]. Several false leads — [TODO: what we initially thought it was]. The breakthrough came from [TODO: what — a specific log, a git blame, a deliberate reproduction in staging].

Once we had the right code path identified, the bug was a few lines. The fix was a few more.

Root cause statement

[TODO: clean statement of the actual bug]

The fix

[TODO: what changed]

The interesting part of the fix wasn’t the fix itself. It was deciding whether the fix needed to apply elsewhere — whether the same shape of bug existed in other parts of the codebase that hadn’t yet hit the conditions to trigger it.

We did an audit of similar code paths. [TODO: what we found]. The audit took longer than the original fix; it was the right call.

What this taught me

Latent bugs are normal in long-lived systems. Not because we’re careless — because conditions change. A system that handled 100 requests/sec for two years and then starts handling 10,000 is going to surface bugs that didn’t exist as bugs at the old volume. The bug was there; the conditions weren’t.

“Rare and harmless” is a story we tell to defer investigation. The pod restart was rare. It was probably never harmless — it was just under the threshold where it mattered. When investigation eventually became unavoidable, the cost was the three years of intermittent noise on dashboards, the cognitive load of “ignore that restart, it’s just the thing,” and the time spent investigating it when it finally broke.

The right policy I now hold: anything you’ve been ignoring is a debt. Track it. Decide when you’ll pay it. “Just monitor and ignore” is a valid choice only if it’s a deliberate one.

Base rate changes are signals. If something that used to happen at X now happens at 10X, that’s data. Even if the thing isn’t dangerous, the change in conditions is worth understanding. Many of the most informative bugs I’ve found started with “wait, why does that happen now when it didn’t before?”

The audit after the fix is often more important than the fix. If a bug existed in one code path, it likely exists in others written by the same person, in the same era, with the same mental model. Audit the shape, not just the instance.

What I’d do differently

[TODO]

Periodically review “long-known noise” — restarts, occasional errors, intermittent slowdowns that we’ve categorized as background. Decide whether they’re worth investigating now while they’re cheap, vs later when scale forces it.
Track the rate of these classes of events, not just their presence. A doubling of background noise is news.
After fixing a latent bug, write down the shape and search for it elsewhere.

VM-to-K8s migration pillar — discusses the pattern of latent VM-era assumptions surfacing under K8s, a related class of bug
Scaling pillar — multiple bottlenecks discovered there had similar “fine at small scale, broken at large scale” patterns