[SKELETON] This is the structural draft for the retention incident. Replace
[TODO]markers with specific dates, queries, and recovery details before promoting.
What happened
A scheduled retention job at Kore, intended to clean up operational data older than [TODO: retention window], deleted records it was not supposed to touch. The blast radius was [TODO: specific scope — collection, time range, tenant impact]. Detection took [TODO: time], primarily because nobody had a “did we delete the right thing?” check between the job’s “completed successfully” report and the data actually being gone.
The first signal something was wrong came from [TODO: customer report? internal anomaly?]. By the time we traced the symptom back to the retention job, the data had been gone for [TODO: days/weeks]. Recovery was partial — [TODO: what we could restore from backups, what was lost, what we had to tell affected tenants].
This is the kind of incident that makes you question every other “boring” maintenance job in your system.
Why it happened
The retention job had been working correctly for [TODO: years] before this. The query it ran was:
[TODO: paste a representative query / pattern. Likely something like
db.collection.deleteMany({ createdAt: { $lt: someDate } })
or a variant that joins against something else]
The problem wasn’t the query when it was written. It was the query in 2026 against a data model that had evolved since [TODO: year].
Specifically, [TODO: explain the evolution]. A field that meant one thing originally was now also being populated by [TODO: a different code path]. A document type that didn’t exist when the retention was written now used the same collection. The query’s filter — perfectly accurate in [TODO: original year] — silently matched records it was never intended to match.
The query wasn’t wrong by any local reading. The query was wrong because the world around it changed.
The detection gap
The retention job’s monitoring was: “did it run successfully? did it complete in expected time?” Both true, in every run, for years.
What it didn’t monitor: “did it delete the right things?” There was no check that the kind of records being deleted matched expectations. No assertion that the deletion volume was within normal range for the time window. No anomaly on a sudden 10× increase in deleted document count.
This is the philosophical thing that makes this incident worth writing down. Alerting on “the job succeeded” tells you nothing about whether the job did the right thing. Alerting on “the job’s effects look like the job’s intent” requires a separate model of what the intent was.
Recovery
[TODO: detailed recovery steps]
- What we could restore from backups
- What customer notifications went out
- Process for affected tenants
Root cause statement
Single sentence: [TODO: cleanly state the actual bug]. The query’s filter relied on an implicit assumption about which records lived in the collection; the assumption silently became false as the schema evolved.
The fix
Three changes shipped, in order of how confident I am they prevent recurrence:
- A volume-based assertion in the retention job. Before deleting, the job now computes the expected deletion count based on historical patterns and aborts if the actual count is outside a sanity band. Not a perfect signal but it would have caught this specific failure mode immediately.
- An explicit type filter in the query. The deletion now restricts to documents of the type it was originally meant to clean up, not all documents matching the age criterion. This makes the query’s intent legible from the query itself rather than from the surrounding context.
- A pre-deletion audit log. Before any deletion, a sample of records that will be deleted is logged. A human can spot-check this if anomaly volume fires. Doesn’t prevent the bug; gives a fast triage path next time.
We also added a [TODO: post-deletion validation? data-shape monitor?] to the broader observability stack — the goal is that this class of bug, not just this specific instance, has a detection path.
What this taught me
The dangerous bugs hide in the things that have always worked. A retention job that’s run for years without incident gets less review attention than a new feature. The cumulative drift in everything around it is invisible until it isn’t.
“Did the job run?” is not the same as “did the job do the right thing?” I’d known this in the abstract; the data loss made me know it concretely. Every scheduled job we own now has a “did it produce the intended effect?” check at design time, even when the job seems trivial.
Recovery confidence is an organisational asset. The reason this incident was even partially recoverable was the backup strategy and the restore-time work we’d put in years ago. That investment looked unnecessary every quarter it didn’t pay off. It paid off entirely on the day it was needed.
Telling customers is its own skill. The technical fix was the easy part. The communication — what we lost, what we restored, what to do about it — was harder and more important. I now spend deliberate time on communication templates for incident classes, not just on the technical postmortems.
What I’d do differently
[TODO: honest reflection on what would have caught this earlier]
- A periodic review of long-running scheduled jobs and what they assume.
- An assertion-style approach to data-modifying operations: pre-conditions, post-conditions, explicit failure on violations.
- A “schema evolution review” gate: when changing the meaning of a field or adding to a collection, an explicit check on every job that reads from it.
Related reading
- Analytics pipeline pillar — discusses the “what would silently break this and how would I know?” question; this incident is what made me ask it for every new system.
- MongoDB debugging — the diagnostic loop that, in hindsight, would have flagged the deletion volume anomaly faster.