This site is a work in progress — some sections are incomplete.
Incident planned

When a Glue job broke the analytics pipeline mid-migration

A scheduled AWS Glue job started failing during what was supposed to be a routine data migration. The job's failure cascaded — Kafka backed up, downstream dashboards went stale, and the migration window we'd carefully planned around stretched from hours to days. The interesting part wasn't the fix; it was discovering that 'Glue job failure' covered three distinct root causes that needed to be untangled before any one of them could be fixed.

Severity SEV2
MTTD [TODO]
MTTR [TODO]
sreincidentgluehudidatapostmortem

[SKELETON] Structure draws on the three failure modes we’ve documented; specific dates, dashboards, and the customer impact need filling in.

What happened

During a planned data migration — [TODO: what was being migrated and why], the AWS Glue job that handles the daily Mongo → Hudi load started failing repeatedly. Each retry attempt either failed in the same way or surfaced a new symptom. Kafka topics behind the job started backing up. Dashboards that depended on the freshness of the Hudi tables went stale. The migration window we’d scheduled — [TODO: hours? a weekend?] — extended to [TODO: how long].

By the time the dust settled, we’d diagnosed not one failure but three, layered on top of each other in a way that masked the actual root cause for most of the incident.

The first symptom

The Glue job started failing with ExecutorLostFailure and OutOfMemoryError exceptions in the Spark driver logs. Classic executor OOM. Initial reaction: bump the worker type, retry.

Worker type bump worked — for one run. The next run failed in the same way. The run after that failed differently: a schema mismatch error this time, not an OOM. The third class of failure showed up later in the week: a “no new offsets to process” report where the job claimed success but no data was actually flowing.

The pattern that should have clued us in earlier: the symptoms were shifting, which usually means there’s more than one root cause and you’re just seeing whichever one is currently dominant.

The three failure modes, untangled

Failure mode 1: skew-induced OOM. The migration was writing data with a skewed key distribution — one tenant’s data dwarfed the others. The Spark join that should have been parallel was actually serial because one executor was trying to hold the giant partition. Worker type bumps “fixed” this by giving the unlucky executor more memory; they didn’t actually fix the skew.

The real fix: salt the skewed key on the join side, broadcast the small side where possible, and tune Hudi’s bulk_insert.shuffle.parallelism to a higher value so partitions stayed bounded.

Failure mode 2: schema drift introduced by the migration. The migration was changing the schema of some records. We had no schema registry at the Kafka layer; producers wrote what they wanted, consumers hoped for the best. When the new schema flowed through, Glue’s inferred schema diverged from what Hudi expected, and write attempts hit IncompatibleSchemaException.

The real fix: add the new schema explicitly to the Hudi table’s schema evolution config, allow additive evolution (new optional fields) and reject incompatible changes. This was a few lines of config; the work was identifying that schema drift was the problem in the first place.

Failure mode 3: a silent sync stop. After we fixed the OOM and the schema, the job ran “successfully” but no new data appeared in Hudi. Investigation showed that a Kafka rebalance during one of the failed runs had left one partition unassigned. The Glue job’s bookmark stayed at the old offset; subsequent runs saw “no new data to process” and reported success.

The real fix: reset the consumer group’s offset to the actual latest assigned offset, then add pipeline_lag_seconds as a freshness SLI so this class of “success but stopped” failure had a detection path.

Why this took so long to untangle

Each failure looked like the only failure when you saw it. The diagnostic muscle of “step back and ask whether there might be more than one root cause” took longer to engage than it should have.

The other thing that slowed us down: the Glue logs were workable but not great. We ended up enabling Spark UI through the Glue console for some runs and reading executor metrics manually, which is fine but not what you want during an incident.

The customer impact

[TODO: what dashboards went stale, which customers noticed, what we told them, how we made up for the freshness gap during the incident]

Recovery

[TODO: step-by-step recovery — which fixes deployed in which order, how we caught the pipeline up, how we backfilled the missed window]

Root cause statement

Single sentence: A schema-evolving migration interacted with a skewed key distribution and a missing freshness SLI, producing three layered failure modes that masked each other during diagnosis. Each individual failure had been encountered before; the combination was novel.

The fix

Beyond the immediate fixes for each failure mode:

  • Schema registry at the Kafka layer. Producers register schema changes explicitly. Compatible-only evolution by default; incompatible changes require a documented migration. This eliminates failure mode 2 going forward.
  • Skew detection in the Glue jobs. Before processing, sample the input and compute the largest-partition ratio. Above a threshold, automatically apply salting or fail the run with a clear error rather than producing an OOM mid-flight.
  • pipeline_lag_seconds SLI on every pipeline component. Already discussed in the analytics pipeline pillar. This incident is what made it not-optional.
  • Runbook for “Glue job failure” that explicitly enumerates the three failure modes and how to triage which one(s) are in play. The runbook structure is in Glue + Hudi failure modes (now retired); the content lives in the team’s incident response wiki.

What this taught me

Multiple root causes is the default in complex systems, not the exception. When a system has been working for months and three new failures show up at once, the prior should not be “three independent things broke.” It should be “one underlying event triggered three failure modes.” Look for the trigger.

Job logs need a triage tree, not a paragraph. “Read the Glue logs” was the team’s playbook for Glue failures; that’s not a playbook, it’s an aspiration. A triage tree — “check for OOM first; if not OOM, check for schema mismatch; if not, check for partition assignment” — is a playbook.

Migrations are when latent bugs surface. The schema, the skew, the bookmark drift — all were latent. The migration exercised the system in ways the steady state didn’t. Plan for surprise during migrations, not just for the migration itself.

Spark debugging is its own skill set. I lost time during this incident because I’m not a deep Spark debugger. The team has invested in skilling this up since.

What I’d do differently

[TODO: honest reflection]

  • Schema contracts before any migration, not as a follow-up.
  • Pre-migration validation: skew profile of the input, before the job runs.
  • Multi-burn-rate alerts on pipeline freshness, deployed before any pipeline goes to production.