Building and evolving an analytics pipeline

[DRAFT NOTE] Pipeline shape and failure modes are accurate; specific volumes, costs, and freshness numbers tagged [TODO] need verification before promoting.

Mongo is great for operational reads and writes. It is, broadly, not great as an analytics store. Once your product team starts wanting “all conversations from this tenant in the last 90 days grouped by intent with timing percentiles,” you have two choices: degrade the operational tier with analytics queries, or build an off-ramp.

We built the off-ramp.

The design choices that mattered

The pipeline shape is fairly standard:

MongoDB  ─(CDC)─▶  Kafka topics  ─▶  AWS Glue jobs  ─▶  Hudi tables on S3
                                                              │
                                                       Athena / Trino  ─▶  dashboards, exports

The two non-obvious choices were Hudi and Glue.

Why Hudi and not Iceberg or plain Parquet. Hudi was the right call at the time we made it — its upsert and incremental-query story was meaningfully ahead of Iceberg’s for our access pattern. Our writes weren’t append-only: conversations don’t always close in order, late-arriving events have to update existing rows, occasional backfills need to merge with the live stream. Parquet doesn’t give you updates without rewriting partitions; Hudi gives you upsert primitives that handle this cleanly. Iceberg has caught up significantly since then. If I were picking again today for a new table, I’d cost-model both before committing — Iceberg’s catalog story is simpler operationally, and our upsert intensity isn’t extreme.

Why Glue and not Spark on EMR/EKS. Headcount math. We didn’t have a dedicated data platform team. Operating our own Spark fleet would have meant owning the cluster, the scheduler, the autoscaling, the spot-instance handling, and the upgrade cycle. Glue lets you not own any of that. You pay AWS to do it. The price is real — Glue isn’t cheap — but the price of a half-engineer maintaining our own Spark is higher at our scale. The break-even is somewhere north of where we were operating.

Kafka was the boring choice. Decouples producers from consumers, gives us a replay buffer when Glue jobs need to be re-run, lets us add new consumers without touching the source. We use change streams from MongoDB into Kafka; partition by tenant for parallelism.

The first 6 months: it worked, mostly

For a while, the pipeline just ran. Dashboards refreshed. BI users got their data. We patted ourselves on the back.

Then production found three failure modes, and one of them was almost philosophical.

Failure mode 1: Glue OOMs at peak

Glue jobs would die during peak business hours. The error was a classic Spark out-of-memory on executors — a join had skewed data, one executor was trying to hold a partition that was orders of magnitude larger than the others, and JVM couldn’t handle it.

The diagnostic took longer than it should have. Glue’s logs are workable but not great; we ended up enabling Spark UI through the Glue console and reading the executor metrics manually. Classic Spark debugging: find the stage with the long-tail task, find the partition with the most rows, find the skewed key.

The fix was a mix of things — broadcast joins for the small-side dataset where the join key was the culprit, repartitioning by a more uniform key before the offending stage, and bumping worker type for the jobs that legitimately needed more memory. None of it was clever. All of it was the kind of work you only know to do after you’ve stared at Spark’s DAG for an hour.

Failure mode 2: schema drift

A producer changed a field’s type — string to nested object, because someone needed structured data there. The change shipped to production without coordinating with the downstream pipeline. Glue job hit the new shape, didn’t know what to do, errored out, and Kafka backed up.

We had no schema contract at the Kafka layer. The producer wrote what it wanted; the consumer hoped for the best. Worked fine until it didn’t.

The fix: enforce schema at the Kafka layer with a registry, add explicit schema evolution rules in Hudi (we allowed additive changes — new optional fields — and rejected incompatible changes), and put the schema in the same PR review as code that produces or consumes the data. The cultural shift was harder than the technical one. Engineers were used to changing a Mongo document shape freely; they had to internalize that some shapes were now contracts.

Failure mode 3: small-files pileup

Hudi writes a new file per micro-batch by default. Over time, you accumulate millions of tiny files. Query performance degrades non-linearly — every query has to open and read metadata from every file. Athena queries that took seconds started taking minutes.

This one was a slow burn — easy to miss because the degradation was gradual. By the time someone complained about slow dashboards, we’d been writing small files for weeks.

The fix was Hudi compaction tuning: clean policies to drop old file versions, cluster operations to combine small files into larger ones, and an explicit file-size target. We added a compaction_lag_seconds metric so we could see when compaction was falling behind, which it did during high-write periods. Compaction itself became a workload we had to monitor, which is the recursive nature of operating data infrastructure — every fix introduces a new thing to watch.

The philosophical failure: the sync that silently stopped

This is the one I think about most.

One day a customer messaged support that their dashboard was 6 hours stale. We checked. The pipeline was “running.” Jobs were scheduled. Each one was successful — they’d start, find no new data to process, and report success. So no alerts. So nobody knew.

What had happened: a Kafka consumer rebalance had left one partition unassigned. The Glue job’s bookmark stayed where it was. Subsequent runs saw no new offsets to process. Nothing was failing. The pipeline was just silently not moving forward.

The detection fix was straightforward — a pipeline_lag_seconds gauge computed as the difference between the latest Mongo write timestamp and the latest Hudi commit timestamp, with a multi-burn-rate alert against an SLO of “lag under 15 minutes.” Once we had it, this class of bug was visible.

The deeper lesson was about how I’d designed alerting in general. I’d been alerting on the presence of errors. This bug had no errors. The right question — and the one I now ask for every new pipeline component — is: “what would silently break this, and how would I know?” Sometimes the answer is a freshness SLI. Sometimes it’s a heartbeat. Sometimes it’s an external probe that exercises the path end-to-end. But the question has to be asked, because the absence of errors is not the presence of correctness.

We added this question to design review for any new pipeline. It’s caught problems before they shipped.

The query layer

Once data is on S3 as Hudi tables, you need a way to query it. Primary path is Athena. Easy to set up, integrates with Glue Data Catalog, charges per byte scanned which keeps people honest about column projection.

For heavier ad-hoc work — multi-table joins, big aggregations — we ran Trino on EMR. Trino is faster for these queries because it’s stateful (long-running cluster, working set in memory) where Athena is stateless (cold start every query). The trade-off is operational: Trino is a cluster you have to operate; Athena is a service AWS operates.

Trino had its own failure modes — the Trino memory spike incident is one of them — but for the workloads that needed it, the latency improvement was significant. As Athena’s Hudi support matured, we migrated more workloads back to it and shrank the Trino fleet.

What this looks like in steady state

Freshness: p95 lag under [TODO: 5-15 minutes] for the live stream after the pipeline_lag_seconds SLO was in place. Backfills are explicitly throttled to not starve the live job.
Job success rate: rose from [TODO: ~95%] to [TODO: ~99.5%] after the OOM, schema, and small-files fixes.
Storage cost: meaningfully lower than the “leave it in Mongo” alternative, even with Hudi metadata overhead and Glue compute cost.
Operational toil: weekly failed-job rotation went from [TODO: hours] to [TODO: under an hour].

What I’d do differently

Build the freshness SLI before the first production load, not after the first silent-stop incident. This is the lesson I keep relearning across systems.

Treat schema as a contract from day one. We added the schema registry late. Every shape change before that point was a small bet that nothing downstream would break.

Cost-model Iceberg vs Hudi now for any new tables. The 2023-era reasoning for picking Hudi was correct; the 2026-era reasoning may favor Iceberg. The way to know is to run the comparison, not to be loyal to a past decision.

Plan compaction monitoring from day one. Compaction is a workload, not a maintenance task. It needs its own metrics and SLOs. We added them after small-files had already accumulated.

Things people ask me about this

Why Hudi, not Iceberg? Right call when we made it (upsert-heavy access pattern, Iceberg’s upsert was less mature). I’d rerun the comparison today before picking for a new table — Iceberg’s catalog story is now cleaner, and for less-upsert-heavy workloads it’d be the simpler operational fit.

Why Glue rather than running Spark yourself? Headcount. A half-engineer operating our own Spark fleet is more expensive than Glue’s price tag at our scale. The flip-point is real but we’re nowhere near it.

What does the silent-sync failure tell you about the design? That I was alerting on error presence rather than success absence. Every pipeline component now gets the “what would silently break this” question at design review. The freshness metric came out of that; multiple other metrics on other systems came out of the same question.

How do you handle backfills that conflict with the live stream? Hudi’s pre-combine field disambiguates late-arriving writes from the live stream. Backfill emits events with backdated timestamps; Hudi’s upsert resolves to the highest pre-combine value. Operationally we throttle backfills so they don’t starve the live job.

What’s the bottleneck if you doubled the data volume? At current sizing, probably Glue concurrency limits — we’d start queueing jobs. The fix would be either bigger workers or splitting the job into shards that can run in parallel. Hudi compaction would be the next thing to watch; doubling write volume doubles compaction work and we’d need to make sure cluster operations keep up.

Datalake setup — log shipping with Promtail/Alloy, Prometheus, and ClickHouse for the operational metrics side
MongoDB sharding — what the source side of the pipeline looks like
Incidents: Glue job broke the pipeline · Trino memory spikes in JP production