The JP Trino cascade: a coordinator and worker that took each other down

[DRAFT NOTE] Timeline timestamps, structural facts, and root-cause analysis are confirmed against server.log-20260518.093256, launcher.log, /var/log/messages, and the application slow-query log. Customer-impact specifics tagged [TODO] need to be pulled from the support and CSM logs before promoting to final.

A page came in around 18:30 UTC. The JP Trino cluster’s ports were refusing TCP. Both JVMs — coordinator and worker, running side by side on a single m5.2xlarge — were dead. DevOps restarted at 18:42. The worker came up, ran for fifteen minutes, and crashed again at 18:58.

That second crash is what made this incident useful. A restart fixes a transient. A restart that doesn’t hold tells you the cause is still there.

What happened

The JP cluster is a single m5.2xlarge (8 vCPU, 32 GB RAM) running the Trino coordinator and one worker as separate JVM processes. Both processes were configured at -Xmx14G — 28 GB of heap on a 32 GB box. There is no second worker; the box is the cluster.

At 17:53 UTC, admin-dashboard traffic on the koreserver fleet spiked. Slow queries to Trino jumped roughly 5× from the baseline ~30/hr to ~228/hr. 288 queries hit the application’s 60-second HTTP timeout in the 40-minute crash window — 82.8% of all slow queries logged. 88% of them originated from BotsServiceAdmin (containment-metrics, conversation, and overview dashboards).

Every one of those dashboard requests fanned out into 8–12 Trino queries on the application side, and each Trino query was split into 4 UNION ALL sub-queries over adjacent time windows. One dashboard render was 32–48 Trino queries. The cluster, sized for steady-state load and never load-tested against an admin-dashboard storm, started accumulating queued and in-flight queries faster than it could finish them.

Both JVMs OutOfMemoryError-exited cleanly (via -XX:+ExitOnOutOfMemoryError, not the kernel OOM-killer — verified by the absence of oom-killer entries in /var/log/messages). The visible symptom on monitoring was the Datadog tcp_check failing on both Trino ports at 18:36:55 UTC.

Timeline

Time (UTC)	Event
17:53	Admin-dashboard traffic begins climbing. Slow-query rate starts trending up.
17:55:48	Worker’s `/v1/memory` endpoint stalls for 20.89 s responding to the coordinator’s poll. First visible signal of GC pressure.
18:23	Coordinator’s self-poll to `localhost:7001` is stalled at 898 s (~15 min) with no response. JVMs are now in a sustained GC death spiral.
18:23 (range)	Internal JWT tokens are arriving at the coordinator already expired by 798 s — the request queue is delayed by 13+ minutes.
~18:25	One query enters `FINISHING` and stays there for 1,321,448 ms (~22 minutes), still holding allocated memory.
18:36:55	Datadog tcp_check on both Trino ports starts failing. Both JVMs have died via `OutOfMemoryError` self-exit.
18:42:21	DevOps restarts the Trino services. Cluster comes back.
18:58:06	Worker JVM crashes again. ~16 minutes after the restart.
~19:[TODO]	Manual mitigation: applied the heap cap reduction (`-Xmx14g → -Xmx12g`) and uncommented `query.max-total-memory-per-node`.
~19:[TODO]	Cluster stable. Dashboard traffic still elevated but the cluster is now degrading gracefully (killing individual queries, not the whole worker).

The investigation

The first hour was the wrong hour. We were looking at it as a worker-only OOM — heap full, restart, wait. The second crash forced us to step back.

False lead. I initially assumed the application’s slow-query rate climb was the cause and that we were looking at a normal capacity overrun. It was the trigger, not the cause. A cluster sized to handle expected steady-state should degrade under burst by queueing, slowing down, killing the worst offenders — not by both JVMs dying and refusing TCP. The fact that a clean restart didn’t hold meant something about the cluster’s structural configuration was making any version of this workload lethal.

The breakthrough. Reading the coordinator log carefully, two numbers jumped out and didn’t fit a simple “worker ran out of memory” story:

A self-poll from coordinator to localhost:7001 stalled for 898 seconds without a response. That’s not a worker that OOM’d cleanly — that’s a worker (and a coordinator) in GC pause, both stuck.
A query stuck in FINISHING for 22 minutes. Trino had decided the query should be cleaned up; the cleanup itself couldn’t make progress. That memory was held the entire time.

Combined with the host spec (32 GB RAM) and the JVM configs (-Xmx14G × 2 = 28 GB committed heap), the picture cleared. Both JVMs were competing for the last few GB of physical RAM with each other, with native memory needed for direct buffers and JIT, and with the OS. There was no slack for GC pressure to dissipate. Once GC fell behind on either JVM, the other followed, because they were on the same box and the OS would page or stall under the pressure.

The dashboard query storm was the spark. The 28-GB-on-32-GB heap commitment was the bomb. The missing query.max-total-memory-per-node was why no individual query was killed before the whole worker went down.

Root cause

One sentence: A spike in admin-dashboard traffic at 17:53 UTC drove the co-located coordinator and worker — both with 14 GB heaps on a 32 GB host, no per-query memory cap configured, and an application that didn’t cancel queries when its own HTTP client timed out — into a sustained GC death spiral, during which orphaned query state from timed-out client requests accumulated faster than Trino could clean it up, ultimately exhausting heap on both JVMs.

Breaking that down:

Structural (the bomb):

S1 — Co-location overcommit. 28 GB of Java heap on a 32 GB host left no headroom for native memory, page cache, or other processes. The default ReservedCodeCacheSize of 512 M × 2 alone is another 1 GB. Direct buffers for network I/O are another several hundred MB per JVM. The cluster was at the edge of physical memory even at rest.
S2 — Missing per-query memory ceiling. query.max-total-memory-per-node was commented out in both config.properties files. The only effective ceiling on any single query was the JVM heap itself. One runaway query could consume everything.

Trigger (the spark):

T1 — Dashboard query storm. Slow-query rate jumped from ~30/hr to ~228/hr (≥5× baseline) starting at 17 UTC. 82.8% of these hit the application’s 60-second HTTP timeout. Most originated from BotsServiceAdmin.
T2 — Application-side query fan-out. Each dashboard request became 8–12 Trino queries, and each of those was split on the application side into 4 UNION ALL sub-queries. One dashboard render ≈ 32–48 Trino queries.

Why memory wasn’t released after queries failed

The smoking gun is the 22-minute FINISHING query. Even when Trino knows a query should be cleaned up, releasing memory is asynchronous and depends on a cleanup thread that, under GC pressure, can’t run. The mechanism falls apart under three conditions, all of which were happening:

Cleanup is slow under GC pressure. Task state objects, exchange buffers between stages, and completed-query metadata are released by background threads that get starved when the JVM is GC-thrashing.
The connector may not propagate cancellation correctly. The MongoDB connector honours maxTimeMS at the cursor level, but if MongoDB returns a timeout error after delivering partial results, those partial results are in Trino’s heap until the operator processes the exception. Older connector versions had bugs where interrupted state wasn’t checked between batch reads. Connector version on the JP host not yet verified.
Client-side timeout ≠ server-side cancellation. This is the big one. When the application’s HTTP client times out at 60 seconds, it drops the TCP connection. Trino has no way to know the client gave up. For Trino to actually cancel the query, the client must explicitly call DELETE /v1/query/{queryId}. The koreserver Trino client did not. The ABANDONED_QUERY and ABANDONED_TASK statuses littering the coordinator log are exactly what shows up when the client disconnected without calling DELETE.

The combined effect was a functional memory leak even though no single piece of code was leaking in the traditional sense. Memory in (new queries from dashboard renders) > memory out (queries failing and being slowly cleaned up). Heap fills. GC thrashes. OOM.

What we did to mitigate

Immediate (during the incident):

Restart the Trino services at 18:42:21. Restored TCP connectivity. Held for ~16 minutes before the worker re-OOM’d.
After the second crash, shrunk both heaps from -Xmx14g to -Xmx12g (24 GB committed on 32 GB), and uncommented query.max-total-memory-per-node=2GB and query.max-memory-per-node=1GB in both config files. Restarted again. This time the cluster held.
Asked the dashboard owners to throttle their refresh loops while we stabilised. Effective, but explicitly a stopgap.

Within 48 hours:

JVM flags applied to both processes: IHOP=35, G1ReservePercent=15, G1PeriodicGCInterval=60000, ReservedCodeCacheSize=256M. GC logging enabled.
query.max-runtime=5m, query.max-execution-time=3m, query.client-timeout=3m set explicitly.
spill-enabled=true, max-spill-per-node=10GB, after verifying disk type and free space on the host.
query.low-memory-killer.policy=total-reservation-on-blocked-nodes so that under future squeeze the cluster kills the heaviest query rather than crashing the worker.

Within two weeks:

Application change: koreserver Trino client now sends DELETE /v1/query/{queryId} on HTTP timeout, user disconnect, or duplicate request. Closes the orphan-query loop that turned every dashboard timeout into a memory leak.
Application change: the 4-way UNION ALL fan-out pattern replaced with a single scan and a CASE WHEN for split labels. ~4× less scan cost per dashboard render.
MongoDB indexes added on botId, timestampValue, and _id to support the Trino mongo connector’s predicate pushdown.

The structural fix and the JVM tuning belong to this incident. The application-side cancellation, query rewrites, and broader observability work belong to the pillar.

What got shipped vs what didn’t

Shipped and stuck:

All JVM and Trino config changes on the JP host. Same changes staged for India production at the time of writing.
Application-side query cancellation on timeout (koreserver Trino client).
The UNION-ALL rewrite for the worst dashboard.
MongoDB indexes.

Shipped, watching it stick:

Default LIMIT on dashboard queries that previously had ORDER BY without one. Enforced in the client wrapper, but easy to bypass if developers go around it. Needs a lint rule we haven’t written yet.

Logged but not yet shipped:

Moving the worker off the coordinator host (the cleanest structural fix). Cost case not yet built.
A load test that reproduces the dashboard storm. Open work.
A lint rule or pre-commit check that prevents new dashboard queries from shipping without a LIMIT.

Quietly didn’t stick:

The “ask dashboard owners to throttle refresh” stopgap drifted back within a sprint, as those things do. The right fix is application-side throttling and caching, not asking humans to be careful.

What this taught me

A client-side timeout is not a cancellation. I had assumed — without ever checking — that when our HTTP client timed out, the upstream query would be cancelled. It wasn’t. Trino has no telepathy; the only way it knows a client gave up is an explicit DELETE. Every distributed system I work with from now on gets a “what does the upstream do when the client disconnects” question at design time. The answer is usually “nothing”, which is fine if you know it.

Restarts that don’t hold are diagnostic, not embarrassing. The second crash at 18:58 felt like a failed mitigation in the moment. It was actually the most useful signal of the incident — it ruled out “transient overload” and forced us to look at structural causes. If a restart fixes the symptom, it might be a transient. If a restart doesn’t fix the symptom, the cause is still there.

Commented-out safety configs are landmines. query.max-total-memory-per-node had been commented out long enough that nobody on the current team remembered why. The template the JP cluster was deployed from carried that commented-out line forward, and nobody had questioned it. The deployment template now has it uncommented with a default value, and a comment explaining that anyone who wants to remove the cap must say so explicitly.

Two near-100%-sized JVMs on one host are one JVM with extra GC. I had been thinking of the coordinator and worker as two independent processes that happened to share a host. They aren’t. They share physical RAM, page cache, OS scheduling, and the kernel’s view of available memory. When either fell behind on GC, the other was already starved. The mental model has to be “one host, two memory consumers that compete” — not “one cluster, two processes that cooperate”.

What I’d do differently

Question the runbook on the second recurrence, not the tenth. The JP cluster had needed restarts every few weeks for months. Each one was treated as a one-off. The cumulative time spent on those restarts and the eventual incident itself far exceeded what a half-day spike on cause analysis would have cost in February.

Ship structural caps before traffic. query.max-total-memory-per-node should be set in the deployment template, not as a post-incident response. Same for query.low-memory-killer.policy. These are defaults; if you have a reason not to set them, write the reason down.

Build the load test that reproduces the dashboard storm. Without it, “we fixed the issue” is reasoning, not evidence. The dashboard query pattern is a well-defined workload — a small number of queries, a known fan-out, an observable rate. There is no good reason we don’t have a synthetic version of it in load-test.

Make application-side cancellation a default in the client library. The koreserver Trino client now does the right thing. If we’d built the client library with cancellation-on-timeout as the default behaviour from day one, this incident’s worst characteristic — the orphan-query memory leak — wouldn’t have existed.

Trino performance and stability pillar — the broader work this incident kicked off.
Trino memory and JVM tuning — the mechanics each of these mitigations is reasoning about.
Analytics pipeline pillar — where the analytics Trino lives in the broader pipeline.