Stabilising Trino: from restarts and guesswork to a tuned cluster

[DRAFT NOTE] Confirmed facts are tagged inline. Numbers tagged [TODO] need verification before promoting to final — primarily before/after latency, query-success rate, and any specific cost deltas.

For about two months I’d been doing the same thing in different regions. Heap full, OOM, restart. Heap up two gigs, hope it holds. It would, for a while. Then another region would page. Then we’d tweak min/max worker counts. Then someone would ask “should we just throw more RAM at it?” and we’d do that, and it would hold for another week.

None of this was engineering. It was housekeeping. We didn’t understand the cluster — we were treating it like a black box that occasionally needed a kick.

Then on May 19 the JP production Trino went down in a 40-minute cascade that a restart didn’t fix. Both JVMs OOM’d, DevOps restarted, the worker re-OOM’d 15 minutes later. That was the tipping point. The next four weeks were the actual work — the kind I should have started months earlier.

This is the story of stopping the housekeeping and doing the engineering.

What the cluster is and what runs on it

Trino is the query layer over our analytics data — Hudi tables on S3, written by the Mongo → Kafka → Glue → Hudi pipeline. It powers admin dashboards (containment metrics, conversation summaries, FAQ overviews) and ad-hoc internal queries. In some regions there is also a mongo catalog so a small set of analytical queries can hit the operational MongoDB directly for fresher data.

The JP cluster, where the incident happened, was the smallest deployment we had: one m5.2xlarge box (8 vCPU, 32 GB RAM) running both the Trino coordinator and a worker as separate JVM processes. India production had multiple workers on dedicated boxes; JP had the single-host topology to keep cost in line with regional demand. The single-host design was the cheapest thing that worked — until the day it stopped working.

I owned this workstream end-to-end: the post-incident investigation, all configuration changes, the application-side fixes (collaborative with the koreserver team), and the rollout across regions.

The starting point: incidents nobody could pattern-match

Across India, JP, and the smaller satellite regions, Trino was generating roughly one operational incident a fortnight. They were never identical:

Worker OOM under “normal” load.
Coordinator pegged at 100% CPU, then unresponsive.
A single query hanging in FINISHING for tens of minutes after the client had clearly given up.
Slow query rate climbing for no obvious reason, dragging dashboard latency from sub-second to 60-second timeouts.

We had no shared mental model of why any of these were happening. The on-call response was always the same — restart, watch for a recurrence, log a ticket, move on. Each restart felt like a fix because the symptom went away. None of the underlying conditions had changed.

Looking back, that’s the failure mode I want to call out before the technical details: a recurring incident with a stale “ack and restart” runbook is a sign you don’t understand the system, not a sign that the system is generally healthy with bad luck. The first time I dug into a Trino crash properly was after PLAT-68052. I should have done it after the second or third recurrence, not the tenth.

The tipping point: PLAT-68052

I’ll keep this brief here — the full postmortem is in the JP cluster cascade incident.

At 17:53 UTC on May 19, admin-dashboard traffic spiked. Slow queries on the cluster jumped roughly 5× — from ~30/hr to ~228/hr. The application’s HTTP client timed out on each at 60 seconds and silently disconnected. Trino, having no way to know the client had left, kept running each query until something else killed it.

The “something else” took 5–22 minutes per query, during which the memory those queries had reserved stayed reserved. With both coordinator and worker JVMs configured at -Xmx14G — 28 GB of heap on a 32 GB box — there was no slack for GC pressure to dissipate. The cluster went into a GC death spiral. Coordinator’s self-poll to the worker stalled at one point for 898 seconds. One abandoned query spent 22 minutes in the FINISHING phase still holding memory. Both JVMs OutOfMemoryError-exited around 18:36 UTC. DevOps restarted at 18:42. The worker crashed again at 18:58.

What the postmortem made obvious — and what every prior incident had hinted at without anyone listening — was that the failure had two structural causes and two triggers:

Structural: co-located coordinator and worker overcommitted physical memory; per-query memory caps were commented out.
Trigger: dashboard query storm; application-side fan-out turned each dashboard render into 32–48 Trino queries.

A restart fixes neither structural cause. That’s why the second crash happened.

What I had to actually learn

Most of the next two weeks was reading and experimenting. I’d been operating Trino without understanding it. The detailed mechanics — memory categories, the JVM heap topology, what each G1GC flag actually does, why a global ORDER BY is different from a LIMITed one, why a UNION ALL over four adjacent time windows is worse than one scan — are in the Trino memory and JVM tuning deep-dive. The short version for this pillar:

A Trino worker tracks User memory (joins, aggregations, sorts) and System memory (exchange buffers, network I/O, scan readers) separately. Total = User + System. Caps are per-node, per-query, and cluster-wide; if any are unset, the heap is the only ceiling.
The JVM heap is inside the physical RAM, with off-heap direct buffers and JIT code cache sitting alongside. The rule of thumb is to cap -Xmx at 75–80% of physical RAM, not 87.5%.
The default G1GC settings are tuned for general-purpose Java workloads, not for Trino’s bursty, large-object allocation pattern. A handful of flags meaningfully change behaviour under load.
A query the client has stopped waiting for is not a query Trino has stopped running. The application must DELETE /v1/query/{queryId} on timeout or disconnect, or the query keeps spending memory until something else kills it.

Each of these became a change.

What we shipped

The fix wasn’t one thing; it was six categories of change, rolled in roughly this order so each one’s effect could be measured before the next.

1. Application changes

The single highest-leverage fix.

Cancel on timeout. The koreserver Trino client now sends DELETE /v1/query/{queryId} on HTTP timeout, user disconnect, or duplicate request. Before this, every timed-out dashboard render left an orphan query running on Trino until the connector or the 5-minute client-timeout caught it.
Eliminate the 4-split UNION ALL pattern. The dashboards split one time range into four sub-queries and UNION ALL’d them. This forced the optimiser to plan four independent execution trees, multiplied scan cost by 4×, and inflated coordinator stage-count. Replaced with a single WHERE timestampValue BETWEEN t0 AND t4 scan and a CASE WHEN over the same boundaries when split labels were genuinely needed (see the deep-dive for the rewrite).
Bounded result sets. Every dashboard query that previously had ORDER BY without LIMIT now has a default LIMIT. This triggers Trino’s Top-N optimisation — the workers keep a bounded priority queue locally and the coordinator merges 100s of rows, not millions.

2. JVM tuning

Both processes on the JP host went from -Xmx14g to -Xmx12g, taking total heap commitment from 28 GB to 24 GB on the 32 GB box. The remaining 8 GB went to OS, JIT code cache, direct network buffers, and GC overhead — what the 87.5%-commit configuration had been starving.

G1GC flags added to both jvm.config files:

-XX:InitiatingHeapOccupancyPercent=35
-XX:G1ReservePercent=15
-XX:G1PeriodicGCInterval=60000
-XX:ReservedCodeCacheSize=256M
-Xlog:gc*:file=var/log/gc.log:time,uptime,level,tags:filecount=10,filesize=100M

In plain language: start old-generation collection earlier (default 45 → 35), keep more headroom for object evacuation (default 10 → 15), proactively return unused committed pages to the OS every 60 seconds, and shrink the JIT code cache (default 512M → 256M) since we have two competing JVMs on one box. GC logging finally enabled so we could see what was happening rather than guessing.

The mechanics of each flag are in the deep-dive. The motivation for all of them was the same: the JP host had no slack, so the JVM had to be configured to use what it had efficiently.

3. Trino configuration: memory caps and timeouts

The structural problem in the incident was that query.max-total-memory-per-node was commented out. With no per-query cap, one runaway query could consume all the worker’s heap.

The post-incident config.properties settings on the JP cluster:

query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
query.max-runtime=5m
query.max-execution-time=3m
query.client-timeout=3m
query.low-memory-killer.policy=total-reservation-on-blocked-nodes
spill-enabled=true
max-spill-per-node=10GB

Soft memory limit at 85% of heap. The low-memory-killer policy means that if the cluster gets squeezed, Trino kills the heaviest query rather than allowing the worker to crash. Spill-to-disk gives us a fallback for queries that legitimately need more memory than the per-node cap — they go slower, but they don’t take the worker down.

The MongoDB connector’s server-side timeout is set to 60s so that any MongoDB cursor a Trino query is reading from gives up before Trino’s own runtime cap fires.

4. MongoDB indexes for the Trino query patterns

The Trino mongo connector pushes down predicates where it can. Without supporting indexes, even pushed-down filters end up as collection scans. We added indexes on the three fields every Trino query was filtering or joining on:

botId
timestampValue
_id

Index choice was conservative — these are the only fields that appear in WHERE clauses across the dashboard workload, and the cardinality of (botId, timestampValue) is high enough to be a strong starting filter. Verified by EXPLAIN on representative queries before and after.

5. Query refactoring: one scan beats four

The 4-way UNION pattern was the worst offender. The same logical aggregation is now a single scan over the full time range, with a CASE WHEN block tagging rows into split buckets only when the application actually needs split-level grouping. Single scan, one stage, same logical output, ~4× less scan cost. Pattern and the before/after SQL are in the deep-dive.

6. Observability

The bare minimum that should have existed before any of this:

GC pause duration and frequency, per JVM.
Worker heap utilisation and spill volume.
Live and prepared query count.
Slow-query log with BotsServiceAdmin-style tagging so we can attribute load by source service.
Query-kill rate (distinguishing user cancel, timeout, and low-memory-killer).
-XX:HeapDumpOnOutOfMemoryError with a configured path so the next OOM leaves evidence we can analyse rather than just a dead JVM.

If we’d had even half of these dashboards in place, PLAT-68052 would have been visible 30 minutes earlier as a GC pause that wouldn’t stop growing.

Outcome

Metric	Before	After
Recurring Trino restart cadence (JP)	~every 2 weeks	0 since rollout
Coordinator+worker JVM heap commitment (JP)	28 GB on 32 GB host (87.5%)	24 GB on 32 GB host (75%)
Dashboard query timeout rate during peak	[TODO: ~30/hr baseline, ~228/hr in incident window]	[TODO: post-rollout baseline]
Per-query memory cap	none (commented out)	2 GB total, 1 GB user, per node
Orphan queries from client timeout	unbounded — held memory until 5-min client-timeout fired	cancelled within seconds of HTTP timeout
UNION-ALL fan-out per dashboard render	4 sub-queries × 8–12 Trino queries = 32–48	1 scan × 8–12 Trino queries = 8–12

Headline operational result: the JP cluster has not needed an unplanned restart since the rollout completed. The same changes are being staged for the India production cluster, which has more workers and a slightly different shape but the same structural risks.

Open questions I haven’t closed

A few honest gaps:

Exchange buffer retention. If the application isn’t draining results fast enough, the coordinator holds output buffers indefinitely. query.max-output-stage-buffer-size is set to the default; I don’t yet know what the right value is for our client behaviour, or how to detect stale buffer retention proactively.
What actually consumes coordinator heap under steady state. Stage metadata, exchange buffers, query history, result sets — I have a model from reading, but haven’t profiled to confirm which is the dominant consumer in our workload.
Worker memory split. How much of the 12 GB worker heap is task execution vs spill metadata vs shuffle, on a typical query mix. Needed before we can confidently size the next deployment.
Completed query cleanup TTL. Trino retains completed/abandoned query info briefly. I don’t know whether tuning this TTL downward would meaningfully reduce coordinator pressure during a storm, or just shorten the diagnostic window after the fact.
Multi-node migration economics. The cleanest fix for JP is to move the worker to its own host. Cost-wise this is a real change. I haven’t built the case for it yet — current cluster stability bought enough time to defer the decision.

These are tracked; they’re in the queue, not lost.

What I’d do differently

Read the system before tuning it. I spent two months adjusting heap sizes without understanding what the heap was holding. The post-incident reading I did in two weeks would have prevented most of the prior incidents.

Set per-query memory caps as a deployment default. query.max-total-memory-per-node being commented out is the cluster equivalent of running prod without a circuit breaker. There is no good reason to ship a Trino cluster without it; the reason it was unset here was that the original config was copied from a template and nobody had thought about it since.

Treat recurring incidents as a signal, not a fact of life. “We restart the cluster every two weeks” was being normalised across multiple teams. The right reaction the first time it happened twice was a half-day spike to find the cause — not a ticket to add to the backlog.

Cancel queries on client disconnect from day one. The orphan-query problem was the single biggest contributor to the incident. It’s also one of the easiest fixes — a few lines of cancellation logic in the client wrapper. I would have caught this on a first-principles read of the Trino API, but I never did the first-principles read.

Profile, don’t speculate. Half the open questions above exist because I tuned and then moved on. Profiling each layer (JVM, Trino, query plan) before declaring the work done would have given us either confidence or a list of next problems. We have the list-by-default; we should have the confidence-by-default.

Things people ask me about this

Q: Why was a coordinator and worker on the same box in the first place?

A: Cost. JP traffic was a fraction of India’s; running a dedicated worker on its own m5.2xlarge would have roughly doubled the cluster spend for that region. The single-host topology is a defensible choice if you size memory honestly and accept that you’ll never burst. We did neither — both JVMs were sized as if the box weren’t shared.

Q: If you had to pick one fix as the most important, which?

A: The application-side query cancellation. Without it, every other fix is fragile — caps and timeouts cap the damage, but the workload keeps generating orphan queries. With cancellation, even a misconfigured cluster degrades gracefully because no query outlives the user’s interest in it.

Q: Why not increase the heap further instead of capping queries?

A: We tried that. It bought a few weeks each time. The next OOM happened at the new higher threshold because the workload was unbounded — there is no heap size that protects you from a runaway query pattern with no per-query cap. The cap is the answer; heap sizing is the slack you give the cap to work in.

Q: How is the Trino-side UNION ALL different from a single scan with a CASE WHEN?

A: Cost-Based Optimiser. Each UNION ALL branch is a separate execution tree the CBO has to compile and the coordinator has to track. With four branches over adjacent time ranges, you get 4× the plan complexity, 4× the scan cost (each branch hits the same source), and 4× the stage metadata. A single scan with a CASE WHEN is one plan, one scan, one set of stages. Same logical result; an order of magnitude less coordinator work.

Q: Did the connector version matter?

A: Possibly. The MongoDB connector’s behaviour around partial results and cancellation has had bugs in older versions where interrupted state wasn’t checked between batch reads. I haven’t confirmed the exact version on the JP host yet — that’s in the open-questions list.

Q: Was anything not fixed?

A: The exchange buffer question. The completed-query TTL question. The single-host topology itself (still in place on JP). And we haven’t yet built a load test that reproduces the dashboard storm; until we have, the confidence that “this won’t happen again” is partly based on understanding rather than evidence.

Trino memory and JVM tuning — the mechanics each of these fixes is reasoning about.
The JP cluster cascade incident (PLAT-68052) — the 40-minute timeline and root-cause analysis.
Analytics pipeline pillar — where the Hudi tables that Trino queries come from.
Datalake setup — broader context on the analytics stack Trino sits inside.