[DRAFT NOTE] Configurations and ratios reflect what was applied on the JP single-host cluster (m5.2xlarge, 32 GB). Values for larger multi-worker deployments are derived from the same reasoning but tagged where the absolute numbers would change.
The cluster was a 32 GB box running two JVMs at 14 GB each. Twenty-eight gigs of heap on a thirty-two gig box. The first time someone walked me through that math, I assumed I was missing something. Surely Trino’s well-trodden enough that you don’t get to ship a config like that without a warning.
You do. And it had been running fine for months. Then a five-minute traffic spike took both JVMs down in a cascade, and the cluster discovered why the 75% rule exists.
This is the dive I wish I’d read before that incident. It covers the three layers of Trino memory accounting, the JVM flags that actually move the needle under production load, and the query patterns that look fine on paper and destroy coordinator memory in practice.
For the business context and what we shipped, the pillar is the read. This dive is the mechanics.
The three layers, top to bottom
Trino doesn’t have one memory limit. It has three nested ones, and they’re configured at three different layers.
+------------------------------------------------------------+
| PHYSICAL HOST RAM |
| |
| +------------------------+ +-------------------------+ |
| | JVM HEAP (-Xmx) | | OFF-HEAP + OS | |
| | | | - direct buffers | |
| | +------------------+ | | - JIT code cache | |
| | | Total Memory | | | - page cache | |
| | | (Trino cap) | | | - OS / other procs | |
| | | | | | | |
| | | +------------+ | | | ~20-25% of host | |
| | | | User Memory| | | | | |
| | | | (Trino cap)| | | +-------------------------+ |
| | | +------------+ | | |
| | +------------------+ | |
| | + heap-headroom | |
| +------------------------+ |
+------------------------------------------------------------+
Each box is enforced by a different mechanism, fires under different conditions, and protects against a different failure mode. The incident that drives this post happened because the outer box (physical RAM) and the middle box (JVM heap) were sized as if the inner box (Trino’s per-query cap) was enforced. The inner box was commented out. So nothing was enforced anywhere.
Layer 1: physical host RAM
The host has a fixed amount. Everything else — JVMs, OS, page cache, kernel — fights for it. The rule that matters: never commit more than 75–80% of physical RAM to JVM heaps. The remaining 20–25% is for direct network buffers (Trino does a lot of these), the JIT code cache (default 512 M per JVM, which adds up fast on a co-located host), file system caching, and the kernel itself. Cross 90% and the OS starts paging or invoking the OOM-killer, and at that point your “well-behaved JVM” is no longer in charge.
On a 32 GB host running both coordinator and worker, this means total heap across the two JVMs should be 24 GB max — -Xmx12g each, not -Xmx14g each.
Layer 2: the JVM heap (-Xmx)
This is the Java heap the JVM allocates. Garbage collection happens here. OutOfMemoryError fires when this is full and GC can’t free anything.
Two important things go inside the heap that aren’t query state:
heap-headroom-per-node(Trino config). Explicit reservation for internal Trino operations, network buffer allocations, and GC overhead. Default is fine on large multi-worker clusters; on a tight single-host cluster, set it explicitly (e.g.2GBon a 12 GB worker heap) so Trino’s own per-node caps don’t try to use space the JVM itself needs.- G1GC reserve regions. G1 keeps a percentage of the heap free as “to-space” for evacuating live objects during collection. If application threads allocate faster than G1 can copy, you get an evacuation failure — Stop-The-World fallback to a single-threaded Full GC. This is where
G1ReservePercentlives (defaults to 10, worth raising to 15 on bursty workloads).
Layer 3: Trino’s User / System / Total accounting
Trino tracks query memory in three categories, all inside the JVM heap:
- User memory. RAM consumed directly by SQL operations: hash tables for joins, state for
GROUP BYaggregations, sort buffers forORDER BY. Capped byquery.max-memory-per-node. - System memory. RAM consumed by internal execution machinery: exchange buffers between stages, network I/O queues, scan reader buffers. Not directly capped on its own.
- Total memory. Exactly
User + System. Capped byquery.max-total-memory-per-node.
Both caps are per query, per node. They’re the safety net that says “no single query can take more than X of this worker’s heap.” If you leave them unset — as the JP config did — the JVM heap itself is the only ceiling, and any one query can run the whole worker out of memory.
Rule of thumb:
query.max-memory-per-node≈ 8–10% of worker heap (e.g. 1 GB on 12 GB).query.max-total-memory-per-node≈ 16–20% of worker heap (e.g. 2 GB on 12 GB).- Soft memory limit at 85% of heap.
Conservative? Yes. These caps make queries fail fast when they exceed them. The alternative is the cluster dying. A failed query is a recoverable event; a dead worker is an incident.
Coordinator vs worker: different jobs, different memory profiles
People often size coordinator and worker the same. That’s wrong, and it’s especially wrong on a co-located host where the two are competing for the same RAM.
Workers do the heavy lifting: data scans, partitioned shuffles, hash-aggregation state, partial sorts, join build sides. Memory is dominated by user memory in operators with large in-flight state. This is the JVM that benefits from a big heap and aggressive G1 tuning.
Coordinator does planning and orchestration: SQL parsing, cost-based optimisation, split scheduling, stage metadata, query history, and — critically — the final stage of any global ORDER BY without a LIMIT. That last one is the trap. If a query ends in a global sort with no LIMIT, every worker streams its data into the coordinator’s heap so the coordinator can do the final sort in one process. This is how a 50-row query plan results in the coordinator going OOM.
The split for a multi-host cluster (e.g. 1 box for coord + worker, 1 box dedicated to two workers, both 32 GB):
| Host | Process | -Xmx | Notes |
|---|---|---|---|
| Shared host | Coordinator | ~7 GB | Smaller — planning + Stage 0 only |
| Shared host | Worker | ~20 GB | Take the larger share |
| Dedicated host | Worker 2 | ~14 GB | Two workers, ~equal split |
| Dedicated host | Worker 3 | ~14 GB |
On a shared host, also set node-scheduler.include-coordinator=false so the coordinator never gets assigned actual processing splits, and cap query.max-processing-threads on the local worker to ~50% of host cores so the coordinator isn’t starved of CPU.
JVM flags that actually matter for Trino
Most of the JVM flag advice on the internet is generic. The handful below are the ones whose absence we could measure under load:
-XX:+UseG1GC
-XX:InitiatingHeapOccupancyPercent=35
-XX:G1ReservePercent=15
-XX:G1HeapRegionSize=32M
-XX:G1PeriodicGCInterval=60000
-XX:+ExplicitGCInvokesConcurrent
-XX:ReservedCodeCacheSize=256M
-Xlog:gc*:file=var/log/gc.log:time,uptime,level,tags:filecount=10,filesize=100M
What each one does, and why:
InitiatingHeapOccupancyPercent=35(default 45). Tells G1 to begin old-generation marking earlier — when the heap is 35% full, not 45%. For Trino’s bursty allocation pattern, the extra runway prevents marking from being too late when a big query arrives.G1ReservePercent=15(default 10). More heap kept free for to-space evacuation. Prevents the evacuation failure → Full GC fallback that manifests in logs asto-space exhausted. This is the flag you wish you’d set after you see your first Full GC pause in a Trino worker.G1HeapRegionSize=32M(default size depends on heap). Larger G1 regions handle wide rows cleanly. Without this, rows above the “humongous” threshold (half the region size) bypass standard regions, land directly in old-gen, and fragment the heap. 32M is large enough to keep most analytical row shapes out of humongous allocation territory.G1PeriodicGCInterval=60000(default: disabled). If 60 seconds pass with no normal GC activity, run one proactively. This deflates the JVM’s memory footprint after a big query finishes — without it, the JVM holds the committed pages until the next GC, which may not happen for a while if traffic is idle. Implementation of JEP 346.ExplicitGCInvokesConcurrent. If something callsSystem.gc()programmatically, run it concurrently instead of Stop-The-World. Cheap insurance.ReservedCodeCacheSize=256M(default 512M). The JIT compiled-code cache. On a co-located host with two JVMs, the default 512M × 2 = 1 GB of physical RAM pinned to compiled code. Trino doesn’t generate enough hot bytecode to need more than 256M; halving this on both JVMs freed half a gig of host RAM. On a dedicated worker, leave it at default.- GC logging. Not optional. The next OOM should leave evidence. Without GC logs, “why did the worker die?” is a guessing game. With them, you can usually see the death spiral building over several minutes.
The “heap dump on OOM” pair worth adding for diagnostic purposes:
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/log/trino/heapdumps
The dump is large — make sure the path has space — but the one OOM you actually need to diagnose is invaluable. Without this, you have the absence of a process.
Soft limits vs hard limits
Two different mechanisms; people confuse them.
Hard limits fire at the per-query, per-node level. query.max-memory-per-node and query.max-total-memory-per-node. When a query crosses one, it is killed. The cluster keeps running. Other queries are unaffected.
Soft limits fire at the cluster level, via resource groups. softMemoryLimit is a percentage of a resource group’s pool. When the group crosses it, the coordinator queues new queries from that group rather than admitting them. Running queries continue. This is the elastic safety brake — it slows arrivals before the cluster is forced to start killing things.
Workers have no idea about soft limits. They only see hard caps, and they only see them per-query, per-node. The cluster-level orchestration lives entirely on the coordinator.
Practical implication: hard caps protect the worker from a runaway query. Soft caps protect the cluster from runaway arrivals. You need both. The JP incident had neither.
Query patterns that destroy coordinator memory
Most production Trino OOMs are not “the workload is too big”. They are “the workload is the wrong shape.” Three patterns are worth knowing.
The 4-way UNION ALL fan-out
This is the pattern that triggered our incident. The application split one logical query over four adjacent time windows and UNION ALL’d the results:
SELECT sum(...), _id
FROM (
(SELECT ... WHERE timestampValue BETWEEN t0 AND t1 GROUP BY ...)
UNION ALL
(SELECT ... WHERE timestampValue BETWEEN t1 AND t2 GROUP BY ...)
UNION ALL
(SELECT ... WHERE timestampValue BETWEEN t2 AND t3 GROUP BY ...)
UNION ALL
(SELECT ... WHERE timestampValue BETWEEN t3 AND t4 GROUP BY ...)
)
GROUP BY _id;
What looks like one query is four execution plans from the coordinator’s point of view. Each UNION ALL branch gets its own plan tree, its own stages, its own scheduling. The cost-based optimiser compiles four plans. The scan cost is paid four times even though all four hit the same source. Coordinator stage-count balloons.
The rewrite is one continuous scan with a CASE WHEN for the split labels (only if the labels are actually used downstream):
SELECT sum(...), _id
FROM (
SELECT ...,
CASE
WHEN timestampValue BETWEEN t0 AND t1 THEN 'split0'
WHEN timestampValue BETWEEN t1 AND t2 THEN 'split1'
WHEN timestampValue BETWEEN t2 AND t3 THEN 'split2'
ELSE 'split3'
END AS splitId
FROM source
WHERE timestampValue BETWEEN t0 AND t4
GROUP BY _id, splitId
)
GROUP BY _id;
Same logical output. One plan, one scan, one set of stages. The coordinator does ~4× less work. Workers parallelise the single scan naturally across splits.
Global ORDER BY without LIMIT
A global sort forces the coordinator to pull all of the (possibly aggregated) rows from every worker into its own heap to do the final sort. For an ORDER BY over millions of rows, this is millions of rows landing in one process.
Add a LIMIT and Trino’s optimiser switches to Top-N: each worker builds a bounded priority queue of size N, only the top-N from each worker is shipped to the coordinator, and the coordinator sorts a small set (N × worker count). On a 3-worker cluster with LIMIT 100, the coordinator sorts 300 rows instead of millions.
SELECT user_id, total_spend
FROM hive.sales.daily_aggregates
ORDER BY total_spend DESC
LIMIT 100;
Rule: dashboards and BI workloads that order data should always have a default LIMIT enforced at the application or wrapper layer. Beware high OFFSET — LIMIT 100 OFFSET 500000 makes workers keep priority queues of size 500,100, defeating the optimisation.
Client backpressure
A client that pulls data slowly causes the coordinator’s output buffer to fill up. Workers keep producing data; the coordinator can’t ship it; the buffer (query.max-output-stage-buffer-size, exchange.max-buffer-size) sits at capacity. This is the easiest way to make a “correctly written” query take down a coordinator that’s otherwise idle.
The fix is on the client side. Node.js example: use async iterators (for await (const row of query)) instead of .fetchall() or .on('data'). Async iterators apply natural TCP backpressure — when the client is slow, the coordinator’s send slows, the worker’s production slows, and the buffer stays bounded. With .fetchall(), every byte the server produces lands in the coordinator’s buffer until the entire result fits, regardless of what the client is doing.
Defensive guardrails for catastrophic dumps:
exchange.max-buffer-size=512MB
query.max-output-positions=500000
These are the hard ceiling. The client behaviour is the actual fix.
Connector lifecycle: cancellation isn’t free
Worth distinguishing two timeouts on connectors that people frequently confuse:
connection-timeout(per-catalog). The allowance for the initial network handshake — authentication, opening sockets. It does nothing to terminate an active, streaming query. If your connector’sconnection-timeoutis 10s and a query has been streaming data for an hour, this setting will not save you.query.max-execution-timeand friends (Trino-wide). Cluster execution caps. When one fires, Trino dismantles the query’s local tasks and broadcasts cleanup signals to the connector. For the MongoDB connector, this is akillCursorscall that releases the cursor on the MongoDB side. The external system stops working on the query too.
The MongoDB connector specifically:
- Respects
maxTimeMSat the cursor level. If MongoDB hits its own 60s server-side timeout, the cursor throws and the connector propagates an exception up to the operator. - Has had bugs in older versions where the interrupted state wasn’t checked between batch reads, so a cancelled query would keep consuming the current batch even after Trino had cancelled it. Check the connector version on your deployment.
- Stores partial results in heap during a batch read. If MongoDB returns the timeout after delivering partial results, those results sit in Trino’s heap until the operator processes the exception — which under GC pressure can be a long time.
The application’s role
You can size a Trino cluster perfectly and still die because the application doesn’t behave. Two things every Trino-consuming application needs:
DELETE /v1/query/{queryId}on disconnect. If the client times out, the user cancels, the upstream HTTP request is dropped, or the application is shutting down — send the DELETE. Without this, Trino has no way to know the client stopped caring. The query keeps running until something else kills it (5-minutequery.client-timeout, an execution-time cap, or memory pressure). Every uncancelled abandoned query is held memory.- Per-query timeouts that match the application’s tolerance. If the application’s HTTP client times out at 60s, set
query.max-execution-time=60s(or less). Don’t leave Trino’s caps higher than the application’s; you’ll create exactly the orphan-query window that caused our incident.
Gotchas
- Commented-out caps are landmines. A safety config that exists in the template but is commented out reads as “we considered this” but enforces as “we shipped without it.” Either set it with a sensible default or delete the line.
- Two JVMs on one host are one memory consumer with extra GC. Heap sizing has to account for both
-Xmxvalues plus 20–25% off-heap headroom. The defaultReservedCodeCacheSizeof 512M × 2 alone is a non-trivial chunk on a tight host. spill-enabled=truerequires actually checking disk. Spill writes to local storage. If the spill path is on a small volume or a slow disk, spilling makes things worse instead of better. Verify disk type and free space before enabling.- GC logs are evidence; without them you have nothing. A worker that crashes silently teaches you nothing. Configure
-Xlog:gc*with file rotation, on every JVM, every time. - Connector version matters. Particularly for MongoDB, where older versions had cancellation bugs. Verify the version, not just the connector type.
Related reading
- Trino performance and stability pillar — what we shipped, in business context.
- The JP cluster cascade incident — the postmortem this dive is the technical companion to.
- Analytics pipeline pillar — the broader pipeline Trino sits inside.