RabbitMQ Cluster
Placeholder — content coming soon.
Placeholder — content coming soon.
Trino on the JVM is three nested memory systems — physical RAM, Java heap, and Trino's own User/System/Total accounting — and most production incidents come from one of those layers being out of step with the others. This dive walks through the mechanics that matter, the G1GC flags worth setting, the query patterns that wreck coordinator memory, and the application-side behaviour that turns transient slow queries into leaks.
Took the analytics Trino cluster from a service that needed restarts every few weeks and reactive heap-resizing across regions to one that survives dashboard storms. Tipping point was a 40-minute coordinator+worker cascade on a single-host JP deployment; the work after that was structural — JVM and Trino memory caps, query lifecycle fixes in the application, query-shape rewrites, and observability that would have caught the incident on the way down rather than after it.
How we built the operational data lake at Kore — Promtail (then Grafana Alloy) for log shipping, Prometheus for metrics, ClickHouse as the queryable backing store for the high-cardinality dimensions Prometheus can't handle gracefully. Three agents collapsed into one when Alloy matured; the ClickHouse layer is what made high-cardinality analytics actually queryable.
The most common Kubernetes mistake I see is using the same `/health` endpoint for liveness, readiness, and startup probes. Each probe answers a different question. Conflating them produces cluster thrashing, cascading restarts, and on-call pages that the application can't explain because the application itself is fine.
When a company moves to Kubernetes and most of engineering is learning it on the job, the same five misconceptions show up over and over. They cost teams in production before anyone realizes they're misconceptions. This is the list I kept catching in design reviews and lunch-and-learns at Kore.
How code and infrastructure changes flow into Kore production. Harness orchestrates pipelines; Terraform manages cloud infrastructure across AWS and Azure; Artifactory hosts our internal artefacts after we moved off JFrog Cloud for multi-region. Docker images, Kubernetes manifests, and VM-era artefacts all flow through variations of the same pipeline.
A tour of the Kore.AI platform infrastructure — the Node.js monolith, the constellation of supporting services, the data tier, the messaging tier, the dual-cloud deployment, and the operational pain points that came from each choice. Written as the context that every other deep-dive on this site assumes.
Prometheus + Loki + Grafana + Grafana Alloy as the observability core, with ClickHouse for the high-cardinality queries Prometheus can't handle. The hardest part wasn't picking the stack; it was building the dashboards and discipline around it. Most observability work is cardinality control and label hygiene, dressed up as 'building an observability platform.'
We run RabbitMQ in production. The team has spent significant time evaluating Kafka and Pulsar as replacements for certain workload classes. This dive covers the actual operational tradeoffs, why we moved off ha-all without picking quorum queues, and the decision tree I'd use today if I were picking a broker from scratch.
The MongoDB diagnostic guides online assume you have one Mongo instance and a few users. Debugging a sharded production cluster under load is a different sport — the queries are sometimes lying to you, the metrics often don't pin the actual problem, and the thing that breaks at 5k CCU was invisible at 500. This is the loop I've developed for figuring out what's actually wrong.
A sharded MongoDB cluster lives and dies by its shard key. Picking one for a collection that has both write-heavy ingest and tenant-scoped queries is a balancing act. This is how I think about it, why MongoDB v8's online resharding changed the stakes, and the specific keys we landed on for our largest collections.
We picked Istio at Kore because we needed simultaneous VM and Kubernetes support, and Linkerd didn't support VMs at the time. It was the right call for that constraint and a heavy choice for everything else. Eventually we migrated from per-pod sidecars to Istio ambient mode, which cut the per-pod overhead by a lot. This is the comparison I'd actually use to pick a mesh today.
A scheduled AWS Glue job started failing during what was supposed to be a routine data migration. The job's failure cascaded — Kafka backed up, downstream dashboards went stale, and the migration window we'd carefully planned around stretched from hours to days. The interesting part wasn't the fix; it was discovering that 'Glue job failure' covered three distinct root causes that needed to be untangled before any one of them could be fixed.
A retention job designed to clean up stale operational data quietly extended its reach to a class of records it was never meant to touch. By the time someone noticed, data was gone for weeks of customer history. The root cause was a query that worked perfectly when written and silently widened as the data model evolved around it.
A class of pods started restarting at irregular intervals. The behaviour had been there for three years, intermittent and rare enough to be ignored. When a scale increase made it consistent, we finally traced it back to a code path that had been written before the codebase looked the way it does now. The bug was simple. The reason it had survived three years was instructive.
Took a multi-tenant conversational-AI platform from a 6k CCU ceiling to 13k by eliminating bottlenecks one tier at a time — RabbitMQ, Redis, MongoDB, workload isolation. Infra cost per scaling event dropped ~27% along the way.
Moved a stateful, customer-facing conversational-AI platform off VMs and onto Kubernetes across two clouds, without downtime, while most of the engineering org was learning K8s for the first time. The single highest-leverage change was splitting health probes by intent — startup, readiness, liveness all answer different questions, and we'd been using one endpoint for all three.
Built the analytics off-ramp that takes operational data out of MongoDB and lands it as analytical tables on S3 — Mongo → Kafka → AWS Glue → Apache Hudi. The interesting part wasn't the design; it was the failure modes we discovered in production: Glue jobs that OOMed at peak, schema drift that broke writes, small-files pileups that slowed every query, and a sync that would silently stop with no alert.
Three RabbitMQ problems at three different scale levels. ha-all mirroring melting the cluster at 800 CCU. Single-cluster topology causing cross-workload interference at 3k CCU. One-message-one-job consumer patterns burning CPU on broker round-trips for high-volume queues. Each needed a different fix; together they took RMQ from primary bottleneck to non-issue.
CPU-based HPA is the default and it's almost always wrong for a Node.js monolith. We replaced it with KEDA + per-pool saturation metrics — RMQ queue depth, event-loop lag, RPS per pod — combined with workload-specific node groups and asymmetric cooldowns. The cost saving wasn't from clever scaling math; it was from no longer paying ML-instance prices to run an API pod.
On May 19, a dashboard query storm pushed the JP Trino cluster — a single m5.2xlarge running both coordinator and worker — into a 40-minute GC death spiral. Both JVMs OOM-exited. DevOps restarted at 18:42; the worker re-OOM'd 15 minutes later. The root cause was two structural problems (co-location overcommit, missing per-query memory cap) meeting two triggers (dashboard query storm, application-side query fan-out).