All

22 entries

Deep Dive17 Jun 2026
RabbitMQ Cluster
Placeholder — content coming soon.
rabbitmqmessaging
Deep Dive27 May 2026
Trino memory and JVM tuning: what each knob actually does
Trino on the JVM is three nested memory systems — physical RAM, Java heap, and Trino's own User/System/Total accounting — and most production incidents come from one of those layers being out of step with the others. This dive walks through the mechanics that matter, the G1GC flags worth setting, the query patterns that wreck coordinator memory, and the application-side behaviour that turns transient slow queries into leaks.
trinojvmg1gcmemoryperformance
Core Work27 May 2026
Stabilising Trino: from restarts and guesswork to a tuned cluster
Took the analytics Trino cluster from a service that needed restarts every few weeks and reactive heap-resizing across regions to one that survives dashboard storms. Tipping point was a 40-minute coordinator+worker cascade on a single-host JP deployment; the work after that was structural — JVM and Trino memory caps, query lifecycle fixes in the application, query-shape rewrites, and observability that would have caught the incident on the way down rather than after it.
trinojvmperformancemongodbdataobservability
Deep Dive24 May 2026
Data lake setup: log shipping with Alloy, metrics with Prometheus, analytics with ClickHouse
How we built the operational data lake at Kore — Promtail (then Grafana Alloy) for log shipping, Prometheus for metrics, ClickHouse as the queryable backing store for the high-cardinality dimensions Prometheus can't handle gracefully. Three agents collapsed into one when Alloy matured; the ClickHouse layer is what made high-cardinality analytics actually queryable.
observabilityalloyprometheusclickhouselokigrafana
Deep Dive24 May 2026
Healthcheck probes at scale: three probes, three questions
The most common Kubernetes mistake I see is using the same `/health` endpoint for liveness, readiness, and startup probes. Each probe answers a different question. Conflating them produces cluster thrashing, cascading restarts, and on-call pages that the application can't explain because the application itself is fine.
kubernetesprobessreproduction
Deep Dive24 May 2026
Kubernetes misconceptions I had to keep unteaching
When a company moves to Kubernetes and most of engineering is learning it on the job, the same five misconceptions show up over and over. They cost teams in production before anyone realizes they're misconceptions. This is the list I kept catching in design reviews and lunch-and-learns at Kore.
kubernetesenablementplatformlearning
Deep Dive24 May 2026
Kore CI/CD pipeline: Harness, Terraform, Artifactory, and the multi-region story
How code and infrastructure changes flow into Kore production. Harness orchestrates pipelines; Terraform manages cloud infrastructure across AWS and Azure; Artifactory hosts our internal artefacts after we moved off JFrog Cloud for multi-region. Docker images, Kubernetes manifests, and VM-era artefacts all flow through variations of the same pipeline.
cicdharnessterraformartifactorydockerkubernetes
Deep Dive24 May 2026
Kore infrastructure overview: architecture, reasoning, pain points
A tour of the Kore.AI platform infrastructure — the Node.js monolith, the constellation of supporting services, the data tier, the messaging tier, the dual-cloud deployment, and the operational pain points that came from each choice. Written as the context that every other deep-dive on this site assumes.
architecturekubernetesawsazuremongodbrabbitmq
Deep Dive24 May 2026
Kore observability setup: what we built, what works, what doesn't
Prometheus + Loki + Grafana + Grafana Alloy as the observability core, with ClickHouse for the high-cardinality queries Prometheus can't handle. The hardest part wasn't picking the stack; it was building the dashboards and discipline around it. Most observability work is cardinality control and label hygiene, dressed up as 'building an observability platform.'
observabilitysreprometheusgrafanalokialloy
Deep Dive24 May 2026
Comparing message brokers: RabbitMQ, Kafka, Pulsar — and why we deferred quorum queues
We run RabbitMQ in production. The team has spent significant time evaluating Kafka and Pulsar as replacements for certain workload classes. This dive covers the actual operational tradeoffs, why we moved off ha-all without picking quorum queues, and the decision tree I'd use today if I were picking a broker from scratch.
rabbitmqkafkapulsarmessaginghatradeoffs
Deep Dive24 May 2026
Debugging MongoDB in a real production cluster: a strategy, not a checklist
The MongoDB diagnostic guides online assume you have one Mongo instance and a few users. Debugging a sharded production cluster under load is a different sport — the queries are sometimes lying to you, the metrics often don't pin the actual problem, and the thing that breaks at 5k CCU was invisible at 500. This is the loop I've developed for figuring out what's actually wrong.
mongodbdebuggingshardingproductionperformance
Deep Dive24 May 2026
MongoDB sharding: picking the right key, surviving the wrong one
A sharded MongoDB cluster lives and dies by its shard key. Picking one for a collection that has both write-heavy ingest and tenant-scoped queries is a balancing act. This is how I think about it, why MongoDB v8's online resharding changed the stakes, and the specific keys we landed on for our largest collections.
mongodbshardingdatav8
Deep Dive24 May 2026
Service mesh tradeoffs: Istio, Linkerd, Kuma — picking the wrong one, then living with it
We picked Istio at Kore because we needed simultaneous VM and Kubernetes support, and Linkerd didn't support VMs at the time. It was the right call for that constraint and a heavy choice for everything else. Eventually we migrated from per-pod sidecars to Istio ambient mode, which cut the per-pod overhead by a lot. This is the comparison I'd actually use to pick a mesh today.
istiolinkerdkumaservice-meshkubernetesambient-mode
Incident24 May 2026
When a Glue job broke the analytics pipeline mid-migration
A scheduled AWS Glue job started failing during what was supposed to be a routine data migration. The job's failure cascaded — Kafka backed up, downstream dashboards went stale, and the migration window we'd carefully planned around stretched from hours to days. The interesting part wasn't the fix; it was discovering that 'Glue job failure' covered three distinct root causes that needed to be untangled before any one of them could be fixed.
SEV2MTTR: [TODO]MTTD: [TODO]
sreincidentgluehudidatapostmortem
Incident24 May 2026
The retention policy that deleted data it wasn't supposed to
A retention job designed to clean up stale operational data quietly extended its reach to a class of records it was never meant to touch. By the time someone noticed, data was gone for weeks of customer history. The root cause was a query that worked perfectly when written and silently widened as the data model evolved around it.
SEV2MTTR: [TODO]MTTD: [TODO]
sreincidentdata-lossretentionpostmortem
Incident24 May 2026
The 3-year-old restart bug nobody understood
A class of pods started restarting at irregular intervals. The behaviour had been there for three years, intermittent and rare enough to be ignored. When a scale increase made it consistent, we finally traced it back to a code path that had been written before the codebase looked the way it does now. The bug was simple. The reason it had survived three years was instructive.
SEV3MTTR: [TODO]MTTD: [TODO]
sreincidentlatentdebuggingpostmortem
Core Work24 May 2026
Scaling a production system to 15k concurrent users
Took a multi-tenant conversational-AI platform from a 6k CCU ceiling to 13k by eliminating bottlenecks one tier at a time — RabbitMQ, Redis, MongoDB, workload isolation. Infra cost per scaling event dropped ~27% along the way.
cloudkubernetesmongodbredisrabbitmqperformance
Core Work24 May 2026
Migrating critical infrastructure from VMs to Kubernetes
Moved a stateful, customer-facing conversational-AI platform off VMs and onto Kubernetes across two clouds, without downtime, while most of the engineering org was learning K8s for the first time. The single highest-leverage change was splitting health probes by intent — startup, readiness, liveness all answer different questions, and we'd been using one endpoint for all three.
kubernetesmigrationsreplatformenablement
Core Work24 May 2026
Building and evolving an analytics pipeline
Built the analytics off-ramp that takes operational data out of MongoDB and lands it as analytical tables on S3 — Mongo → Kafka → AWS Glue → Apache Hudi. The interesting part wasn't the design; it was the failure modes we discovered in production: Glue jobs that OOMed at peak, schema drift that broke writes, small-files pileups that slowed every query, and a sync that would silently stop with no alert.
datakafkagluehudis3mongodb
Core Work24 May 2026
Optimizing RabbitMQ for reliability and throughput
Three RabbitMQ problems at three different scale levels. ha-all mirroring melting the cluster at 800 CCU. Single-cluster topology causing cross-workload interference at 3k CCU. One-message-one-job consumer patterns burning CPU on broker round-trips for high-volume queues. Each needed a different fix; together they took RMQ from primary bottleneck to non-issue.
rabbitmqmessagingperformancehabackendthroughput
Core Work24 May 2026
Experimenting with autoscaling in real-world production systems
CPU-based HPA is the default and it's almost always wrong for a Node.js monolith. We replaced it with KEDA + per-pool saturation metrics — RMQ queue depth, event-loop lag, RPS per pod — combined with workload-specific node groups and asymmetric cooldowns. The cost saving wasn't from clever scaling math; it was from no longer paying ML-instance prices to run an API pod.
kuberneteshpakedaautoscalingcapacity-planningperformance
Incident19 May 2026
The JP Trino cascade: a coordinator and worker that took each other down
On May 19, a dashboard query storm pushed the JP Trino cluster — a single m5.2xlarge running both coordinator and worker — into a 40-minute GC death spiral. Both JVMs OOM-exited. DevOps restarted at 18:42; the worker re-OOM'd 15 minutes later. The root cause was two structural problems (co-location overcommit, missing per-query memory cap) meeting two triggers (dashboard query storm, application-side query fan-out).
SEV3MTTR: ~16 minutes (first restart) / ~[TODO] (full stabilisation)MTTD: [TODO]
sreincidenttrinojvmmemorypostmortem

All

RabbitMQ Cluster

Trino memory and JVM tuning: what each knob actually does

Stabilising Trino: from restarts and guesswork to a tuned cluster

Data lake setup: log shipping with Alloy, metrics with Prometheus, analytics with ClickHouse

Healthcheck probes at scale: three probes, three questions

Kubernetes misconceptions I had to keep unteaching

Kore CI/CD pipeline: Harness, Terraform, Artifactory, and the multi-region story

Kore infrastructure overview: architecture, reasoning, pain points

Kore observability setup: what we built, what works, what doesn't

Comparing message brokers: RabbitMQ, Kafka, Pulsar — and why we deferred quorum queues

Debugging MongoDB in a real production cluster: a strategy, not a checklist

MongoDB sharding: picking the right key, surviving the wrong one

Service mesh tradeoffs: Istio, Linkerd, Kuma — picking the wrong one, then living with it

When a Glue job broke the analytics pipeline mid-migration

The retention policy that deleted data it wasn't supposed to

The 3-year-old restart bug nobody understood

Scaling a production system to 15k concurrent users

Migrating critical infrastructure from VMs to Kubernetes

Building and evolving an analytics pipeline

Optimizing RabbitMQ for reliability and throughput

Experimenting with autoscaling in real-world production systems

The JP Trino cascade: a coordinator and worker that took each other down