[DRAFT NOTE] Stack composition and the agent consolidation story are accurate. Specific cardinality numbers, alert volumes, MTTR/MTTD figures tagged
[TODO].
Every observability stack ever built has the same surface description: “metrics, logs, traces, dashboards, alerts.” If that’s all you need to know, you don’t need this dive.
This dive is the honest version. What we picked, why we picked it, what it costs us operationally, the mistakes we made along the way, and the disciplines that determine whether the stack is useful or just a collection of expensive databases.
The starting point
When Kore was running on VMs, “observability” was a patchwork. Host metrics from CloudWatch. Application logs in MongoDB (yes, really — a separate retention story that eventually became its own cost problem). A handful of Grafana dashboards built by individual engineers, each one telling a slightly different version of the truth.
There was no single picture of “is the platform healthy right now” that the on-call could open at 3am. That phrase — “what would I look at if I were paged right now?” — became the design brief for everything that came after.
The constraints were:
- Multi-cloud (AWS, Azure). Anything vendor-locked was off the table.
- Cost-sensitive. SaaS observability (Datadog, New Relic) was attractive but priced for our cardinality at the cost of multiples of running it ourselves, plus data residency concerns for some tenants.
- Operable by a small team. No dedicated observability platform team.
We landed on the OSS Grafana stack: Prometheus for metrics, Loki for logs, Grafana for query and dashboard, eventually ClickHouse for the cardinalities Prometheus couldn’t handle. Three agents per node (Promtail + node_exporter + Grafana agent) consolidated to one (Grafana Alloy) when Alloy matured.
The why-not-Datadog conversation comes up repeatedly. The honest answer: cost. At our cardinality and log volume, the run rate would have been [TODO: rough number] — multiples of running it ourselves. The OpEx for our self-hosted stack (compute, storage, engineering time) is less than the SaaS price, and the engineering time we spend on it is non-trivial but bounded.
Metrics: Prometheus, with discipline
Prometheus is excellent within its sweet spot and miserable outside it. The sweet spot is time-series with bounded cardinality. The discipline that keeps it inside that spot is the actual work of running it.
Federation / remote-write strategy. [TODO: confirm the storage backend — Mimir, Thanos, plain remote-write to long-term?] Per-cloud Prometheus instances scrape locally and remote-write to a long-term store. Queries that need cross-region data hit the long-term store; local queries hit the Prometheus instance directly for lower latency.
Exporters that mattered. node_exporter for host. mongodb_exporter for Mongo internals. redis_exporter for Redis. rabbitmq_exporter (or the built-in management API depending on version) for RMQ. Application-level metrics emitted by the koreserver process directly via a Prometheus client.
Recording rules. Anything a dashboard queries should already be pre-computed by recording rules. Cardinality is fixed at write time, not query time. A dashboard panel that takes 30 seconds to load because it’s doing real-time aggregation across millions of series is a dashboard panel about to be replaced with a recording rule.
Label hygiene. Reviewed in PRs. No per-user labels. No per-session labels. No request_id. Tenant labels only on metrics where tenant-level views are genuinely needed and the tenant cardinality is bounded.
The discipline is the point. The technology is easy; the operating model that keeps Prometheus from melting is what’s hard.
Logs: Loki, with cardinality control as the main pain
Loki was picked over alternatives (Elasticsearch, Splunk, SaaS log providers) for the same reasons as Prometheus: cost and multi-cloud portability.
The main operational pain has been cardinality control on the label set. Loki indexes labels; high-cardinality labels blow up the index. The patterns that cost us:
request_idas a label. Each request creates a new stream. Streams pile up; index size explodes.- High-cardinality dimensions in structured logs labelled directly. Same problem in a slightly different shape.
The fix is structural: high-cardinality data goes in the log body, not the label set. Stream-level label pruning at the Loki ingest. A quarterly sweep of “what streams are eating the most space?” with corrective action.
The lesson that recurs across observability work: the wrong default is “label everything that might be useful.” The right default is “label only what you’ll query by; everything else goes in the body.”
ClickHouse: where Prometheus can’t go
[TODO: this section needs your specific use cases]
ClickHouse fills the gap Prometheus and Loki can’t. Specifically:
- High-cardinality analytical queries. Per-tenant, per-request, per-user breakdowns over weeks of data.
- Wide tables with many columns where most queries read a few.
- Concurrent ad-hoc analytical work.
[TODO: what specific use cases drove ClickHouse adoption at Kore? Per-tenant analytics? Request-level dashboards? Operational data warehouse?]
[TODO: ingest path — direct from application? CDC from somewhere? Kafka?]
[TODO: schema and partitioning approach]
What ClickHouse isn’t:
- A drop-in replacement for Prometheus. Different query model; PromQL doesn’t translate.
- A general-purpose database. Updates and deletes are not cheap; design for append-mostly.
- A low-latency operational store. Sub-second query latency is achievable; sub-millisecond isn’t.
The mental model: Prometheus for “the platform’s vital signs,” Loki for “what did this specific component say,” ClickHouse for “what was the platform doing across all tenants between time X and Y, broken down by Z.”
The incident-response dashboard
This is the single most-used artefact in the whole stack. Top fold shows, for each major stack:
- Current p50 / p95 / p99 latency
- Error-rate sparkline over the last hour
- Saturation indicators (Mongo IOPS, RMQ queue depth, Redis memory, NFS IOPS)
- Per-container delays (ML, FAQ, voice path) so the on-call can see which downstream is slow without clicking through
Below the fold: drill-down panels per stack with the Loki queries pre-wired. The dashboard is the runbook entry point — every alert links to its corresponding panel.
The thing that makes this dashboard useful is the discipline of “if you can’t answer ‘is the platform up and if not what’s slow’ from the top fold, the top fold is wrong.” When a new incident type surfaces a question the dashboard can’t answer, the dashboard gets updated. It’s a living artefact, not a one-time deliverable.
Alerting philosophy
Multi-burn-rate SLO alerts for customer-visible paths. Alertmanager routing by service class and tenant tier. Grouping and inhibition rules so a cascade incident doesn’t fan out to 50 simultaneous pages.
The principle: an alert should mean “a human needs to do something now.” If it doesn’t, it’s noise. Noise eventually trains people to ignore real alerts. We periodically audit the alert tree and prune anything that’s been firing without action being needed.
[TODO: specific alert volume / pager fatigue numbers if you have them]
The Alloy migration
After running Promtail + node_exporter + Grafana agent on every node for [TODO: months], the agent footprint and config sprawl became a real maintenance cost. Three DaemonSets, three config languages, three upgrade cycles.
Grafana Alloy collapses all of this into a single agent. Same telemetry, one binary, one config (a flavour of HCL).
Migration was tranched by node pool, with a back-out path per tranche, and old DaemonSets stayed disabled for weeks before being removed. Zero data gaps, zero customer-visible incidents.
Detail in the data lake setup deep-dive.
What I’d do differently
Start with SLOs before alerts. We retrofitted SLOs onto an existing alert tree. Cleaner the other way: define what “good” looks like for each customer-facing path, alert on burn rate, build dashboards that show SLO status. The alerts then come from the SLO model rather than vice versa.
Adopt OpenTelemetry collectors earlier for traces. Metrics and logs we did well. Tracing came late and we’re still catching up. Distributed-system debugging is genuinely harder without traces — the metrics-and-logs combination doesn’t tell you what touched what in a request path.
Centralise dashboard ownership earlier. “Every team owns its dashboards” produced inconsistent panels, broken links across reorgs, and no shared mental model for incident response. A central convention for the top-fold incident-response dashboards — even if team-specific dashboards remain decentralized — would have paid back.
Build the cardinality discipline before the cardinality crisis. We learned about high-cardinality labels by watching Loki struggle. The discipline should have been documented and PR-reviewed from day one.
Things people ask me about this
Why not Datadog? Cost at our cardinality and log volume. Also two clouds and tenant data residency concerns made SaaS messier than it looked.
What does cardinality control look like in your day-to-day? PR review on metric/label changes. A recording-rules layer that pre-aggregates anything a dashboard actually queries. A quarterly sweep that finds high-cardinality streams in Loki and prunes them. There’s no silver bullet — it’s a discipline.
Walk me through one row of your incident-response dashboard. [TODO: pick a real row and narrate it]. The point is the on-call can answer “is the platform up and if not what’s slow” without leaving the top fold.
What’s the case for moving to Tempo and Mimir later? Tempo for trace correlation in Grafana. Mimir for scaling the metrics tier past a single Prometheus. Same operational model for traces/metrics/logs. The argument against is operational burden — we’d be running another distributed system. The flip happens when we’re at the scale where Prometheus is straining; we’re not quite there yet.
Where does ClickHouse fit operationally? [TODO: specific answer]. Different ownership? Different oncall? Or part of the same observability team?
Related reading
- Data lake setup — the data layer of this story (Alloy, Loki, Prometheus, ClickHouse); this dive is the consumption layer (dashboards, alerts, SLOs)
- Kore infrastructure overview — what we’re observing
- Scaling pillar — every scaling diagnosis used this stack