Experimenting with autoscaling in real-world production systems

The first autoscaling change I shipped at Kore was wrong, and I knew it was wrong, and I shipped it anyway because the alternative — letting on-call operators keep manually scaling pools before customer launches — was clearly worse.

That’s the honest framing for this entire workstream. Production autoscaling is a series of “this is wrong, but it’s less wrong than what we had” iterations until you eventually land somewhere that’s actually correct. There is no clean theory you apply once and walk away. There’s a default that mostly works, and an operating model around it that handles the cases where it doesn’t.

The default was wrong in a specific way

When I picked up the autoscaling work, the platform was running HPA on CPU. That’s the K8s default. It’s a fine default for some workloads. It is a terrible default for ours.

Koreserver — our Node.js monolith — is a single-threaded process with an event loop. When it’s “busy,” what that actually means is: requests are queued behind I/O, or the event loop is blocked on a downstream call, or RabbitMQ has piled up work, or Redis is slow. CPU utilization in any of those cases can look almost idle. A pod can be at 30% CPU and serving requests at p99 latency of 8 seconds — because the bottleneck is downstream and CPU has nothing to do with it.

HPA on CPU in that world has two failure modes:

It scales too late. By the time CPU is actually high, the user-visible latency has been bad for several minutes. Users are already angry.
It scales for the wrong reasons. A GC pause spikes CPU briefly. A batch job starts on the same pod. HPA scales up, the spike resolves, HPA scales down. We just paid for half an hour of extra pods to ride out a 30-second blip.

The operational consequence: operators were manually pre-scaling pools before known traffic events (customer launches, scheduled load tests, big tenant onboardings). It worked but it didn’t scale. Every launch needed a person watching graphs.

The principle: scale on what’s actually saturating

The fix wasn’t a single signal. It was a different signal per traffic class.

The platform breaks into roughly four pool types, each scaling on a different metric:

Pool	Primary signal	Why
Voice / IVR (latency-sensitive)	p95 event-loop lag + in-flight request count	CPU hides event-loop saturation; lag is the user-visible signal
Async worker (RMQ consumers)	RMQ queue depth via KEDA	Queue depth is the saturation signal for a consumer
FAQ / ML-heavy	CPU + p95 downstream call latency, combined	ML inference is actually CPU-bound; downstream latency catches dependency saturation
General HTTP	RPS per pod, CPU as guardrail ceiling	RPS normalized per pod is a cleaner signal than raw CPU

CPU stays as a guardrail on every pool. If pods are scaling up while CPU is idle, that’s a red flag that the custom metric is misbehaving. If CPU is pinned and we haven’t scaled up yet, we want to know.

The thing that makes this work practically — and the thing I underestimated initially — is KEDA. HPA on custom metrics is possible via Prometheus Adapter, but it’s fiddly. KEDA gives you scalers for queue depth, message age, custom Prometheus queries, even cron-based triggers, with a declarative ScaledObject resource that’s much more legible than the HPA equivalent:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: consumer-runtime
spec:
  scaleTargetRef:
    name: consumers
  minReplicaCount: 0           # scale-to-zero in lower envs
  maxReplicaCount: 40
  triggers:
    - type: rabbitmq
      metadata:
        queueName: runtime-jobs
        mode: QueueLength
        value: "20"            # target: 20 msgs per consumer pod

KEDA also gives us scale-to-zero for lower environments, which on its own is one of the biggest wins of this work — dev and staging environments drop to zero replicas outside business hours instead of running idle.

Cooldowns: asymmetric on purpose

One of the lessons that took longest to internalize was that scale-up and scale-down should not be symmetric.

The cost of being under-provisioned during a spike is high — users experience degraded service, SLOs burn, sometimes errors. The cost of being over-provisioned for a while is low — you pay for some idle pods.

So: scale up fast, scale down slow. In practice:

behavior:
  scaleUp:
    stabilizationWindowSeconds: 0
    policies:
      - type: Percent
        value: 100              # double replicas in 30s if needed
        periodSeconds: 30
  scaleDown:
    stabilizationWindowSeconds: 1800   # 30-minute buffer
    policies:
      - type: Percent
        value: 10
        periodSeconds: 60

Scale-up has no stabilization window — if the signal goes red, we react immediately and can double pod count in 30 seconds. Scale-down has a 30-minute stabilization window and removes at most 10% of pods per minute. That asymmetry is what eliminates oscillation. Symmetric cooldowns produce sawtooth scaling graphs and pod churn that itself becomes a cost.

For RTM/WebSocket pools, scale-up triggers earlier (lower threshold) because of a specific quirk: long-lived TCP connections stick to specific backend pods. New pods come up idle while existing pods stay overloaded. The fix isn’t autoscaling, it’s connection draining and probe tuning, but earlier scale-up gives those mechanisms more time to redistribute load.

The bigger win: node groups

Per-pool autoscaling is the part of this work that sounds technically interesting. The part that saved actual money was less glamorous.

Before this work, every pod ran on the same node pool. Node sizing was determined by the most resource-hungry workload — ML inference, which needed memory-optimized instances with lots of RAM. Every other pod — API servers, async workers, NLP — ran on the same nodes. A scale-up event for the API pool provisioned ML-sized nodes for API pods. We were paying premium prices for commodity workloads.

We split into workload-specific node groups:

Node group	Instance	Workload
`app-compute`	c6i.8xlarge (32 vCPU, 64 GB, 3.5 GHz Ice Lake)	koreserver, consumers, NLP
`ml-memory`	memory-optimized	ML inference, embeddings
`rmq-dedicated`	c5.18xlarge	RabbitMQ only
`mongo-compute`	c5d.18xlarge	MongoDB shards

(MongoDB shards are on EC2 not K8s, but they’re part of the same capacity planning conversation.)

The instance choices weren’t random. c6i.8xlarge specifically — Intel Ice Lake at 3.5GHz — was chosen because koreserver’s NLP paths are clock-speed sensitive. Azure equivalents at 2.4GHz with turbo boost performed measurably worse on the same workload, and turbo boost timing is unpredictable enough that we couldn’t size capacity against it.

The ~27% cost reduction per scaling event that gets mentioned in the main scaling pillar is mostly from this. Per-pool HPA fixed the false-positive scale-ups. Node groups fixed the structural over-provisioning. Together they made cost growth sub-linear against CCU.

Cluster Autoscaler, not Karpenter (yet)

We use Cluster Autoscaler for node-level scaling. Karpenter was evaluated and rejected — at the time we made the decision, its operational model (NodePools, disruption budgets, consolidation policies) was more complex than we had bandwidth to operationalize. CA with explicit node groups and over-provisioning buffers met our needs.

This is the kind of decision worth revisiting. Karpenter has matured significantly. Its consolidation behaviour and spot-instance handling could meaningfully improve our bin-packing. If I were starting this work today I’d evaluate it again before committing to CA.

Cron-based minimum replica management

This one is operational, not technical, but it matters.

Production environments need a minimum floor above zero — KEDA scale-to-zero is great for dev/staging but in production we want pods warm to absorb the first request burst. The “right” minimum varies by time of day. At 3am, traffic is low; at 9am as Asian business hours start, traffic ramps fast.

A cron job adjusts minReplicas on a schedule. Lowered to a skeleton footprint outside business hours, raised to a warm baseline 15 minutes before business hours start. The 15-minute lead time is empirical — it’s enough for the cluster autoscaler to provision nodes if needed, and for the new pods to pass startup probes before the first burst of traffic arrives.

Without this, the first 5 minutes of business hours saw cold-start latency every day. With it, the ramp is invisible.

What I’d do differently

Set per-pool metric alerts before launching per-pool HPA. I shipped the new autoscaling config before the dashboards and alerts caught up. The first week was partially-blind — autoscaling was making decisions and I didn’t have great visibility into why. Should have built the observability first.

Define min/max replicas from load tests, not gut feel. Initial values were “reasonable-looking numbers.” Several pools had minReplicas: 1, which is a self-DOS waiting to happen for any customer-facing pool. We tightened these later based on actual data; should have been data-driven from the start.

Evaluate Karpenter earlier. I deferred the evaluation because we had a working CA setup. The savings from Karpenter’s consolidation and spot handling are real and we left them on the table.

Codify the autoscaling design in the golden-path manifests. Each new deployment has to think about its scaling strategy. Most don’t. The golden-path templates should ship with a sensible ScaledObject and force the team to either accept the default or articulate why their workload is different.

Things people ask me about this

What’s the single most impactful change you made? Splitting into workload node groups. Not the metric choice — the node group split. The metric change fixed false-positive scaling and made the system more responsive. The node group split is what made cost growth sub-linear: you stop paying ML-instance prices to run an API pod. The autoscaling logic itself matters; the substrate it runs on matters more.

Why not VPA? VPA is useful for calibrating initial requests/limits in dev — we use it in that mode. As a runtime scaling lever it’s not the right tool: it doesn’t add pods, it resizes existing pods (which means a restart), and it composes poorly with HPA when both target the same metric.

How do you handle a new queue or traffic class? Someone writes a new KEDA ScaledObject. We have a template they fork. The thing they’re forced to think about is: what’s the right saturation signal for this workload? The default — copy whatever the previous service used — is usually wrong. The forcing function is the most valuable part.

Why is your scale-down 30 minutes? Empirical for our traffic patterns. Long enough to absorb a second peak after an initial spike (e.g., a customer job that runs twice). Short enough to not burn a meaningful budget on idle pods. Tune for your traffic shape. The principle — scale down much more conservatively than up — matters more than the specific number.

How do you handle traffic that doesn’t fit the per-pool model — a tenant with a wildly different workload mix? Honestly: we don’t, well. We’ve considered tenant-aware routing that would send specific tenants to specific pools, but operationally that’s a maintenance burden we haven’t taken on. Today, a tenant with an atypical workload gets averaged into whatever pool their traffic class hits. This is a real limitation; the fix would be either tenant-aware sharding (expensive) or pool subdivision (also expensive). For now, the worst-offending tenants get tracked and the pools they hit get extra capacity headroom.

Scaling to 15k CCU — the broader scaling work this autoscaling fits into
VM-to-Kubernetes migration — the K8s platform this autoscaling runs on
Kore infrastructure overview — the broader infra context, including node groups