Healthcheck probes at scale: three probes, three questions

If you’ve ever seen a Kubernetes cluster that’s “definitely fine” but is somehow restarting half its pods every few minutes during a deploy, the cause is almost always probes.

Specifically: probes that don’t know what they’re supposed to do.

This dive is about why the three probes exist, what each one actually means, the failure modes you get when you conflate them, and the configuration pattern I now use everywhere as the golden path.

Three probes, three questions

Kubernetes gives you three probe types. The names hide what they do, which is part of the problem.

Probe	What it asks	What happens on failure
`startupProbe`	”Has the process finished booting?”	Liveness and readiness wait. No restart.
`readinessProbe`	”Can I serve traffic right now?”	Pod removed from Service endpoints. No restart.
`livenessProbe`	”Am I actually alive (not deadlocked)?”	Pod restarted.

The crucial thing is the last column. Only liveness restarts. Readiness just removes you from load balancing. Startup just buys time.

Now look at what happens if you use the same /health endpoint for all three. Whatever logic that endpoint runs has to answer the most-permissive question (am I alive) when liveness checks it, and the strictest (am I ready now) when readiness checks it. You can’t have it both ways. Either you restart pods that are just slow, or you serve traffic from pods that aren’t ready yet. Usually both, at different times, in ways that are infuriating to debug.

The failure modes you actually see

The reason this matters is that the failure modes have specific, predictable shapes, and once you’ve seen them you spot them immediately.

“CrashLoopBackOff that isn’t a real crash.” Pod has a long cold start (let’s say 45 seconds — caches, plugins, connection pools). Liveness fires every 20 seconds at the same /health endpoint that fails until startup completes. K8s gives up after 3 failures, kills the pod, schedules a replacement. The replacement does the same thing. The deployment stays in CrashLoopBackOff forever. The application has never finished starting. Nobody who looks at it sees an actual crash because there isn’t one — it’s K8s killing healthy pods because they’re slow.

Cascading restarts during downstream blips. Liveness endpoint checks “can I talk to MongoDB?” MongoDB has a 5-second blip. Every pod’s liveness fails. K8s restarts every pod. The restart cascade itself adds load to MongoDB, which extends the blip. Now you’re in a feedback loop.

Pods marked ready that can’t actually serve. Readiness endpoint is the same as liveness, which is the same as a generic health check. The generic check passes the moment the HTTP server is responding. Pod gets traffic. Pod can’t actually serve it yet because its caches aren’t warm. First few requests time out. Users see errors. The dashboard says everything’s healthy.

I’ve seen all three of these in production. The first one is the most embarrassing because it’s so basic. The second is the scariest because it amplifies real failures.

The pattern that works

Three endpoints. Three semantics. Don’t share them.

startupProbe:
  httpGet:
    path: /health/started
    port: 8080
  failureThreshold: 30      # 30 * 10s = 5 minute boot budget
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  periodSeconds: 5
  failureThreshold: 3
  timeoutSeconds: 2

livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  periodSeconds: 20
  failureThreshold: 3
  timeoutSeconds: 5

Each endpoint checks one specific thing:

/health/started — process is up, dependencies initially loaded. DB connection pool created, config fetched, plugins initialised, any one-time startup work complete. Returns 200 once startup is done. After that it can stay 200 forever — Kubernetes stops calling this probe once it succeeds.
/health/ready — pod can serve a request right now. Cheap. Should not call expensive downstream dependencies. Should reflect “am I in a state to take traffic” — could be a flag you flip during graceful shutdown, a check on whether your local connection pools have capacity, whether you’re in a known-bad state. Toggling readiness should be near-instant.
/health/live — process is alive, not deadlocked. Most applications need almost nothing here. If your HTTP server is responding at all, you’re alive. Resist the urge to check downstream dependencies in liveness. A downstream blip is not a reason to restart your own process; it’s a reason to fail requests temporarily. Liveness checks that test downstream are how cascade restarts happen.

A quick decision tree

When someone shows me a probe config that mixes concerns, I run them through three questions:

If a downstream dependency is down, do you want the pod restarted? No → don’t put dependency checks in liveness.
If the app takes time to boot, do you want liveness killing it during boot? No → use startupProbe.
If the app is alive but temporarily can’t take traffic (e.g., draining, warming up), should it be restarted? No → use readinessProbe.

The answers are almost always no, no, no. Which gives you the three-endpoint split.

At scale, this matters more than it looks

Here’s a thing I underestimated initially. At 1,000 pods, with shared probes hitting a /health endpoint that does work:

Liveness probe period 5s, every pod, every 5s = 200 calls/second hitting the same endpoint.
That endpoint maybe does a DB ping, maybe checks a config, maybe takes 20ms.
That’s 4 seconds of CPU per second across the fleet, just for health checks.

If your /health endpoint is non-trivial, this is a real cost. Splitting probes by intent means /health/live can be near-free (basically “am I responding?”). /health/ready is cheap and rarely changes. /health/started runs once per pod lifecycle. The total CPU spent on health checks drops noticeably.

We saw this directly during the VM-to-K8s migration — application CPU spent on health checks dropped meaningfully once liveness stopped firing every 5s on a shared endpoint.

Gotchas

startupProbe is the most underused. If your app has any meaningful cold start, add it. Without it, liveness kills warm-up. With it, liveness waits.
timeoutSeconds should be shorter than periodSeconds. Probes that take longer to time out than their period queue up.
Don’t make readiness expensive. It runs every few seconds across every pod. Expense multiplies.
exec probes spawn a process every invocation. Prefer httpGet unless you have a strong reason. Process creation overhead at scale is real.
If you’re using Istio (or any sidecar mesh), set holdApplicationUntilProxyStarts: true. Otherwise the app starts before the sidecar is ready, probes fire against the sidecar, probes fail, you debug this for hours.
Probe values are workload-specific. The numbers in the example above are sensible defaults for our monolith. Tune yours based on actual boot time and steady-state behaviour. Don’t copy-paste.

What this looked like in practice

When we shipped the three-probe split across the migrated workloads at Kore:

Cascading-restart incidents during deploys: dropped to essentially zero.
“CrashLoopBackOff that isn’t a real crash” pages: gone.
Application CPU spent on health checks: dropped noticeably.
On-call confidence that “the cluster looks weird” meant a real problem (and not a probe config) went up significantly.

Full migration context in the VM-to-K8s migration pillar.