Two years ago, the company committed to running everything on Kubernetes. Most of the org had never run K8s in production. That’s a deceptively small sentence.
The platform was running on a mix of VMs across AWS and Azure — services pinned to specific hosts, security groups stitched together with care, sticky-session state living wherever it happened to land. The conversational-AI platform was the hardest tranche to migrate: largest footprint, most stateful, most customer-facing. And alongside the migration itself, there was a quieter but equally important problem — getting the rest of engineering productive on K8s before the migration introduced more failure modes than it solved.
This is the story of both — the technical migration and the adoption work that made it stick.
What we were dealing with
VM-era operations have a specific texture. Services live on specific hosts. You SSH in when something breaks. Security groups encode “service A talks to service B over port 8080” and as long as nobody changes the SG, traffic flows. Config is baked into the AMI or set via instance metadata. When a host fails, you build a new one and hope state was elsewhere.
K8s breaks every one of those assumptions. Pods are ephemeral. Networking is declared, not implicit. Configuration is data, not part of the image. State has to be explicitly modeled — emptyDir, PVC, external service. None of this is hard once you internalize it. The “once you internalize it” is the entire problem.
The first thing that broke: cascading restarts
The conversational-AI platform’s monolith — koreserver — was slow to boot. Warm caches, plugin loads, downstream connection pools, ML model warmup. Cold start was on the order of 30-60 seconds depending on environment.
We migrated the first tranche with probes that were basically what the existing health-check endpoint exposed, copy-pasted three times — same /health URL for liveness, readiness, and startup probe. K8s killed pods during startup because liveness was failing on a pod that was still loading. Replacement pod came up, K8s started killing that one for the same reason, and the cluster started thrashing. We had to roll back the first tranche within hours.
It’s an embarrassing failure mode in retrospect — probes are entry-level K8s — but it’s also incredibly common. The K8s docs make a clear distinction between the three probes; manifest copy-paste makes them look identical. And the names of the probes hide what they actually do. “Readiness” doesn’t restart anything; “liveness” does. “Startup” is the only one that respects boot time. If you don’t know that going in, the three-probe split looks like over-engineering.
We rebuilt the probe configuration around what each probe actually means:
startupProbe: # "am I done booting?"
httpGet:
path: /health/started
port: 8080
failureThreshold: 30
periodSeconds: 10 # 5 minutes of grace before liveness kicks in
readinessProbe: # "can I take a request right now?"
httpGet:
path: /health/ready
port: 8080
periodSeconds: 5
timeoutSeconds: 2
failureThreshold: 3 # quick to remove from LB; never restarts
livenessProbe: # "am I deadlocked?"
httpGet:
path: /health/live
port: 8080
periodSeconds: 20
failureThreshold: 3
timeoutSeconds: 5 # only fires after startupProbe succeeds
Three endpoints, three semantics: /started means “process is up and dependencies loaded”; /ready means “I can serve a request right now”; /live means “I’m not deadlocked, please don’t restart me.” Same monolith, but the application now exposes three distinct truths about itself.
After we shipped this across the migrated workloads, the cascading-restart incidents went to roughly zero. Health-check call volume into the application also dropped — liveness no longer fires every 5s on every pod, which had been consuming meaningful CPU at our pod count. Full breakdown including the failure modes for each probe in healthcheck probes.
The rest of the migration
The probe fix was the highest-leverage change. The rest was a longer list of less dramatic work:
Config & secrets. Env-vars baked into AMIs went into ConfigMaps and Secrets, mounted as files where the application supported file-based config (rotation becomes free). Runtime-tunable flags moved to a config service callable at startup.
Networking. VM security groups translated into NetworkPolicy. Where NetworkPolicy couldn’t express the intent — path-level access control was the biggest gap — Istio AuthorizationPolicy filled in. This surfaced latent permission issues that VM SGs had been silently allowing; messy in the short term, net positive long term. The Istio rationale is its own story — see service mesh tradeoffs.
Stateful surfaces. Shared state went through PVCs with ReadWriteMany against an NFS mount. This is the decision I’d most like to redo. We treated NFS as a constant and built around it; we should have used the migration as the window to design it out. The scaling work later paid for that decision with inode exhaustion and IOPS upgrades.
Sticky-session paths were the other interesting case. On VMs, sticky sessions worked by routing to a specific instance. On K8s, instance identity isn’t stable — pods get rescheduled, scaled, replaced. We refactored those paths to be session-store-backed (Redis) rather than node-affinity-backed. Took longer than expected because some of the session state was implicit — code that assumed “the next request will hit the same pod” without saying so explicitly.
Rollout. Each workload migrated behind a routing-layer feature flag. Customer traffic shifted in increments: 1% → 10% → 50% → 100%. Rollback was a flag flip, not a redeploy. We carried the rollback story until the workload had been at 100% for a week with no anomalies. Zero customer-visible downtime across both clouds.
The other half: getting the org productive on K8s
When the dust settled on the first few migrated workloads, a new pattern emerged. Other teams wanted to migrate their services too — and they were going to do it whether or not the migration was well-supported. The choice wasn’t “centralized vs decentralized adoption”, it was “structured adoption vs the wild west”.
The wild west was already happening. Teams were copy-pasting manifests that worked once for someone else’s service. The mistakes propagated: missing PDBs, imagePullPolicy: Always with :latest tags, identical probes for liveness/readiness/startup, no resource requests so the scheduler had nothing to plan against. CrashLoopBackOff was the most common Slack message in the K8s channel.
No dedicated platform team existed. The enablement had to fit alongside everything else. Three deliverables:
Golden-path manifests. A small repo with templated manifests for four shapes: stateless service, async worker, CronJob, stateful service. Each shipped with the boring-but-essential defaults already filled in — requests and limits set (with comments explaining the numbers), probes split correctly with links to the probe deep-dive, PDB, ServiceAccount with least-privilege RBAC, NetworkPolicy stub. The intent was less compliance and more “give the average engineer a working starting point that doesn’t need debugging.”
It worked partly because we were honest about what the defaults were and weren’t. Resource requests weren’t “the right answer for your service” — they were “a sensible starting point for a koreserver-style workload; tune based on your actual usage.” Probes weren’t “what you should always do” — they were “split by intent because they answer different questions.”
Runbooks. About a dozen, each targeting a specific failure mode that on-call had been paged for repeatedly: CrashLoopBackOff triage tree (image → config → probes → dependency → app), probe flap diagnosis, pending pods (the scheduling decision tree), ImagePullBackOff, resource pressure, networking (“why can’t A call B?” — NetworkPolicy → DNS → Istio AuthorizationPolicy). Each runbook structured the same way: symptom → likely causes in order of probability → checks → fixes. Linked directly from the alert.
Lunch-and-learns. Half-hour talk, half-hour Q&A, recorded. The topics weren’t syntax — syntax is documented everywhere. They were misconceptions. “Pod is a VM” (it isn’t, and treating it like one will hurt you). “Restart fixes everything” (it doesn’t, and reaching for it first will hide real bugs). “Requests = limits” (they shouldn’t be — see K8s misconceptions).
And one cheap thing that paid back hugely: a weekly office hour. Any team, any K8s question, any stage. Mostly used by teams about to migrate or about to ship a manifest change. The misconceptions surfaced there first — by the time someone asked the question in office hours, three other people in their team had the same one privately.
What this got us
Migration outcomes:
- Both clouds (EKS and AKS) carrying the platform with zero customer-visible downtime during the cutover.
- Cascading-restart incidents during deploys: dropped to essentially zero after the probe fix.
- Health-check noise on the monolith: significantly reduced (liveness no longer firing every 5s on every pod).
- Networking: surfaced latent permission gaps that had been hidden by overly-permissive VM SGs. Painful to clean up, but the platform is more correct for it.
Adoption outcomes:
- Most production services now share a common manifest shape. Platform-wide changes — probe defaults, NetworkPolicy upgrades, label conventions — are tractable instead of per-service archaeology.
- Time-to-diagnose for routine deploy failures dropped significantly for teams using the runbooks. The previously-rare moment of “I have no idea what’s wrong, I’ll just ask in #platform” became the common case being “let me check the runbook first.”
What I’d do differently
Make NetworkPolicy a hard gate from tranche one, not tranche three. Early migrations inherited “allow all” because nobody had written the policies yet. We paid for that cleanup later. The right call would have been to slow the first migration by a week and ship it with NetworkPolicy in place.
Document the probe pattern as a manifest from day one. I figured out the three-probe split during incident response. By the time it was written down, several teams had reinvented it differently. We standardized eventually, but the period of “everyone has a slightly different probe config” was longer than it needed to be.
Treat golden-path manifests as a real internal product. Versioned, changelogged, with a deprecation policy. We ran them as “shared YAML” for too long. When the defaults needed to change — and they did, e.g. when we tuned probe periods after scaling experiments — there was no clean way to propagate the change. Teams forked at v1 and never came back.
Retire NFS during the migration. This is the single biggest regret. We had a once-in-a-decade window to redesign the shared-state dependency and we didn’t take it. The scaling work then had to deal with NFS as a constraint.
Run lunch-and-learns as a rotation, not a series. A one-time series doesn’t reach the people who join after it ran. New engineers inherited old misconceptions from peer copy-paste. The misconception material should have been onboarding content from the start.
Things people ask me about this
What’s the single biggest mistake teams make with K8s probes? Using the same endpoint for all three. Liveness reuses readiness → slow pods get killed instead of removed from LB. Startup not used → liveness kills warm-up. The names hide what each probe does; the docs make it clearer than the manifest copy-paste pattern does. Probe deep-dive has the full breakdown.
What’s the misconception you fight the most?
“A pod is a VM.” It manifests as over-provisioned resources, services that misuse local disk, and engineers running kubectl exec to fix things in place instead of pushing a config change. The fix is mental model work, not documentation. K8s misconceptions.
How did you handle stateful workloads? Per-workload analysis up front, written down, before moving. Ephemeral data stayed ephemeral. Shared state went through PVCs (NFS, regrettably). Sticky-session paths refactored to be session-store-backed. The discipline was “decide and document, then migrate” rather than “lift-and-shift and figure out the state model in production.”
How do you measure adoption? Number of services on the golden-path template. Deploy frequency per team (should rise as confidence grows). Incident rate caused by K8s-shape problems (should fall). Attendee count for lunch-and-learns is a vanity metric — it tells you whether people showed up, not whether they got better.
How do you keep golden-path manifests from going stale? Changes go through review with the platform owners. New defaults are tied to specific incidents or postmortems, so the rationale is documented. Publish a changelog. Stale templates are worse than no templates because people trust them.
Related reading
- Healthcheck probes at scale — the deep-dive on the highest-leverage fix
- K8s misconceptions I had to unteach — the lunch-and-learn content, written down
- Kore infrastructure overview — the broader infra context
- Kore CI/CD pipeline — how deploys actually flow through this stack
- Related pillar: Scaling to 15k CCU — what we did on top of the K8s platform