Kubernetes misconceptions I had to keep unteaching

If you spend enough time helping people adopt Kubernetes, you start to recognize patterns. The same five mistakes show up over and over, in different teams, in different services, with subtly different surface symptoms but the same root cause: a mental model from the previous world (VMs, mostly) that’s been ported into Kubernetes without being updated.

This is the list. Each one cost someone something in production at Kore. Each one is fixable with a few minutes of conversation if you catch it early, and weeks of incident response if you don’t.

1. “A pod is just a small VM”

This is the most damaging one because everything else flows from it.

Why it’s wrong: pods are designed to be ephemeral, disposable, and identical. VMs are designed to be long-lived, stateful, and individually tuned. Treating a pod like a VM gives you over-provisioned resources (because you size each one for worst case “in case it’s important”), kubectl exec-based fix-in-place debugging (because you SSH’d into VMs to fix things), persistent local state (because VMs had stable disks), and genuine surprise the first time the scheduler reschedules a pod and your local data disappears.

The mental model that replaces it: pods are interchangeable replicas of a Deployment. If a pod is broken, you fix the Deployment, not the pod. State goes somewhere durable — a database, a PVC, an object store. Resource requests reflect a typical pod, not a generous one. The pod that’s running right now is going to die at some point; that’s the design.

The cost when this isn’t internalized: I’ve seen teams keep pets — specific pod names they’d refer to in dashboards because they’d manually tuned config inside the pod. The first horizontal scale event blew that pattern up immediately.

2. “Restart fixes everything”

Particularly common with teams coming from environments where “restart the service” was a triage step that often genuinely fixed things.

Why it’s wrong: in Kubernetes, restart is a tool, not a strategy. Liveness probes that bounce pods automatically just industrialise the masking. If your team’s mental model for any problem is “restart the pod,” you’re masking root causes that will return at scale, and you’re masking them in a way that makes them harder to find next time because the symptom (pod restart) is also the workaround.

The corrosive version of this is “memory leak? Just restart every hour.” It works. Until the leak rate increases. Until you scale up and the restarts collide. Until the next person on the team inherits the workaround and forgets there was ever a real bug.

What replaces it: restart is a diagnostic step. “Does the problem return immediately or after time?” tells you something useful. If a service genuinely needs frequent restarts to stay healthy, that’s a bug worth fixing, not a feature to automate around.

3. “Requests should equal limits”

This one comes from people who’ve read enough K8s docs to be dangerous.

Why it’s wrong: requests and limits answer different questions.

Requests = “minimum I need to be scheduled.” Used by the scheduler. The pod is guaranteed this much.
Limits = “absolute ceiling above which I get throttled or killed.” Used by the kernel. The pod cannot exceed this.

Setting them equal makes your pod a Guaranteed-QoS-class pod, which has some advantages — it’s the last to be evicted under node pressure, it can use static CPU pinning. But it also means you’ve sized for the worst case for both scheduling and runtime. The scheduler thinks the pod always needs the full amount; the kernel won’t let it burst above it. You’ve optimised for both edges of the distribution simultaneously, which is usually wasteful.

What replaces it: requests at the typical workload, limits at the safety ceiling. Equal only when you genuinely need Guaranteed QoS and accept the cost of provisioning for worst case.

Specific failure mode I’ve seen: someone sets requests: 8Gi, limits: 8Gi because “the docs say to make them equal for predictability.” Cluster runs out of memory headroom because every pod is reserving 8Gi even when it normally uses 2Gi. Scheduler can’t pack pods densely. Cost goes up. Pods get OOMKilled anyway because actual usage occasionally peaks above 8Gi.

4. “Liveness and readiness are basically the same”

Covered in detail in the healthcheck probes deep-dive. Short version: they answer different questions and using the same endpoint for both produces predictable failure modes. The #1 cause of cluster thrashing I’ve seen during VM-to-K8s migrations.

What replaces it: split the probes, give each its own endpoint, use startupProbe for anything with a meaningful cold start. The probe deep-dive has the configuration template.

5. “Helm makes it portable”

Helm packages YAML and templates it with values. That’s what it does. That’s all it does.

Why it’s wrong: portability comes from writing manifests that depend only on standard Kubernetes primitives, having an explicit values story per environment, and testing on every cluster you claim to support. Helm doesn’t abstract away cluster differences, RBAC scopes, CRD versions, or environment-specific behaviour. A Helm chart that “deploys cleanly” is not a Helm chart that “works correctly.”

What replaces it: Helm is a packaging convention, not a portability layer. Use it for what it’s good at — composing related manifests, parametrising environment-specific values, providing an install/upgrade flow. Don’t expect it to insulate you from cluster differences. Test each cluster.

The bonus one nobody warns you about

“Everything should be :latest” — convenient until a node restart pulls a different image than the one the pod was running, and behaviour diverges silently across the fleet. You now have a heisenbug that only appears on pods that happened to be scheduled after the image was repushed.

The fix is mechanical: pin image tags. Use digests for the truly load-bearing components (the digest is a content hash; it can’t drift). Document the policy in the golden-path manifest so the next team to ship doesn’t repeat the mistake.

How I teach against these

These five (six) come up in design reviews, code reviews, and incident postmortems. The pattern that worked at Kore was:

One misconception per lunch-and-learn session. Cover the wrong mental model, the cost when it’s not corrected, then the replacement.
Tie each one to a specific incident whenever possible. Stories stick; lists don’t. “Remember when service X got OOMKilled during Black Friday” is more memorable than “limits and requests should differ.”
Show the bad pattern and what it costs. Then show the corrected pattern and what it changes.
Add the corrected pattern to the golden-path manifest. The next team to ship gets it for free, without needing to attend the talk.

After about [TODO: months] of this rotation, the misconceptions still surface — but in design reviews, where they’re cheap to fix, not in production, where they’re expensive. The signal that adoption is working isn’t that misconceptions disappear; it’s that they get caught earlier each time.

VM-to-K8s migration pillar — the broader context for this work
Healthcheck probes — the deep-dive on misconception #4
Kore CI/CD pipeline — how deploys flow, including the image-tagging policy that fights misconception #6