This site is a work in progress — some sections are incomplete.
Deep Dive draft

Service mesh tradeoffs: Istio, Linkerd, Kuma — picking the wrong one, then living with it

We picked Istio at Kore because we needed simultaneous VM and Kubernetes support, and Linkerd didn't support VMs at the time. It was the right call for that constraint and a heavy choice for everything else. Eventually we migrated from per-pod sidecars to Istio ambient mode, which cut the per-pod overhead by a lot. This is the comparison I'd actually use to pick a mesh today.

istiolinkerdkumaservice-meshkubernetesambient-modetradeoffs

Service meshes are one of those categories where the right answer depends entirely on what constraint you’re optimizing for. Most of the comparison articles online don’t lead with “what’s your actual constraint?” They lead with feature matrices, which makes Istio look obviously best because Istio has the most features.

In practice, Istio’s features are also Istio’s costs. We picked Istio for a specific reason — VM support — and lived with the cost for everything else. This dive is that story plus the comparison I’d use today.

Why we picked Istio (a single constraint)

When we made the mesh decision at Kore, the platform was mid-migration from VMs to Kubernetes. Some services were on VMs (EC2, Azure VMs). Most were moving to Kubernetes (EKS, AKS). The mesh had to handle both.

That single constraint eliminated Linkerd. At the time, Linkerd didn’t support VM-based workloads. We’d have had to run something else for the VM services and Linkerd for the K8s services, then bridge them — operational complexity that defeated the point of having a mesh in the first place.

That left Istio and Kuma. Istio’s ecosystem was richer (more clients, more operator familiarity, more troubleshooting content). Kuma is interesting and we did consider it; it ended up losing on operational familiarity rather than on capability. If we were picking today with the same VM-and-K8s constraint, I’d give Kuma a more serious look — its multi-zone, multi-cloud story has matured significantly.

If we hadn’t had the VM constraint, Linkerd would have been the first choice. It’s lighter, simpler, has fewer footguns. Several of the failure modes we’ve debugged in Istio production simply don’t exist in Linkerd because Linkerd doesn’t have that feature.

What Istio actually gave us

The marketing list for Istio is long. The list of things we actually use is shorter, and that’s normal — most mesh adoptions use a fraction of the available surface area.

mTLS. Service-to-service authentication across the mesh. Once the mesh is in place, this is essentially free. Doing it right without a mesh is expensive — you have to manage certificate rotation, distribution, and validation across every service yourself. The mesh removes a class of work that would otherwise consume real engineering time.

Observability. Request rates, latencies, and error rates per service and route, emitted by the proxy without any application changes. This was the biggest day-to-day win. The team got a baseline picture of the request path without touching application code — which mattered because instrumenting the monolith would have been a months-long effort we didn’t have budget for.

Configurable timeouts per route. Declared in VirtualService, varied by workload latency profile. Replaced ad-hoc client-side timeouts that were inconsistent across services. This is one of those things that sounds boring but eliminates a category of incidents (cascading timeouts because every client picked its own value).

Outlier detection / circuit breaking. Shedding traffic from a misbehaving pod without a restart. The kind of thing that means a single bad pod stops causing user-visible errors before anyone is paged.

What we evaluated but didn’t lean on

Rate limiting. We had internal application-level rate limiting already. Duplicating it at the mesh layer added complexity without clear benefit. The general lesson: don’t move a working capability to the mesh just because the mesh supports it.

Traffic splitting for canaries. Some use, but most canary routing happened at the ingress layer instead. Cleaner — fewer moving parts in the request path, easier to reason about during incidents.

The sidecar problem — and what we did about it

Per-pod Envoy sidecars sound fine until you do the math at scale.

Each sidecar consumes 50-100 MB of RAM and a measurable slice of CPU at non-trivial RPS. Multiply by pod count — at our scale, 1,000+ pods — and you’re paying for a shadow cluster of proxies. At one point we calculated that the aggregate sidecar overhead was costing us [TODO: meaningful number] in compute spend per month.

The operational overhead compounds. Sidecar injection has to be managed per namespace. Init containers extend pod startup time. The classic holdApplicationUntilProxyStarts race — application starts before the sidecar is ready, fails its first few requests — requires explicit mitigation (set holdApplicationUntilProxyStarts: true on the pod template, which is not the default and which I will tell you about so you don’t debug it for hours like I did).

The fix is Istio ambient mode, which replaces per-pod sidecars with two components:

  • ztunnel — a DaemonSet (one per node) that handles L4 transparently: mTLS, connection tracking, basic telemetry. No per-pod overhead; the cost is fixed per node, which is dramatically cheaper at high pod count.
  • Waypoint proxies — per-namespace (or per-service) Envoy deployments for L7 policy: HTTP routing, retries, circuit breaking, AuthorizationPolicy. Deployed only where L7 policy is actually needed, not by default everywhere.

We migrated namespace-by-namespace. The mechanical part was straightforward — enable ambient mode on the namespace, remove sidecar injection annotations, deploy waypoint proxies for namespaces that needed L7. Existing VirtualService and DestinationRule resources mostly carried over.

What improved:

  • Per-pod sidecar overhead eliminated. Cluster memory footprint reduced proportionally to pod count — meaningful at our scale.
  • Sidecar injection complexity removed. No more init container ordering races, no more holdApplicationUntilProxyStarts debugging.
  • Startup time per pod improved (no sidecar to initialize before the app can start).
  • ztunnel operates at L4 with kernel bypass capabilities — lower latency for high-throughput paths.

What to watch for:

  • Waypoint proxies are a new failure domain. If a waypoint is misconfigured or unhealthy, all L7 policy for that namespace is affected — previously it was per-pod and contained. Treat waypoints with the same operational seriousness as any stateless service.
  • Ambient mode was newer when we migrated. Less operational familiarity in the team meant a steeper learning curve for incidents. Production-ready, but expect to spend time understanding the new failure modes.
  • Not all Istio features have parity in ambient yet. Verify your specific VirtualService features before migrating.

What I’d pick today

If I were starting a new mesh deployment today, the decision tree would be:

  1. Do you need to mesh non-K8s workloads (VMs, bare metal)?

    • Yes: Istio or Kuma. Kuma is the cleaner choice operationally if you don’t already have Istio.
    • No: continue to step 2.
  2. Are you a single-team service mesh deployment, or are you a platform team supporting many app teams?

    • Single team: Linkerd. Lighter, simpler, fewer footguns, gets you 80% of the value at 30% of the operational cost.
    • Platform team: Istio (ambient mode). The feature surface area you’ll need eventually justifies the operational cost.
  3. What’s your existing operator familiarity?

    • This matters more than feature comparisons. A team that knows Istio will operate Istio better than they’ll operate Linkerd. A team starting fresh should probably pick the simpler tool.
  4. What’s your traffic pattern?

    • L4-heavy with mTLS as the main need: ambient-mode Istio or any mesh, frankly. The L4 features are commoditized.
    • L7-heavy with complex routing, traffic splitting, auth policies: Istio still has the richest surface area; Linkerd is catching up but not there yet for the most complex use cases.

WebSocket and the long-lived connection problem

One quirk worth flagging because it bit us: long-lived WebSocket connections (our RTM channel) don’t redistribute when you scale up. The connection is already established and sticky to a specific pod; new pods come up idle while existing pods stay overloaded.

This is not a service mesh problem per se — it’s a property of long-lived TCP connections. But the mesh affects how you mitigate it. We tuned connection draining (pods signal readiness to close existing connections before being removed from endpoints), tuned readiness probes (pods are not marked ready until they can actually serve RTM), and adjusted autoscaling to trigger earlier for RTM pools (so new pods come up before the existing pods are fully saturated). The mesh helps with the observability side — you can see per-pod RTM connection count — but it doesn’t redistribute connections for you.

Gotchas

  • holdApplicationUntilProxyStarts: true is not the default. Set it, or debug mysterious startup failures under load.
  • Quantify sidecar overhead in your cluster before committing. Vendor numbers are best-case. At 1,000+ pods the math is not trivial.
  • Pick the mesh for a reason, then own that reason. “We picked Istio because we needed VM support” is defensible. “We picked Istio because it has the most features” is how you end up with operational pain you didn’t budget for.
  • Waypoint proxies need explicit ownership. In ambient mode, agree on who owns each namespace’s waypoint before something breaks.
  • Keep an explicit map of “mesh gives us X, app gives us Y.” The line blurs over time. When something breaks at runtime you need to know which layer to debug first.