Optimizing RabbitMQ for reliability and throughput

RabbitMQ has a deserved reputation for being one of those technologies that “just works” until it suddenly very much doesn’t.

For the first 18 months I was at Kore, RMQ was the messaging substrate I touched least. It was working. We had queues, we had consumers, things moved. Then we tried to scale past 800 concurrent users and RMQ became the entire problem.

This is the story of three distinct RMQ fights, each at a different scale level, each with its own personality.

Fight 1: ha-all and the cluster that ate itself (800 → 1,500 CCU)

The first signal we were in trouble was Linux load average on RMQ nodes. Sitting at 150+ on machines with 32 cores. That’s a 5× backlog of processes waiting for CPU. Queues were piling up, consumers were stalling, error rates were climbing.

But here’s the trap: every individual metric looked plausible. Memory was fine. Disk IOPS was fine. Network throughput was fine. Just the load average, sitting there at 150, mocking us.

Three things had compounded. They were all separately fixable; together they were fatal.

No CPU limits on RMQ pods. They were sharing nodes with application workloads and helping themselves to every spare core. Application pods would slow down because RMQ was eating their CPU; RMQ would slow down because its work was being interrupted by application workloads spinning up. Classic noisy-neighbor pattern, except RMQ was both the noisy neighbor and the neighbor being kept up at night.

Erlang scheduler count set to 72 — matched to the host vCPU count, not the pod’s allocated CPUs. Erlang’s runtime assumes its schedulers own dedicated CPU. With 72 schedulers and no actual CPU reservation, they were time-slicing against each other and against everything else on the node. The scheduler’s own metrics showed huge amounts of sleep time — the schedulers were spending more time waiting for CPU than doing work.

ha-mode: all mirror policy. Every queue mirrored to every node. In an 8-node cluster, every message gets seven copies. Under any meaningful write rate, the replication traffic dominates the actual work traffic. Network and disk become bottlenecks.

The fix was four changes in sequence — and I emphasize “in sequence” because each one’s effect was only legible after the previous one was applied:

CPU limit of 8 cores per RMQ pod. Node affinity so RMQ pods only schedule on dedicated RMQ nodes.
Erlang schedulers reduced from 72 → 8 (one per allocated CPU). Load average dropped from 150 to ~25 within minutes of the deploy. This was the single most dramatic moment of the entire scaling work — a graph that had been red for weeks suddenly went green.
Mirror policy ha-all → ha-two for high-throughput queues, ha-three for the few that genuinely needed it. Applied per queue class via name pattern, staged carefully across queue classes rather than big-bang:
```
rabbitmqctl set_policy ha-two \
  "^(runtime\.|worker\.).*" \
  '{"ha-mode":"exactly","ha-params":2,"ha-sync-mode":"automatic"}' \
  --priority 10 --apply-to queues
```
After this plus rebalancing jobs across nodes, error rate dropped from 6% to 0.05%.
Expanded the cluster: 8 → 32 nodes (c5.18xlarge: 72 vCPU, 144 GB, 25 Gbps). Reached 1,500 CCU.

Later we tuned schedulers back up to 16 after seeing high sleep time at 8 under moderate load. There’s no formula here — the right number depends on actual concurrency, instance type, and queue traffic pattern. Profile sleep time and tune.

We also considered quorum queues — they’re the modern replacement for classic mirrored queues, RMQ 3.13+ deprecates the old mirror policy — but deferred. The team was actively evaluating Kafka and Pulsar for some workloads, and migrating classic-mirrored → quorum → Kafka would be wasted intermediate work. The full quorum-queues-vs-everything-else thinking is in the message brokers compared deep-dive.

Fight 2: one big cluster, many problems (1,500 → 3,000 CCU)

With ha-all and scheduler tuning out of the way, we got to about 1,500 CCU before queues started piling up again. This time the metrics were less dramatic — load average was reasonable, mirror replication was bounded — but consumers couldn’t keep up and queue depth was creeping.

The diagnosis was more interesting. We were running one large RMQ cluster: 32 nodes, all queue classes mixed together. A few problems flowed from this:

Rebalancing storms. When a node joined or left the cluster, queue ownership had to redistribute. With many queues and many nodes, the rebalance took time and was CPU/network heavy. Any queue not actively being rebalanced was still affected by the rebalance traffic.
Cross-class interference. Analytics queues, runtime queues, and admin queues all shared the same cluster. A spike in analytics traffic would consume cluster resources that runtime needed. We had no isolation between workload classes that had wildly different SLOs.
One failure domain. If something went wrong with the cluster — a network partition, a botched config push, a buggy plugin — everything was affected.

The fix was conceptually simple, operationally complex: split into 4 independent clusters of 8 nodes each. One cluster per workload class (runtime, async work, analytics, admin).

The operational complexity was real. The application layer now had to know which cluster owned which queue class. We did this via environment variables per deployment — RMQ_RUNTIME_HOSTS, RMQ_WORKER_HOSTS, etc. — and a small library wrapping the connection logic. Migrations of existing queues to the new clusters had to be done carefully, queue by queue, with consumers watching both old and new during the transition.

What it bought us: blast-radius isolation. A spike in one workload class no longer affected the others. Rebalancing on one cluster didn’t touch the other three. Different mirror policies and instance sizes per cluster, tuned to the workload.

Result: 4,000 CCU at 100× server capacity, 4,400 at 120×. RMQ stopped being the bottleneck; the next constraint surfaced (Redis and MongoDB, covered in the scaling pillar).

Fight 3: consumer overhead at scale

This was a smaller fight in CCU terms but mattered for high-volume queues — analytics events, batch jobs, anything where individual messages were small and frequent.

The pattern most of our consumers used was the obvious one: pull one message, do the work, ack, repeat. For small messages with small work units, the per-message overhead — broker round-trip for the fetch, broker round-trip for the ack, all the per-call setup in the consumer process — was the same order of magnitude as the actual work. We were CPU-bound on overhead, not on doing things.

Adding more consumer pods worked for a while. It also compounded the broker load, because every consumer pod meant more channels, more open connections, more individual acks. We were trading consumer CPU for broker overhead.

The fix was bounded batching with explicit backpressure:

Batch shape. Pull up to N messages or wait T milliseconds, whichever happens first. Both N and T tuned per queue based on average message work-time. For high-volume queues, N was in the 50-200 range and T was around 50ms. Acks issued in bulk via multiple=true after batch work completes.

Prefetch. basic.qos (prefetch_count) set to roughly 2× batch size — enough to keep the consumer fed during processing, low enough that a single slow consumer doesn’t park a huge backlog.

Poison-message handling. This is the part that’s easy to get wrong, because a naive “reject the whole batch on any error” makes batching strictly worse than single-message processing in the face of bad data. The pattern: per-message try/catch inside the batch loop; on failure, ack the surrounding successful messages and individually nack the failing message to a DLQ. The batch boundary is for amortization, not atomicity.

Backpressure. When the consumer’s downstream (DB, downstream HTTP) signals slowness via increased latency, stop pulling new messages. RMQ’s flow control then surfaces back to producers naturally. The metric to watch is downstream call p95 latency; threshold tuned per workload.

Separately, we cut consumer thread pool from 128 → 10/11. At 128 threads per pod, lock contention was a meaningful fraction of CPU time. The actual concurrency the pod could serve was much lower. This single change moved SLA at 3,200 CCU from above 2 seconds to comfortably below — the kind of result that feels like it shouldn’t be that simple.

Where this leaves us

Phase	Change	Result
Mirror policy + scheduler	ha-all → ha-two, CPU limits, node affinity, scheduler tuning	800 → 1,500 CCU; load avg 150 → 15; error rate 6% → 0.05%
Cluster expansion	8 → 32 nodes	1,500 CCU at stable error rate
Topology	1×32 → 4×8 clusters	4,000–4,400 CCU; RMQ no longer the bottleneck
Consumer batching	Bounded batching, prefetch tuning, thread pool reduction	Per-pod throughput up significantly; broker round-trips down; SLA achieved at 3,200+

What I’d do differently

Set CPU limits and node affinity on RMQ from day one, not as a fix. The racing-for-cores problem existed because limits weren’t set. This is policy, not architecture. Every messaging tier should have it.

Plan for multiple clusters from the start. The 1 → 4 cluster migration on a live system was careful work. Designing for multiple clusters at the start would have been straightforward; retrofitting it took weeks of cautious migration.

Add DLQ routing before the first bad message in production. The batching implementation needed poison-message handling immediately; we didn’t build it until the first time a bad message stalled a queue. That gap shouldn’t have existed.

Get an explicit timeline for quorum queues. Classic mirrored queues are deprecated. We deferred quorum for good reasons (Kafka/Pulsar evaluation in flight), but “deferred” can quietly become “forgotten.” Better to have a “decide by Q3” item than no item at all.

Things people ask me about this

Was the actual bottleneck ha-all or the lack of CPU limits? Both, and they compounded. CPU limits alone would have helped some — pods would have stopped clobbering each other. ha-all alone would still have been a problem — replication traffic dominates at 8 nodes. The 150 load average came from both at once. Fixing one without the other gave partial improvement; fixing both is what moved the needle.

Why four clusters of eight rather than two clusters of sixteen? Workload class isolation. We had four natural groups with different traffic profiles, SLOs, and criticality. Splitting along those lines gave each class its own failure domain. Two clusters would have been a compromise — better than one, worse than four.

What’s the right batch size? It’s a function of per-message work time and per-call broker overhead. Heuristic: pick N so the time to process a batch is on the order of seconds, not tens of milliseconds (under-amortized) or tens of seconds (latency-violating). Sweep ±50% in load testing and pick the local optimum. The answer is different per queue.

Why not just move everything to Kafka? Kafka is the right answer for some of these workloads long-term. It’s not a 6-week project; it’s a multi-quarter migration. The RMQ fixes I describe here unblocked scaling now; they don’t preclude the Kafka migration later. The cost of switching messaging substrates is real; you want to do it when you’ve earned the scale that justifies it, not before.

Message brokers compared — RMQ ha-all/ha-two/quorum, Kafka, Pulsar tradeoffs
Related pillar: Scaling to 15k CCU — RMQ was the first of four bottlenecks; this pillar is the broader context
Related pillar: Analytics pipeline — uses Kafka rather than RMQ for stream processing