This site is a work in progress — some sections are incomplete.
Deep Dive draft

Comparing message brokers: RabbitMQ, Kafka, Pulsar — and why we deferred quorum queues

We run RabbitMQ in production. The team has spent significant time evaluating Kafka and Pulsar as replacements for certain workload classes. This dive covers the actual operational tradeoffs, why we moved off ha-all without picking quorum queues, and the decision tree I'd use today if I were picking a broker from scratch.

rabbitmqkafkapulsarmessaginghatradeoffs

The honest framing for this comparison: there is no “best” message broker. There are brokers that fit specific workload shapes, and there are brokers you already operate. The intersection of “fits the workload” and “your team can run it well” is usually the right answer.

This dive is the comparison I wish I’d had when I started touching the messaging tier at Kore. It covers RabbitMQ’s HA policy choice (ha-all → ha-two), why we deferred quorum queues despite the deprecation notice, and what an honest Kafka vs RMQ vs Pulsar tradeoff looks like for our workloads.

What we run, and why

Kore runs RabbitMQ as the primary work bus. Classic mirrored queues (not quorum yet, more on that). Four independent clusters of 8 nodes each, segregated by workload class. Mirror policy ha-two for high-throughput queues, ha-three for highest-criticality.

We also run Kafka for the analytics pipeline (Mongo → Kafka → Glue → Hudi, covered in the analytics pipeline pillar). It’s not a replacement for RMQ in production yet, but the team has done serious evaluation work on whether some workload classes should migrate.

Pulsar was evaluated and deferred. More on that below.

Why ha-two and not ha-all (or quorum queues)

This is the most consequential RMQ decision we made, so it gets the most space.

ha-all mirrors every queue to every node. In an 8-node cluster, every message has 7 copies. Sounds safe; it’s the default people reach for when they want availability. Two problems become severe at scale:

  1. Replication overhead grows with cluster size. Every write triggers fan-out to every other node. Network and disk on each node are doing 7× the work of the actual write.
  2. Partition recovery is catastrophically slow. When the cluster heals after a network partition, every node has to re-sync every queue it’s master for. With large queues and many nodes, the sync storm can take minutes-to-tens-of-minutes, during which the cluster is degraded.

At Kore, ha-all was the policy when I got there. Node load average sat at 150+ even with only 8 nodes. We thought it was a capacity problem and added more nodes; the problem got worse because more nodes meant more replication targets.

ha-two keeps each queue on exactly two nodes. Tolerates one node loss (the surviving mirror takes over). Memory and sync cost is bounded by the queue itself, not by cluster size. Adopted for our high-throughput queues.

ha-three does the same with three. Tolerates two-node loss simultaneously. Adopted for a small set of highest-criticality state queues where surviving two concurrent failures was a requirement.

Applied per queue class via name pattern, staged carefully — not big-bang:

rabbitmqctl set_policy ha-two \
  "^(runtime\.|worker\.).*" \
  '{"ha-mode":"exactly","ha-params":2,"ha-sync-mode":"automatic"}' \
  --priority 10 --apply-to queues

The result was dramatic: load average dropped from 150+ to ~25, error rate from 6% to 0.05%. The full operational sequence is in the RMQ optimizations pillar.

So why not quorum queues?

Quorum queues are the modern Raft-based replacement for classic mirrored queues. They have better durability guarantees, simpler failover semantics, and they’re the recommended path forward — classic mirrored queues are deprecated in RabbitMQ 3.13+.

We evaluated them and deferred. Three reasons:

  1. Consumer code compatibility. Quorum queues have stricter consumer semantics: no auto-ack, different message redelivery behaviour. A non-trivial portion of our consumer code would need review and potentially changes. Not catastrophic, but not free either.
  2. Memory footprint. Higher per-message than classic mirrored for some of our workloads. Would need to be re-benchmarked at our message volumes before committing.
  3. The longer-term broker strategy is in flux. The team is actively evaluating Kafka and Pulsar for some workload classes. Migrating classic-mirrored → quorum → Kafka is wasted intermediate work. Deferring quorum queues preserved optionality.

This is a deferred adoption, not a permanent rejection. The deprecation will eventually force the migration — better to plan it with a clear timeline than discover it during an emergency RMQ upgrade.

RabbitMQ vs Kafka: the comparison that actually matters

If you read the marketing copy you’d think RMQ and Kafka are competitors. They’re not, really. They optimize for different things, and the workload you have tells you which one fits.

DimensionRabbitMQKafka
Message modelQueue (work distribution)Log (replayable stream)
Per-message overheadHigher (acks, routing)Lower (append-only, batch)
Throughput ceilingTens-to-hundreds of K msg/sec per clusterMillions of msg/sec
LatencySub-millisecond achievableSingle-digit milliseconds typical
Consumer modelPush (consumer pulls one, work, ack)Pull (consumer reads from offset)
ReplayHard (messages gone after ack)Native (rewind to any offset)
RoutingRich (exchanges, bindings, headers)Minimal (partition + key)
Operational complexityModerateHigher (especially with KRaft/ZK choice)

The pattern that decides for me: if you need work distribution with backpressure and idempotent acks, RabbitMQ. If you need event streaming with replay and multiple independent consumers, Kafka.

At Kore, the runtime job queues are work-distribution — pull a job, do it, ack, repeat — and RMQ fits cleanly. The analytics pipeline is event-streaming — many consumers, replay needed for backfills, partition-ordered processing — and Kafka fits cleanly. We didn’t pick one broker for everything; we picked the right broker for each workload class.

The workloads in the middle — high-volume async work with replay-ish characteristics — are where the tradeoff gets interesting. For some of those, Kafka would be the right long-term answer at our scale. The reason we haven’t migrated yet is operational cost: every workload that moves to Kafka adds a new operational surface. We do it deliberately, not opportunistically.

Where Pulsar fits

Pulsar is the broker I find most theoretically interesting and have used the least in production.

It tries to unify queue (RMQ-like) and stream (Kafka-like) semantics in one system, with native multi-tenancy and tiered storage. On paper it’s exactly what you’d want if you had to pick one broker for everything.

In practice, the team’s evaluation flagged:

  • Operational complexity is real. Pulsar’s architecture (brokers + BookKeeper + ZooKeeper, plus Pulsar Proxy in many setups) is more moving parts than either RMQ or Kafka individually. You need more operational depth, especially for failure modes.
  • Ecosystem and tooling are thinner. RMQ has decades of client libraries and operational tools. Kafka has the Confluent ecosystem and immense community. Pulsar has both, but smaller and patchier.
  • The “best of both worlds” pitch hides tradeoffs. Queue semantics in Pulsar work; they’re not identical to RMQ. Stream semantics in Pulsar work; they’re not identical to Kafka. For workloads that genuinely need both, Pulsar is great. For workloads that fit cleanly into one model, you’re paying complexity tax.

We deferred adoption. The right time to revisit Pulsar would be either a workload that genuinely needs unified queue+stream semantics, or a long-term consolidation effort that takes “we run three brokers” and tries to make it “we run one.”

The decision tree I use now

If I’m picking a broker for a new workload today:

  1. Is it work distribution (jobs to be done) or event streaming (a log of things that happened)?

    • Work: continue to step 2.
    • Streaming: continue to step 3.
    • Both/unclear: Pulsar is worth a serious look; otherwise pick the dominant model.
  2. Work distribution — what’s the throughput requirement?

    • Hundreds of K/sec or less: RabbitMQ (with ha-two, dedicated nodes, CPU limits, sensible scheduler tuning).
    • Higher: Kafka with consumer groups, accepting the worse fit for work-distribution semantics.
  3. Streaming — what’s the replay requirement?

    • Unbounded replay, long retention: Kafka.
    • Bounded, short-term: RabbitMQ Streams (yes, this exists; under-used).
  4. What does your team already operate well?

    • This is the question nobody asks. If you have 5 years of RMQ ops experience and zero Kafka, RMQ is probably the right answer even if Kafka technically fits better. Operational competence dominates technology choice.

Things people ask me about this

Why RabbitMQ in 2026? Because it works, our team operates it well, and the replacement cost (any of: migrate to quorum, migrate to Kafka, migrate to Pulsar) is currently higher than the operational pain. The economics will shift; until they do, optimising the broker we have beats migrating to one we’d have to learn.

Why not just move everything to Kafka and be done? Kafka isn’t great at work distribution. For our runtime job queues — which are work distribution with strict ack semantics — Kafka would be a worse fit operationally. The right model is “different brokers for different workload classes,” not “one broker to rule them all.”

Will you migrate to quorum queues eventually? Yes. The deprecation will force it. The trigger is either an RMQ version that removes classic mirrored, or a workload that genuinely needs quorum’s durability guarantees over what ha-three already gives us. Until then we’re carrying the deprecation debt deliberately.

What’s the cost of running RMQ at your scale? Significant infrastructure cost (4 clusters × 8 nodes × c5.18xlarge isn’t cheap), modest operational cost (a few hours per week of platform time once the post-scaling fixes settled in). The cost of not running it well — what the 800-CCU ceiling cost us — was much higher.