Scaling a production system to 15k concurrent users

The number on the slide was 15,000.

That was the engineering target — enterprise sales had committed contracts that needed it, the SaaS environment was onboarding new clients, and on-prem customers were expanding. Infra needed to scale urgently, and cost was a real concern. Production, running a mix of bot complexities, was sitting around 4k concurrent users at the time at 60x capacity.

A previous NFR experiment had topped out at 6k CCU before degrading. The production gap was roughly 4×, and every team that had looked at it had landed on the same conclusion: throw more hardware at it.

It didn’t work past 6k. That’s where I came in.

This is the story of getting from 6k to 13k. It took about a month of load testing, a sequence of bottlenecks that surfaced one after another, and a lot of staring at dashboards that said “everything looks fine” while the system was clearly not fine.

What the platform actually is

Kore.AI is a multi-tenant conversational-AI platform — voice (IVR), chat (WebSocket/RTM), FAQ, ML-driven flows. The runtime is a Node.js monolith we call koreserver, surrounded by specialised containers: Chatscript (CS), ML inference, FAQ engine, message consumers, analytics. State lives in MongoDB (sharded), Redis (ElastiCache, for cache and session), RabbitMQ (job and event bus), and a shared NFS mount for cross-pod logs and artefacts. Everything runs on both AWS (EKS) and Azure (AKS), in different regions for different customer commitments.

“Concurrent user” needs a definition, because it’s the entire benchmark. We mean an active session — either an inflight HTTP request, an in-progress IVR call, or an open WebSocket. RTM (WebSocket) sessions are long-lived and cheap per request; voice (IVR) and webhook sessions are shorter but heavier — more service calls per turn. Our SLA target for moderate-complexity bots: total response under 2 seconds.

Our reference bot throughout this work was C-IVR bot — “Confirm Appointment” task - IVR channel, moderate complexity. Unless otherwise stated, CCU numbers in this piece refer to C-IVR load. At the start of this work, C-IVR topped out at 2,400 CCU regardless of how much infrastructure was added.

My role: I owned this workstream end-to-end under Chandrasekhar Poshamolla, our Engineering Director. Infrastructure changes, load testing, RMQ/Redis/MongoDB tuning — that was mine. Application-side changes were collaborative; I drove the platform side.

The starting point: 800 users, and the dashboards lie

At 800 concurrent C-IVR users, with 20× server capacity, the system would degrade. Hard. Latency climbed non-linearly, error rates spiked, and requests started dropping. The previous NFR experiment had pushed this to 2,400 by adding hardware and then hit a wall — more resources produced no improvement.

The first thing I had to accept: more hardware wasn’t the answer. The bottleneck was somewhere specific, and it wasn’t showing up in the obvious places. CPU graphs looked relaxed. Memory was fine. MongoDB had headroom. NFS wasn’t saturated.

But RMQ node load average was sitting at 150+.

If you’re not familiar with Linux load average, anything north of the core count means processes are queueing for CPU. 150 on a machine with 32 cores means there’s a 5× backlog — the cluster was thrashing. Nobody had noticed because nobody was watching load average on the message broker. Everyone was watching CPU utilisation, IOPS, and queue depth on the components they expected to be the bottleneck.

That’s lesson zero: metric selection is half the diagnosis. CPU% on each component tells you almost nothing if you’re watching the wrong components. Load average on RMQ told us a bigger story.

Bottleneck #1: RabbitMQ (Week 0–1, 800 → 1,500 CCU)

The first few days of looking at RMQ told the whole story. Three things had compounded:

No CPU limits. The pods would happily consume every spare core on the node, competing with koreserver, NLP, and everything else scheduled there.

Erlang scheduler count set to 32. Each RMQ pod had a CPU request of 8 cores, but the Erlang scheduler count was matched to the host vCPU count — 32. Under CFS scheduling with no CPU limits, you end up with 32 scheduler threads fighting for cores they don’t own. When multiple RMQ nodes land on the same instance, the contention compounds. Most schedulers spend time sleeping, not processing.

ha-all mirror policy. Every queue mirrored to every node. In an 8-node cluster, every message gets replicated 7 times. Under any meaningful write load, replication traffic alone saturates the network and disk across every node simultaneously.

The fix was four changes in sequence, each validated by load test before the next:

CPU limit of 8 cores per RMQ pod. Node affinity to isolate RMQ pods on dedicated nodes — no sharing with application workloads.
Reduced active Erlang schedulers from 32 → 8 (one per allocated CPU, not host CPU). Load average dropped from 150 to ~25 almost immediately. This one felt magical.
Switched ha-all to ha-two for high-throughput queues, ha-three for highest-criticality. Applied per queue class via rabbitmqctl set_policy with a regex match against queue names — staged rollout, not big-bang. Error rate dropped from 6% to 0.05%.
Expanded the RMQ cluster from 8 → 32 nodes (c5.18xlarge: 72 vCPU, 144 GB, 25 Gbps network).

Result: 1,500 CCU. We also upgraded NFS to 25k IOPS at the same point — which had zero measurable effect, confirming that RMQ was the constraint we’d just fixed.

The full HA-policy reasoning — including why we didn’t move to quorum queues — is in the message broker comparison deep-dive. Short version: the team was evaluating Kafka and Pulsar; migrating classic-mirrored → quorum → Kafka would have been wasted intermediate work.

Bottleneck #1.5: still RabbitMQ, but different (Week 2)

Queues kept piling up despite the node expansion. Two things went wrong simultaneously.

We’d reduced schedulers to 8 in Week 1, which was right when each pod was being clobbered. But with isolated nodes and CPU limits in place, the schedulers were now under-utilised — Erlang’s scheduler sleep time was climbing. We tuned back up to 16. The lesson: scheduler tuning is empirical. Profile sleep time, adjust, repeat. There’s no formula.

The second problem was Botkit. We were running one Botkit pod per two app pods, and they were restarting under load. Botkit writes logs directly to NFS without logrotate — a design choice we’d come to regret. Under sustained load, inode pressure from unbounded log growth slowly degraded NFS performance for every other workload sharing the mount. We reduced Botkit count to 5, which was sufficient for 120× scale, and the restart cascade stopped.

Bottleneck #2: Redis and MongoDB connection fan-out (Week 3, 1,500 → 3,000 CCU)

With RMQ stable, the next constraint surfaced almost immediately.

Redis (ElastiCache, 3 shards × 4 replicas) crossed 94% engine CPU. Redis is single-threaded per shard — once a shard is pegged, you cannot make it faster by adding replicas. Your only option is more shards. We went 3 → 4, then to 6 shards × 3 replicas on cache.c7gn.8xlarge. Engine CPU distributed across more processes; problem resolved.

MongoDB was more interesting. MongoS slow query counter was showing 13,000+ slow queries per second — but MongoD CPU was low. That combination is diagnostic: the queries aren’t slow because the shards are saturated, they’re slow because something at the routing tier is choking.

What we found: every pod (250+) was opening connections to every MongoS instance, round-robin via the connection string. With dozens of pod types, multiple replicas, and multiple MongoS instances, the total connection count was enormous and MongoS network interfaces were saturating. We tried adding 8 more MongoS instances — it didn’t help. Total connection count stayed the same; it was just spread across more endpoints.

The actual fix was the opposite: we updated each pod’s config to connect to exactly one MongoS instance instead of all of them. Since we had 8 MongoS nodes, each pod’s connection load fell to one-eighth of what it was. Network load fell, slow queries dropped, SLA recovered. Same data, same shards, same query patterns — just a connection-pooling change.

We also switched the ML model from Ontology to Few-shot at this stage, and dropped the ml-embeddings memory limit from 25 GB to 8 GB per pod — it was heavily under-utilised. That’s a 3× density improvement on a memory-bound workload, which made the scale-out arithmetic considerably friendlier.

Result: 3,000 CCU.

Bottleneck #3: thread pools, log noise, and RMQ topology (Week 4, 3,000 → 5,000 CCU)

The remaining lift in this phase came from several smaller changes rather than one large one. Two mattered most.

Consumer thread pool: 128 → 10. Here, “thread pool” refers to the worker concurrency setting in the consumer service (Node.js, using the default libuv thread pool or, in some cases, custom worker pools per library). It’s tempting to assume that more threads = higher throughput, but at 128 threads per pod, we saw excessive lock contention and context switching — the system spent more time coordinating threads than processing messages. Reducing the pool to 10, aligning with the pod’s true concurrency (CPU/core count and I/O pattern), improved performance dramatically: SLA at 3,200 CCU fell from over 2 seconds to comfortably below, with more predictable latency and less jitter.

RMQ topology: 1 × 32 nodes → 4 × 8 nodes. A single 32-node cluster is one failure domain. When a node restarts, the rebalancing storm touches every queue. When one queue class misbehaves, it affects all others. Splitting into 4 independent clusters meant teaching the application which cluster owns which queue class — operationally more complex — but gave us blast-radius isolation. Result: 4,000 CCU at 100× capacity, 4,400 CCU at 120×.

We also made certain log streams conditional: log_for_debug and log_for_transition were burning RMQ queue capacity and NFS inode budget for telemetry nobody was reading at runtime. Making them flag-based — only enabled for customers who explicitly need them — got us to 5,000 CCU at 110×.

Bottleneck #4: analytics writes competing with runtime traffic

At 5,000 C-IVR CCU, a different constraint emerged. Analytics workloads were writing heavily to the same MongoDB cluster serving real-time traffic. The write amplification from analytics collections was significant — but analytics data didn’t need to be real-time. A delay of several minutes was entirely acceptable for every analytics consumer we had.

We separated all analytics collections into a dedicated MongoDB cluster, isolating the heavy background writes from the operational shard set. This pushed C-IVR to 5,000 CCU at 100× scale with analytics fully enabled. Without this change, runs had stalled at 4,000 CCU at 100× — the separation added 25% more headroom at the same infrastructure footprint. (That’s 50 CCU per capacity unit, up from 40 — a 25% efficiency gain from one architectural boundary.)

The principle is straightforward: if a workload doesn’t require low latency, don’t let it compete with one that does.

What “13k CCU” actually means

The headline number is bot-dependent. Simpler bots make fewer service calls per turn, so they scale further on the same infrastructure. Here’s where we ended up across the test scenarios:

Bot / Channel	CCU	Capacity multiplier	Avg latency	p95	Error rate	RPS
PT bot — RTM	13,000	60×	317 ms	625 ms	0.01%	1,142
PT bot — IVR	9,000	60×	477 ms	567 ms	0.07%	695
PT bot — Webhook	9,000	60×	472 ms	562 ms	0.00%	746
C-IVR (moderate)	4,500	90×	670 ms	1,841 ms	0.33%	340

When I say “13k”, I mean the simple-bot RTM scenario — a real, validated number, but the easiest workload. Production was running a mixed load (simple and moderate bots) sitting around 4k aggregate before this work; post-optimisation we comfortably supported that mix with headroom to grow. C-IVR — our hardest benchmark — reached 5,000 with analytics fully enabled.

Getting moderate bots to 10k is Phase 2, and it’s a fundamentally different problem: MongoS broadcast queries that fan out without shard keys, stricter query discipline at the application layer, and potentially moving certain queue classes off RabbitMQ entirely.

The other half: ~27% cost reduction per scaling event

Scaling the system was half the work. The other half was ensuring subsequent scaling events didn’t cost a fortune.

The old model was simple and wasteful: one node pool, sized for the most resource-hungry workload (ML inference), so every pod got ML-instance prices regardless of what it actually needed. Every scaling event paid the same premium.

We split into dedicated node groups:

Node group	Instance type	Workloads
`app-compute`	`c6i.8xlarge` (32 vCPU, 64 GB, 3.5 GHz Ice Lake)	koreserver, consumers, NLP
`ml-memory`	memory-optimised	ML inference, embeddings
`rmq-dedicated`	`c5.18xlarge`	RabbitMQ only
`mongo-compute`	`c5d.18xlarge`	MongoDB shards

And we replaced CPU-only HPA with KEDA driven by per-pool saturation signals. The autoscaling design is its own story — covered in the autoscaling pillar. The 27% is the delta between the old “scale everything together” model and the new “scale the actual bottleneck” model, measured on the same synthetic load profile. Not a one-time saving — a per-event reduction that compounds with each subsequent scaling cycle.

What I’d do differently

Instrument before tuning. I spent days diagnosing bottlenecks by tail-grepping load averages and slow-query counts. We had Prometheus; we just didn’t have the right dashboards. Every iteration would have been faster with a proper observability layer in place first. The next major scaling work will start with dashboards, not load tests.

Set CPU limits on every pod from day one, as policy. The RMQ-pods-racing-for-cores problem existed only because limits weren’t set. This shouldn’t require debugging; it should be a default. For application pods that don’t spawn threads against host concurrency, I’d now start with limits on, validate under load, then remove the CPU limit (keeping requests) if throttling is measured and benign — not assumed.

Design out NFS, don’t just buy more IOPS. We upgraded NFS IOPS three times during this work. Each time it bought runway; each time we should have been building the replacement — pod-local storage for logs, object storage for artefacts. NFS was treated as a constant when it should have been a deprecation target from the start.

Model the next ceiling before declaring victory. When we hit 5k for moderate bots, the win was shipped and the team moved on. We should have spent another week modelling where the next constraint would appear. Phase 2 started without that prediction and has been slower for it.

Scaling and cost optimisation are different problems with different feedback cycles. Scaling unblocks throughput limits. Cost optimisation reduces waste. They can pull in opposite directions — buying scale headroom with brute force, then paying to make that brute force cheaper. The right sequence: scale safely first (downtime and growth blockers are existential), then optimise once stability is established. Chasing both simultaneously leads to confused priorities and half-finished work in both directions.

Things people ask me about this

Q: What was actually the first bottleneck — I’d expect MongoDB at this scale.

A: RMQ, not MongoDB. MongoDB had headroom at 800 CCU. The signal was load average on RMQ nodes sitting at 150+, not IOPS anywhere. Everyone was watching CPU utilisation on the components they expected to be the bottleneck. Watching the wrong metric on the wrong component is exactly how the previous attempts had stalled.

Q: How did you distinguish MongoS fan-out from MongoD compute saturation?

A: 13k slow queries per second on MongoS with low MongoD CPU is diagnostic. If the shards were saturated, you’d see high MongoD CPU alongside the slow queries. Low MongoD CPU with a high MongoS slow-query rate means the routing tier is the constraint, not compute. The fix was connection pinning per pod, not adding MongoD or MongoS capacity.

Q: Why ha-two and not quorum queues?

A: Quorum queues were evaluated and deferred. The team was actively evaluating Kafka and Pulsar for some queue classes; migrating classic-mirrored → quorum → Kafka would have been wasted intermediate work. ha-two gave us the fault tolerance we needed without ha-all’s replication overhead. Full reasoning in the broker comparison.

Q: How is the 27% cost number defensible?

A: It’s a delta — same synthetic load profile, old single-pool model vs new workload-grouped + KEDA-driven model. Not a one-time saving; a per-event saving that compounds with each cycle. The methodology is measurable, not narrative, and I can walk through it in detail.

Q: What would break first if you needed 15k for moderate bots?

A: MongoS broadcast queries — queries without a shard key that fan out to every shard. At 5k moderate-bot CCU, we were tolerating a meaningful fraction of those; at 15k they’d dominate. The 4-cluster RMQ topology also has a ceiling. Phase 2 has both as primary workstreams.

Deep-dives from this work: MongoDB sharding · Message brokers compared · Service mesh tradeoffs · Healthcheck probes at scale · RabbitMQ cluster architecture
The autoscaling design that kept cost growth sub-linear: Autoscaling pillar
Underlying infra context: Kore infrastructure overview