High Available Mosquitto MQTT on Kubernetes

51 jandeboevrie 31 5/14/2025, 8:42:36 PM raymii.org ↗

Comments (31)

zrail · 122d ago

To preface, I'm not a Kubernetes or Mosquitto expert by any means.

I'm confused about one point. A k8s Service sends traffic to pods matching the selector that are in "Ready" state, so wouldn't you accomplish HA without the pseudocontroller by just putting both pods in the Service? The Mosquitto bridge mechanism is bi-directional so you're already getting data re-sync no matter where a client writes.

edit: I'm also curious if you could use a headless service and use an init container on the secondary to set up the bridge to the primary by selecting the IP that isn't it's own.

jandeboevrie · 122d ago

> so wouldn't you accomplish HA without the pseudocontroller by just putting both pods in the Service?

I'm not sure how fast that would be, the extra controller container is needed for the almost instant failover.

Answering your second question, why not an init container in the secondary, because now we can scale that failover controller up over multiple nodes, if the node where the (fairly stateless) controller runs goes down, we'd still have to wait until k8s schedules another pod instead of almost instantly.

rad_gruchalski · 122d ago

> without the pseudocontroller

I am making an assumption. I assume that you mean the deployment. The deployment is responsible for individual pods. If a pod goes away, the deployment brings a new pod in. The deployment controls individual pods.

To answer your question: yes, you can simply create pods without the deployment. But then you are fully responsible for their lifecycle and failures. The deployment makes your life easier.

zrail · 121d ago

I was referring to the pod running the kubectl loop. As far as I can tell (I could be wrong! I haven't experimented yet) the script is relying on the primary Mosquitto pod's ready state, which is also what a Service relies on by default.

andrewfromx · 122d ago

when dealing with long lasting TCP connections, why add that extra layer of network complexity with k8s? I work for a big IoT company and we have 1.8M connections spread across 15 ec2 c8g.xlarge boxes. Not even using a NLB just round-robin DNS. Wrote our own broker with https://github.com/lesismal/nbio and use a packer .hcl file to make the AMI that each ec2 box boots. Using https://github.com/lesismal/llib/tree/master/std/crypto/tls to make nbio work with TLS.

stackskipton · 122d ago

Ops type here who deals with this around Kafka.

It comes down to how much you use Kubernetes. At my company, just about everything is in Kubernetes except for databases which are hosted by Azure. So having random VMs means we need to get Ansible, SSH Keys and SOC2 compliance annoyance. So the workload effort to get VMs running may be higher than Kubernetes even if you have to put in extra hacks.

NewJazz · 122d ago

You don't need ansible if it is all packed into the Ami.

stackskipton · 120d ago

Packer only works if you can replace machines on repeatable basis and data can be properly moved.

If not, you need Ansible to run apt update;apt upgrade -y periodically, make sure Security Software is installed and other maintenance tasks.

spotman · 121d ago

Having worked at multiple IoT companies with many millions of connections. This is the way.

People tend to overcomplicate things with K8S. I have never once seen a massively distributed IoT system run without a TON of headache and outages with k8s. Sure, it can be done, but it requires spending 4-8x the amount of of development time and has many more outages due to random things.

It's not just the network, its also the amount of config you have to do to get a deterministic system. For IoT, you dont need as much bursting (for most workloads). Its a bunch of devices that are connected 24/7 with fairly deterministic workloads, that are usually using some type of TCP connection that is not HTTP, and trying to shove it into an HTTP paradigm costs more money and more complexity and is not reliable.

avianlyric · 122d ago

K8s itself doesn’t introduce any real additional network complexity, at least not vanilla k8s.

At the end of the day, K8s only takes care of scheduling containers, and provides a super basic networking proxy layer for convenience. But there’s absolutely nothing in k8s that requires you use that proxy layer, or any other network overlay.

You can easily setup pods that directly expose their ports on the node they’re running on, and have k8s services just provide the IPs of nodes running associated pods as a list. Then rely on either on clients to handle multiple addresses themselves (by picking an address at random, and failing over to another random address if needed), configure k8s DNS to provide DNS round robin, or put an NLB or something in front of it all.

Everyone uses network overlays with k8s because it makes it easy for services in k8s to talk to other services in k8s. But there’s no requirement to force all your external inbound traffic through that layer. You can just use k8s to handle nodes, and collect needed meta-data for upstream clients to connect directly to services running on nodes with nothing but the container layer between the client and the running service.

andrewfromx · 122d ago

| Aspect | Direct EC2 (No K8s) | Kubernetes (K8s Pods) |

|-------------------------|-------------------------------------------------------|-------------------------------------------------------------------------------------|

| Networking Layers | Direct connection to EC2 instance (optional load balancer). | Service VIP → kube-proxy → CNI → pod (plus optional external load balancer). |

| Load Balancing | Optional, handled by ELB/ALB or application. | Built-in via kube-proxy (iptables/IPVS) and Service. |

| IP Addressing | Static or dynamic EC2 instance IP. | Pod IPs are dynamic, abstracted by Service VIP. |

| Connection Persistence | Depends on application and OS TCP stack. | Depends on session affinity, graceful termination, and application reconnection logic. |

| Overhead | Minimal (direct TCP). | Additional latency from kube-proxy, CNI, and load balancer. |

| Resilience | Connection drops if instance fails. | Connection may drop if pod is rescheduled, but Kubernetes can reroute to new pods. |

| Configuration Complexity| Simple (OS-level TCP tuning). | Complex (session affinity, PDBs, graceful termination, CNI tuning). |

avianlyric · 121d ago

If you read my reply again, you’ll notice that I explicitly highlight that K8s does not require the use of a CNI. There’s a reason CNIs are plugins, and not core parts of k8s.

How do you think external network traffic gets routed into a CNIs front proxy? It’s not via kube-proxy, kube-proxy isn’t designed for use in proper production systems, it’s only a stop gap to provide a functioning cluster to enable bootstrapping of a proper network management layer.

There is absolutely nothing preventing a network layer directly routing external traffic to pods, with the only translation being a basic iptable rule to enable routing of data sent to a nodes network interface with a pod IP to be accepted by the node and routed to the pod. Given it’s just basic Linux network interface bridging, happening entirely in the kernel with zero copies, the impact of this layer is practically zero.

Indeed the k8s services setup with external load balancers basically handle all of this setup for you.

There are plenty of reasons not to use k8s, but arguing that a k8s cluster must inherently introduce multiple additional network components and complexity is simply incorrect.

andrewfromx · 121d ago

While Kubernetes theoretically allows for simple iptables-based routing, in practice, very few production environments stop there. Most clusters do use CNIs, kube-proxy, Service abstraction, and often external load balancers or Ingress controllers, which together form a nontrivial networking stack.

The claim that kube-proxy is “not designed for use in proper production systems” is simply incorrect. It is widely used in production and supported by Kubernetes core. While it has limitations—especially with high-connection-load environments—it is still the default in most distros, and even advanced CNIs like Calico and Cilium often interact with or replace kube-proxy functionality rather than ignore it.

If kube-proxy is just a stopgap, why do so many managed Kubernetes platforms like GKE, EKS, and AKS still ship with it?

While it’s true CNIs are plugins and not core to Kubernetes, this outsources complexity rather than eliminates it. Different CNIs (e.g., Calico, Cilium, Flannel, etc.) use different overlay models, BPF implementations, or routing approaches.

Even in a CNI with kernel-fast path routing, pod churn, rolling updates, or horizontal scaling still introduce issues. Service IPs are stable, but pod IPs are ephemeral, which means Long-lived TCP clients pinned to pod IPs.

You can design a lean Kubernetes network path with minimal abstractions, but: You lose things like dynamic service discovery, load balancing, and readiness-based traffic shifting. You must manage more infra manually (e.g., configure iptables or direct routing rules yourself). You’re fighting against the grain of what Kubernetes is designed to do.

avianlyric · 120d ago

Not entirely sure what your point is. You state that Kubernetes must come with a complex networking stack, and I point out that simply isn’t true, which apparently you agree with.

But because other people use complex networking stacks in Kubernetes, that means for a bunch of good reasons, that apparently means it can’t be used without that stack? If you don’t need that functionality, and you have different requirements, then why would you implement a k8s stack that includes all that functionality? That would be foolish.

Everyone ships with kube-proxy because it’s the simplest way to provide an unopinionated network stack that will give you a fully functional k8s cluster with expected bells and whistles like simple service discovery. But most people quickly replace it with a CNI for various reasons.

My understanding is that your application is all about managing high throughput, direct to application TCP connections. So I have no idea why you’re talking about the need for fully fledged service discovery, or long lived TCP connections pinned to a single pod (I mean, that just the nature of TCP, got nothing to do with k8s or any of its network stacks).

> You can design a lean Kubernetes network path with minimal abstractions, but: You lose things like dynamic service discovery, load balancing, and readiness-based traffic shifting. You must manage more infra manually (e.g., configure iptables or direct routing rules yourself). You’re fighting against the grain of what Kubernetes is designed to do.

If you’re building a stack without k8s, then you don’t get any of those feature either, so not sure what the fuss is about. Doesn’t mean that’s k8s core competency as a workload schedular, and its general framework for managing stateful and stateless resources doesn’t provide a lot of value. You just need to learn how to take advantage of it, rather than following some cookie cutter setup for a stack that doesn’t meet your needs.

andrewfromx · 119d ago

My point is this:

Having worked at multiple IoT companies with many millions of connections. This is the way.

https://news.ycombinator.com/item?id=44032250

avianlyric · 116d ago

> People tend to overcomplicate things with K8S. I have never once seen a massively distributed IoT system run without a TON of headache and outages with k8s

I don’t dispute that. I’m simply saying that people overcomplicating their K8s setup isn’t evidence that you can’t build simple K8s setups with simple networking. It’s just that requires people to actually think about their k8s design, and not just copy common k8s setups which are usually geared towards different workloads.

> that are usually using some type of TCP connection that is not HTTP, and trying to shove it into an HTTP paradigm costs more money and more complexity and is not reliable.

What has HTTP got to do with k8s? There’s absolutely nothing in k8s that requires the use of HTTP for network. I’ve run plenty of workloads that used long lived TCP connections with binary protocols that shared nothing with HTTP.

oulipo · 122d ago

Wouldn't more modern implementations like EMQx be better suited for HA ?

jpgvm · 122d ago

I built a high scale MQTT ingestion system by utilising the MQTT protocol handler for Apache Pulsar (https://github.com/streamnative/mop). I ran a forked version and contributed back some of non-proprietary bits.

A lot more work than Mosquitto but obviously HA/distributed and some tradeoffs w.r.t features. Worth it if you want to run Pulsar anyway for other reasons.

oulipo · 122d ago

I was going to go for Redpanda, what would be the pro/cons of Pulsar you think?

jpgvm · 121d ago

With Redpanda you would need to build something external. With Pulsar the protocol handlers run within the Pulsar proxy execution mode and all of your authn/authz can be done by Pulsar etc.

Redpanda might be more resource efficient however and less operational overhead than a Pulsar system.

Pulsar has some very distinct advantages over Redpanda when it comes to actually consuming messages though. Specifically it enables both queue-like and streaming consumption patterns (it is still a distributed log underneath but does selective acknowledgement at the subscription level).

oulipo · 121d ago

I'm not so sure what do you mean by "queue-like and streaming consumption patterns" ?

a stream is a form of queue for me no?

jpgvm · 121d ago

Definitely not. Stream is an ordered log, a queue is a heap.

A stream has cumulative acknowledgement, i.e I have read up to X offset on partition Y, if I restart unexpectedly please redeliver all messages since X. This means that if any message on Y is failing you can't update the committed offset X without a) dropping it into the ether or b) writing it to retry topic. b) sounds like a solution but it's really just kicking the can down the road because you face the same choice there until it ends up in a dead-letter topic that you send stuff that can't be automatically dealt with. In the literature this is called head of line blocking.

Queues are completely different. Usually instead of having a bunch of partitions with exclusive consumers you want a work-stealing approach that has consumers rip whatever work items they can get and stay as well fed as possible and be able to deal with failing items by Nack'ing them and sending them back to the queue. In order to facilitate this though the queue needs to implement the ability to selectively Ack(nowledge) messages and keep track of which messages haven't been successfully consumed.

This is easy with a traditional queuing system because they usually don't offer any ordering guarantees (or if they do they are per key or something and pretty loose) and they store the set of messages "yet to be delivered" rather than "all messages in order" like a streaming system does. This makes it trivial to acknowledge a message has been processed (delete it) or nack it (remove the processing lock, start a redelivery timer for it). Naturally though this means the ability to re-consume already acknowledged messages pretty much doesn't exist in most queue systems as they are long-gone once they have been successfully processed.

Mixing the two is the magic of Pulsar. It has the underlying stream storage approach, with it coming ordering properties and a whole bunch of stuff that is good for scaling and reliability but layers on a queue based consumption API by storing subscription state durably on the cluster i.e it tracks which individual messages have been Ack'd rather than offsets like Kafka consumer groups or similar APIs.

Building this yourself on Kafka/Redpanda is possible but it's extremely difficult to do correctly and you need to be very careful about how you store the subscription state (usually on a set of compacted topics on the cluster). I say this because I took this path in the past and I don't recommend it for anyone that isn't sufficiently brave. :)

jandeboevrie · 122d ago

Would they work as performant and use the same amount of (less, almost nothing) resources? I've ran mosquito clusters with tens of thousands of connected clients, thousands of messages per second, on 2 cores and 2GB of ram, while mostly idling. (Without retention, using clean sessions and only QoS 0)...

bo0tzz · 122d ago

EMQX just locked HA/clustering behind a paywall: https://www.emqx.com/en/blog/adopting-business-source-licens...

zrail · 122d ago

Sigh that's annoying.

Edit: it's not a paywall. It's the standard BSL with a 4 year Apache revert. I personally have zero issue with this.

casper14 · 122d ago

Oh can you comment on what this means? I'm not too familiar with it. Thanks!

zrail · 122d ago

BSL is a source-available license that by default forbids production use. After a certain period after the date of any particular release, not to exceed four years, that release automatically converts to an open source license, typically the Apache license.

Projects can add additional license grants to the base BSL. EMQX, for example, adds a grant for commercial production use of single-node installations, as well as production use for non-commercial applications.

bo0tzz · 122d ago

It is a paywall, clustering won't work unless you have a license key.

zrail · 122d ago

Yeah I see that now. Ugh.

seized · 122d ago

VerneMQ also has built in clustering and message replication which would make this easy.

oulipo · 122d ago

Have you tried both EMQx and VerneMQ and would you specifically recommend one over the other? I don't have experience with VerneMQ