RabbitMQ and Highly Available Queues

RabbiqMQ is a AMQP broker with an interesting set of HA abilities. Do a little research and your head will start spinning working out the differences between making messages persistent, or queues durable, or was it durable messages and HA queues with transactions? Hopefully the following is all the information you need in one place.

Before evaluating them you need to define your requirements.

  • Do you want queues to survive broker failures? 
  • Do you want unconsumed messages to survive a broker failure?
  • What matters more, publisher speed, or the above? Or do you want a nice compromise?

RabbitMQ allows you to:

  • Make a cluster of Rabbits where clients can communicate with any node in the cluster
  • Make a queue durable, meaning the queue definition itself will survive broker failure
  • Make a message persistent, meaning that it will get stored to disk, which you do by setting a message's delivery_mode
  • Make a queue HA, meaning its contents will be replicated across brokers, either a specified list, all of them or a number of them 
  • Even an HA queue has a single master that handles all operations on that queue even if the client is connected to a different node in the cluster, the master sends information to the replicas, these are called slaves
Okay so you have a durable queue that is HA and you're using persistent messages (you really want it all!). How do you work with the queue correctly?

Producing to an HA queue


You have three options for publishing to a HA queue:
  • Accept the defaults, the publish will return with no guarantees in the result of broker failure
  • Publisher confirms
  • Transactions
The defaults: You went to all that effort of making a durable HA queue and send a persistent message and then you just fire and forget? Sounds crazy, but its not. You might have done the above to make sure you don't lose a lot of messages, but you don't want the performance impact of waiting for any form of acknowledgment. You're essentially accepting a few failures when you lose a rabbit that is the master for any of your queues.

Transactions: To use RabbitMQ transactions you do a txSelect on your channel. Then when you publish a message you call txCommit which won't return until your message has been accepted by all of the master and all of the queues slaves. If you message is persistent then that means it is on the disk of them all, you're safe! What's not to like? The speed! Every persistent message that is published in a transaction results in an fsync to disk. You need a compromise you say? 

Publisher confirms: So you don't want to lose your messages and you want to speed things up. Then you can enable publish confirms on your channel. RabbitMQ will then send you a confirmation when the message has made it to disk on all the rabbits but it won't do it right away, it will flush things to disk in batches. You can either block periodically or set up a listener to get notified. Then you can put logic in your publisher to do retries etc. You might even write logic to limit the number of published messages that haven't been confirmed. But wait, isn't queueing meant to be easy?

Consuming from a HA queue


Okay, so you have your message on the queue - how do you consume it? This is simpler:
  • Auto-ack: As soon as a message is delivered RabbitMQ discards it
  • Ack: Your consumer has to manually ack each message
If your consumer crashes and disconnects from Rabbit then the message will be re-queued. However if you have a bug and you just don't ack it, then Rabbit will keep a hold of it until you disconnect, then it will be re-queued. I bet that leads to some interesting bugs!

So what could go wrong?


This sounds peachy, you don't care about performance so you have a durable HA queue with persistent messages and are using transactions for producing and  acks when consuming, you guaranteed exactly once delivery right? Well, no. Imagine your consumer crashes having consumed the message but just before sending the ack? Rabbit will re-send the message to another consumer.

HA queueing is hard!

Conclusion 


There is no magic bullet, you really need to understand the software you use for HA queueing. It is complicated and I didn't even cover topics like network partitions. Rabbit is a great piece of software and its automatic failover is really great but every notch you add on (transactions etc) will degrade your performance significantly.