UH Message Broker upgrade to clustered environment

Table of Contents

Overview

We plan to upgrade the UH Message Broker from a single host to a clustered environment to improve availability. Here is what you can expect:

ItemPreviouslyChanged to…Comments
Server configuration

Single server

3-node cluster
SSL protocolsSupports TLS 1.0, 1.1 and 1.2Only TLS 1.1 and 1.2 are supported
Software versions

RabbitMQ 3.1.5

Erlang 19

RabbitMQ  3.7.12 or higher

Erlang 21.2.6 or higher


RabbitMQ clientWhatever the current version was when you downloaded your client

Although we expect older clients to work, we recommend that you upgrade to the latest client

The oldest Java client you should use is 3.6.6.  Prior versions may force the use of the unsupported TLS 1.0, regardless of Java version and settings.

Need to determine how SSL certification verification should be implemented given these warnings from later clients:

WARN [localhost-startStop-1] com.rabbitmq.client.TrustEverythingTrustManager.<init> SECURITY ALERT: this trust manager trusts every certificate, effectively disabling peer verification. This is convenient for local development but offers no protection against man-in-the-middle attacks. Please see https://www.rabbitmq.com/ssl.html to learn more about peer certificate verification.

AMQP heartbeat

Default is 65535 seconds (18.2 hours)


Default is 60 seconds

You need to set the AMQP heartbeat on your client settings to 60 seconds so that it matches the server's expectations. Otherwise, the server may think that you are no longer connected.

Here's a Java example: https://www.rabbitmq.com/heartbeats.html#using-heartbeats-in-java

This smaller heartbeat value will generate network traffic every 60s, thus preventing network devices from dropping your connection when it is idle.

Publish confirmsRecommendedStrongly recommendedIf you publish messages, you've always been expected to use publish confirms or risk not being notified of failed messages.   This is even more important in a clustered environment.
Handling dropped broker connections

If your RabbitMQ client does not already do this for you, you should have code that handles dropped connections to the broker.  The code should repeatedly attempt to reconnect until it's successful, and if applicable, retry the interrupted operation.

No change. Continue doing the same.

If you are not currently doing this, you should, but it's usually only an issue for those who have persistent connections to the broker (continuously consuming messages from their queue). 


Don't worry too much about this if you connect-consume-and-disconnect on a daily or hourly basis.

WARNING: if you have your own code that reconnects, check whether your version of the RabbitMQ client is also capable of reconnecting, especially if you're upgrading from a really old version.  You should only have one entity perform the handling of lost connections, otherwise you may end up with multiple re-connections for each connection that was lost.

This is how the RabbitMQ client recovers from network failures:
https://www.rabbitmq.com/api-guide.html#recovery

Note: It is possible to have a lost connection in a "zombie" state while at the same time having a newly created recovery connection.  To speed up the clean up of the zombie connection, you should set the recovery retry interval to 3 times the heartbeat value.   To further reduce the likelihood of zombies, you could set the heartbeat to 30 seconds and the recovery retry interval to 90s.  See also https://groups.google.com/forum/#!topic/rabbitmq-users/7AZz4Nr0_Rk

High availability

This is a single server, so any major failure could cause the broker to be unavailable for several minutes or hours.

The broker should come back within 16 seconds.

If you set your dropped connection retry interval to 16 seconds, that should result in a successful re-connection after any dropped connection.

The load balancer health check is every 5 seconds, with 3 failures triggering the switch to another cluster node.
Mirrored queuesN/A

Queues are mirrored and synchronized across all 3 nodes unless the queue name begins with an underscore.


Mirroring allows your queue to be serviced by any node when you reconnect to the cluster after a failure.

Mirroring uses up more resources, especially when it's a large test queue that hardly gets consumed.  You can skip mirroring for such test queues by using an underscore as the first character in the queue name.

Consumer prefetchOptional and possibly set by you or the RabbitMQ client you use.

An application reported getting connection resets after reading over a thousand messages without ack. The solution was to add this line:

channel.basicQos(10);

Please refer to https://www.rabbitmq.com/consumer-prefetch.html

We don't know if the need to add this line of code was the result of the newer RabbitMQ client version, the new RabbitMQ cluster or because the application code itself changed. There's also the possibility that the application should have had this line of code under the old broker (this is not a new feature), but it got lucky and never experienced the issue since its queue doesn't get backlogged (its publishing is throttled).

If your application uses basicConsume or similar (as opposed to basicGet) to read thousands of messages *before* eventually acknowledging them, you may want to consider adding this line of code.

Timeline

DateEvent
Jul 24 2019Test cluster environment is available for developers to test
Oct 20 2019Production migration to cluster environment