UH Message Broker upgrade to clustered environment
Table of Contents
Overview
We plan to upgrade the UH Message Broker from a single host to a clustered environment to improve availability. Here is what you can expect:
Item | Previously | Changed to… | Comments |
---|---|---|---|
Server configuration | Single server | 3-node cluster | |
SSL protocols | Supports TLS 1.0, 1.1 and 1.2 | Only TLS 1.1 and 1.2 are supported | |
Software versions | RabbitMQ 3.1.5 Erlang 19 | RabbitMQ 3.7.12 or higher Erlang 21.2.6 or higher | |
RabbitMQ client | Whatever the current version was when you downloaded your client | Although we expect older clients to work, we recommend that you upgrade to the latest client The oldest Java client you should use is 3.6.6. Prior versions may force the use of the unsupported TLS 1.0, regardless of Java version and settings. | Need to determine how SSL certification verification should be implemented given these warnings from later clients: WARN [localhost-startStop-1] com.rabbitmq.client.TrustEverythingTrustManager.<init> SECURITY ALERT: this trust manager trusts every certificate, effectively disabling peer verification. This is convenient for local development but offers no protection against man-in-the-middle attacks. Please see https://www.rabbitmq.com/ssl.html to learn more about peer certificate verification. |
AMQP heartbeat | Default is 65535 seconds (18.2 hours) | Default is 60 seconds | You need to set the AMQP heartbeat on your client settings to 60 seconds so that it matches the server's expectations. Otherwise, the server may think that you are no longer connected. Here's a Java example: https://www.rabbitmq.com/heartbeats.html#using-heartbeats-in-java This smaller heartbeat value will generate network traffic every 60s, thus preventing network devices from dropping your connection when it is idle. |
Publish confirms | Recommended | Strongly recommended | If you publish messages, you've always been expected to use publish confirms or risk not being notified of failed messages. This is even more important in a clustered environment. |
Handling dropped broker connections | If your RabbitMQ client does not already do this for you, you should have code that handles dropped connections to the broker. The code should repeatedly attempt to reconnect until it's successful, and if applicable, retry the interrupted operation. | No change. Continue doing the same. If you are not currently doing this, you should, but it's usually only an issue for those who have persistent connections to the broker (continuously consuming messages from their queue). | Don't worry too much about this if you connect-consume-and-disconnect on a daily or hourly basis. WARNING: if you have your own code that reconnects, check whether your version of the RabbitMQ client is also capable of reconnecting, especially if you're upgrading from a really old version. You should only have one entity perform the handling of lost connections, otherwise you may end up with multiple re-connections for each connection that was lost. This is how the RabbitMQ client recovers from network failures: Note: It is possible to have a lost connection in a "zombie" state while at the same time having a newly created recovery connection. To speed up the clean up of the zombie connection, you should set the recovery retry interval to 3 times the heartbeat value. To further reduce the likelihood of zombies, you could set the heartbeat to 30 seconds and the recovery retry interval to 90s. See also https://groups.google.com/forum/#!topic/rabbitmq-users/7AZz4Nr0_Rk |
High availability | This is a single server, so any major failure could cause the broker to be unavailable for several minutes or hours. | The broker should come back within 16 seconds. If you set your dropped connection retry interval to 16 seconds, that should result in a successful re-connection after any dropped connection. | The load balancer health check is every 5 seconds, with 3 failures triggering the switch to another cluster node. |
Mirrored queues | N/A | Queues are mirrored and synchronized across all 3 nodes unless the queue name begins with an underscore. | Mirroring allows your queue to be serviced by any node when you reconnect to the cluster after a failure. Mirroring uses up more resources, especially when it's a large test queue that hardly gets consumed. You can skip mirroring for such test queues by using an underscore as the first character in the queue name. |
Consumer prefetch | Optional and possibly set by you or the RabbitMQ client you use. | An application reported getting connection resets after reading over a thousand messages without ack. The solution was to add this line: channel.basicQos(10); Please refer to https://www.rabbitmq.com/consumer-prefetch.html | We don't know if the need to add this line of code was the result of the newer RabbitMQ client version, the new RabbitMQ cluster or because the application code itself changed. There's also the possibility that the application should have had this line of code under the old broker (this is not a new feature), but it got lucky and never experienced the issue since its queue doesn't get backlogged (its publishing is throttled). If your application uses basicConsume or similar (as opposed to basicGet) to read thousands of messages *before* eventually acknowledging them, you may want to consider adding this line of code. |
Timeline
Date | Event |
---|---|
Jul 24 2019 | Test cluster environment is available for developers to test |
Oct 20 2019 | Production migration to cluster environment |