Table of Contents
Overview
We plan to upgrade the UH Message Broker from a single host to a clustered environment to improve availability. Here is what you can expect:
Item | Currently | Changing to… | Comments |
---|---|---|---|
Server configuration | Single server | 3-node cluster | |
SSL protocols | Supports TLS 1.0, 1.1 and 1.2 | Only TLS 1.1 and 1.2 are supported | |
Software versions | RabbitMQ 3.1.5 Erlang 19 | RabbitMQ 3.7.12 or higher Erlang 21.2.6 or higher | |
RabbitMQ client | Whatever the current version was when you downloaded your client | Although we expect older clients to work, we recommend that you upgrade to the latest client The oldest Java client you should use is 3.6.6. Prior versions may force the use of the unsupported TLS 1.0, regardless of Java version and settings. | Need to determine how SSL certification verification should be implemented given these warnings from later clients: WARN [localhost-startStop-1] com.rabbitmq.client.TrustEverythingTrustManager.<init> SECURITY ALERT: this trust manager trusts every certificate, effectively disabling peer verification. This is convenient for local development but offers no protection against man-in-the-middle attacks. Please see https://www.rabbitmq.com/ssl.html to learn more about peer certificate verification. |
AMQP heartbeat | Default is 65535 seconds (18.2 hours) | Default is 60 seconds | You need to set the AMQP heartbeat on your client settings to 60 seconds so that it matches the server's expectations. Otherwise, the server may think that you are no longer connected. Here's a Java example: https://www.rabbitmq.com/heartbeats.html#using-heartbeats-in-java This smaller heartbeat value will generate network traffic every 60s, thus preventing network devices from dropping your connection when it is idle. |
Publish confirms | Recommended | Strongly recommended | If you publish messages, you've always been expected to use publish confirms or risk not being notified of failed messages. This is even more important in a clustered environment. |
Handling dropped broker connections | If your RabbitMQ client does not already do this for you, you should have code that handles dropped connections to the broker. The code should repeatedly attempt to reconnect until it's successful, and if applicable, retry the interrupted operation. | No change. Continue doing the same. If you are not currently doing this, you should, but it's usually only an issue for those who have persistent connections to the broker (continuously consuming messages from their queue). | Don't worry too much about this if you connect-consume-and-disconnect on a daily or hourly basis. WARNING: if you have your own code that reconnects, check whether your version of the RabbitMQ client is also capable of reconnecting, especially if you're upgrading from a really old version. You should only have one entity perform the handling of lost connections, otherwise you may end up with multiple re-connections for each connection that was lost. This is how the RabbitMQ client recovers from network failures: Note: It is possible to have a lost connection in a "zombie" state while at the same time having a newly created recovery connection. To speed up the clean up of the zombie connection, you should set the recovery retry interval to 3 times the heartbeat value. To further reduce the likelihood of zombies, you could set the heartbeat to 30 seconds and the recovery retry interval to 90s. See also https://groups.google.com/forum/#!topic/rabbitmq-users/7AZz4Nr0_Rk |
High availability | This is a single server, so any major failure could cause the broker to be unavailable for several minutes or hours. | The broker should come back within 16 seconds. If you set your dropped connection retry interval to 16 seconds, that should result in a successful re-connection after any dropped connection. | The load balancer health check is every 5 seconds, with 3 failures triggering the switch to another cluster node. |
Mirrored queues | N/A | Queues are mirrored and synchronized across all 3 nodes unless the queue name begins with an underscore. | Mirroring allows your queue to be serviced by any node when you reconnect to the cluster after a failure. Mirroring uses up more resources, especially when it's a large test queue that hardly gets consumed. You can skip mirroring for such test queues by using an underscore as the first character in the queue name. |
Timeline
Date | Event |
---|---|
Jul 24 2019 | Test cluster environment is available for developers to test |
Oct 20 2019 | Production migration to cluster environment |
Verify your access to the future broker
- Login to your PRODUCTION server where your application runs and connects to esb.hawaii.edu
- If you can run a program to connect to a different broker in this production environment without affecting your application's real data, continue. STOP OTHERWISE.
- Test a connection to our future broker server using these settings: (DO NOT CONSUME MESSAGES YET IF POSSIBLE)
- Host: esb-cluster.its.hawaii.edu (temporary name until we switch to esb.hawaii.edu for the upgrade)
- Port: 5671
- Vhost: (whatever vhost your application is currently using, this should be uhims for those using UHIMS Events)
- User and password: same as your production application accounts in esb.hawaii.edu
- Queues: same as your production queues
- IP addresses: we allowed the same IP addresses you provided for esb.hawaii.edu. Let us know if you need additional IP addresses allowed, ESPECIALLY IF YOU WERE USING esbprod1.pvt.hawaii.edu
- Test consuming messages
- Once you are connected to the above broker, you can proceed to test consumption of messages.
- The queues in this broker will become the real queues on the day of the upgrade, so they are getting live copies from production and they will be trimmed to match where the old server stopped during the upgrade.
- DO NOT RUN THESE TESTS AFTER 10/17/2019 4:30 PM. The queues will begin synchronizing in preparation for the upgrade.
- You have three options to run tests for consuming messages:
- If possible, consume only one message and disconnect. You are done testing.
- Otherwise, see if you can consume without ack, then disconnect. This will put back the messages.
- If none of the above are possible, go ahead and consume and ack as little messages as you can.
- Don't forget to stop any consumers and disconnect from the future broker.
- You should not use esb-cluster.its.hawaii.edu for anything other than these pre-deployment tests! On the day of the upgrade, esb-cluster.its.hawaii.edu may still point to the production server but the cert's host name won't match since it will become esb.hawaii.edu.