How to patch a TMG array– some thoughts on NLB high availability
One of the reasons for using an array is the availability of NLB, which is known to provide fault tolerance and load balancing.
NLB relies on heartbeats to determine whether the cluster nodes are alive. The nodes divide the potential client IP addresses among each other (in fact actually the hashes of the IPs) and send each other heartbeats, thereby notifying the other members that they are up and running.
As soon as a node is down (fails to send the heartbeats), the remaining nodes take ownership over the failing node’s IP hashes, providing coverage for all the clients normally served by the broken node.
How does this all relate to patching?
For new connections, the above behavior is straightforward, but what if we just “unplug” one of our TMG machines? What happens to existing connections served by this box? No other node will be aware of the state of these connections, so essentially they will simply fail.
This is exactly what happens if you just all of a sudden patch and reboot one of your TMG nodes.
How can we circumvent this? Is there any workaround?
In general, NLB supports what is called “drain mode”.
When your “drainstop” a node, NLB will still serve existing connections owned by that node but it won’t accept new connections. New connections will be handled by the other available nodes in the array.
With that, If you are intentionally taking a node offline then you can use drainsstopping to service all the active connections before you take the node offline for patching.
Therefore, when patching a particular TMG node, Ideally you will :
1. drain the node, wait until the session count drops to zero (sessions tab)
2. suspend it (so that NLB won’t be automatically started on next reboot)
3. patch the node
4. make sure the system operates properly with the patch
5. start NLB again to make the node join the array again
Here is a screenshot of the NLB options available in the TMG Management console:
Reference:
http://technet.microsoft.com/en-us/library/cc725691.aspx
Authors
Balint Toth
Support Escalation Engineer
Microsoft CSS Forefront Edge Team
Technical Reviewer
Eric Detoc
Escalation Engineer
Microsoft CSS Forefront Edge Team
Comments
Anonymous
January 01, 2003
quick Information, when you drain-stop the NLB on a TMG Server and reboot the server or restart the Firewall Service afterwards, the NLB will be automatically started again.Anonymous
November 16, 2011
The described procedure doesn't work out! when draining and stoping, TGM will disconnect ipsec s2s tunnel and will not reconnect on remaining node. additionally sessions NEVER drop to 0 !!! I tried draining over one week and -besides the TMG own connections - there were still other old and NEW!! sessions.Anonymous
April 18, 2012
Drainstopping is stil the best method for intentionally taking an NLB node out of service; however, I have noticed that with persistent TCP connections (such as from Outlook Anywhere / ActiveSync clients), the node will never completely drain. The idea is to allow all non-persistent connections to drain off (which should not take very long), and then you can stop the NLB completely. Persistent connections will be cut off, but those clients should attempt to re-establish, and will end up on the other active NLB nodes.Anonymous
October 28, 2013
Thats why you should set the NLB service to suspend before you reboot the server. Like the article say, reboot the server and verify that everything looks ok (in eventvwr etc) and after this start the service.