This is the true story of a Citrix administrator's worst day on the job. It's about how the loss of five Citrix servers in his environment brought the remaining twelve down, why the default load evaluator is to blame, and how you can prevent it from happening in your environment.
...So I get a call from a client asking me to perform a “root cause analysis” as to why their “entire farm” went down all at once. The words “root cause” coming from this client is pretty strange seeing how most days any problems they have are solved by a simple reboot of the offending server (but I digress). Anyway, I schedule the next day to come onsite and get a concall going to get some initial information.
It seems that the day before the entire farm became “unavailable” in the course of about 30 minutes. The event that seemed to start it was the failure of a single switch. Of course the obvious questions was, “Is your entire farm on that switch?” No, the farm was actually split across four major switches with a switch in each rack that uplinks to these four. The total number of servers in the farm was 18, with 17 hosting apps and a dedicated data collector. The 18 servers were split between 5 different racks because the environment had grown over time and it made sense to keep them separate in case of power loss to a rack or a SWITCH FAILURE in a rack.
After showing up on site, looking at the environment, reviewing the performance data on the servers just prior to “The Crash,” and asking a few simple questions, we quickly figured out what had happened. One simple failure and a couple of default settings had brought the entire farm to its knees.
Here's how it went down:
- Switch in rack two fails. This disconnects the users that were attached to Servers 1 through 5. Average user load: 52 users per server.
- These 250 odd users then attempt to reconnect to their application via the NFuse page.
- The remaining 12 servers in the farm are running at about 51 or 52 users per server.
- The Data collector looks at load and begins to distribute users to the server with the lowest load. Server 9 has only 45 users and begins to see a little of the black hole effect referenced in this article.
- Server 9’s performance suddenly goes in the toilet. The other remaining servers follow shortly.
- Within 10 minutes these 12 servers now have an extra 20-25 users per server trying to connect to them.
- 15 minutes into this a number of servers are no longer sending performance data or are sending data with “gaps” in it where the server seems to pause.
- Users continue to disconnect and reconnect in an attempt to get to a “better server” which only worsens the problem due to the increase in logon and logoff processes.
- 30 minutes into this process the Citrix admin begins to disable logons to the server but finds a number of them unresponsive.
- Well, let's just say its not a pretty afternoon.
So what happened? Some might say he didn’t have enough servers? But unless you go into DR mode who has an extra 25-30% capacity lying around? No, they had tested these servers and found that they could go up to 60 maybe 62 users before performance went in the toilet. But 70-75 users per server was just to much.
To understand what happened you need to look at the Citrix default load evaluator that gets used when you begin to use load balancing with default configurations. This load evaluator looks at one metric only: user count on the server. To make matters worse its default maximum is 100 users regardless of the hardware you have or the applications you are running. This basically means that the ZDC thinks the server can still handle new load until you reach 100 total users. So when this environment lost 5 servers at once the other 12 were asked to accept a load that they could not handle.
The cure to this problem is to setup an evaluator based on your maximum user load and possibly a couple of other metrics. If they had setup an evaluator with a max load of 60 or 62 (their theoretical maximum), the other 12 servers would have stopped accepting connections once they reached that point. Sure 100 or 120 users would have not been able to connect till the servers were brought back up, but it’s better than the almost 900 users that were affected that day for several hours.
Personally I would recommend that you set the user load on your new evaluator right at the point where performance starts to degrade. Sometimes an extra session or two can sneak in beyond the load evaluator settings. In addition I would recommend throwing in some other metrics like memory in use and processor utilization. These are nice generic metrics that almost any environment can use. If you happen to know your bottleneck on your servers use that metric in your evaluator too. The trick is to keep the remaining productive servers/users online even if you have a massive failure somewhere else.
These guys had run on the default evaluator for almost two years without a problem. Sure they would loose a server once in a while or roll one or two out for maintenance, but they never had lost five at once. Of course this situation also could have been avoided if their servers had been patched into more than the single switch in the rack, but then you could always have a power loss in the rack, or my personal favorite, the time a pipe above a server room broke and poured water directly onto two racks. Just because you make the NICs and power supplies and everything else redundant doesn’t mean things can't happen.