How the Default Load Evaluator can Ruin your Day

This is the true story of a Citrix administrator's worst day on the job.

This is the true story of a Citrix administrator's worst day on the job. It's about how the loss of five Citrix servers in his environment brought the remaining twelve down, why the default load evaluator is to blame, and how you can prevent it from happening in your environment.

...So I get a call from a client asking me to perform a “root cause analysis” as to why their “entire farm” went down all at once. The words “root cause” coming from this client is pretty strange seeing how most days any problems they have are solved by a simple reboot of the offending server (but I digress).  Anyway, I schedule the next day to come onsite and get a concall going to get some initial information.

It seems that the day before the entire farm became “unavailable” in the course of about 30 minutes. The event that seemed to start it was the failure of a single switch. Of course the obvious questions was, “Is your entire farm on that switch?” No, the farm was actually split across four major switches with a switch in each rack that uplinks to these four.  The total number of servers in the farm was 18, with 17 hosting apps and a dedicated data collector.  The 18 servers were split between 5 different racks because the environment had grown over time and it made sense to keep them separate in case of power loss to a rack or a SWITCH FAILURE in a rack.

After showing up on site, looking at the environment, reviewing the performance data on the servers just prior to “The Crash,” and asking a few simple questions, we quickly figured out what had happened. One simple failure and a couple of default settings had brought the entire farm to its knees.

Here's how it went down:

  1. Switch in rack two fails. This disconnects the users that were attached to Servers 1 through 5. Average user load: 52 users per server.
  2. These 250 odd users then attempt to reconnect to their application via the NFuse page.
  3. The remaining 12 servers in the farm are running at about 51 or 52 users per server. 
  4. The Data collector looks at load and begins to distribute users to the server with the lowest load. Server 9 has only 45 users and begins to see a little of the black hole effect referenced in this article
  5. Server 9’s performance suddenly goes in the toilet. The other remaining servers follow shortly. 
  6. Within 10 minutes these 12 servers now have an extra 20-25 users per server trying to connect to them.
  7. 15 minutes into this a number of servers are no longer sending performance data or are sending data with “gaps” in it where the server seems to pause.
  8. Users continue to disconnect and reconnect in an attempt to get to a “better server” which only worsens the problem due to the increase in logon and logoff processes.
  9. 30 minutes into this process the Citrix admin begins to disable logons to the server but finds a number of them unresponsive.
  10. Well, let's just say its not a pretty afternoon.

So what happened? Some might say he didn’t have enough servers? But unless you go into DR mode who has an extra 25-30% capacity lying around? No, they had tested these servers and found that they could go up to 60 maybe 62 users before performance went in the toilet. But 70-75 users per server was just to much.

To understand what happened you need to look at the Citrix default load evaluator that gets used when you begin to use load balancing with default configurations. This load evaluator looks at one metric only: user count on the server. To make matters worse its default maximum is 100 users regardless of the hardware you have or the applications you are running.  This basically means that the ZDC thinks the server can still handle new load until you reach 100 total users. So when this environment lost 5 servers at once the other 12 were asked to accept a load that they could not handle.

The cure to this problem is to setup an evaluator based on your maximum user load and possibly a couple of other metrics. If they had setup an evaluator with a max load of 60 or 62 (their theoretical maximum), the other 12 servers would have stopped accepting connections once they reached that point. Sure 100 or 120 users would have not been able to connect till the servers were brought back up, but it’s better than the almost 900 users that were affected that day for several hours.

Personally I would recommend that you set the user load on your new evaluator right at the point where performance starts to degrade. Sometimes an extra session or two can sneak in beyond the load evaluator settings. In addition I would recommend throwing in some other metrics like memory in use and processor utilization. These are nice generic metrics that almost any environment can use. If you happen to know your bottleneck on your servers use that metric in your evaluator too. The trick is to keep the remaining productive servers/users online even if you have a massive failure somewhere else.

These guys had run on the default evaluator for almost two years without a problem. Sure they would loose a server once in a while or roll one or two out for maintenance, but they never had lost five at once. Of course this situation also could have been avoided if their servers had been patched into more than the single switch in the rack, but then you could always have a power loss in the rack, or my personal favorite, the time a pipe above a server room broke and poured water directly onto two racks. Just because you make the NICs and power supplies and everything else redundant doesn’t mean things can't happen.

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

This message was originally posted by Marc Palombo on December 10, 2004
Ron you should have asked Brian Madden for help on this one. He could have told you that eG Innovations has a powerful "root cause analysis" engine and would have found out all the problems and kinks in a matter of minutes. You can check out a live demo on Citrix Testdrive: (click on the eG Suite link halfway down the apps page)

Feel free to e-mail me directly if you'd like for us to schedule you to get some agents in-house to try out for free for 30 days. Rob Koury from RapidApp has seen the product at the last iForum show and I've been talking with Mitch Northcutt as well.
This message was originally posted by Ron Oglesby on December 10, 2004
I had the answer in a 30 minute con call and an hour or so on site. It took me more time to write all the docs. Anyway would the root cause analysis thing have said "Hey meat balls, why didnt you implement a real laod evaluator?" I do that anyway (been doing this a little while) and they just didnt have me build there system. I only get the call when they break it. Dont get me wrong, I love tools, but in this case root casue might have shown the switch failure and the 5 servers dropping, not said "hey if you setup a real evaluator you would be set, see the QFARM /load and how the remaining servers are reporitng aload index of 5200, yeah thats bad". Ron
This message was originally posted by Marc Palombo on December 10, 2004
Ron, you should ask Brian and Rob for their opinion on the product when you get a chance. We would still appreciate your time to look at the product. You might be right regarding this particular instance; however, I would bet that eG would have "proactively" alerted you to these problems BEFORE they became an issue not afterward. Now, will it respond "hey meat balls" when an alert comes up - I'll have to ask - you might be able to customize the responses with their SDK tool... ;-)

This message was originally posted by Brian Madden on December 10, 2004
Don't get me wrong, I think eG makes a great product, but it would not have proactively notified me that my load evaluators were configured in such a way that my site would fail. Sure, once the switch failed it would be great, and it would proactively tell me "Warning, your other servers suddenly have more users than they usually do." But this failure was so fast that by the time eG notified me it would already be too late. The only way to save this is to configure the load evaluators so that it never happened in the first place.
This message was originally posted by Ron Oglesby on December 10, 2004
I guess that is my question, will it tell me that my servers run at 30 users average but could handle 120, and I should adjust my load evaluator?
This message was originally posted by an anonymous visitor on December 10, 2004
Setting load evaluators that include cpu utilization without hard performance numbers on how many total total users each server can support under full load could create disastrous results. Idle users don't chew cpu, and load balancing based on cpu would continue to load users on the server. If all users then got busy, there would be a potential run on resources on the box causing the same affect described by using the default load evaluator.

As an alternative, limits could be placed on numbers of users directly on the protocol per server and leave the default load evaluator in place.
This message was originally posted by Ron Oglesby on December 11, 2004
But i didnt say just to use Proc did I? I recommended a use of the user count AND proc and mem. Alternatley you can set it at the protocl level. BUT the issue I ahve seen with that is that the def evaluator is left in plac all the protocol listerns are set with max sessions, then one get changed lower (possibly by accident) and since it is a single server change, the laod evaluator consintues to route users to that server, even though it is has hit its in correctly set max. So if you keep the user max set at the evaluator you know its the same on all servers.
This message was originally posted by Jeff Pitsch on December 11, 2004
The problem with setting limits based on user (even if tested properly) is that all users do not use the system the same way every time. Just because a server can handle 50 connections does not mean that it always be able to handle 50 users. Having the added assurance of the other evaluators would be the prudent way to go.
Ron and Brian - you were right. Here are the comments from eG's CEO on the above e-mail thread:

"read through the link. boy, this is a hard one. The failure in question is a hard failure - the switch went down without a warning (it is possible there was a warning and the customer/Ron didnt even see it). Once the switch went down, it triggered a bunch of reactions. We'd have spotted the reactions and would have provided documented proof of exactly what happened - i.e., you'd see from our graphs the ripple effect that Ron talks about. The root-cause of the problem in this case is the switch going down, and the load evaluator being improper is a secondary effect, which would have been reflected in our alerts. To do the analysis that Ron did - to determine 100 users is the max assumed by the default load evaluator and that this was causing the problem, would still require a human."

I disagree. There is rarely a run on memory except in the event of a memory leak, so I really don't believe that should factor in to a load balancer and should be flushed out during performance testing. CPU resources have runs on them from time to time based on the application but is usually short lived. If it is a longer spike than that and creates sustained performance issues for other connected users, then appsense or similar cpu clamping technology products are more effective than simply trying to rout the users elsewhere. Idle users laying in wait on a server and then all getting busy at the same time in an environment where you truly have not determined absolute maximum user connections on full utilization is just dangerous. What I am saying is using these 3 metrics is not a replacement for proper performance testing. A properly tested server deemed to be able to handle 50 users will allways be able to handle 50 users, regardless of how they use the system.
When designing performance tests you will have to make assumptions about the behaviour of the user. These assumptions will not necessarily be applicable to real life.
In my opinion, the load of a box can easily defined by evaluators based cpu utilization, memory usage, etc. because to each of those metrics a threshold can be applied as a division between acceptable load and high load.
I agree that a single application (stable without memory leaks) causes temporary peaks of cpu and memory utilization. But for a sufficiently high number of users (like the previously mentioned 50) the average cpu and memory utilization of ALL applications does not show the erratic behaviour as you would expect of a single user or application.

You might be able to deduce from you performance tests that each of your servers will be able to handle a maximum 50 users in all cases. Firstly, this only applies to behaviour that is comparable to that which you have assumed in your performance tests. The behaviour of a user might and probably will change over time. Secondly, you might even be wasting some of the performance of your servers because there will be some resources in store for more users up to your theoretical maximum.

I'd feel much saver basing my load balancers on directly measurable metrics instead of assumptions about the users' behaviour.

Here's a suggestion how about simply turning on the inbuilt CPU Utilization management? You might even see the black hole effect disappear and perhaps even see the server actually handle the extra user load.

Monitoring CPU utilzation as a metric is stupid. The sweet spot for the inbuilt CPU management is 100% so monitoring the load below this level is nonsense. you pay for 100% of a CPU so why can't you use it? - becuase you have been brainwashed with dumb ass "Windows" thinking.