If you administer a busy Citrix MetaFrame farm, you probably have certain periods with ten or more fresh logons per minute. During these logon peaks, do you feel confident rebooting or starting up a server?
I know I wouldn’t be, thanks to what I call “The black-hole effect.” The black hole effect is what happens when a fresh server comes online during busy periods of logon in load-balanced environments. Left unchecked, this can have serious consequences for your entire farm.
To understand this problem, let’s look at a sample farm that’s made up of 50 MetaFrame Presentation Servers. Imagine it’s 8:00 AM and your users are getting ready to work, happily logging into the farm and checking their mail while drinking a nice warm cup of coffee. By 8:25, the average load on each server is about 60 users, which is about 80% of the maximum load that would occur that day.
At that moment, your colleague (a fellow farm administrator) remembers that he forgot to power on one of the MetaFrame servers that had been out of service due to a power supply failure. He heads down to the datacenter, turns on the server, and picks up a cup of coffee on the way back to his desk.
He’s back to his desk by 8:28, and he’s ready to log on and check his email just like all of the users.
Twenty seconds after starting his session, he notices that something is not right. Nothing happens after entering his credentials. (By now the logon script should have kicked off.) After another twenty seconds of waiting, your fellow administrator decides to disconnect his session and give it another try. No luck there either—the session freezes right after logon.
Before he can figure out what’s going on, the phone starts ringing. It turns out that he’s not the only one having logon problems. The helpdesk is reporting many logon freezes and users are failing to actually connect or start a session on the farm.
You decide to investigate this problem that’s getting more dramatic every moment. Using the Citrix Management Console, you enable the load evaluator log and notice that all new sessions are being connected to a single server. Your colleague confirms that this server is the one that he just rebooted.
You try disabling logons on that server but the farm is still unresponsive. RDP logons don’t work either, so you have nothing left to do except to shut down that server. You do so, and the farm operation returns to normal.
So what is the lesson here? Should your partner not be allowed to touch anything before he has his morning coffee? Probably so, but more importantly you have experienced the “Black Hole Effect.”
In my opinion, the Black Hole Effect is a justified name for what can happen in these situations. Whether it’s caused by a newly-installed Citrix server or after an unexpected reboot caused by good old-fashioned blue screen, the result can have the same disastrous effect if it occurs during a peak logon period.
This black hole effect is by “design.” Ordinarily, Citrix MetaFrame servers are load balanced in such a way that their loads are more-or-less equal at all times. The problem is that if you have a busy farm (like the example that had 60 users on each server), a new server that comes online will have zero users. Therefore, in order for the load-balancing algorithm to keep the server loads equal, it will send the next 60 users to the new server. During busy logon periods this effectively creates an unintentional D.o.S. attack against the new server.
The problem is that regular logons can be pretty demanding on the servers resources. Just think of everything that takes place:
- The session is initialized
- Another listener is created to facilitate the next session
- The profile is loaded
- Group policies are applied
- The logon script configures all kinds of setting
- The application or desktop is started
While a few sessions starting simultaneously normally pose no problem for a server, it is a very different story when more then ten sessions connect in a very short period.
The problem is magnified by the fact that Citrix’s session-based load managing algorithms only count active sessions. Disconnected sessions have no influence on the reported server load. When a server becomes unresponsive, frustrated users tend to disconnect the session, meaning that the initialized session is not accounted for in load management. The user load officially stays low and new sessions keep being directed to the already overloaded server.
Until the server is configured so that logons are disabled, it will continue to accept new sessions and frustrated users disconnect to retry their logon attempt. The server “sucks” in new sessions like a black hole, and your farm seems unavailable to all new users. (Of course other users already working on other servers experience no problems.) In some cases, the new server can become so overloaded that its IMA service stops responding, effectively meaning that server is out of commission until it’s powered down.
A Load Evaluator-Based Solution?
The first and most simple thing that most people try to do is to configure a “fail-safe” load evaluator in Load Management. In addition to the usual “Server User Load” evaluator, you can configure the “CPU Utilization” rule in the CMC to have minimum and maximum values that are only one digit apart. (For example, 93 and 94.) If the CPU load is 93% or lower, the “CPU Utilization” load evaluator will not report any load and will not influence load management statistics. When the CPU utilization hits 94%, this rule kicks in and immediately reports a full load (a load index of 10000). This will cause load management to stop directing new connections to the server until the average load drops below 93% again.
While this is a great way to configure your rules in general, this specific configuration won’t fix the black hole effect. The problem is that the “CPU Utilization” load evaluator calculates CPU load based on an average of ten 30-second sampling intervals. A full load will not be reported until after two or three minutes. This is more then enough for freshly-booted server to be swamped by users during that logon peak.
It’s also possible to configure other rules (like the context switches) as a failsafe. To do this, you configure the lower limit (where no load is reported) above normal operations and the upper limit (when full load is reported) just one notch higher. By doing so, this rule will not influence normal operations but will “fire” when the limit has been reached, reporting full load and preventing additional users from starting new sessions on that server.
However, this “circuit breaker” type of rule configuration still will not counteract the Black Hole Effect. To completely eliminate the Black Hole Effect, Citrix could (or “should”) create a new rule called “Maximum New Logons” that could be used to throttle the rate of new logons. (Ideally, this rule would allow you to specify X number of sessions my connect within Y seconds.)
It would be nice if Citrix also created a new rule that allowed you to load balance your servers based on all sessions—active and disconnected—in addition to the current rule that counts only active sessions.
DADE Power Tools
Since Citrix current version of Load Management provides no solution, I’ve typically implemented a simple startup script that counts new logons in a 30-second timeframe. This “Logon Throttle” script runs for 30 minutes whenever a server is booted. When the logon load is too high, it puts the server “on hold” by disabling logons. The script then re-enables logons after the server has digested these logons properly. This script was only intended to act as a failsafe device—under normal conditions it would never put a server on hold.
The script prevents the Black Hole Effect by temporarily disabling the logons. While effective, this was a somewhat crude method. Disabling logons on the server prevented users from reconnecting to disconnected sessions and it prevented administrators from connecting even via RDP. However, this was the only option we had.
I am not a programmer. Fortunately, my respected coworkers Daniel Nikolic and Dennis Damen have some serious programming skills. They converted my script into two services (called DADE Power Tools) which can be easily controlled with a few registry settings. They’ve also provided an ADM policy template so that these settings can centrally administered with group policy objects.
The DADE Power Tools are completely free, and they include two services: Logon Throttler and Black Hole Protector.
The LogonThrottler Service
The LogonThrottler service keeps track how many logons (configurable) occur within a certain timeframe (also configurable). When the logons exceed the specified logon rate, the service protects the server by disabling logons. The service disables logons with the “change logon /disable” command.
The LogonThrottler settings are stored in the registry in the following location: HKLM\SOFTWARE\DADE\LogonThrottler. It makes use of several values:
This setting determines the length of time between each poll in seconds. The default value is 5 seconds which should be enough for most environments.
The PollHistoryCount value specifies the number of polls that are used to calculate the logonrate. The default value is 12 polls.
The total timeframe in which the server is monitored is PollHistoryCount x PollInterval. With the default values, the total time-frame is 12 x 5 = 60 seconds. The LogonThrottle service will monitor how many logons have occurred the last 60 seconds.
The MaxLogons value allows you to specify how many new sessions you find acceptable within the timeframe. The default settings allow a maximum of 8 new sessions within 60 seconds. When the logonrate exceeds the 8 sessions within a minute, the LogonThrottle Service disables logons by executing the “change logon /disable” command.
Since logoffs also consume resources, they can be added to the counted logons. The LogoffFactor value determines how a logoff should be weighted. The value has a minimum of 0 and maximum of 200%. The default value is 50%, which means that logoff are only half weighted. When 7 logons occur and 6 logoffs in 60 seconds, the LogonThrottle will pause the server for 200 seconds (10 x 20). 50% of 6 logoffs is counted as 3 logons. Add these to the 7 logons and the LogonThrottle service will count the 7 logons and 6 logoffs as 10 new sessions. When the LogoffFactor value is configured as 0, the logoffs have no influence on the LogonThrottler.
The value of SleepDuration determines how long the server will pause per new session. If 11 sessions were started within 60 seconds, the LogonThrottle service will disable logons for 11 x 20 = 220 seconds. After 220 seconds have passed, the LogonThrottle service will re-enable logons. The default value of SleepDuration is 20 seconds.
The MaxRuntime value specifies the number of seconds that the LogonThrottle service will be active. For example, if you only want to throttle logons for 15 minutes after a reboot, you would configure this value to be 900. When MaxRuntime has a value of zero (default), the LogonThrottler service will always be monitoring the server.
The BlackHole Protector Service
The BlackHole Protector service can be considered an “emergency brake” for Terminal Servers and MetaFrame Presentation Servers. This service keeps track of all RDP and ICA disconnected and active sessions. It allows you to limit the total sessions possible on a server. Especially in twin-datacenter environments, Citrix servers run the risk of being overloaded when on of the datacenters fails. By default, this service is set to manual startup type—you need to explicitly enable the BlackHole protector service and set it to startup type automatic.
The BlackHole protector settings are stored in registry in under the following key: HKLM\SOFTWARE\DADE\BlackHole
MaxSession tells the Blackhole Protector how many ICA, RDP, Active and Disconnected are allowed on the server. Logons are disabled when this number is reached. You should configure this value to be 25% above the value of the Server User load evaluator.
ReleaseLevel determines when the BlackHole protector re-enables logons.
This setting determines the length of time between each poll in seconds. The default value is 15 seconds.
This value specifies the number of seconds that the Blackhole Protector service will run. When MaxRuntime has a value of zero (default), the Blackhole protector service will always be monitoring the server.
For the latest information and discussion about the DADE Power Tools
and other tools like the Flex Profile Kit, check: