Dealing with the "Black Hole Effect:" Throttling Logons to New Servers

If you administer a busy Citrix MetaFrame farm, you probably have certain periods with ten or more fresh logons per minute. During these logon peaks, do you feel confident rebooting or starting up a server?

If you administer a busy Citrix MetaFrame farm, you probably have certain periods with ten or more fresh logons per minute. During these logon peaks, do you feel confident rebooting or starting up a server?

I know I wouldn’t be, thanks to what I call “The black-hole effect.” The black hole effect is what happens when a fresh server comes online during busy periods of logon in load-balanced environments. Left unchecked, this can have serious consequences for your entire farm.

To understand this problem, let’s look at a sample farm that’s made up of 50 MetaFrame Presentation Servers. Imagine it’s 8:00 AM and your users are getting ready to work, happily logging into the farm and checking their mail while drinking a nice warm cup of coffee. By 8:25, the average load on each server is about 60 users, which is about 80% of the maximum load that would occur that day.

At that moment, your colleague (a fellow farm administrator) remembers that he forgot to power on one of the MetaFrame servers that had been out of service due to a power supply failure. He heads down to the datacenter, turns on the server, and picks up a cup of coffee on the way back to his desk.

He’s back to his desk by 8:28, and he’s ready to log on and check his email just like all of the users.

Twenty seconds after starting his session, he notices that something is not right. Nothing happens after entering his credentials. (By now the logon script should have kicked off.) After another twenty seconds of waiting, your fellow administrator decides to disconnect his session and give it another try. No luck there either—the session freezes right after logon.

Before he can figure out what’s going on, the phone starts ringing. It turns out that he’s not the only one having logon problems. The helpdesk is reporting many logon freezes and users are failing to actually connect or start a session on the farm.

You decide to investigate this problem that’s getting more dramatic every moment. Using the Citrix Management Console, you enable the load evaluator log and notice that all new sessions are being connected to a single server. Your colleague confirms that this server is the one that he just rebooted.

You try disabling logons on that server but the farm is still unresponsive. RDP logons don’t work either, so you have nothing left to do except to shut down that server. You do so, and the farm operation returns to normal.

So what is the lesson here? Should your partner not be allowed to touch anything before he has his morning coffee? Probably so, but more importantly you have experienced the “Black Hole Effect.”

In my opinion, the Black Hole Effect is a justified name for what can happen in these situations. Whether it’s caused by a newly-installed Citrix server or after an unexpected reboot caused by good old-fashioned blue screen, the result can have the same disastrous effect if it occurs during a peak logon period.

This black hole effect is by “design.” Ordinarily, Citrix MetaFrame servers are load balanced in such a way that their loads are more-or-less equal at all times. The problem is that if you have a busy farm (like the example that had 60 users on each server), a new server that comes online will have zero users. Therefore, in order for the load-balancing algorithm to keep the server loads equal, it will send the next 60 users to the new server. During busy logon periods this effectively creates an unintentional D.o.S. attack against the new server.

The problem is that regular logons can be pretty demanding on the servers resources. Just think of everything that takes place:

  • The session is initialized
  • Another listener is created to facilitate the next session
  • The profile is loaded
  • Group policies are applied
  • The logon script configures all kinds of setting
  • The application or desktop is started

While a few sessions starting simultaneously normally pose no problem for a server, it is a very different story when more then ten sessions connect in a very short period.

The problem is magnified by the fact that Citrix’s session-based load managing algorithms only count active sessions. Disconnected sessions have no influence on the reported server load. When a server becomes unresponsive, frustrated users tend to disconnect the session, meaning that the initialized session is not accounted for in load management. The user load officially stays low and new sessions keep being directed to the already overloaded server.

Until the server is configured so that logons are disabled, it will continue to accept new sessions and frustrated users disconnect to retry their logon attempt. The server “sucks” in new sessions like a black hole, and your farm seems unavailable to all new users. (Of course other users already working on other servers experience no problems.) In some cases, the new server can become so overloaded that its IMA service stops responding, effectively meaning that server is out of commission until it’s powered down.

A Load Evaluator-Based Solution?

The first and most simple thing that most people try to do is to configure a “fail-safe” load evaluator in Load Management. In addition to the usual “Server User Load” evaluator, you can configure the “CPU Utilization” rule in the CMC to have minimum and maximum values that are only one digit apart. (For example, 93 and 94.) If the CPU load is 93% or lower, the “CPU Utilization” load evaluator will not report any load and will not influence load management statistics. When the CPU utilization hits 94%, this rule kicks in and immediately reports a full load (a load index of 10000). This will cause load management to stop directing new connections to the server until the average load drops below 93% again.

While this is a great way to configure your rules in general, this specific configuration won’t fix the black hole effect. The problem is that the “CPU Utilization” load evaluator calculates CPU load based on an average of ten 30-second sampling intervals. A full load will not be reported until after two or three minutes. This is more then enough for freshly-booted server to be swamped by users during that logon peak.

It’s also possible to configure other rules (like the context switches) as a failsafe. To do this, you configure the lower limit (where no load is reported) above normal operations and the upper limit (when full load is reported) just one notch higher. By doing so, this rule will not influence normal operations but will “fire” when the limit has been reached, reporting full load and preventing additional users from starting new sessions on that server.

However, this “circuit breaker” type of rule configuration still will not counteract the Black Hole Effect. To completely eliminate the Black Hole Effect, Citrix could (or “should”) create a new rule called “Maximum New Logons” that could be used to throttle the rate of new logons. (Ideally, this rule would allow you to specify X number of sessions my connect within Y seconds.)

It would be nice if Citrix also created a new rule that allowed you to load balance your servers based on all sessions—active and disconnected—in addition to the current rule that counts only active sessions.

DADE Power Tools

Since Citrix current version of Load Management provides no solution, I’ve typically implemented a simple startup script that counts new logons in a 30-second timeframe. This “Logon Throttle” script runs for 30 minutes whenever a server is booted. When the logon load is too high, it puts the server “on hold” by disabling logons. The script then re-enables logons after the server has digested these logons properly. This script was only intended to act as a failsafe device—under normal conditions it would never put a server on hold.

The script prevents the Black Hole Effect by temporarily disabling the logons. While effective, this was a somewhat crude method. Disabling logons on the server prevented users from reconnecting to disconnected sessions and it prevented administrators from connecting even via RDP. However, this was the only option we had.

I am not a programmer. Fortunately, my respected coworkers Daniel Nikolic and Dennis Damen have some serious programming skills. They converted my script into two services (called DADE Power Tools) which can be easily controlled with a few registry settings. They’ve also provided an ADM policy template so that these settings can centrally administered with group policy objects.

The DADE Power Tools are completely free, and they include two services: Logon Throttler and Black Hole Protector.

The LogonThrottler Service

The LogonThrottler service keeps track how many logons (configurable) occur within a certain timeframe (also configurable). When the logons exceed the specified logon rate, the service protects the server by disabling logons. The service disables logons with the “change logon /disable” command.

The LogonThrottler settings are stored in the registry in the following location: HKLM\SOFTWARE\DADE\LogonThrottler. It makes use of several values:

"PollInterval"="5"
This setting determines the length of time between each poll in seconds. The default value is 5 seconds which should be enough for most environments.

"PollHistoryCount"="12"
The PollHistoryCount value specifies the number of polls that are used to calculate the logonrate. The default value is 12 polls.

The total timeframe in which the server is monitored is PollHistoryCount  x PollInterval. With the default values, the total time-frame is 12 x 5 = 60 seconds. The LogonThrottle service will monitor how many logons have occurred the last 60 seconds.

"MaxLogons"="8"
The MaxLogons value allows you to specify how many new sessions you find acceptable within the timeframe. The default settings allow a maximum of 8 new sessions within 60 seconds. When the logonrate exceeds the 8 sessions within a minute, the LogonThrottle Service disables logons by executing the “change logon /disable” command.

"LogoffFactor"="50"
Since logoffs also consume resources, they can be added to the counted logons. The LogoffFactor value determines how a logoff should be weighted. The value has a minimum of 0 and maximum of 200%. The default value is 50%, which means that logoff are only half weighted. When 7 logons occur and 6 logoffs in 60 seconds, the LogonThrottle will pause the server for 200 seconds (10 x 20). 50% of 6 logoffs is counted as 3 logons. Add these to the 7 logons and the LogonThrottle service will count the 7 logons and 6 logoffs as 10 new sessions. When the LogoffFactor value is configured as 0, the logoffs have no influence on the LogonThrottler.

"SleepDuration"="20"
The value of SleepDuration determines how long the server will pause per new session. If 11 sessions were started within 60 seconds, the LogonThrottle service will disable logons for 11 x 20 = 220 seconds. After 220 seconds have passed, the LogonThrottle service will re-enable logons. The default value of SleepDuration is 20 seconds.

"MaxRuntime"="0"
The MaxRuntime value specifies the number of seconds that the LogonThrottle service will be active. For example, if you only want to throttle logons for 15 minutes after a reboot, you would configure this value to be 900. When MaxRuntime has a value of zero (default), the LogonThrottler service will always be monitoring the server.

The BlackHole Protector Service

The BlackHole Protector service can be considered an “emergency brake” for Terminal Servers and MetaFrame Presentation Servers. This service keeps track of all RDP and ICA disconnected and active sessions. It allows you to limit the total sessions possible on a server. Especially in twin-datacenter environments, Citrix servers run the risk of being overloaded when on of the datacenters fails. By default, this service is set to manual startup type—you need to explicitly enable the BlackHole protector service and set it to startup type automatic.

The BlackHole protector settings are stored in registry in under the following key: HKLM\SOFTWARE\DADE\BlackHole

"MaxSessions"="140"
MaxSession tells the Blackhole Protector how many ICA, RDP, Active and Disconnected are allowed on the server. Logons are disabled when this number is reached. You should configure this value to be 25% above the value of the Server User load evaluator.

"ReleaseLevel"="120"
ReleaseLevel determines when the BlackHole protector re-enables logons.

"PollInterval"="15"
This setting determines the length of time between each poll in seconds. The default value is 15 seconds.

"MaxRuntime"="0"
This value specifies the number of seconds that the Blackhole Protector service will run. When MaxRuntime has a value of zero (default), the Blackhole protector service will always be monitoring the server.

For the latest information and discussion about the DADE Power Tools
and other tools like the Flex Profile Kit, check:
http://portal.loginconsultants.nl/forum/index.php?board=16


 


DADEPowerTools.zip

Join the conversation

18 comments

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

This message was originally posted by French Reader on October 4, 2004
Cancel
This message was originally posted by Gabe Knuth on October 4, 2004
Thanks Jeroen!
Cancel
This message was originally posted by Jeppe on October 6, 2004
CTX104926 - Data Collector resolves Load Balance requests to a Single Server
Resolution
Citrix has added a time-out that can be set through the registry to stop this issue from occurring. If the server is not responding to a ticket request, the data collector is notified and updates the server load to maximum. The fix prevents any new ticket requests to that server. The unresponsive server is notified to stop its own IMA service. This prevents an unresponsive least loaded server from causing a farm to become unresponsive.
The fix is contained in the following hotfixes:
• For servers running MetaFrame Presentation Server 3.0 and Windows Server 2000, install Hotfix MPSE300W2K013 on all servers in the farm
• For servers running MetaFrame XP 1.0 Feature Release 3 and Windows Server 2000, install Hotfix XE103W2K125 on all servers in the farm
• For servers running MetaFrame XP 1.0 Feature Release 3 and Windows Server 2003, install Hotfix XE103W2K3029 on all servers in the farm
Cancel
This message was originally posted by Jeppe on October 6, 2004
If the Citrix solution works, it's still a great job, Jeroen... ;-)
Cancel
This message was originally posted by an anonymous visitor on October 5, 2004
Brian. If this is the type of stuff you teach in your training class I can't wait to attend! I'm sure Citrix will add this feature but never admit you gave them the idea. That's OK, we all know where to check for some answers.
Cancel
This message was originally posted by Jeroen on October 6, 2004
Thanx all, but Daniel Nikolic & Dennis Damen deserve most of the credits here, they invested alot of time in developing and testing the DADE Power Tools.

Cancel
This message was originally posted by Jeroen on October 6, 2004
About CTX 104926, it looks really promising but does not prevent a server from being overloaded or the Black Hole Effect from happening. The fix does not throttle the logons.

The fix prevents IMA sending new sessions to an already overloaded and unresponsive server, which makes this a really essential fix, whatever circumstances. Just like a CPU "Circuit Breaker" load evaluator it does help prevent a server sucking in new sessions until an administrator manually intervenes.

The problem with the black hole effect is that is can easily happen with a overloaded but still responsive server, since load management works only with active sessions and does not account for disconnected sessions.
Cancel
This message was originally posted by Jeff on October 21, 2004
Does this solution work on 2003 servers?
Cancel
This message was originally posted by an anonymous visitor on October 21, 2004
Cancel
Has anyone seen this happen out of the blue?? I have a Citrix box that takes about 20 sessions than people start having problems described as above. I reboot all my citrix boxes every night. I also Have one server that will not take a single log on and throughs similar errors like above. Thank you
Rlevandoski@aisequip.com
Cancel
Hi
Check this
http://support.citrix.com/article/CTX108549
 
1. Incoming connections to a load balanced farm fail to fully connect. This occurs when a new server comes online during peak logon periods. At that time, the load balancer is sending all incoming connections to the new server, essentially overwhelming and preventing it from fully updating its server load.
With this Slow-Start Load Balancing fix, logons are given a logarithmic load bias during connection time to limit the number of simultaneous logon requests. This biasing level is used in conjunction with the server’s “real” load to route connections to the least loaded server. This allows time for servers to gradually increment the number of connections in environments where the server load is well below the farm average load, which is often the case when you restart a server during the work day.

Note: To resolve the issue in its entirety, must install this hotfix on both the zone master and all member servers of the farm.
How To Use Slow-Start Load Balancing
Slow-Start Load Balancing uses Intelligent Load Biasing (ILB). ILB works by giving logons a higher load bias. The default ILB algorithm assigns a bias of ½ the remaining load capacity. Essentially, the default algorithm is Current Resolution Load += . The ILB adjusts itself back down after pending logons are complete.
For example:





INCOMING LOGONS


RESOLUTION LOAD



1


5000



2


7500



3


8750

To turn off ILB, set the registry value: HKEY_LOCAL_MACHINE\Software\Citrix\IMA\LMS\UseILB to 0 and restart the server.

To tweak the ILB algorithm, adjust the value of: HKEY_LOCAL_MACHINE\Software\Citrix\IMA\LMS\ILBMultiplier and restart the server. By default, the value is set to 2. By increasing the value, you can allow more concurrent logons. Essentially, the algorithm is changed. For example, if you change the value to 4, the ILB algorithm is Current Resolution Load += .





INCOMING LOGONS


RESOLUTION LOAD



1


2500



2


4375



3


5781

Note: Maximum load = 10000.

Cancel
Wouldn't I also have to remove the server from the Published Application or else IMA still sends users to the disabled server??
Cancel

ORIGINAL: Seizure

Wouldn't I also have to remove the server from the Published Application or else IMA still sends users to the disabled server??


that was my thought too, change logon /disable still results in people getting directed to that server and getting the "logons disabled" error, correct?
Cancel
Put the server into a load evaluator based on the Scheduling Rule which will prevent users that are using published apps from reaching the server.

Shawn
Cancel

Payday loans are much easier to cope up with than the black holes. It is considered as a galactic scale of disasters but it won't affect us personally just like how getting payday loans copuld affect us. Scientists in Europe have been working hard about black holes and the Hadron Collider, which looks exactly the same as black holes. To read more about black holes, please read the article on your <a title="READ Don't Lose Your Payday Loans in a Black Hole!" rev="vote-for" href=" personalmoneystore.com/.../">payday loans</a> source.


Cancel

This article may not cover all possible scenarios and intended as a guideline. Citrix recommends that careful consideration is used when placing servers in the appropriate domain. This article does not attempt to cover all aspects of the basic Microsoft procedures and/or best practices for moving computer within domains.


<a href="http://www.unclepayday.ca">borrow money fast</a>


Cancel

"We finally have been able to measure black-hole and bulge masses in several galaxies witnessed as they were in the first billion years after the Big Bang, and the evidence suggests that the constant ratio witnessed nearby may not hold in the early Universe. The black holes in these young galaxies are much more massive compared to the bulges than persons seen in the nearby Universe," stated Fabian Walter of the Max-Planck Institute for Astronomy (MPIfA) in Germany.


The next challenge is to figure out how the black hole and the bulge influence each others' growth. We don't know what mechanism is at work here, and why, at some point in the process, the 'standard' ratio between the masses is established. licensed and bonded <a href="http://www.unclepayday.ca">payday loan</a> lender


Cancel

This message was originally posted by Jeppe on October 6, 2004


If the Citrix solution works, it's still a great job, Jeroen... ;-)


Cancel

-ADS BY GOOGLE

SearchVirtualDesktop

SearchEnterpriseDesktop

SearchServerVirtualization

SearchVMware

Close