The erratic freezes and pauses that sometimes occur on Terminal Servers are probably the most annoying and difficult performance problems to troubleshoot. The good news (and the bad news) about these kinds of problems is that they are almost never your fault. The resolution usually points back to a device driver, service pack, or hotfix of some sort.
Erratic issues usually fall into two categories:
- An application or process pegs the processor at 100% utilization for a short time and then returns to normal. During this time, users' sessions are usually unresponsive.
- The server just freaks out for a few seconds. Every so often everything freezes, including the Performance Monitor MMC snap-in. Then after a few (or even 20 or 30) seconds, the performance chart jumps ahead in time to the current position. However, the "blackout" period causes a blank, with all performance counters showing zeros or no data.
The best attack plan for these types of problems is as follows:
- Check the web for your specific problem.
- Update service packs, apply hotfixes, and/or update device drivers.
- Use the Performance MMC snap-in to check for anomalies.
Step 1. Search the Web for your Problem
Solutions to the erratic problems are almost never intuitive, so it's worth it to spend ten or fifteen minutes on the web to gain an understanding of your problem before you try to do anything on the server.
For example, Windows Server 2003 was released in April 2003. In July, Microsoft KB article 821467 was published with a title "Windows Server 2003 Terminal Server Stops Responding." This article indicated that the problem only happens with Windows 2003 Servers, and that a fix is available from Microsoft. Why bother troubleshooting on your own when someone else might have done it already?
Here are the most useful websites for searching for these types of problems. They're presented in the order that most people search for them.
- Google groups (groups.google.com). You're best luck usually comes from the Microsoft news groups. Often the Microsoft MVPs keep these lists up to date with hotfixes, and they're usually faster than Microsoft KB articles.
- Microsoft Knowledge Base (www.microsoft.com/support).
- The THIN.net archives (www.thethin.net). The archive searching tool on the THIN's main website is sometimes awkward to use, so you might try searching the archives via their listserv provider. (www.freelists.org/archives/thin)
- Citrix Support Knowledge Base (support.citrix.com/kb). The search engine never seems to find what you're looking for, but this site is a requirement for locating Citrix hot fixes.
Step 2. Update Service Packs, Hotfixes, and Drivers
Most of the erratic problems have already been fixed at some point. Do a quick search on the Microsoft Knowledge Base for "server stops responding" and you'll see that 90% of the solutions say "obtain the latest service pack or hotfix."
For example, Service Pack 2 for Windows 2000 fixes a problem with the registry cache locking. This problem usually occurs on busy Terminal Servers and causes the whole system to pause if any registry writes need to be made while the registry is being backed up (which Windows does periodically). Applying Service Pack 2 or newer completely eradicates the problem.
However, you also need to be careful about updating production servers. When Service Pack 4 for Windows 2000 first came out in August 2003, it broke many people's Terminal Server environments. Check the web resources listed in Step 1 to make sure that whatever patch you're applying is safe. Also, apply the patch to a test server before putting it on a production server. Roy Tokeshi's Thin Client Support community at www.tokeshi.com is a great reference site for all the latest Terminal Server support, hotfix, and service pack information.
While you're at it, you should also update the hardware device drivers and firmware. There have been countless cases in which hard drive firmware or driver updates have "magically" fixed the occasional hiccup. Keep in mind that the users in a Terminal Server environment really push a server to its limits, and your hardware (and the drivers) are definitely getting their exercise.
Step 3. Launch the Performance Monitor MMC Snap-In
If web searching and server patching didn't fix your erratic problems, you'll have to continue the investigation yourself. Chances are that you've already fired up Performance Monitor. If you're still having the problem, you'll need to have it active during one of the glitches.
Add counters for your processors (Processor | % Processor Time). It's best to add one counter for each processor instead of the "_Total," since that will allow you to more easily see whether a single process pegs the CPU.
If your problem is extremely intermittent, don't forget that you can configure an alert to watch for it. Then, you can configure that alert to automatically start a performance data log. The only problem with this is that you'll need to configure your alert to check at a frequent interval—maybe every few seconds. Be careful that you don't create a Catch-22 situation where your complex monitoring, logging, and alerting schemes actually tax the system more. The ideal situation is for you to be able to view the problem live. (You could also configure Performance Monitor on a remote computer to track your Terminal Server.)
Look for applications that are taking up 100% of the processor
Look at what exactly is happening when your system slows down. Does the CPU spike? If so, there are a few different approaches you can take. First, try to determine what's causing the spike. Is someone's roaming profile loading? Did a bunch of users just log on? In some cases, you may encounter an application or process that takes up too much CPU utilization.
Dealing with Overzealous Applications
The easiest solution in these cases (especially in Terminal Server environments) is to add CPU throttling software to your server. This software monitors all running processes and clamps down on anything that hits a predefined limit. This is helpful for applications that aren't really Terminal Server-friendly but that your users insist on using.
There are many vendors that make products to help ensure that CPU resources are available when they need to be. Here are some of the more popular tools:
- Appsense Optimizer ( www.appsense.com )
- Aurema ARMTech ( www.aurema.com )
- RES PowerFuse CPUShield ( www.respowerfuse.com )
- RTO Software TScale 3.0 ( www.rtosoft.com )
- TAME ( www.tamedos.com )
- TMuLimit ( www.tmurgent.com/TMuLimit.htm )
- ThreadMaster ( threadmaster.tripod.com )
Each of these tools approaches CPU over-utilization in a different way, so you should investigate all of them. There are situations in which one of these tools will fail to control a process while another tool works. If you don't get the results you need with one tool, it's worth trying another.
Look for Periods when Everything Goes to Zero
Hopefully searching the web, patching your server, and updating your drivers and firmware will alleviate any sporadic problems that you were having. If you're still experiencing performance issues, there are a few things left to try.
In some cases, you'll notice that all Performance Monitor counters will just disappear for a few seconds and everything pauses. When the system comes back, Performance Monitor has jumped ahead, with nothing but zeros left in the twilight zone.
In other less extreme cases, you might not notice anything strange in Performance Monitor. In fact, everything might look normal even as your system takes a performance hit. In these cases, you'll need to continue stepping through this document.
As with the previous problems, you'll also need to consider everything that's happening on your server. One company's Terminal Servers would max out at 25 users, even though Performance Monitor showed no problems. It was later discovered that this was due to the fact that they were redirecting the users' "Application Data" folder to a remote home drive location. By redirecting this critical profile folder, the server would slow to a crawl each time a user needed to access data from that folder (a very frequent event). Stepping through the performance troubleshooting steps caused them to consider everything that was happening on their server, and ultimately led them to experiment by turning off folder redirection. Thus they were able to isolate their problem.