Brian Madden Logo
Your independent source for application and desktop virtualization.
advertisement

High Pages/Sec & Disk Transfers/sec, in the Citrix XenApp / Presentation Server forum on BrianMadden.com

rated by 0 users
Not Answered This post has 0 verified answers | 73 Replies | 3 Followers

Top 500 Contributor
Points 1,097
Chris Norman posted on Mon, Mar 24 2008 2:29 PM
Hey folks
Hp DL360s G4p
MS 2003 Enterprise Server SP2
PS 4.0 Ro4
8 Gigs of ram and 3.40 mhz dual processors

We publish out the entire desktop (about 30 users to a server)

I'm investigating some stalls we seem to have every so often throughout the day. The users complain about latency where they type and nothing happens for a few seconds. I've tried everything I know to resolve this. I've got local text echo on (set on the web Interface)but it doesn't seem to be helping.

The normal resource monitor shows nothing out of the ordinary, the cpu is normally bouncing from 20 to 30% and the ram usage is rarely above 4gig.

Today I'm running perfmon and watching Pages/sec & Disk Transfers/sec
I'm seeing the page/sec jump up and down at times between 750 and 15,000 for an hour or so. Then it will calm down and sit at zero with some quick 1500 spikes. In some instances I see the Disk Transfers/Sec keeping the same peaks and dips. This seems crazy high to me. I do have the /PAE switch in the boot.ini file but I’m starting to wonder if the ram (above 4gig) is even being used. Is this normal to see during peak times during their day?

Senior Administrator (Citrix)
USI Holdings

No matter where I am i'm never where I want to be.

  • | Post Points: 140

All Replies

Top 500 Contributor
Points 650
You may be running into Kernel Mode memory limitations. Check your paged and non-paged pool usage and see what limits are currently set. Process Explorer with installed symbols provides a good way of monitoring the limits and usage of the paged and non-paged pool memory. For a good article on the limiations of memory in Windows systems, check out Brian's article http://www.brianmadden.com/content/article/The-4GB-Windows-Memory-Limit-What-does-it-really-mean-
Greg Guhin
Banner Health
Phoenix AZ
  • | Post Points: 5
Top 25 Contributor
Points 7,712
Hi Chris,

There are a number of possible reasons for the behaviour you're seeing. Judging by what you're seeing, my initial guess would be some sort of i/o request bottleneck, but the fact that you've got 8 GB of RAM complicates things since it could also be a kernel memory issue.

I'll go through the obvious possibilities.

One the network side, one fairly common reason for system pauses and apparent latency is the SMB i/o request queue filling up. When the queue is full, i/o essentially stops until the queue is available again. This can lead to momentary and in some cases quite long pauses in system performance and can also exhibit as significant latency.

When an SMB client (in this case the TS system) connects to a file server, the file server defines the size of the client's request queue length. The default value is the workstation default and is too small for TS environments. This can be fixed by maxmpxct/maxworkitems tweaking at the file server, not TS end. IN really busy environments, the maxmpxct/maxworkitems values should also be increased on your domain controllers.

The following unmanaged GP snippet will handle the SMB tuning on back-end servers:

--------------------------------------------
CATEGORY "SMB tuning - Redirector Settings"

POLICY "SMB Redirector Parameters"
KEYNAME "SYSTEM\CurrentControlSet\Services\Lanmanserver\Parameters"
PART "Set MaxWorkItems" DROPDOWNLIST REQUIRED
VALUENAME "MaxWorkItems"
ITEMLIST
NAME "Default - 210" VALUE NUMERIC 210
NAME "Enhanced - 4096" VALUE NUMERIC 4096 DEFAULT
NAME "Maximum - 8192" VALUE NUMERIC 8192
END ITEMLIST
END PART
PART "Set MaxMpxCt" DROPDOWNLIST REQUIRED
VALUENAME "MaxMpxCT"
ITEMLIST
NAME "Default - 50" VALUE NUMERIC 50
NAME "Enhanced - 1024" VALUE NUMERIC 1024 DEFAULT
NAME "Maximum - 2048" VALUE NUMERIC 2048
END ITEMLIST
END PART
PART "Set MaxRawWorkItems" DROPDOWNLIST REQUIRED
VALUENAME "MaxRawWorkItems"
ITEMLIST
NAME "Default - 64" VALUE NUMERIC 64
NAME "Enhanced - 512" VALUE NUMERIC 512 DEFAULT
END ITEMLIST
END PART
PART "Set MaxFreeConnections" DROPDOWNLIST REQUIRED
VALUENAME "MaxFreeConnections"
ITEMLIST
NAME "Default - 100" VALUE NUMERIC 100
NAME "Enhanced - 4096" VALUE NUMERIC 4096 DEFAULT
END ITEMLIST
END PART
PART "Set MinFreeConnections" DROPDOWNLIST REQUIRED
VALUENAME "MinFreeConnections"
ITEMLIST
NAME "Default - 32" VALUE NUMERIC 32
NAME "Enhanced - 256" VALUE NUMERIC 256 DEFAULT
END ITEMLIST
END PART
END POLICY ; smb
----------------------------------------

It's interesting to note that in a scenario where users are hitting client drives quite heavily, the TS server is now a file server and needs similar SMB tuning.

Other possible reasons are excessive registry refreshes, fixed by changing the refresh interval, and filled pending writes to your disk subsystem, which requires a battery backed write cache on your RAID controller to fix.

The registry refresh interval is a TS tweak and can be handled by the following GP snippet, which also removes an unnecessary file system overhead of updating file access time/dates:

----------------------------------------
CATEGORY "Registry/File System Tuning"

POLICY "Reduce Registry Update Impact"
KEYNAME "System\CurrentControlSet\Control\Session Manager\Configuration Manager"
PART "Set RegistryLazyFlushInterval" DROPDOWNLIST REQUIRED
VALUENAME "RegistryLazyFlushInterval"
ITEMLIST
NAME "Default - 6 msec" VALUE NUMERIC 6
NAME "Enhanced - 30 msec" VALUE NUMERIC 30 DEFAULT
END ITEMLIST
END PART
END POLICY

POLICY "No Access Timestamp Update"
KEYNAME "System\CurrentControlSet\Control\FileSystem"
PART "Set NTFSDisableLastAccessUpdate" CHECKBOX
VALUENAME "NTFSDisableLastAccessUpdate"
VALUEON NUMERIC 1
VALUEOFF NUMERIC 0
END PART
END POLICY ; no access timestamp update

END CATEGORY ; reg/file tuning
-----------------------------------

Then there are kernel memory issues which can get far more interesting.

Running a 32-bit TS system with more than 4 GB of physical memory can at times be a real juggling act because if kernel memory is an issue, you'll be juggling the various kernel memory allocations (page pool, PTEs etc) to get something that goes well.

Using server 2003 enterprise and /pae lets the operating system use more than 4 GB of RAM, but with TS, you're generally running a large number of processes which puts a much greater strain on available kernel memory.

I've had a number of instances where stuff like antivirus software pushed 4+GB (and even 4GB) systems over the edge and made the systems downright flakey in a lot of interesting ways. However that's just by way of warning you that your configuration could lead to problems, because the symptoms you've described are more in line with i/o issues.

regards,

Rick

Ulrich Mack
Quest Software
Provision Networks Division



Ulrich Mack
Quest Software
Provision Networks Division

  • | Post Points: 5
Top 25 Contributor
Points 7,712
Hi Chris,

Left out answering part of your question. While paging activity is normal and will happen regardless of the amount of RAM, unless you're game to run without a page, which works but I wouldn't recommend it unless you like living dangerously. However there is a difference between maintenance paging and agressive swapping where your system will slow down, again because of i/o limitations.

If the amount of free memory is still adequate, then things are probably ok on the paging front, but it is of course essential that your pagefile is contiguous, and that the start and maximum sizes are the same. If the default page file settings are used, your server can end up with a very fragmented page file which won't help things at all. If you've got to fix this, defrag your disk to get a large enough contiguous free space and then run the Sysinternal's pagedfrag utility to get your page file back into one piece.

Run defrag nightly as a scheduled task (defrag c:) and your disk subsystem will at least keep going as well as it can.

regards,

Rick


Ulrich Mack
Quest Software
Provision Networks Division

  • | Post Points: 5
Top 10 Contributor
Points 24,510
I think Rick the nail on the head mentioning the write cache settings on your RAID controller. You probably need to enable a write cache policy in the RAID BIOS (assuming you have an on-board battery on your RAID controller).

http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&objectID=c00818421&prodTypeId=15351&prodSeriesId=397638

Cheers,




Alan Osborne

President (MCSE, CCNA, VCP, CCA)

VCIT Consulting - Citrix/Terminal Services Remote Desktop Solutions for SMB

VCIT website My Blog

  • | Post Points: 5
Top 500 Contributor
Points 1,097
Thanks for all the good info guys...

Rick
I defrag every evening via Diskeeper. I also reboot the servers every night which (correct me if I'm wrong) should defrag the pagefile. I set the page file to 4092/4092 and moved it off the system partition for good measure.

I set the following on the file server and on the TS server:

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\lanmanworkstation\parameters]
"MaxCmds"=dword:00000800

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\lanmanserver\parameters]
"MaxWorkItems"=dword:00002004
"MaxMpxCt"=dword:00000800
"MaxRawWorkItems"=dword:00000200
"MaxFreeConnections"=dword:00000064
"MinFreeConnections"=dword:00000020

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Configuration Manager]
"RegistryLazyFlushInterval"=dword:0000003c

I don't seem to be doing the time stamp update but that shouldn't be a critical point should it? I think I have all the setting you listed in one of the GPOs I use (True Control Template). I might have to take another look at it. GPO's are much better than the reg hacks I have been doing in the past.

How necessary would it be to apply this to my DC?
I have about 900 users on at peak times and I'm adding more every day. 60ish Citrix servers. I think that DC is also the server exchange uses. We point all the Citrix servers to the same DC.

Senior Administrator (Citrix)
USI Holdings

No matter where I am i'm never where I want to be.

  • | Post Points: 20
Top 25 Contributor
Points 7,712
Rick Mack replied on Mon, Mar 24 2008 11:18 PM
Hi Chris,

Diskeeper should be taking care of ntfs, the mft and pagefile so things should be fine in that area.

The SMB tuning (maxmpxct/maxworkitems etc) looks fine as does the longer registry refresh interval. However that doesn't quite let SMB out of the mix. I'll explain.

Once the SMB tuning has happened, you shouldn't see many network i/o request queue type problems, except in the scenario where the file server is slowing down and therefore causing more pending i/o requests (current commands on client). Things that can cause this are:

- again fragmentation, which if you're running diskeeper on your TS systems, I'd be really surprised if you weren't also using it on your file server.
- stuff like running a backup between 9-5, SAN maintenance, anyhitng that hits your disk subsystem
- misconfigured switches in your data center
- file system damage, generally security descriptors. This is an interesting one because it only affects the TS servers IF you hit one of the damaged security descriptors. This is often an artefact of a file server BSOD or unpalanned power outage. Automatic shutdown on power loss is a must-have for your file server.
- folder redirection in a scenarion where you have applications that, for example, really hammer %appdata%. That can generate a huge amount of extra network i/o requests.

Note that I mention network i/o requests rather than i/o. The importance of this was brought home to me a few years ago when we were supporting a school that was running Corel Draw 9 and had decided to redirect the corel draw folder to the user's home drive.

The day they went live with Corel Draw, the whole farm (10-15 servers) locked up for about 10-15 minutes. Corel draw was loading a 14k INI file, one byte at a time. So starting up Corel Draw generated 14,000 separate 1 byte network i/o requests per student, in a class of 24 students. The back-end file server hosting the user drive shares just got overwhelmed. The end result was the TS systems hanging but if you waited long enough they recovered.

SMB tuning helped though we ended up using much bigger values for maxworkitems/maxmpxct, as did adjusting the antivirus package to check on write only. We finally got things going okay after looking at the SAN configuration as well and finding it was hugely suboptimal. Oh, and disabling hyperthreading on the file server is definitely worth doing to avoid hitting the wall suddenly.

Monitor the redirector current commands with perfmon to get an idea of just how busy things are. Note that the current commands counter is the sum of commands pending to all servers and it tends to be fairly static. I almost have a suspicion that the counters don't get updated properly, but if current commands are consistently over about 120-150, then chances are good you could have a back-end problem.

If you have got performance/throughput issues because your file server just isn't fast enough, then have a look at HP's Polyserve product. It's a brilliant way to supply as much network file throughput as you could possibly ever need.

Disabling the time stamp update won't make a significant difference, but it's one of the things you disable to tweak things up a bit.

Doing the DC is definitely worth while, although when we have seen DC SMB issues, they were seen at logon time, with some users experiencing very slow logins, together with system pauses.

Printer drivers can also be a culprit though that generally manifests itself as a huge CPU spike when users log on so I doubt that your problem.

After that you have to start looking at your kernel memory allocations to see if something's broken there.

regards,

Rick

Ulrich Mack
Quest Software
Provision Networks Division

  • | Post Points: 35
Top 500 Contributor
Points 1,097
Rick you explain things well. :)

Here’s how I have things set up. I only redirect "My Documents" I believe the less you can redirect the better off you are. It gets redirected to the users H:\my Docs. We use a mandatory profile for everyone and save the settings using Flex profiling. The Flex INI files are stored in H:\Windows. The users log in times are ok, I can see in Edgesight that they are all getting the same login times and it's rarely a problem. I will have a chat with our DC admin and see if we can apply the registry changes to the DC's anyway.

Printers... well I have to admit we are still old School. We have ton's of thin clients out in the field so we have a number of print servers in our data center. We only used tested and approved drivers (most are HP). We have a policy to not allow kernel mode drivers. Once a month we completely delete the drivers off the citrix servers and reload fresh copies to prevent corruption. Even though we do this I do see that dam HP port resolver stacking up in the task manager. I could have a driver issue and not know it. Pauses could happen when certain users log in. Just haven’t seen that happen yet.

I'm actually Not Defragging the file server. Here’s a crazy one for you (kind of off topic)... We are migrating off of our emc NAS to a Windows file server because we can't seem to solve some performance issues with it. So I built a Win2003 R2 64 bit file server and we presented a lun to it off our emc SAN. We are slowly moving people over to it. I am looking at the current Commands now and it's been on zero for an hour now. :)

Antivirus... (TrendMicro Serverprotect 5.58) is set to only to scan Incoming. I also built some exclusions so it will ignore things like winlogon, userinit, smss & wfshell. So I think I’m good in that department.

Backup... (Legado) I have to look into this one. I know we have so much data it takes more than 2 days to back it all up. We have not gotten our archive solution up and running yet so Backup has been very difficult to keep under control. I'll ping the admin and see what order things are getting backed up. Maybe I can get my stuff done first so we can rule that out.

Power.. Well we have our equipment in an AT&T data center with onsite redundant power. So power isn't a problem. However, I do have to manually power down a server or two every now and again. They seem to get stuck rebooting sometimes. I'm not sure what is causing it and it doesn't happen a lot. But in any case I suppose that’s not healthy. I used to do a chkdsk /r once a month during our maintenance window. I could start doing that again. That way the bad blocks/sectors could at least get identified.

One thing I failed to mention...
This issue has been going on sense around November. Which is about the same time we started migrating people off of Lotus Notes and on to Outlook. This may or may not be a factor. The main reason we upped the ram on our servers from 4 to 8 was because of Lotus Notes. It loves ram and the longer you keep it open the more it takes.

Senior Administrator (Citrix)
USI Holdings

No matter where I am i'm never where I want to be.

  • | Post Points: 20
Top 500 Contributor
Points 1,097
One of the things we are trying...

Tonight I'm segmenting one office away from the general population. So we can see if the emc NAS is our problem. We have a 80 man office that we've moved all their files off the NAS to the MS file server. I'm going force them on 3 Citrix servers by themselves. They will have no ties at all to the emc NAS or effected by any other users.


BTW… We haven't done any tuning on the emc NAS. We were told by emc it's ready for whatever we throw at it. So we've gone on the belief that nothing needed to be done. We did open a ticket with them specifically on the maxconnections topic.


I think your right about my issue being io the characteristics are inline. The entire server stalls. I get multiple servers stalling at the same time when it happens.

Senior Administrator (Citrix)
USI Holdings

No matter where I am i'm never where I want to be.

  • | Post Points: 5
Top 25 Contributor
Points 7,712
Hi Chris,

After being screwed around by EMC, HP and IBM with underperforming SANs I'm starting to get a bit sceptical about what optimised really means. OPtimised for databases doesn't necessarily mean optimised for file serving, or vice versa. There's nothing like a good i/o throughput test to find out whether the optimisation is at all useful.

If you're seeing multiple copies of the port resolver it means you've got some broken print drivers. HP stuffed up big time and managed to produce a lot of TS-unfriendly drivers up 'til about a year ago when Rob Tuft came back on the scene to fix things up.

Check out http://www.hp.com/pond/ljbeta for an "admission" of some of the problems. The Citrix/Terminal Server Fix description says it all. Citrix technote article, CTX111947 is getting a bit dated but it some of the problem HP drivers.

Get a copy of AddPrinter or StressPrinter (Citrix utilities download) and test all your printer drivers. Once you get rid of the dud drivers things will just go a bit better.

One of the things to watch out for with outlook is personal folders (.pst files). These can cause outlook to generate a lot more network activity, particularly if they're on the user's home drive. That may just have pushed the NAS over the edge.

regards,

Rick








Ulrich Mack
Quest Software
Provision Networks Division

  • | Post Points: 20
Top 500 Contributor
Points 387
Hi Chris, hi Rick

It seems that we have almost exact the same problem in our very similar environement.
Our current commands go up to about 80 and I read that they shouldn't be much higher then the number of nics in the TS!!.
So, what about the outcome of the migration of the 80 man office?
I would be very interested in the result.
My main question is:
Do we have the high number of current commands because our SAN cannot serve the file fast enough and so the SMB queries from the TSes are lining up in the redirector? So what seems to be a TS, network or FS problem is nothing but a SAN with bad answering times?
Our FS has about 900 current connections with a 2,6 and a 0,8 TB mirrored (storage foundation) volume(Diskeeper running), with a 1 GBitNIC. 21 TS (~40 user) connecting to FS. Only "my documents" and IE Cache (limited to 1 MB) is redirected. Login times are normal with about 25 sec (measured with EdgeSight). Running LN 6.55 and definitely have perfomance issues. No clue if serverprotect 5.58 could be a causing the high number of current commands.
Have all the LANmanworkstation (only on TS) and LANmanserver (only on FS, not yet on TS) applied from Michael Roth File Server Tuning article.
How could I check/prove that SAN is the bottleneck to force the SAN-Team to provide more spindles? Right now there are only building 4+1 LUNs without using meta-LUNs.
Looking forwared for some answers, I do aggree, Rick, you explain very good.

Bye, Daniel
  • | Post Points: 20
Top 50 Contributor
Points 5,251
Greg A replied on Sat, Apr 19 2008 3:06 PM
Random keyboard typing latency is usually related to one of two things.

1. Most HP servers don't come with BBWC cache. Without disk cache your disk time will be constantly high. You should get the BBWC cache add-on for all your Citrix servers. If your server is not HP you will still need a hard disk cache controller with write cache enabled.

2. Queue mouse movements and keystrokes. You can test this by using the PN agent to connect to your servers and disabling/enabling "queue mouse movements and keystrokes" in the PN agent properties. Test by holding down a key on the keyboard in any editor program such as Wordpad. Hold the key down for an entire page and look for the lag. If this turns out to be the problem (as verified with the PN agent) you can disable queue mouse movements and keystrokes in the web interface config files so that web interface users have queue mouse movements and keystrokes disabled when they connect to the servers.
  • | Post Points: 5
Top 25 Contributor
Points 7,712
Hi Daniel,

Current commands doesn't seem to be dynamically updated with any sort of rhyme or reason.

I suspect it is at best an average with a fairly large update period. The description about matching the number of disk spindles is bullshit on a terminal server.

A Current Commands value of 80 means that on average you've got 80 pending smb/rpc network i/o commands in the queue. With standard out-of-the-box tuning it means that you're probably getting pretty close to the edge at times. When the pending requests queue is full, things stop momentarily until the queue has free slots.

If you don't increase the maxmpxct/maxworkitems values on the file server, anything at all that slows down your file server will potentially impact on your TS systems. Tuning maxcmds on the TS end won't do a thing for you unless the other end is increased as well.

Diskeeper on your file server is a good investment because it removes one of the common bottlenecks.

regards,

Rick

Ulrich Mack
Quest Software
Provision Networks Division

  • | Post Points: 20
Top 500 Contributor
Points 1,097
let me toss another thing into the mix here.

I still have my issue BTW. We've been looking real hard at exchange and have discovered that our consultants(emc) didn't scale the san luns with enough drives. So we are running into disk queing on the exchange side. This seems to have an adverse effect on the Citrix servers as well as non Citrix users. When Outlook stalls it seems to cause a CX server or workstation stall (not network) Anyone on that server seems to experience about a 5 seconds of server lag. The sucky part in all this is I can not reproduce anything on the fly. I have to wait for it to happen.

I'm also replacing all my HP DL360 "G3" servers with G5 and I got the battery caching specifically to try and set it up like you guys mentioned in this thread. Time to replace them so I figured i'd make that part of my standard build.

I have about 1200 users spread out over 39 servers (growing every day), We average about 25 to 30 people per server. Which should be just fine. We didn't have any problems till we tossed Outlook into the picture. So it remains our prime suspect for now.

Daniel there is a way to look at the disk queing on your SAN. I'll ask our SAN admin how he is looking at it. I think you can see it on the SAN Surfer side. I need to confirm that though. I suppose it also depends on what kind of SAN you have too.

Senior Administrator (Citrix)
USI Holdings

No matter where I am i'm never where I want to be.

  • | Post Points: 20
Top 500 Contributor
Points 1,097
I checked with our SAN admin and he said he has been using SAN Surfer to look at the Disk Que.

Senior Administrator (Citrix)
USI Holdings

No matter where I am i'm never where I want to be.

  • | Post Points: 35
Page 1 of 5 (74 items) 1 2 3 4 5 Next > | RSS