Brian Madden Logo
Your independent source for application and desktop virtualization.
advertisement

High Pages/Sec & Disk Transfers/sec, in the Citrix XenApp / Presentation Server forum on BrianMadden.com

rated by 0 users
Not Answered This post has 0 verified answers | 73 Replies | 3 Followers

Top 500 Contributor
Points 1,097
Chris Norman posted on Mon, Mar 24 2008 2:29 PM
Hey folks
Hp DL360s G4p
MS 2003 Enterprise Server SP2
PS 4.0 Ro4
8 Gigs of ram and 3.40 mhz dual processors

We publish out the entire desktop (about 30 users to a server)

I'm investigating some stalls we seem to have every so often throughout the day. The users complain about latency where they type and nothing happens for a few seconds. I've tried everything I know to resolve this. I've got local text echo on (set on the web Interface)but it doesn't seem to be helping.

The normal resource monitor shows nothing out of the ordinary, the cpu is normally bouncing from 20 to 30% and the ram usage is rarely above 4gig.

Today I'm running perfmon and watching Pages/sec & Disk Transfers/sec
I'm seeing the page/sec jump up and down at times between 750 and 15,000 for an hour or so. Then it will calm down and sit at zero with some quick 1500 spikes. In some instances I see the Disk Transfers/Sec keeping the same peaks and dips. This seems crazy high to me. I do have the /PAE switch in the boot.ini file but I’m starting to wonder if the ram (above 4gig) is even being used. Is this normal to see during peak times during their day?

Senior Administrator (Citrix)
USI Holdings

No matter where I am i'm never where I want to be.

  • | Post Points: 140

All Replies

Top 25 Contributor
Points 7,712
Hi Chris,

Very few SAN administrators actually look at the SAN as anything but a black box full of disks. But the sad truth is that SANs don't have any magic, they're big boxes with loads of disk, lots of cache and some sofware that makes the whole lot look like a huge disk drive that we can partition up any way we like.

But the basic building block is still a disk drive. The read write characteristics of a disk drive are that as long as you're doing sequential reads or writes things are pretty quick but as soon as you do random i/o things suck. The way you try to make up for the random i/o deficiencies is to use a huge amount of read and write cache. But if your data sets are large enough, the cache read hit ratio starts dropping and the write cache soon fills up. And your SAN bogs down.

Solid state drives are going to majorly change this, but at $10,000 per terabyte for flash drives, that's a lot of money to spend unless you find that improving SQL performance by 300-400% etc on your existing server hardware is worth the cost.

Disregarding that for the moment, when disk i/o slows on a busy file print server and the Citrix/TS-end i/o request queues have filled up, your Citrix servers will stop. The stop might be a few milliseconds, seconds or even minutes until the file server has caught up and you can send more i/o requests. And it affects your whole farm.

Where you've got competing i/o tasks on the SAN, in particular stuff like exchange, database servers etc, one way to make life a bit easier is to take the SAN architecture into account, so that competing i/o tasks aren't hammering the same disk sets. That can mean a bit less throughput overall and some wasted disk space, but it means that exchange can't hose your file server or sql performance.

Anyway, going back to your scenario which does seem to be i/o related, if you haven't already increased maxmpxct/maxworkitems on your file server, then DO IT!

The enclosed policy template section will do what you need and the enhanced settings will generally be all that's needed unless things are really bogging down. Note that maximum isn't a true maximum, it's just the highest values you can use without having a significant negative effect on kernel memory on the file server.

-----------------
CLASS MACHINE

CATEGORY "Backend Server Tuning"

CATEGORY "Redirector Settings"

POLICY "SMB Redirector Parameters"
KEYNAME "SYSTEM\CurrentControlSet\Services\Lanmanserver\Parameters"
PART "Set MaxWorkItems" DROPDOWNLIST REQUIRED
VALUENAME "MaxWorkItems"
ITEMLIST
NAME "Default - 210" VALUE NUMERIC 210
NAME "Enhanced - 4096" VALUE NUMERIC 4096 DEFAULT
NAME "Maximum - 8192" VALUE NUMERIC 8192
END ITEMLIST
END PART
PART "Set MaxMpxCt" DROPDOWNLIST REQUIRED
VALUENAME "MaxMpxCT"
ITEMLIST
NAME "Default - 50" VALUE NUMERIC 50
NAME "Enhanced - 1024" VALUE NUMERIC 1024 DEFAULT
NAME "Maximum - 2048" VALUE NUMERIC 2048
END ITEMLIST
END PART
PART "Set MaxRawWorkItems" DROPDOWNLIST REQUIRED
VALUENAME "MaxRawWorkItems"
ITEMLIST
NAME "Default - 64" VALUE NUMERIC 64
NAME "Enhanced - 512" VALUE NUMERIC 512 DEFAULT
END ITEMLIST
END PART
PART "Set MaxFreeConnections" DROPDOWNLIST REQUIRED
VALUENAME "MaxFreeConnections"
ITEMLIST
NAME "Default - 100" VALUE NUMERIC 100
NAME "Enhanced - 4096" VALUE NUMERIC 4096 DEFAULT
END ITEMLIST
END PART
PART "Set MinFreeConnections" DROPDOWNLIST REQUIRED
VALUENAME "MinFreeConnections"
ITEMLIST
NAME "Default - 32" VALUE NUMERIC 32
NAME "Enhanced - 256" VALUE NUMERIC 256 DEFAULT
END ITEMLIST
END PART
END POLICY ; smb

POLICY "Opportunistic Locking"
KEYNAME "SYSTEM\CurrentControlSet\Services\Lanmanserver\Parameters"
PART "Disable OpLocks" CHECKBOX
VALUENAME "EnableOplocks"
VALUEON NUMERIC 0 DEFAULT
VALUEOFF NUMERIC 1
END PART
END POLICY ; oplocks

END CATEGORY ; redirector settings
--------------------

regards,

Rick






Ulrich Mack
Quest Software
Provision Networks Division

  • | Post Points: 5
Top 50 Contributor
Points 5,251
Greg A replied on Tue, Apr 22 2008 3:17 PM
I doubt Outlook is pausing your entire Citrix server. I would think it is more likely due to % disk time being very high on each server. Look at % disk time of the physical servers in performance monitor on a Citrix server. The extra server usage of Outlook may have just helped push the servers over the tipping point. Your new servers with BBWC should help if so.

You should still try the test I mentioned above. It could be a work-around until you get the BBWC servers. I had to do this on a farm where the company was to cheap to buy BBWC and it stopped the pauses from the user's point of view.

---
2. Queue mouse movements and keystrokes. You can test this by using the PN agent to connect to your servers and disabling/enabling "queue mouse movements and keystrokes" in the PN agent properties. Test by holding down a key on the keyboard in any editor program such as Wordpad. Hold the key down for an entire page and look for the lag. If this turns out to be the problem (as verified with the PN agent) you can disable queue mouse movements and keystrokes in the web interface config files so that web interface users have queue mouse movements and keystrokes disabled when they connect to the servers.
---
  • | Post Points: 20
Not Ranked
Points 150
By chance are you running 2003 R2 Quota'ing on the FS's?

Is there any time repetition as to when the freeze occurs?

Rgds

Andy
  • | Post Points: 5
Not Ranked
Points 235
Chris,

What's the status of your issue now. We are experiencing the same thing here, though we are on a smaller scale, 120 users over 3 servers.

I think the culprit is Outlook too, but not necessarily Exchange.

Do you have the email notifications enabled in your Outlook client?
  • | Post Points: 20
Top 500 Contributor
Points 1,097
Hi
No, problem still pending. Only because I'm not done going through my list of things to try.
We've found that people not using Citrix seem to be having similar problems. SO everyone is working on this. "If you want to find something wrong in your environment, just roll out Citrix to find it". ;-)

One thing the messaging group tied over the weekend was to replace all the Qlogic HBAs with Emulux. They have seen substantial performance improvement. This was done in our Philly Data center which is much smaller than the one I have all my Citrix servers in.

I'm in the middle of building my server image all over again from scratch to be sure there’s no issues with it (I don't expect to find anything wrong) Then I'm rolling out some new servers to replace some aging ones. I'm also making sure all the servers have the cache battery and the settings are in-line with some of the suggestions in this thread. I already do the stuff Rick suggested with max connections etc.. The file server actually seems to be running just fine. (Very low amount of io activity on it.)

I said we suspect Exchange as one of the main problems. I really don't think it's the exchange software just something in that environment. 90% of the complaints revolve around email. We didn't have any problems when we were a Lotus Notes shop.
I’ll be sure to post an update as soon as we get a handle on it.

Are you guys using Qlogic HBA’s?

Senior Administrator (Citrix)
USI Holdings

No matter where I am i'm never where I want to be.

  • | Post Points: 20
Not Ranked
Points 235
I'm recanting my earlier post about it being Outlook.

I'm running perfmon and it looks like it's all io wait on the local disks. I'm already running 15K drives, and have around 40 users per server. Using ntfilemon from Microsoft, formerly sysinternals, I can see that 99% of the user io is from Internet Explorer and it's web cache, pages and cookies.

It's not the bandwidth either it's the shear number of random io requests. We are seeing 100 reads and 50 writes every second and spikes of up to 300 each, this averages to around 150 random IOPS 2:1 read/write, all off a single RAID1 (single spindle) and since it's a 15K disk that means the total theoretical limit of random ios a second it can handle is, (60 / 15000) / 2 = 2, and the manufactures lists read seek at 3.5ms and write at 4ms, so the total number of random IOPS it can handle is 174. This explains the choking and high io waits!

Now I need to find a way to move that io off the main hard drives and somewhere else... I would love to move the whole "Documents and Settings" onto a solid state device, but these servers are blades with only 2 hard drives each and I have thought about using usb 2.0 flash drives too, but they have no reliable method of securing flash drives to them and usb is slow and can't handle the IOPS needed.

The only thing I can think of is to have a "flash drive" iSCSI target. I can setup a 1u Linux host with IET and share out a solid state device, maybe a 32GB solid state SATA disk partitioned into 4 8GB partitions, 1 for each terminal server. I then have these either mounted onto a junction on the server or as a second hard disk and use group policy to push the users' profiles off of C: onto the solid state iSCSI target.

-Ross
  • | Post Points: 5
Not Ranked
Points 235

I'm going to attempt a threaded reply here, hope it comes out ok...

[quote=Chris Norman]Hi
No, problem still pending. Only because I'm not done going through my list of things to try.
We've found that people not using Citrix seem to be having similar problems. SO everyone is working on this. "If you want to find something wrong in your environment, just roll out Citrix to find it". ;-)[/quote]

When you say people not using Citrix do you mean desktops, or people running straight terminal services? I'm running just terminal services here, but if it's desktops too, then well there are going to be other problems too, whether they are related is anybody's guess at this point.

[quote=Chris Norman]One thing the messaging group tied over the weekend was to replace all the Qlogic HBAs with Emulux. They have seen substantial performance improvement. This was done in our Philly Data center which is much smaller than the one I have all my Citrix servers in.[/quote]

We're on a smaller scale and we use iSCSI here with software initiators, we have limited space and also have our Exchange and SQL clusters running on blades with iSCSI and the blades really limits the external expansion capabilities.

[quote=Chris Norman]I'm in the middle of building my server image all over again from scratch to be sure there’s no issues with it (I don't expect to find anything wrong) Then I'm rolling out some new servers to replace some aging ones. I'm also making sure all the servers have the cache battery and the settings are in-line with some of the suggestions in this thread. I already do the stuff Rick suggested with max connections etc.. The file server actually seems to be running just fine. (Very low amount of io activity on it.)[/quote]

Take a look at my analysis, maybe you are seeing the same issues I am. Don't know what your primary storage devices are in your environment, but if they are simple mirrored hard drives then you might be seeing the same problem.

[quote=Chris Norman]I said we suspect Exchange as one of the main problems. I really don't think it's the exchange software just something in that environment. 90% of the complaints revolve around email. We didn't have any problems when we were a Lotus Notes shop.
I’ll be sure to post an update as soon as we get a handle on it.[/quote]

Look forward to hearing the results.

[quote=Chris Norman]Are you guys using Qlogic HBA’s?[/quote]

iSCSI software initiators, the local disks on the blades are Seagate 15K U320 in a hardware RAID1. Dell hardware, so it's the simple PERC 4/im controllers.
  • | Post Points: 5
Not Ranked
Points 235
Ok I ran a little more analysis with filemon. I gathered almost 36K operations and dumped them to file and used string utilities to figure out the precentage of operations by process and here's what I got:

Operations: 35939
csrss.exe: 2333 6%
EXCEL.EXE: 1472 4%
explorer.exe: 2448 6%
iexplore.exe: 12154 33%
lsass.exe: 8 0%
mcshield.exe: 1023 2%
mstsc.exe: 86 0%
NTAbl.exe: 15 0%
OUTLOOK.EXE: 396 1%
proquota.exe: 11483 31%
services.exe: 2 0%
spoolsv.exe: 413 1%
svchost.exe: 1530 4%
taskmgr.exe: 6 0%
userinit.exe: 341 0%
winlogon.exe: 1467 4%
WINWORD.EXE: 758 2%
wsrm.exe: 4 0%

There were 34 users at the time.

What I didn't notice was how much impact the proquota.exe process is on the box. I missed that completely. As it stands IE takes 33% of all io operations, and proquota 31%. So disabling profile quota tool should give a little breathing room until I can get the solid state iscsi solution together.

-Ross
  • | Post Points: 35
Top 25 Contributor
Points 7,712
Hi Ross,

Profile quotas aren't a good idea on TS, it's one of those things that I just never do to the point where I don't even ask anymore "'cause you just don't do it". :-(

Provided the users's desktop and personal (My Documents) folders are redirected, profile quotas are simply unnecessary.

By all means disable quotas.

Samsung and Sandisk both have 64 GB flash drop-in replacements for local hard drives (SCSI, SATA or IDE). I think they're still too expensive, but in your scenario might provide a really quick fix. They handle random iops 100-1000 times faster than physical hard drives.

Have a look at i/o read/writes via task manager, that'll give you a good realtime look at whats doing most of the i/o.

regards,

Rick

Ulrich Mack
Quest Software
Provision Networks Division

  • | Post Points: 20
Not Ranked
Points 235
Thanks Rick,

I disabled the proquota utility, I still have profile quotas even though I redirect "My Documents", "Application Data" and "Desktop" to the users' home drives, set to something like 15MB, more to prevent users from hording unauthorized software in their profiles and to reduce the "it's so slow logging in" complaints. I will take a look at that though. I need a way to keep user profiles small, essentially it should just contain their preferences, but I have noticed more and more web apps (ie WebEx) that install directly to the profile and not the users' "Application Data" directory like they should... I wish developers of Microsoft software would follow Microsoft's guidelines.

  • | Post Points: 5
Not Ranked
Points 235
Ok,

Disabling proquota helped a lot, about 30% less io on average which is tremendous.

Now if I can get Internet cache off the C: onto a solid state device I believe I can reduce another 30%. I need to get all sorts of planned infrastructure upgrades complete first, but I am adding this to the list.

-Ross
  • | Post Points: 20
Top 500 Contributor
Points 1,097
Ross
When you say people not using Citrix do you mean desktops, or people running straight terminal services? I'm running just terminal services here, but if it's desktops too, then well there are going to be other problems too, whether they are related is anybody's guess at this point.

Yes local workstations as well. We have a mixed environment. The goal is to get everyone on Citrix but it's taking some time.

Take a look at my analysis, maybe you are seeing the same issues I am. Don't know what your primary storage devices are in your environment, but if they are simple mirrored hard drives then you might be seeing the same problem.


Yeah I'm doing mirrored hard drives on the Citrix servers 15k.

Disabling proquota helped a lot, about 30% less io on average which is tremendous.

Yeah I figured that would help you. If your having problems with the profiles getting out of control I'd suggest trying Flex Profiling with a single mandatory profile.

"Some what of an update"
I read this a while back and dismissed it as not being my issue but the messaging team believes it's the root of their performance problem on the exchange end. I though i'd post a link. Never know it might be useful to someone.
http://msexchangeteam.com/archive/2007/07/18/446400.aspx

Senior Administrator (Citrix)
USI Holdings

No matter where I am i'm never where I want to be.

  • | Post Points: 35
Top 100 Contributor
Points 1,837
Hi Chris,

I too ran into this very situation with similar hardware and Winders set-up.
-Dell 2950 2x Xeon dual core 2.0GHz procs 1066MHz FSB.
-73GB RAID1 15k SAS drives (two partitions 20/50).
-Perc5/i controller 256MB cache.
-8GB RAM, 4GB Page file.
-Windows Server 2003 Ent. (PAE) SP2.
-MF PS4.0 ent. R02 + specific hotfixes
-Publishing full (locked-down) desktops.
-Utilizing hybrid profiles, Kixtart scripting to map printers, copy desktop shortcuts, build Outlook profile, proflwiz run at logon.cmd and logoff.cmd to load/save user appdata into their respective profile.
-Mandatory profiles reside on each server as well as Kixtart executable and scripts
(to load faster).
-User \My Documents and \appdata folders redirected to Xiotech SAN
-Exchange folders on different LUN on same SAN
-UPHclean service running.
-Server NICs Gig-E to Cisco, then fibre to Brocade fabric connected to SAN.
-Server Second NIC on 100MB "heartbeat" switch for internal IMA traffic.
-Third NIC configured as first and disabled - as a quick backup in case #1 failed.

I too, experienced high pages/sec and context switches/sec with this server config. Which did not make (typical Microsoft) sense. The OS was barely touching the installed RAM and the disk was going crazy! To the point that the user community would experience the same hesitations, or sporadic excessive logon times.

Earlier you mentioned that the console shows everything fine, but PerfMon displays high pages/sec and context switches/sec. Do you have the Citrix Resource Manager Summary database configured? If no, I recommend setting it up: http://sbc.vanbragt.net/mambo/articles/using-citrix-resource-manager-2.html, http://support.citrix.com/article/CTX107434 However, I would recomment changing the default count metrics. The defaults are set too low and you'll find yourself getting Yellow and Red Alerts several times a day (when they really don't exist). I ended up setting pages/sec and context switches/sec to double the default value.

I my analysis (running PerfMon in datalog mode over a week), I was able to determine a couple/three things. Especially spikes around holidays and at lunchtime - heavy IE usage periods.
1) Outlook 2K3 is a hog, but unfortunately, I had to leave certain services on in the .prf file (like support for .pst's). One is also unable to use the new "Cached Exchange Mode" when installed on TS - but network was never an issue, at least on the local NIC, dunno about the server switch or Brocade switch).
2) IE was the #2 major cause of disk i/o with respect to context changes and paging. Still working to deal with this - can't avoid it when publishing desktops this way - they gotta get to the Intranet and Sharepoint...
3) Loading and unloading profiles was the #1 cause of the problem. Adjusting the session timeout and disconnect behavior can help a lot in this respect - the downside is sessions left open for extended periods (security), and higher consumption of licenses.

Oh, almost forgot. It's a good idea to also tweak your anti-virus client so that it's not looking at every single thing! Again, another security compromise.

My last project to address this was looking into client redirection of IE. The problem for me was 1/2 my Thin clients were Blazer (or Wyse 1200LE/S10 type) OS, so no native browser. Those Thin-Clients with WinCE(S30) or XPe(V90) should be able to do it. You just open-up a whole new scope of duties in keeping the media plug-ins up-to-date!

Chris, please update this thread should you find anything else that may have been missed.

Best regards,

Samuel A. Rodriguez
Sr. Systems Administrator

  • | Post Points: 5
Top 500 Contributor
Points 880
Chris

We too are experiencing a very simular intermittent pausing and typing delays.

For us, we have been able to align the user pause or typing delay with a logoff event.

However, the end user impact does not occur on every logoff event, only intermittently. And only some of the users experience the pause.

By monitoring all application processes % Processor Time at 2 second intervals, we observed that during only some of the logoff events the active application processes (nlnotes.exe) will spike to some number over 100 percent.

It during this spike we have been able align the user impact. And we cannot reproduce this in our test Farm either. Even if we reduce the server to one core and have 6 users login and observer the logoff event, the problem is NOT reproducable.

Note: The pausing only occurs for our users intermittently when the server has more than 30 users on the server. The Microsoft esclation performance team is currently looking at this and have told us it does not look like a disk resource issue. At least for now. The last test they requested from us is to run without AntiVirus. The results were the same, problem remains.

We have tried all of the registry settings and 6 hotfixes from MS, as well as diabling the TCP Offload engine, problem still remains. I even defragged, still problem remains.

For us this seems to be a scaleability issue of some sort, and only occurs when more than 30 sessions are on the server.

Currently, we are waiting for the next step from MS.

ScottC.





  • | Post Points: 20
Top 500 Contributor
Points 1,097
Hi Scott,
Welcome to the thread.

Because we are transitioning off of Notes and on to Outlook we have to have both loaded for the users. We run version 6.5.3, because of some dependancies we can't upgrade. I noticed the same thing when a user logs in. But i can't reproduce it. You have me thnking though. I have noticed that sometimes when I log out I get a process hang and I have to end task. I don't recall what it is exactly, system tray something or another and explorer.exe. I need to check my notes and re visit that.

What version of Notes you guys using?

Are you running PS4 on win2003?

Do you have complaints from the users when they log off? Like they have to end task on something?

Did you notice if you started having this issue after windows 2003 SP2?

Senior Administrator (Citrix)
USI Holdings

No matter where I am i'm never where I want to be.

  • | Post Points: 20
Page 2 of 5 (74 items) < Previous 1 2 3 4 5 Next > | RSS