Rating:  Votes: 0 rating(s) Score: 0/0 |
Hi,
I'm troubleshooting a performance issue with a CPS farm that was migrated from CPS 3.0 to a new farm with CPS 4.5. Migration was really just a rebuild of all apps on new servers to a new farm. The CPS 3.0 farm is really just a performance reference.
The user complain of performance issues, typing delays, and disconnects on the new farm.
Summary of environment:
- ASP environment - essentially just one published app and MS Word in use - CPS 4.5 (HRP-02) 32-bit servers running a very few published apps on W2K3 R2 SP2 - all server builds identical - Data center houses Dell M600 blade center used for CPS farm - (2) quad core processors, 16 GB RAM per blade (these servers are new to this CPS 4.5 farm, old CPS 3.0 farm was on older rack mount Dell servers) - ICA sessions are established via a static ICA file and are load balanced using an F5 Big-IP load balancer with sticky sessions (persistence) enabled - CPS servers (blade center) are in a DMZ along with file server cluster, AD DCs are on inside LAN. - Appropriate ports have been opened between DMZ and inside LAN for AD authentication. - Client PCs run web client and static ICA file is launched via a custom web portal (in other words, connections don't actually go through a WI server, instead all ICA sessions target a FQDN that resolves to the VIP of the Big-IP load balancer) - Citrix UPD is used exclusively - Folder redirection in place with redirection of My Docs and Desktop to share on clustered file server (FC SAN - Dell CX400) - TS Roaming profiles in use, hundreds of users but only 200 concurrent users max load balanced across farm.
Initially, the farm was built out with 64-bit CPS 4.5 servers where the performance issues arose. There was a good deal of troubleshooting done and issues with Citrix UPD on 64-bit OS were resolved yet performance issues continued. At first, troubleshooting revealed a problem with network connectivity on the edge as not all users were complaining (appeared to be limited to high latency, low bandwidth connections on edge). However, turned out that more users began to complain even from remote locations that had low latency, high bandwidth pipes. Perf mon analysis revealed that local disk I/O was getting queued too often and % disk idle was too low - other counters (CPU, network I/O, paging, memory, etc) OK. Upgrade of firmware and drivers appeared to have resolved the local disk I/O issues (although M600 blades do not have write caching). Again, focus was network issues, so we put in place bandwidth restrictions via Citrix policy. Unfortunately, performance issues persisted. The old CPS 3.0 farm never exhibited these issues, but servers were not blades; however, the Big-IP was load balancing that farm too.
Client decided to rebuild the entire farm with 32-bit CPS 4.5 instead, blaming the published app. Unfortunately, the 32-bit farm is showing the same issues as the 64-bit farm was. One blade server removed from the VIP pool on the Big-IP and targetted directly (tweaked host file) showed no performance issues; however, the user count on that box was pretty low (only 25 users - long story). As before farm rebuild, CPS servers in the VIP pool are suffering from performance issues yet Perf mon analysis shows no issues.
Attention is focused again on network infrastructure - this time in the data center, but it's proving difficult to troubleshoot. Also, new Dell M600 blade servers are suspect, but it's not currently possible to directly target a blade server with 60 user sessions outside of the VIP pool of the Big-IP. The reason is that all user's reference the same ICA file that in turn references a public FQDN that resolves to VIP of the Big-IP, so targetting a single server requires inserting an entry in the host file on the client side to override the public DNS resolve. My client doesn't want to make these client side changes.
Ideally, I would like to get the Big-IP load balancer out of the equation by implementing a CSG/WI server; however, I don't think this will work with a static ICA file for launching sessions due to STA ticket expiry issues. So, my client cannot abandon the use of the static ICA file launching the app for now as users are accustomed to launching published app via web portal and AD credentials are embedded within the ICA file (not secure, I know).
Help!!!
Any ideas would be greatly appreciated.
Alan Osborne President (MCSE, CCNA, VCP, CCA) VCIT Consulting - Citrix/Terminal Services Remote Desktop Solutions for SMB p: 604-288-7325 c: 778-836-8025 web: http://www.vcit.ca blog: http://www.vcit.ca/wordpress
|
|
|
Rating:  Votes: 0 rating(s) Score: 0/0 |
Bump! Isn't there anyone that can offer some ideas??
A little more information to add:
- Wireshark trace at remote end and CPS server show that the CPS servers are issuing RST packets in the middle of TCP conversations for no apparent reason (TCP conversations are pretty clean).
- An ACK is sent by the client PC exactly 60 secs prior to the RST packet, which corresponds with the ICA keep-alive interval. If the ICA keep-alive is changed to 70 secs, the delay between the last client PC ACK and the RST from the CPS server is then 70 secs. There is no traffic between endpoints in between the client ACK and the RST from the CPS server. It would seem that CPS is issuing a RST because it thinks the client has been disconnected. I don't want to disable ICA keep-alive altogether though.
- A CPS 3.0 server behind the same network infrastructure has no performance issues and we aren't seeing the RST packets being issued by this server either.
Thanks in advance for any advice you can offer.
Alan Osborne President (MCSE, CCNA, VCP, CCA) VCIT Consulting - Citrix/Terminal Services Remote Desktop Solutions for SMB p: 604-288-7325 c: 778-836-8025 web: http://www.vcit.ca blog: http://www.vcit.ca/wordpress
|
|
|
Rating:  Votes: 0 rating(s) Score: 0/0 |
Hello Alan,
If the CPS 4.5 server/s are sending RST's than it's probably safe to assume it's detecting client disconnects (Automatic Client reconnection enabled). I'd first verify the ICA listener settings, then I'd flash the server NIC's with the latest firmware and re-configure the settings. Also, double-check the port settings at the switch they are connected to, to ensure they match.
As far as the user performance issue. It sounds as though you may have nailed that one down? I don't fully understand the benefit of the Big IP appliance over the built-in load balancing when configured properly. Would you mind providing a little more detail on this set-up?
quote ICA sessions are established via a static ICA file and are load balanced using an F5 Big-IP load balancer quote connections don't actually go through a WI server, instead all ICA sessions target a FQDN that resolves to the VIP of the Big-IP load balancer So, this is network load balancing? Can the BIG-IP appliance determine server load (with respect to system counters)? BTW - Which Citrix Load Evaluator are you using?
I take it you're not using the "ica" DNS entry to resolve to the ZDC? (A much better way to evaluate load than network IMO).
I know if the Thin client world we configure static ica files in the ftp root directory of the WI server. We then create an entry with the path to the ICA file in the DHCP Options of the subnet of the computers we wish to configure. http://www.wyse.in/resources/deploying_a_blazer_device.doc This may be useless for your situation, but may help you think of potential options. What about Anonymous access to that app, or configure Pass-Thru ICA Client?
quote Ideally, I would like to get the Big-IP load balancer out of the equation by implementing a CSG/WI server; I agree with you on that one. Are the remote users on dedicated WAN links or connecting over Internet? If dedicated, I'd skip the CSG part. Use dual WI servers (exactly the same) and NLB those.
Samuel A. Rodriguez Sr. Systems Administrator
|
|
|
Rating:  Votes: 0 rating(s) Score: 0/0 |
Hi Sam,
Thanks for the reply!
The server NICs are already running the latest firmware and driver. We also tried disabling Chimney Offload, TCP Chimney, Receive-Side Scaling, and other "optimizations" in the SNP enhancement MS introduced (concern was with the Broadcom chipset). I also tried disabling TCP checksum offload. The speed/duplex settings have been verified for all interfaces in the infrastructure too. The only layer 1/layer 2 issues we found were overruns on the PIX firewall which we are taking up with Cisco support.
As far as the network design goes, this wasn't architected by me :-)
I agree with your assessment that native load-balancing using Citrix LEs would be preferrable to using the Big-IP. The problem is that the client uses a web portal that launches a static ICA file which targets the FQDN of the Big-IP VIP directly, so the XML service on the ZDC isn't queried, nor is the WI involved (although they use the web client). The Big-IP load-balances the traffic using intelligent round robin (no performance counters) through to the CPS servers.
You said:
I take it you're not using the "ica" DNS entry to resolve to the ZDC? No, the VIP is targetted directly and the Big-IP decides which CPS server gets the load. I would much prefer to use CSG/WI front-end and do away with the Big-IP, which they will hopefully agree to long term. Users connect across the Internet from all over the place, no dedicated WAN links involved.
I'll take a look at http://www.wyse.in/resources/deploying_a_blazer_device.doc as you suggest - thanks!
Alan Osborne President (MCSE, CCNA, VCP, CCA) VCIT Consulting - Citrix/Terminal Services Remote Desktop Solutions for SMB p: 604-288-7325 c: 778-836-8025 web: http://www.vcit.ca blog: http://www.vcit.ca/wordpress
|
|
|
Rating:  Votes: 0 rating(s) Score: 0/0 |
How intriguing,
I don't see why CPS 3.0 would perform better than 4.5 (all else being equal). There must be a configuration difference between the two farms. Sorry, wish I had something more substantial.
I know how server administrators can be protective of their environment, especially when they're comfortable with something, but they have to remember that our world is a fluid environment and change is inevitable. IMO one should embrace concepts that make better sense.
Rather than get hung-up on "our 3.0 farm worked fine in this configuration", I would take advantage of this migration (and your expertise) to review mainstream best practices and see how best to leverage the new features of their new 4.5 farm.
With respect to the performance issues, I would get someone from the infrastructure team to go out to one of the remote sites and work closely with you so that you have a real-time feedback to what you're seeing in Wireshark while at the same time run an instance of Spotlight-on-Windows on the servers http://www.quest.com/spotlight-on-windows/. I was thinking, could this possibly be an MTU issue?
FWIW, In parallel with your efforts to find your performance issue and put that to bed, I'd try and talk them into a proof-of-concept trial integrating WI with their web portal (which is documented better, takes advantage of native Citrix LE's and with CSG, quite secure).
Good luck Alan, I'm sure you'll find your culprit. Please share with us what you find.
Best regards,
Samuel A. Rodriguez Sr. Systems Administrator
|
|
|
Rating:  Votes: 0 rating(s) Score: 0/0 |
Hi,
Just in case someone comes across a similar issue in future, I'm following up with the resolve to this issue.
After further troubleshooting, it turned out that the F5 Big-IP default TCP profile was the single biggest contributor to the problem. On the advice of F5 Support, we implemented a custom TCP profile with more generous timeout values which virtually eliminated disconnects and performance issues.
The only other issue was an apparent conflict between DEP and SpeedScreen. Even though DEP was enabled for essential Windows services and programs only, we found that disabling DEP altogether and disabling SpeedScreen eliminated all remaining disconnect issues (sometimes users would click on particular buttons within the core published app and get disconnected). My client has chosen not to re-enable SpeedScreen at this time, so I'm not sure whether DEP was causing the published app to tank or whether DEP was causing issues with SpeedScreeen.
The only remaining issue is with the Citrix UPD and 0x3EB errors, despite having confirmed the SMA user account perms and experimenting with running the "Citrix Print Manager Service" under the local system account (made no difference). Fortunately, the printer mapping issues are infrequent. Having the user logoff and log back on results in the client printers getting mapped - weird.
Hope someone finds this useful in future.
Cheers,
Alan Osborne President (MCSE, CCNA, VCP, CCA) VCIT Consulting - Citrix/Terminal Services Remote Desktop Solutions for SMB p: 604-288-7325 c: 778-836-8025 web: http://www.vcit.ca blog: http://www.vcit.ca/wordpress
|
|
|
Rating:  Votes: 0 rating(s) Score: 0/0 |
Hi Alan,
See http://support.citrix.com/article/CTX107148 as an example of issues with DEP and zlc_api.dll. The not so funny thing is that there was a hotfix that fixed this issue with CPS 3.0 but it isn't available for CPX 4.X. Makes you wonder doesn't it?
AMD and Intel based systems behave slightly differently as well.
DEP and zlc_api.dll can be a bad combination with the latest O.S. service packs which have change DEP behaviour. However completely disabling DEP isn't such a good idea either because it does protect you from buffer overflows, nor is disabling speedscreen potentially very good from a performance viewpoint. You've got a number of other options:
1. Add executable to DEP exclusion list (control panel > system > advanced > performance > DEP.
2. Add an application compatibility shim, disableNX to the executable using the Application Compatibility Administration tool.
3. Change a registry setting and add a new one for the executable you want to exclude. Thisdoes exactly the same thing as option 1.0. eg change HKLM\Software\Microsoft\Windows NT\CurrentVersion\AppCompatFlags\NoExecuteState from 14012 (decimal) to 14013, and add the name of the executable as a valuename under HKLM\Software\Microsoft\Windows NT\CurrentVersion\AppCompatFlags\Layers, eg appname.exe, REG_SZ, DisableNXShowUI.
The last option lets you script the changes you need or even make an unmanaged group policy to roll out DEP exclusions to all the servers in a farm.
regards,
Rick
Ulrich Mack Quest Software Provision Networks Division
|
|
|
Rating:  Votes: 0 rating(s) Score: 0/0 |
My .02 is that DEP on 2003 is largely useless anyway. It tends to cause more problems than it solves. If you seriously want to tackle buffer overflows, you should be deploying Server 2008 with ASLR.
@Alan - Thanks for the nice writeup on your issues and what the resolution was.
Shawn
___________ http://www.shawnbass.com
|
|
|
Rating:  Votes: 0 rating(s) Score: 0/0 |
Hi Rick,
Thanks for the link to CTX107148, I hadn't come across that one yet.
I was surprised that we had issues with DEP and SpeedScreen given the fact that DEP was set to "Turn on DEP for essential Windows programs and services only" as that should limit DEP protection to Windows system binaries only (at least that's my interpretation of the setting).
Thanks also for the registry settings RE DEP - I'll add them to the vault for future reference as scripting this stuff would be a better option for larger farms. I typically roll out cloned CPS servers with DEP set to "Turn on DEP for essential Windows programs and services only" which works 99% of the time.
Cheers,
Alan Osborne President (MCSE, CCNA, VCP, CCA) VCIT Consulting - Citrix/Terminal Services Remote Desktop Solutions for SMB p: 604-288-7325 c: 778-836-8025 web: http://www.vcit.ca blog: http://www.vcit.ca/wordpress
|
|
|