Best practices for virtualizing terminal servers, from Project VRC Phase 2

Ruben and I are getting close to the release of the Project VRC phase 2 white paper.

Ruben and I are getting close to the release of the Project VRC phase 2 white paper. (Here's a primer on Project VRC if you haven't heard of it.) Although we've been discussing and presenting the best practices from the first paper around the world, it's still important to understand Project VRC’s methodology and how the results should be interpreted. This article is a preview of the next VRC whitepaper:

Virtual CPUs (vCPUs)

A Terminal Server is similar to a highway--it's shared by its users. And just as a highway with multiple lanes is more efficient than a single lane, so too is a Terminal Server with multiple vCPUs. Besides the obvious advantage of increasing capacity, adding a second lane means that the impact of an accident or slowdown is greatly reduced as traffic can still get through via a free lane. In other words, having more than one lane means a single slowdown doesn't impact everyone.

This is not fundamentally different with Terminal Server workloads. Configuring each VM with a single vCPU could theoretically be more efficient (or faster). But in the real world of Terminal Server workloads, this is highly undesirable. Such workloads vary greatly in resource utilization and are typified by hundreds or even thousands of threads in a single VM, so a dual vCPU setup gives Terminal Server users have much better protection than a single vCPU against congestion issues. 

This means that a minimum of 2 vCPUs per VM were configured for all tests for Project VRC, even when a single vCPU has been proven to be more efficient. This best practice is also valid for practically all real world Terminal Server-based workloads. (This also applies to Terminal Servers running third-party add-ons such as Citrix XenApp.)

vCPU Overcommit

Another important best practice which is valid for all tested hypervisors is not to overcommit the total amount of vCPUs in relationship to the available logical processors on the system.  For example, on a system with eight logical cores, no more than eight vCPUs should be assigned in total to all VMs running on the host. (Well, technically this is only important when the primary goal is to maximize user density. )

Various tests in phase 1 of Project VRC have proven that overcommitting vCPUs negatively affects performance. This is not completely surprising since overcommiting means that multiple VMs must share individual logical processors which creates additional overhead. 

As a result, Project VRC will not overcommit vCPUs in any tests since the maximizing user density is the primary goal.  But it's important to understand that overcommitting vCPUs is not prohibited in every case. For example, when the main goal is to maximize the amount of TS/XenApp VM’s (good old fashioned server consolidation), overcommitting vCPUs is no problem at all. (In these cases, however, configuring two vCPUs per VM is still recommended because the “highway” principle still applies.

Transparent Page Sharing

vSphere’s ability to overcommit VM memory and memory de-duplication through transparent page sharing (TPS) is highly useful for the consolidation of many VMs onto a single server. Nevertheless, one of the older Terminal Server best practices floating around the Internet communities was to disable TPS. And in fact Project VRC phase 1 showed that disabling TPS actually improved performance by 5-10%. This makes sense since TPS is made possible via a background process which scans and reallocates memory, consuming a modest amount of CPU in the process.

When it is the primary objective to maximize the amount of users with Terminal Server workloads and there is enough physical memory available, we still recommend disabling TPS. As a result, all Project VRC tests were conducted with TPS disabled, unless stated otherwise. 

However, this VRC recommendation should not be understood as an overall recommendation to disable TPS. For instance, when the main goal is to maximize the number of VMs is (which is quite common, e.g. VDI and rather typical server consolidation efforts), TPS can be very helpful and is recommended. 

Interpreting Project VRC Results

Lastly, Project VRC uses the product-independent Login Consultants VSI 2.1 benchmark to review, compare, and analyze desktop workloads on TS and VDI solutions. The primary purpose of VSImax is to allow sensible and easy to understand comparisons between different configurations.

The data found within Project VRC is therefore only representative for the VDI & TS workloads. Project VRC results cannot and should never be translated into any other workloads like SQL, IIS, Linux, Unix, Domain Controllers, Network, etc.

Also, the “VSImax” results (the maximum amount of VSI users), should never be directly interpreted as real-world results. The VSI workload has been made as realistic possible, but it always remains a synthetic benchmark with a specific desktop workload. Real world TS and VDI performance is completely dependent on the specific application set and how and when these applications are used. To include specific applications or customize the VSI 2.1 workload, VSI PRO must be used. 

It is important to stress that no benchmark or load testing tools can 100% predict the real-world capacity of an environment, even when the specific applications are used. Real-world users are simply not 100% predictable in how and when they use applications.

Do you virtualize TS/XenApp workloads?

Nowadays, there are many reasons to virtualize TS/XenApp workloads, and there are reasons not to virtualize TS/XenApp. Do you agree with the Project VRC best practices? Do you virtualize TS/XenApp? Why (not)? Have you been succesful with it?

Join the conversation

10 comments

Send me notifications when other members comment.

Please create a username to comment.

Regarding CPU over-commit - I've always wondered where the service console fits in to this?  In the VRC docs, you've always stated that 4x2 vCPU VMs is acceptable (optimal?) on an 8 core system, but where does that leave the service console when it comes to things like backups, snapshots, etc.?


Cancel

caustic386,


From my experience, the Hypervisor itself will use some resources, but normally they would be negligible, and will not have significant impact on VMs.


Specially if you use ESXi .


Cancel

Jeroen,


You may have this already covered this elsewhere. I’d like to better understand your view on CPU fair sharing within the Citrix stack for XA on a hypervisor. My experience has shown it’s better to turn it off and let the hypervisor scheduler handle the fair sharing. That said I do loose something that helped me make the OS feel a lot more responsive to users.


As for your methodology. I agree it’s a good thing to bench mark for max density but important to keep people informed that max density is not always a desired goal, vs. predictability of a less stressed system. I’ve virtualized mostly x64 workloads as they were the newer machines in my Farm for which I had a use case for.  On a Quad dual core box, I’ve had no issues running 8 XA servers with one Virtual CPU each, 16GB memory per core, (so a 64GB server) running at peak 35 sessions per XA instance. So that’s 280 concurrent sessions on the physical server, running mixed workloads from light to heavy once I turned off the CPU fair sharing. I could have got more, but my test rig is only so good. On a standalone physical server I could go higher, but that didn’t help me with my server consolidation goal. I got 350 sessions NP before my test rig died. The hypervisor overhead relatively was small in my view, although I did not spend much time measuring it.


I’m sure I would be fine with x32 as well, as max density is not my goal. For me the biggest headache is management of the Hypervisor environment at scale.


Cancel

Appdetective,


I got similar results, but on x68 VMs. Each ESX hosts with 4 dual processor VMs, (to match the 8 cores available), around 30 sessions per VM, 120 per host (averages). The limiting factor here was the CPU usage.


Users were running standard office apps (MS Office, Adobe Reader, Internet Explorer), a mix of light, average, and heavy users (~2000 users total, ~70 VMs).


Besides 5 to 10 seconds increase on logon times, we manage to match the exact same performance of physical after the session was established.


This is on some Old dual core AMD processors, I think we could go much further on new HW (Shanghai, Nehalen).


Cancel

We're in teh middle of migrating to virtualized XA now.  Our internal load testing has validated what you have stated above.  We deliver an EMR system that gets 4000 concurrent conenctions a day and based on the applications workload we had to go 4vcpu to match reworld performance.  We also found we could not over commit physical cores.  What we haven't tried yet is testing whether hyperthreading would provide us with increased capacity (by allowing more XA workloads per host) has anyone tested hyperthreading in this scenario?


Cancel

@ Appdetective


This is what Ruben and I say about fairshare: it should not make any difference under normal operating conditions.


When apps behave and the load is "normal", fairshare makes absolutely zero difference. The value of fairshare specifically comes into play with rogue/CPU dominant apps or extreme congestion (logon spike). Fairshare can basically regulate process priority, but is should do this only when it is required (CPU is 100%). Fairshare have been a life saver in many environments, but it can also negatively affect performance of an app when you really need the CPU cycles.


Even on a hypervisor, within a guest the OS is responsible for the thread scheduling. The hypervisor manages individual VM's and schedules their vCPU’s ad whole: or at least, this is my uneducated assumption… So from my perspective fairshare also makes sense when the running on a hypervisor.


Just out of curiosity: do you run Office/IE (for personal usage) on that farm, or just specific business apps? The thing is: especially with IE and Flash installed: users browse sites with a s**tload of flash advertising: this eats CPU like crazy. (Just checkout CPU utilization during lunch break)


Cancel

@Jeroen


I agree with you that at normal load there is no impact, and understand the compromise off turning it on. However I think it helps to break up the use cases into SBC and VDI. Let me try to explain what I mean.


SBC. This means  users are sharing an operating system and one user can impact another by say watching a Flash Video. In this case I get fair sharing can help.


VDI. (let’s just pretend for a second that XD can do this). If I have 8VM’s one VM per core as an example and I turn on fair sharing then here’s how I think about it. In this case it’s a 1-1-1 user to OS to core mapping. Fair sharing can certainly help apps within that single user session at peak load with the compromises that you mention. This does mean that fair sharing is only happening for the core the machine sees.


If instead I say let the Hypervisor figure it out, and show each VM all the COREs, and assume that not all VMs will be sucking CPU at peak, then overall would that give more available CPU cycles to any app running in any of the VMs? I believe in this case if you turn on fair sharing you will be fighting with the hypervisors scheduler which will result in poor performance and reduced scalability.


If I apply this same logic back to the SBC case then my testing to date shows that turning fair sharing off for XA and showing all cores to the XA VMs allows me to get better predictability. but I do loose the in-OS sharing that helped with user experience. I also expect by turning off fair sharing I could greater density of XA VMs running in more reliable manner.


My test setup is not way as extensive as yours, so at least for me it would be great to have further clarity around this topic from yourself and Ruben who do great stuff BTW! If I am wrong and fair sharing on in XA is better being turned on, then I would ask the question 1-1 XA for VDI, does it scale better than regular VDI and is the user experience more predictable?


Cancel

@AppDetective


I think you pretty much have hit the nail on the head. If you shedule a sheduler you are likely to create some kind of latency. With no control over the timings of either you are at the mercy of the gods and must disable one or the other.


Unfortunately, I've seen production environments where exposing 8 vCPUs to a XA server ended few users per server, silly unpredictability.  


In my experience the QoS of the XA server environment always wins and you will keep the Fair shairing turned on with no vCPU over commit in favor of user experience. All it takes is one user to run a end of month report from their enterprise app for any given server to become pretty poor.


In the VDI space we have seen some very nice results as the AppSense scheduler can be disabled or enabled for sets of applications or processes.  While the jury is still out by actually reducing the aggressiveness of the share factor algorithm inside the OS we may be able to reduce conflict that two schedulers create when they mess with CPU clocks.


Cancel

@appdetective


I think you are wrong about the hypervisors scheduling


"If instead I say let the Hypervisor figure it out, and show each VM all the COREs, and assume that not all VMs will be sucking CPU at peak, then overall would that give more available CPU cycles to any app running in any of the VMs?"


The best case is that the vCPU gets 100% of a core (actually the equivalent as it may move around cores).


There are issues with co-scheduling multi-vcpu VMs as all need to be scheduled together, see communities.vmware.com/.../DOC-4960 . In this case the 1-1 mapping is more significant as there is always available CPU


Cancel

@appdetective


Re:"Fair sharing can certainly help apps within that single user session at peak load with the compromises that you mention. This does mean that fair sharing is only happening for the core the machine sees."


No, fair sharing is applied to the vCPU which may be moving around the real cores. I dont know why you think switching off fair sharing in a XA setup would make for a better level of service, think about what happens when a vCPU spikes at 100%.


When you talk about VMs "seeing" cores, are you talking about processor affinity? (I am less familiar with XenServer jargon than VMwares)


Cancel

-ADS BY GOOGLE

SearchVirtualDesktop

SearchEnterpriseDesktop

SearchServerVirtualization

SearchVMware

Close