Teradici CTO Randy Groves weighs in on the PC-over-IP / RemoteFX "custom chip" vs. "GPU" discussion

Customers may *want* to install GPUs in their servers, however, GPUs will not physically fit into most servers.

Yesterday I wrote an article explaining how Microsoft's forthcoming RemoteFX enhancements to RDP can use GPUs or custom chip plug-in cards on the host to render the graphics. I compared Microsoft's use of GPUs to Teradici's use of custom chips, and I questioned whether VMware's Mike Coleman was accurate in his blog own post when he claimed that customers wouldn't want to put GPUs in their servers.

A lively discussion ensued with great points being made on all sides of the conversation. But one comment is so important and insightful it deserves highlighting in its own post. The following was posted as a comment yesterday by Randy Groves, Teradici's CTO:

Let me clarify Mike Coleman’s comments about GPUs in servers. Customers may *want* to install GPUs in their servers, however, GPUs will not physically fit into most servers. For example, the GPU that Microsoft used for their demonstration was an NVIDIA FX5800. This card is a double-wide x16 PCIe card that consumes 189 Watts. The vast majority of virtualized servers are either Blade servers or 2U rack servers. If you survey the PCIe slots available in these servers from the major server OEMs, none of them have a double-wide PCIe slot (i.e. no place to plug this card). In addition, the FX5800 retails for about $3,000. A Dell 2970 server with dual, hex-core Opterons is available on their website for less than $2,000, so it will be less expensive to buy another server than to use an FX5800 as a RemoteFX offload engine.

The slot size and cost are not even the biggest issues. Rack and Blade servers typically support only 25 Watts per slot due to the cooling capacity of these dense servers. An NVIDIA FX370LP is only 25 Watts so could be used in 2U servers that have a x16 slot (none have more than one x16 slot). The FX370LP has only 8 “CUDA cores” compared to the FX5800 which has 240 “CUDA cores”. This means it will support about 30 times fewer displays when offloading RemoteFX.

Between now and the availability of RemoteFX, the GPU vendors could release some newer 25 Watt GPUs on x8 PCIe cards so that more than one GPU can fit into a server, but the 25 Watt power limit will significantly constrain the number of “CUDA cores” that can be included and, therefore, significantly limit the number of displays that can be supported. The other alternative is to put all the GPUs into a PCIe expansion chassis which adds additional cost and takes up valuable space in the datacenter. Neither option is very attractive.

The fundamental reason that GPUs are not a good choice for image compression is that they were designed and optimized for a completely different purpose: rendering pixels for CAD, animation, and gaming. Image compression primarily works with 32-bit pixels which allows fixed-point arithmetic to be used. GPUs must use 32-bit floating-point arithmetic to render 3D images. Floating-point arithmetic requires massively more gates in the silicon than fixed-point arithmetic. This makes GPUs power- and cost-inefficient when doing image compression (in fact, using the multimedia instructions in the server CPU for software image compression is a less costly approach).

Don’t get me wrong, the vGPU capability that Microsoft pre-announced will be of value to workstation customers that want to share a beefy GPU between multiple users of 3D applications. In fact, it’s interesting that the “server” Tad was using in your video is actually a Dell Precision Workstation. However, the sharing of a GPU between virtual machines will not likely be unique to Microsoft.

Finally, you stated,

“And VMware's software-based host encoder for PC-over-IP is CPU intensive and affects user density in a big way. (i.e. If you enable it, you get fewer users per server)”.

This is absolutely not consistent with benchmarking we have done. VMware View 4 is achieving the same server consolidation ratios as VMware View 3 did with RDP while XenDesktop 4 consolidation ratios are 20-30% lower than either.

Thanks Randy for the wonderful education!

I guess this means that Microsoft will probably end up pushing customers towards the custom chip option? Although really at this point it's anyone's guess, and realistically this is all just an academic exercise until RemoteFX comes out.

Lastly I'm curious about the user density on VMware View 4. I know that VMware initially claimed to double user density with View 4 (from 8 to 16 per core), although when we pushed them we learned that a lot of that was due to ESX 4 and Nehalem. So I wonder if the same is true here, where the software PC-over-IP host is affecting user density and keeping it the same was ESX 3 / View 3 when it could have otherwise doubled? Then again, maybe it's an expectation change too? Maybe I was just so excited to experience remoted multimedia via View that I ended up pushing the host harder than I did before?

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.


Take a look at this VMworld session:


There's a detailed analysis on PCoIP (beta at that time), and ICA/RDP, for various conditions, including CPU overhead on the host.

PCoIP will generate *less* CPU overhead compared to RDP and similar or better than ICA in most cases.

Take a look at it !



Thanks for highlighting this. it is true that user density is significantly higher on Nehalem servers than on prior generations but that is independent of View 3 and View 4. The presentation that VMguy points to was done on vSphere (ESX4) and is some of the benchmarking I mentioned (see slide 43). One subtlety to notice is that the PCoIP blue line finishes the benchmark ~10% faster than RDP or ICA. This is because host-side rendering does not hold up an application waiting for client-side handshaking. This is particularly interesting as this benchmark was done on a low latency LAN.

I suspect your "CPU-intensive" experience came from viewing something like an HD video which can generate almost 50 million pixel changes per second. That does require the PCoIP encoder to be very busy and is not really a typical use case for VDI (480p and smaller is what most web content is). If a user does need to view HD videos frequently and the videos are not in an MMR-friendly format, a dual vCPU VM should be used. HD video doesn't work very well on a single core physical desktop and virtual CPUs are no different. Decoding HD video requires a lot of CPU power independent of the remoting protocol being used. For this reason, we find that users who need to watch high-fidelity HD video all day are choosing workstations with PCoIP hardware-accelerated solutions most of the time.



This presentation is only available to VMworld 2009 attendees.


Brian - I'm reposting this comment which I made on your  previous post on this topic.


I'm not sure that representing GPUs as just rendering engines is accurate. GPU's should really be considered as floating point math processors that can also be used as very effective graphics processors.  You only have to look at the uses that high end GPUs are used for that have nothing to do with graphics (Computational Chemistry, Molecular Dynamics, Data Mining and Analytics etc.) to see that. NVIDIA are pushing their Fermi architecture GPUs as high performance computing systems with only a passing reference to their graphics processing roots and claiming that they "... deliver equivalent performance at 1/20th the power consumption and 1/10th the cost"  to the latest quad-core CPUs.

Having said that, I agree with you points on the difficulty in shoehorning GPUs into blade systems, there is neither the space, power or cooling capacity to get anything but the smallest of GPUs into a blade server.  However if vendors see an opportunity here then we can be assured that they will find ways to overcome the problem (dedicated GPU server/blades may be a possibility here).  This may also turn to your advantage, if Cisco were to offer a " Remote_FX Compression Blade" forUCS there is every reason to expect that a PCoIP Compression Blade would also be offered.

We must also remember that GPU based RFX is only one option. IF Teradici and VMware can create a s/w implementation of the PCoIP engine that does not impact server consolidation ratios, then there is no reason to think that Microsoft can't do so as well. Personally I find it hard to accept that anyone can develop a s/w solution that does not impact server scalability to some degree and would suggest that comparing the performance of View 3 with View 4 + PCoIP is not valid.  Perhaps you should share the View 4 RDP and View 4 PCoIP performance results instead.





Great post, and it is very interesting to hear of the real-world impications with respect to cost, space, cooling, and power requirements of these solution options to provide true-pc like graphics functionality to remote users.  

Among the recent announcements in this area,  one that you may find interesting came from NComputing, who announced a $20 chip that will not only support RemoteFX (when available), but more significantly supports similar functionality right now through the use of their UXP protocol without requiring any specialized server GPU and related issues.    See details and demo video at: www.ncomputing.com/numo


Why are we assuming that all of us already have blade servers and shoehorning a GPU in them will be a problem. What about new customers in this area who are just looking to buy a server solution for Virtualization

you can already get a 1U with 4 tesla GPUs from nvidia that are considerably more powerful that CPU based solution and at fraction of the cost.  With Fermi around the corner and having 512 cuda cores each it's only going to get more powerful.

I for one am interested in a GPU based solution for Virtualization. Any one saying otherwise is just protecting its own butt.