Yesterday I wrote an article explaining how Microsoft's forthcoming RemoteFX enhancements to RDP can use GPUs or custom chip plug-in cards on the host to render the graphics. I compared Microsoft's use of GPUs to Teradici's use of custom chips, and I questioned whether VMware's Mike Coleman was accurate in his blog own post when he claimed that customers wouldn't want to put GPUs in their servers.
A lively discussion ensued with great points being made on all sides of the conversation. But one comment is so important and insightful it deserves highlighting in its own post. The following was posted as a comment yesterday by Randy Groves, Teradici's CTO:
Let me clarify Mike Coleman’s comments about GPUs in servers. Customers may *want* to install GPUs in their servers, however, GPUs will not physically fit into most servers. For example, the GPU that Microsoft used for their demonstration was an NVIDIA FX5800. This card is a double-wide x16 PCIe card that consumes 189 Watts. The vast majority of virtualized servers are either Blade servers or 2U rack servers. If you survey the PCIe slots available in these servers from the major server OEMs, none of them have a double-wide PCIe slot (i.e. no place to plug this card). In addition, the FX5800 retails for about $3,000. A Dell 2970 server with dual, hex-core Opterons is available on their website for less than $2,000, so it will be less expensive to buy another server than to use an FX5800 as a RemoteFX offload engine.
The slot size and cost are not even the biggest issues. Rack and Blade servers typically support only 25 Watts per slot due to the cooling capacity of these dense servers. An NVIDIA FX370LP is only 25 Watts so could be used in 2U servers that have a x16 slot (none have more than one x16 slot). The FX370LP has only 8 “CUDA cores” compared to the FX5800 which has 240 “CUDA cores”. This means it will support about 30 times fewer displays when offloading RemoteFX.
Between now and the availability of RemoteFX, the GPU vendors could release some newer 25 Watt GPUs on x8 PCIe cards so that more than one GPU can fit into a server, but the 25 Watt power limit will significantly constrain the number of “CUDA cores” that can be included and, therefore, significantly limit the number of displays that can be supported. The other alternative is to put all the GPUs into a PCIe expansion chassis which adds additional cost and takes up valuable space in the datacenter. Neither option is very attractive.
The fundamental reason that GPUs are not a good choice for image compression is that they were designed and optimized for a completely different purpose: rendering pixels for CAD, animation, and gaming. Image compression primarily works with 32-bit pixels which allows fixed-point arithmetic to be used. GPUs must use 32-bit floating-point arithmetic to render 3D images. Floating-point arithmetic requires massively more gates in the silicon than fixed-point arithmetic. This makes GPUs power- and cost-inefficient when doing image compression (in fact, using the multimedia instructions in the server CPU for software image compression is a less costly approach).
Don’t get me wrong, the vGPU capability that Microsoft pre-announced will be of value to workstation customers that want to share a beefy GPU between multiple users of 3D applications. In fact, it’s interesting that the “server” Tad was using in your video is actually a Dell Precision Workstation. However, the sharing of a GPU between virtual machines will not likely be unique to Microsoft.
Finally, you stated,
“And VMware's software-based host encoder for PC-over-IP is CPU intensive and affects user density in a big way. (i.e. If you enable it, you get fewer users per server)”.
This is absolutely not consistent with benchmarking we have done. VMware View 4 is achieving the same server consolidation ratios as VMware View 3 did with RDP while XenDesktop 4 consolidation ratios are 20-30% lower than either.
Thanks Randy for the wonderful education!
I guess this means that Microsoft will probably end up pushing customers towards the custom chip option? Although really at this point it's anyone's guess, and realistically this is all just an academic exercise until RemoteFX comes out.
Lastly I'm curious about the user density on VMware View 4. I know that VMware initially claimed to double user density with View 4 (from 8 to 16 per core), although when we pushed them we learned that a lot of that was due to ESX 4 and Nehalem. So I wonder if the same is true here, where the software PC-over-IP host is affecting user density and keeping it the same was ESX 3 / View 3 when it could have otherwise doubled? Then again, maybe it's an expectation change too? Maybe I was just so excited to experience remoted multimedia via View that I ended up pushing the host harder than I did before?