How the hidden "SLA bump" can kill your VDI project: You'd better know your desktop SLAs going in!

This week's conversation on about the whether you should use local or SAN-based storage for VDI has been one of the best conversations we've ever had on any topic.

This week's conversation on about the whether you should use local or SAN-based storage for VDI has been one of the best conversations we've ever had on any topic. What's clear after reading the 50+ comments is that people are roughly falling into two camps:

  • Those who believe that IT services delivered from the datacenter—including VDI—should have a high level of reliability, and
  • Those who think, "meh... they're just desktops. How reliable do they need to be?"

(At this point I should mention that it's possible to leverage local storage AND have high availability, but that's not that point of today's article.) The conversation centers around how reliable "desktops" need to be. Obviously this is something that varies from company-to-company and depends on many things including why the company is doing VDI in the first place. And while we can't resolve the "How reliable do desktops need to be?" question on, we can set the context for how you can answer that question in your own environment.

Whenever the question of desktop reliability comes up, the first thing people point out is that traditional desktops and laptops are not reliable or redundant. From an SLA perspective, traditional desktops only break one at a time. Individual desktops have no redundancy and are actually not that reliable—a typical one might break every 2000 hours (hard drive, fan, memory, whatever). The saving grace of course is that a single desktop failure only takes out a single user at a time.

Compare that to VDI desktops running on servers in your datacenter. Your servers—with redundant drives, power, memory, and fans—might only fail every 200,000 hours instead of every 2000. Even with 40 or 50 users per server, you still have a statistically lower per-user failure rate with server-based desktops, but from a practical sense it's a bigger deal since VDI means you're dealing with 40 users down at once instead of just one. (Not that it really matters in the context of this conversation, but a far bigger cause of failure than hardware is human error—dumb users with traditional desktops and dumb admins with VDI.)

Of course the actual impact of a single server or desktop failure really depends on what it means to be "down." In the perfect world where all user data and personality are on the network, your "downtime" is only as long as it takes to get a new machine in front of the user. Swapping out a laptop might be a few hours for a traditional user, but with VDI, reconnecting to a functional desktop is just a matter of re-clicking the "connect" button.

Just having a few seconds of downtime, though, only works when your data and personality are on the network. And while we all know that the shared-master-with-dynamically-layered-desktop is the long-term goal of VDI, the reality is that most of today's VDI deployments are based on persistent disks, where each user "owns" his or her own disk image. (Right?) And if that persistent image is stored on a single server, and that server goes down, then you're looking at a longer outage period as you move drives, restore, and/or rebuild those users' desktops.

But wait, isn't that the way traditional desktops work? If a users loses a hard drive, then they're down until you restore their data from backup. The amount of time that takes is directly related to your planning for this. Are you backing up laptops? Are you only backing up data that's on the network in their profile and home drive? Are you providing a service like DropBox? Do you back up the entire desktop?

At the end of the day, the money you spend backing up and protecting traditional desktops is directly related to how important your company perceives that desktop to be. And moving the desktops to the datacenter doesn't change that. (Well, again this depends on the reason you're moving your desktops to the datacenter. If it's truly about availability and not cost savings, then yeah, boot them from a SAN and get live migration and everything. But that's not the reason that a lot of people do VDI.)

The level of redundancy you built into your desktop solution—be it traditional or VDI, local or SAN—is a matter of company policy. The technology is merely a tool to deliver on a certain SLA.

So what are your desktop SLAs currently? If a user's laptop fails, what's the agreed upon time to provide a replacement? When that replacement is booted up, what's the expectation of what's on it? Are the apps reinstalled? Is the profile back? Is all the user data there?

Why SLAs matter even more for VDI

One of the interesting "side effects" of VDI (and desktop virtualization in general) is that it forces you to "formalize" all aspects of your desktop environment. I'm willing to bet that a lot of companies don't even have written SLAs when it comes to desktops. Sure, you could estimate what your "approximate" SLA is—within 4 business hours for example—but it's probably not formal.

The reason this is important with VDI is that higher SLAs means higher costs. 24/7 four-hour response is more expensive to deliver than 8/5 four-hour. 99.999% uptime is more expensive than 99.99. We all know that.

The problem with VDI is that there's often a hidden "SLA bump" that happens when you move your desktops from the desktop to the datacenter. (The "bump" is that you move your desktops from an environment with a low SLA to an environment with a high SLA.) You get a bump in SLA without a corresponding plan to bump the costs. This is a problem is because companies don't usually account for it as they're working out the cost models and financial justification for VDI. (You know, if you believe in that kind of thing.) Somehow these companies hope to magically increase the SLA for their desktops while expecting the management costs to be cheaper?!?

What does this mean for you?

In order to avoid falling victim to the unintentional SLA bump, you should first formalize the existing SLA that you have for your desktops. Even if this is something you do on your own, figure out what the expectation is for desktop support and write it down. If you're just in the planning stages for your VDI project, ask around and see what peoples' expectations are around support and uptime. If they're higher for VDI, point out that fact and maybe ask whether there's extra budget for that.

And remember, desktops and servers are not the same. And just moving your desktops into the datacenter doesn't automatically mean that you have to change your desktop SLA. Remember that it's ok for desktop users to be second-class datacenter citizens. If your company wants to upgrade your users to first class, remember that always costs money.

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

coming from the old world of Terminal Server / MetaFrame / XenApp, this type of risk is well known...

if desktop is down : reconnect to session from any device

if network is down : reconnect to session from other network

if server is down : launch immediatly new session

user get educated... and it is well accepted.

of course, this is a connected mode ;-)

but for most of customer and applications, data reside outside of the desktop/front end servers devices... on the file server, in the database and you only loose the last unsaved content (not a lot but could be important based on your business)...


Moving desktops to the Datacenter with Desktop Virtualization can’t go without over thinking your desktop SLA! Thanks Brian for your agree.


Good point about the SLA Bump. This is exactly what we see with customers- as soon as we engage a VDI project in the Data Center existing server SLA (or just expectations) are automatically assumed to apply.

Then the whole process becomes much more complex and expensive because we are designing for ultimate uptime, HA, redundancy, etc

Since VDI is relatively immature this leads to environments that become overly complex and may not really be necessary. Good suggestion- go back and first establish your SLA/expectations for the desktop before starting a VDI project!


Nah, this is just a quick follow-up on the 60 or so comments on Local Storage vs. SAN. Of course one would expect that any serious VDI talkabouts to be considered in its relative strength, contrasted to the 90 or so percentile of PC dominance. pDesk, April 1 joke of 2009 seems to be very capable..

So Brian throws in this Mean Time Between Failure on the fire (as it burns so nice). Quite expected and typed on the very form of his pDesk. For remember, the rules do not apply justly.

For me in my opinion I do indeed see the merits and justifications for the different sorts of Remote Delivery in their niche cases – not just apps but Desktops. OK, fine. I do still consider VD as what it I, a rare niche for years to come. If or when I stand on the losing side, the slippery slope, I will cry my eyes out for the loss of creativity the barren grounds of swiped away individuality and play. No laughter is there to hear.


You seem to miss/overlook that from the users prespective unless both their desktop device and the VDI solution are up, they can't do work. No matter how reliable your VDI solution is, overall it will make the overall users experience less reliable. If you have a SLA for physical desktops at 98.788% uptime (from article (1-24/2000)) and your VDI solution at 99.988% uptime (from article), the end user only experiences 98.776% uptime. Not a big change, but worse none the less.

Secondly, when VDI is down there is nothing a users can do about it so they feel out of control. When their physical desktop is down, at least they can mess with it and see if they can resolve the issue themself. This is hard to quantify, but, at least to me, is very apparent.