VMware Update 2 wrap-up, Plus - what can we do in the future?

It's been a few days now since VMware's VI 3.5 and ESX 3.

It's been a few days now since VMware's VI 3.5 and ESX 3.5i Update 2 problems, so I thought it prudent to wrap up what's gone on. Those of you directly affected are probably already aware of what's gone down.

As of Wednesday morning, a new, full version of Update 2 was unavailable, but an Express Patch has been released to fix existing problems. The Express Patch was made available Tuesday night, as announced in a blog post by new (and probably really tired) CEO Paul Maritz.

In the blog post, dated August 12 at 8:26 PM, Maritz writes:

In remedying the situation, we've already released an express patch for those customers that have installed/upgraded to ESX or ESXi 3.5 Update 2. Within the next 24 hours, we also expect to issue a full replacement for Update 2, which should be used by customers who want to perform fresh installs of ESX or ESXi.

That's a pretty quick turnaround time, even though this should have never, ever happened in the first place. Maritz went on, owning up to the problem and noting specific areas where failures occurred:

I am sure you're wondering how this could happen. We failed in two areas:

  • Not disabling the code in the final release of Update 2; and
  • Not catching it in our quality assurance process.

We are doing everything in our power to make sure this doesn't happen again. VMware prides itself on the quality and reliability of our products, and this incident has prompted a thorough self-examination of how we create and deliver products to our customers. We have kicked off a comprehensive, in-depth review of our QA and release processes, and will quickly make the needed changes.

I really feel that VMware handled this as well as they could have, and that while this situation shouldn't have happened, they've done an admirable job of stepping up to the plate. It's not like this happens all the time, and that patch after patch keeps breaking things. I think it's been a bit of a humbling and eye-opening experience for them, and hopefully that will make them better in the long run.

I feel bad for the folks that waited a week before installing, just to make sure that everything was OK, then got burned anyway. I don't think this is an early-adopter problem or a lack of testing problem at all. Nobody outside of VMware could've seen this one coming, and if they didn't see it, administrators shouldn't be held responsible.

I've been trying to think of ways to avoid this type of thing in the future. In conventional data centers, if ten servers go down, that's bad because ten servers went down. In a virtualized world, if ten servers go down, that could mean 100 servers went down, or 400 workstations! "Wait longer before rolling out an update!" will surely come up as a way to avoid it, but that's not really a solution. The drop dead date for Update 2 could've been December 12 instead of August 12, and you wouldn't expect folks to wait that long to test, would you?

It's naive to think that something like this won't happen again (with VMware or anyone else, for that matter). A comment in the blog post I made the other day mentioned using multiple hypervisors in the data center as a redundant solution, but at this point I'm not sure that's a viable thing. Maybe if your two platforms were Hyper-V and XenServer, since Microsoft and Citrix have a partnership, but otherwise I just don't see it happening.

Some might say that simply splitting the virtualized servers between two or three hypervisors is enough since you're only shouldering half the outage, but in reality that just makes them 2 or 3 times more prone to a single failure. An airplane with two engines is twice as likely to experience an engine failure as an airplane with one engine, and the same applies here.

So, putting all your eggs in one basket is bad, as is splitting them between a few baskets. Maybe it's time to think about putting all your eggs in multiple baskets. Maybe a third party "synchronizer" that can automatically maintain two completely separate environments? I can't imagine the storage required, but if the systems are important enough to use two virtualization platforms then storage might not be a problem.

So what do you think? Is it necessary? Can it be done in an automated way? Got any other ideas?

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

it just remind me a training I've atend on risk management...
One main topics was : "when you perform a modification, how to get back to previous stable state in case of troubles" ? Everybody have face this (I'm sure)...

We probably just have to extend this to : "you have to find a way to get back to ealy N-x state at all time"...

It mean : an upgrade project will include a step back procedure... and you will have to keep the state back procedure up and running for each modification (if N-1 didn't work, try N-2 then N-3...).

that's were OVF is taking also place by allowing (in a near futur) a common and compatible VM format... Will not automate the solution but will allow to place VM's on another platform...
Very true, but the problem posed by VMware patching is that there is no easy rollback procedure.  So if an upgrade went belly up, I would assume (and correct me if I am wrong) that it would basically be a restore of the ESX host affected, and not a simple uninstall.  And to make it even worse, it is best practice to have all your ESX servers at the same patch level.  So what do you do?  Conumdrum indeed!

Pulled Mirrored Drive of the ESX host, or just rebuild from scratch without the update. This is of course if you have SAN snapshots running frequently. Still downtime tho which blows. 2 Clusters at different patch levels? Hmmm lots of $ for licensing. Rock and a hard place if you ask me.


!!!without any advertising!!!

That's where I like solution like Provisioning Server from Citrix.

  • You reboot, you get back to "initial compliant configuration"
  • You choose to boot from "out of test" image (not an upgraded production version)
  • You could choose to boot back to the another configuration (N-x capable).

The "time to upgrade" is 30' (time to reboot), The "time to roll back" is 30' (the time to reboot). The "time to get back to configuration xyz" is 30' (time to reboot). The risk to roll back is down to a minimum.

That the end of the wonderfull story... In reality, it should be different... Without talking about pricing, I'm not sure that all servers are eligeable to this solution. Don't know if you can stream an ESX, Hyper-V or XenServer servers ? Could is do that with an Oracle DB ? I just tested is for a "blank windows server" and it is great... I'm now testing it with my Pre... XenApp servers and it seems to work as it should...

Probably a way (the technolgoy) to investigate. 

Vmware should lower prices. This demonstrates they are not 100% reliable and they cannot charge highest prices. Microsoft will be the end of them.

How many shops actually perfom full host based recovery??? I still think the hypervisor et al. needs to be viewed as a disposable commodity, if it dies, big deal, there should be tested procedures around its re-birth. I think too many of us, myself included at times, always look at BCP, DRP as something you do afterwards. It should be something that is built into the project or product at the beginning...This analogy may be a bit extreme, however, it's not like we started jumping out of planes before we realized that hey a parachute may be a good idea. The same should be said for our Enterprise systems that we hold very dear to the business..



This makes sense to me.  I'd not heard of OVF before...thanks for the tip.  And as to the airplane analogy, I would say that you are right that an airplane with two engines is twice as likely to have an engine failure.  But if an engine fails, the plane can still fly...it's difficult, but at least it won't crash.

I could just be nuts but would this work?... 

To supply the VMWare ESX OS to the hardware, use Provisioning server to roll back to the last know good image, ie the one before you applied the patch



I'm in full agreement as to the benefit of using Provisioning Server as a rapid provisioning\rolll-back service, but would it have worked here?

I know that Provisioning Server can manage the OS and above and deliver that to bare metatal or to a hypervisor but can it deliver the full stack including the hypervisor - sorry if this is a stupid question, I'm not quite awake yet.

And building on that concept, I'll have to look at VisionApp to see if their management toll would play a part in managing an App/OS/hypervisor stack.

Provisioning server can be used to porvision the Hypervisor and the hosts on the server it has provisioined. It is demo'd in this video http://mfile.akamai.com/8296/wmv/citrix.download.akamai.com/8296/iForum07/Demos/ProvisioningServerDemo.asx.

Re > Maybe if your two platforms were Hyper-V and XenServer, since Microsoft and Citrix have a partnership, but otherwise I just don't see it happening.

Not today, but move fwd a couple of years.

 You don't think twice about running a single service on intel and AMD processors, they are interchangeable commodities.  Why shouldn't the hypervisor be exactly the saem In a couple of years the hypervisor will be exactly the same.


It has to be said ESX has an alarming number of patches. 

Yon't see Intel and AMD relesing patches to their platforms that often, nor Citrix XenServer and the open source Xen platforms.



thanks - problem solved, buy Provisioning server and use it to deploy the hypervisor.

I think there's a wonderful marketing opportunity for Citrix here :)


You currently cannot stream the Hypervisor itself, the demo included a lot of custom work at the backend.

Support for Hyper-V will be added in the next release of PVS (due out this month). XenServer streaming is planned for early next year. And after this we are prioritizing support for ESX :-)


A Citrix Employee....Posting on here as a guest....can it be....

maybe this explains why there are so many posts as a guest from people trashing Vmware / Quest/Provision...in fact any Citrix competitor.  Does someone @ Citrix monitor these boards...ane people getting rewarded....is it a points based system :) 

If you're referring to Intel and AMD from a CPU perspective, they don't release patches as you can't reprogram a CPU.  If you're talking about them from a motherboard reference design, yes they release new BIOS chips, but that's a little different than what we're talking about here.  What exactly are you trying to say?

The point being that we have assigned a level of confidence in the hypervisor that is not backed up by reality.  The subliminal marketing message is that the hypervisor and the physical platform are interchangable, what works on one will work one the other, that the hypervisor does not detract from (and in many cases improves on) the reliability/availability of the physical platform. 

This assumption has now been demonstrated to be wrong.

Processor vendors, knowing that they can't just issue a patch, have to work much harder to get it right before they release a new product.  Either the hypervisor developer has to make changes to ensure they deliver a product that is as stable and bug free as the processor OR we have to assume that it is going to fail at some point and protect ourselves from this reality.

Wow, y'all are making this sound like a huge deal, when I don't think it was.  So the folks at VMware made a mistake.  It's shocking because it hasn't happened before, and we've all seen the screenies or movies of VMware hosts that have been online and running for over 3 years.  That's a pretty good track record!  I think MS and Citrix have to add a whole lot of functionality to get even close to the feature set VMware offeres on it's ESX.  Heck, even the free ESXi has more features than Hyper-V.  In the end, the very flexibility that VMware's brand of virtualization brings to the table is the thing that saved it.

Read the message will you, he's saying that Provision Server will NOT solve the problem.

You should be thanking him for correcting a mis-assumption, not knocking him for posting as a guest. 

 <shakes head in disbelief>


If you read the guys comments about Citrix employee's posting on here as a guest he has a very valid point.  I don't see him criticizing the contents of the Citrite's post...just that when you see comments like the one posted further down(About esx patching and how great xenserver is.)that are so biased against any Citrix competitor it makes you wonder...how many other comments are from Citrix employee's....

Do you ride the short bus?
Sounds good, but doesn't it just move the point of failure.  Yea it can be HA'd, but if PVS has a

Sounds good, but doesn't it just move the point of failure.  Yea it can be HA'd, but if PVS has a similar time bomb in the code, things would not be any better.  Don't get me wrong, I really like the idea of steaming the hypervisor, but it only changes potential problem and it is not as simple as just streaming a disk.  There are integration issues that require cooperation from the OS (or hypervisor).  This is likely going to happen with HyperV and Xen, but I would be surprised to see it with ESX.  That said, I think this was a huge issue, as I believe VMWare does as well.  It’s similar to a surgeon sewing you back up with a sponge left inside.  However, the concepts of “do no harm” and malpractice do not apply to software vendors.  The real problem is that it feeds the fears (both rational and irrational) of upper management regarding virtualization.


It's worth remembering that this was a licensing bug, and didn't cause any downtime for machines already running. This is the way VMWare designed the system, so that if the licence server goes down everything else continues running. Admittedly you couldn't cold boot a VM, and you couldn't VMotion, but I think most datacentres can live without these for 2 days. I mean, how often do you cold boot a server?!

no, I don't work for VMWare, but I am a VCP. Anon because I still havn;t registered :-)


this is just a question of lyfecycle...  The basis of your infra should be the system you upgrade the least... the true point is : a bug with licence expiration could not be avoided by anybody...


I reboot sbout 150 servers per night, so I guess I have a slightly different perspective to most people on the significance of this incident.

My takeaway from this incident is the importance of knowing about the difference between running on a hypervisor and running on a processor and acting accordingly.  As stated above

" Either the hypervisor developer has to make changes to ensure they deliver a product that is as stable and bug free as the processor OR we have to assume that it is going to fail at some point and protect ourselves from this reality."


cold boot or reboot?


Please google "microcode update" and then join the rest of us in the 21st century.

Since you already made the analogy though this looks a LOT to me like the Pentium 60/66 division error to me.

Do you cry wolf or work for Citrix?