It's been a few days now since VMware's VI 3.5 and ESX 3.5i Update 2 problems, so I thought it prudent to wrap up what's gone on. Those of you directly affected are probably already aware of what's gone down.
As of Wednesday morning, a new, full version of Update 2 was unavailable, but an Express Patch has been released to fix existing problems. The Express Patch was made available Tuesday night, as announced in a blog post by new (and probably really tired) CEO Paul Maritz.
In the blog post, dated August 12 at 8:26 PM, Maritz writes:
In remedying the situation, we've already released an express patch for those customers that have installed/upgraded to ESX or ESXi 3.5 Update 2. Within the next 24 hours, we also expect to issue a full replacement for Update 2, which should be used by customers who want to perform fresh installs of ESX or ESXi.
That's a pretty quick turnaround time, even though this should have never, ever happened in the first place. Maritz went on, owning up to the problem and noting specific areas where failures occurred:
I am sure you're wondering how this could happen. We failed in two areas:
- Not disabling the code in the final release of Update 2; and
- Not catching it in our quality assurance process.
We are doing everything in our power to make sure this doesn't happen again. VMware prides itself on the quality and reliability of our products, and this incident has prompted a thorough self-examination of how we create and deliver products to our customers. We have kicked off a comprehensive, in-depth review of our QA and release processes, and will quickly make the needed changes.
I really feel that VMware handled this as well as they could have, and that while this situation shouldn't have happened, they've done an admirable job of stepping up to the plate. It's not like this happens all the time, and that patch after patch keeps breaking things. I think it's been a bit of a humbling and eye-opening experience for them, and hopefully that will make them better in the long run.
I feel bad for the folks that waited a week before installing, just to make sure that everything was OK, then got burned anyway. I don't think this is an early-adopter problem or a lack of testing problem at all. Nobody outside of VMware could've seen this one coming, and if they didn't see it, administrators shouldn't be held responsible.
I've been trying to think of ways to avoid this type of thing in the future. In conventional data centers, if ten servers go down, that's bad because ten servers went down. In a virtualized world, if ten servers go down, that could mean 100 servers went down, or 400 workstations! "Wait longer before rolling out an update!" will surely come up as a way to avoid it, but that's not really a solution. The drop dead date for Update 2 could've been December 12 instead of August 12, and you wouldn't expect folks to wait that long to test, would you?
It's naive to think that something like this won't happen again (with VMware or anyone else, for that matter). A comment in the blog post I made the other day mentioned using multiple hypervisors in the data center as a redundant solution, but at this point I'm not sure that's a viable thing. Maybe if your two platforms were Hyper-V and XenServer, since Microsoft and Citrix have a partnership, but otherwise I just don't see it happening.
Some might say that simply splitting the virtualized servers between two or three hypervisors is enough since you're only shouldering half the outage, but in reality that just makes them 2 or 3 times more prone to a single failure. An airplane with two engines is twice as likely to experience an engine failure as an airplane with one engine, and the same applies here.
So, putting all your eggs in one basket is bad, as is splitting them between a few baskets. Maybe it's time to think about putting all your eggs in multiple baskets. Maybe a third party "synchronizer" that can automatically maintain two completely separate environments? I can't imagine the storage required, but if the systems are important enough to use two virtualization platforms then storage might not be a problem.
So what do you think? Is it necessary? Can it be done in an automated way? Got any other ideas?