Brian Madden Logo
Your independent source for application and desktop virtualization.
Marketplace

advertisement
Gabe Knuth's Blog

Oops! VMware's VI 3.5 Update 2 locks out customers, pulls Update 2 download.

Written on Aug 12 2008 5,004 views, 31 comments


by Gabe Knuth

Word on the street is that VMware Infrastructure 3.5 Update 2 and ESXi 3.5 Update 2 have been affected by a licensing bug left over from development. This bug renders any powered off machine incapable of being turned back on, suspended machines incapable of being resumed, and admins unable to use VMotion to migrate machines, only reporting a "general error." Searches through the logs unearth this as the real problem:

Aug 12 10:40:10.792: vmx| http://msg.License.product.expired This product has expired.
Aug 12 10:40:10.792: vmx| Be sure that your host machine's date and time are set correctly.
Aug 12 10:40:10.792: vmx| There is a more recent version available at the VMware Web site: "http://www.vmware.com/info?id=4".

The workaround involves disabling NTP and setting your clock back to August 10, and can be found in full detail here.

According the KB article at VMware.com, engineering "will reissue the various upgrade media including the ESX 3.5 Update 2 ISO, ESXi 3.5 Update 2 ISO, ESX 3.5 Update 2 upgrade tar and zip files by noon, PST on August 13. These will be available from the page: http://www.vmware.com/download/vi. Until then, VMware advises against upgrading to ESX/ESXi 3.5 Update 2."

Let's hope that it actually is fixed in that amount of time, otherwise people will be resetting their VI 3.5 servers' clocks as nightly maintenance.

Update 2 was released on July 28th, and I can't begin to guess how many organizations this has affected. Update 2 contained many important and popular feature adds, like Windows Server 2008 Support, enhanced VMotion compatibility, live cloning of virtual machines, hot virtual disk extensions, 10GbE NFS and ISCSI support, and much more. All these features may have led admins to deploy the update a little earlier than normal (I'm speaking to the wait-and-see folks like myself).

The people who truly utilize all the functionality of virtualization are probably the ones who are losing the most here. Imagine having your systems dynamically reallocate virtualized hardware resources based on the time of day or day of week. Maybe the weekends are spent running batch jobs on 80% of your servers while during the week it's just 20%. In that case, those servers fired up Friday night and shut down early Monday morning just like normal. But, with the licensing bug, the servers required to get through the week (a full 60% of the servers!) can't come back up.

VMware really messed up here, but they don't need me to tell them that--I'm pretty sure the ringers in their phones wore out a few hours ago. It will be interesting to see how this pans out and how people using Update 2 in production were affected. It's sure to be a humbling experience for VMware, and how they handle it should speak volumes about the state of company and the responsibility it has to its customers.

 



Comments

Guest wrote Huge issue
on 08-12-2008 11:05 AM
Many of our customers were hit by this today, it is a huge issue, one that is not going away quickly enough.  Problem is that once VMware releases the hotfix, you can't Vmotion the vm's since the hotfix hasn't been applied yet.  One possible work around is have a swing server, patch that server, Vmotion the vm's over to that server, then go from there.  Still you are talking about an outage.  VMware royally screwed up on this one.
Guest wrote Conspiracy theory...
on 08-12-2008 12:19 PM

Where did the guy who's running Vmware used to work?...  I'm only joking...Vmware's rivals would never do something like that....would they :)

btw.  The bug also exists in the free esx standalone product they have just released...shot yourself in the foot why don't you.

Guest wrote August 12th, Diane Greene's Birthday ?? maybe?
on 08-12-2008 12:45 PM

 

Guest wrote To Make things worse....
on 08-12-2008 12:47 PM
Today is patch tuesday when many Windows machines get patched and reboot.... huh.. wonder how that's going to go...
Guest wrote Testing Results
on 08-12-2008 1:17 PM
Guess that will further delay the testing results we are waiting in the Qumranet, Citrix, VMWARE not bakeoff :-)
Guest wrote Enterprise Virtualisation?
on 08-12-2008 2:27 PM

I think it is the bug of the year. There is no information from VMWare because the site is down (is it running on 3.5U2 :-) ). If you call support they tell you they will fix it in 36 hours (or so).

If this is VMWares statement of Enterprise Virtualisation I will look more at XenServer!

Gabe Knuth wrote Re: To Make things worse....
on 08-12-2008 2:51 PM
I came across that very thing when I was looking into the issue.  It appears that rebooting a VM is ok, it's just the act of actually powering on a machine that triggers it.
Guest wrote Re: Enterprise Virtualisation?
on 08-12-2008 3:45 PM

Wondered how long before the first Citrix fanboy raised their head.  If you think Citrix is gonna be such a huge improvement i suggest you haven't suffered the pain of Presentation Server / Access gateway/ AAC over the past few years. 

Guest wrote Re: Enterprise Virtualisation?
on 08-12-2008 3:46 PM

To be fair I don't remember something of this magnitude happening to VMware and still feel this product is pretty solid. You can bet at least this won't happen again. Let's not be too critical and start tossing stones, unless you can tell me another time where a foul up like this has happened?

 ~Gabriel Medrano

Shanetech wrote Re: Enterprise Virtualisation?
on 08-12-2008 4:15 PM

Agreed, we had our VMware rep call us first thing this morning and let us know about this issue so that we could get the information out to our Global Customer Base. Are we happy? Of course not, however it is software, like someone else said it's not like Citrix has not had something like this happen, or even Microsoft, geezzz any you guys rememeber NT 4 Service Pack hell?

Guest wrote Zero-tolerance policy with VMware ?
on 08-12-2008 9:56 PM

Citrix had problems with hotfixes, and constantly retires hotfixes when they have problems. Not sure why you guys tend to be so critics with VMware, and so kind with Citrix, which is a newbie on the virtualization would.

MS Exchange 5.5 SP3 messes up the exchange database. This is the solely only time we hear something from VMware. 

Also, how much people is imprudent enough to patch ALL hosts at the same time without going through a validation process first ?

 Why this "Zero-tolerance" policy with VMware ?
Jason Boche wrote Re: Enterprise Virtualisation?
on 08-12-2008 10:17 PM
Truer words were never spoken.
Jason Boche wrote Re: Zero-tolerance policy with VMware ?
on 08-12-2008 10:18 PM
Also very good points.  Nobody from the Citrix world should be thumping their chests about track history of QA and software updates.
Jason Conomos wrote Perception
on 08-12-2008 11:09 PM

I think some of the issue here is that it is not a Microsoft patch which takes out a server or a few, nor is it a Citrix patch that screwed up a server, but a virtualisation platform which takes out groups of servers which in some instances, there may not be a feasible work around.

So why are people having this zero-tolerance approach?  Well quite simple.  It is because people have reaslied that even if you implement HA and DR with a virtualisation platform, you still have that SPOF, which is...  your virtualisation platform vendor.

So the question now is, what to do about it?  Should companies now consider having a blend of hypervisors to ensure if this does occur again (and it probably will, lets be honest) they could perform a manual failover and conversion somehow to another virtualisation platform.

Food for thought

Guest wrote Re: Perception
on 08-13-2008 12:26 AM

Answer to this is to have a good BC/DR plan. You cannot account for everything, so you prepare for every possibility. I think having a blend of hypervisors is the total wrong approach, you need a good testing methodology that will sift out issues like this.

 

~Gabe M.

Guest wrote what happened to having a SEPERATE ***test* small farm to test updates before applying?
on 08-13-2008 4:22 AM

 

Guest wrote Re: Zero-tolerance policy with VMware ?
on 08-13-2008 8:23 AM

Because they made the claim to be the only enterprise ready virtualization platform. They bragged about their stability due to the number of years and systems they have been deployed. And they charge twice as much as the nearest competitor. Exchange was hardly the most expensive e-mail solution at the time of SP3, it was not cheap but it was not the most expensive. When you pay good money for something you do so because you expect it to be a better product. How many of the Citrix buggy hotfixes shut down entire systems or crippled all of its advanced (pricey) functionality.

Guest wrote Re: Perception
on 08-13-2008 8:30 AM
So every customer should have been expected to test something nobody could have expected. They don't have access to the code, so they could not have known there was going to be a time based kill switch on production code they had a license for. It would be like buying Windows 2003 Server for your system, running it for months and one day it wont turn on because a date had passed. No admin would have ever thought of testing for that. Now if it was one of those 180 day trial copies that they installed then they would know that the day would come when there system would cease to function. Putting the blame on the customer for lack of testing is pure bull@hit. VMware shoulders all the blame for not catching this bug, surely someone at their company intentionally placed this code in the product, but then everyone forgot about it before they released it to the entire world. And not as a beta, but as production ready. Yeah, if everyone had waited a month before they deployed it to all their systems then it would not have impacted anyone. Then again, if nobody deployed it who would have uncovered to problem. Or what if it happened to be September 12th, or October 12th that it took effect.
Guest wrote Re: what happened to having a SEPERATE ***test* small farm to test updates before applying?
on 08-13-2008 8:38 AM
Don't be retarded. This was not a buggy NIC driver, system instability at high load, or advanced feature that screwed up a vdisk that we are talking about here. If I had a test farm and put this code on it for a week and hammered away at it I would not have seen this issue if I completed my testing before August 11th. So no testing plan or test environment would have changed the results. VMware needs to make sure they include an uninstall feature in their ESX software. If I install an upgrade and it totally tanks my environment, let me uninstall it and roll back to the previous version. That ability would have saved everybody in this situation. Except for those that deployed this new version right off the bat.
Matt Dean wrote Re: what happened to having a SEPERATE ***test* small farm to test updates before applying?
on 08-13-2008 8:41 AM

Yeah, small test environment....

Software is released on July 28th, I install it in my test farm August 2nd and test for a full week and then install it into production on August 9th.  Then August 12th, stops working...  all because I wasn't savy enough to roll my clocks into the future to test for time based licensing code errors.

good call.

 

Fortunately, this didn't happen to me, but perfectly reasonable for someone who was waiting on those new features.

Shanetech wrote Re: what happened to having a SEPERATE ***test* small farm to test updates before applying?
on 08-13-2008 8:48 AM

I would hope anyone in an enterprise environment would allow for more than 30 days of testing before deployment into production, hell we cant even get out of qualification in 45 days.....I think this probably had a bigger impact on those who rolled Update 2 out of the gate and those smaller shops with point and click happy update & defrag engineers

Guest wrote Wow
on 08-13-2008 8:49 AM

Seriously. Who applies patches immediately without testing? Especially a large update to the core of your infrastructure. Basic Administration folks....

For most Engineers, you wait for the user community to test non critical patches. Thanks everyone for keeping me from installing update 2 !