Thursday, August 14, 2008

VM infrastructure and disaster recovery

VMWare's ESX/ESXi Update 2 contained what they describe as a "build timeout" which caused patched machines to expire their licenses on August 12th. This meant that VMs could not be powered on or resumed on updated machines, and that VMotion couldn't be used to move VMs to those systems. VMWare has released a patch and a letter to their customers notifying them of the issue, and has flagged the isssue as an alert in their knowledgebase. The fixed patch requires a reboot of the VMWare host, potentially causing off-cycle maintenance to be required for those systems that were affected.

As more infrastructure moves to a VM environment, we create the potential for greater failures when the VM host systems have issues. In this case, a single patch could prevent DR from occurring if all of your VMWare systems were patched and a failure occurred. Workarounds were relatively easy if you knew what was wrong - system date and time could be changed in the short term, or, if necessary, a pre-patch backup could be restored to the system.

How can we best plan to handle issues like this? In many ways, the same processes that system administrators have used for years to test patches will continue to serve us, but we need to have plans in place for what to do when an issue effects all VM hosts at a given patch level. This reminds me of Hoff's talk about VM infrastructure at BlackHat - we're more vulnerable than we think we are with VMs, and this patch issue is a great, relatively low cost reminder.

So - how are you planning to handle VM infrastructure outages?

No comments: