Wednesday, February 17, 2010

Server 2008 R2 Hyper-V and Intel Xeon 5500 Series Processor Problem

So I have been helping our internal IT implement Hyper-V to consolidate our server infrastructure. We implemented a two node Server 2008 R2 Hyper-V cluster on Dell R710 servers. Everything was going pretty good until this past Sunday. Our IT Manager converted the second of two SQL Servers to be a VM using SCVMM 2008 R2 P2V process. The conversion went pretty good. On Monday, all our consultants are required to submit their timesheet for the previous week, and custom application developed and located on SharePoint. This puts a little bit of a load on the Front End and SQL backend servers in the morning as everyone is trying to submit thier timesheets and our internal workers are pulling reports and getting invoices ready to be sent out (Need to get paid for the great work we all do!)
Well this Monday became challenging as now the entire SharePoint Environment plus, Exchange, and several other servers are hosted VMs on the cluster. The challenge came from the fact that our Hyper-V Host servers began to crash and reboot, failing the cluster and all VMs running on them. Initially it appeared to be a network problem. So we decided to move the last SQL box converted to a dedicated VSwitch and Physical NIC (We have a three NIC team setup to a single VSwitch for all of the VMs on both Hosts) This one done thinking that the network errors we were seeing on the Hosts were caused by the newly added SQL server. This did not solve the problem. We also saw some storage issues and reviewed everything to ensure that it was sound (Fiber Attached Storage on the Hosts; most VMs OS Drive is provided a couple Clustered Shared Volumes, CSV, thru the fiber connections with thier data drives utilizing Pass-Thru disks).
So with the problem still being sporadic we looked to VMM and removed the cluster from VMM and rebooted the Hosts separately to check things out. This seemed to solve the problem a bit (Tuesday no issues) . Today we had crashes again, our IT Manager walked thru the crash dumps with a fine tune comb and found the culprit: 0x00000101 - CLOCK_WATCHDOG_TIMEOUT error message. Didn't take him long to find this KB article documenting the problem, http://support.microsoft.com/kb/975530
There is a know issue with Intel Xeon 5500 Nehalem Processors. The problem happens sporadically so initial identification of the problem is a little tough. Their is a hotfix for this as well as some work around documented in the KB article.
This problem is only with the new Nehalem processors and Server 2008 R2 with Hyper-V role installed. The problem is documented by Intel as well here: http://www.intel.com/assets/pdf/specupdate/321324.pdf

Hope this information helps. Would be very interested to see who else has experienced this problem.

No comments:

Post a Comment