Virtual Machine Fails to Respond to Network Traffic

The following issue pertains to an environment running a Windows 2012 Server Failover Cluster. This cluster had 3 nodes, each node running 4 VMs, running stably for months.

Suddenly one afternoon the virtual machines on one of the nodes failed to respond to network traffic. Looking at the VMs in the Failover Cluster Manager, they showed a status of “Running – locked”. Then VMs on that node began shutting down. They did not fail over to another node. The Clustered Shared Volume could not be browsed in Windows Explorer from the affected node. The iSCSI connector showed that the server’s connection to the SAN was up, however.

The Failover Cluster Manager showed the following critical errors: “Cluster Shared Volume ‘Volume1’ (‘name’) is no longer available on this node because of ‘STATUS_IO_TIMEOUT(c00000b5)’. All I/O will temporarily be queued until a path to the volume is reestablished.”

Earlier that afternoon, I had created a new virtual machine on that node. After creating it, I decided to back up the new VM using Microsoft DPM 2012. While I was doing so, that’s when the failure occurred. I discovered after a few minutes of research that it was this action—backing up a clustered virtual machine—that caused the problem. I stopped the DPM backup (which was hung anyway), deleted the job, and restarted the affected Cluster Node server. The VMs ran normally after that.

According to Microsoft, here are the conditions:

Consider the following scenario:

  • You enable the Cluster Shared Volumes (CSV) feature on a Windows Server 2012-based failover cluster.
  • Create a virtual machine on a CSV volume on a cluster node.
  • Start the virtual machine.
  • Try to create a backup of the virtual machine on the CSV volume by using Microsoft System Center Data Protection Manager (DPM).

In this scenario, one of the following issues occurs:

  • The backup is created, and the virtual machine enters a paused state.
  • The CSV volume goes offline. Therefore, the virtual machine goes offline, and the backup is not created.

Errors: Software snapshot creation on Cluster Shared Volume(s) (‘volume location’) with snapshot set id ‘snapshot id’ failed with error ‘HrError(0x80042308)(2147754760)’. Please check the state of the CSV resources and the system events of the resource owner nodes.

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: Date and time
Event ID: 5120
Task Category: Cluster Shared Volume
Level: Error
Keywords:
User: SYSTEM
Computer: Computer name

Description: Cluster Shared Volume ‘Volume1’ (‘name’) is no longer available on this node because of ‘STATUS_IO_TIMEOUT(c00000b5)’. All I/O will temporarily be queued until a path to the volume is reestablished.

Cause: The virtual machine enters a paused state because the Ntfs.sys driver incorrectly reports the available space on the CSV volume when the backup software tries to create a snapshot of the CSV volume. Additionally, the CSV volume goes offline because it does not resume from a paused state after an I/O delay issue or error occurs.

Resolution: install the hotfix described here Please read through the hotfix information carefully, and consult Microsoft Support if you have any issues or questions

For other IT Support and IT Service issues take a look at our IT Support Page.