Problem
The cluster appears to be working fine, but every 15 minutes or so the following events are logged on the node that owns the quorum witness disk:
Source: Microsoft-Windows-Ntfs Event ID: 98 Level: Information Description: Volume WitnessDisk: (\Device\HarddiskVolumeNN) is healthy. No action is needed. Event ID: 1558 Source: Microsoft-Windows-FailoverClustering Level: Warning Description: The cluster service detected a problem with the witness resource. The witness resource will be failed over to another node within the cluster in an attempt to reestablish access to cluster configuration data. Log Name: System Event ID: 1069 Level: Error Description: Cluster resource 'Witness' of type 'Physical Disk' in clustered role 'Cluster Group' failed. Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.
Analysis
Some digging in the event log identified a disk error incident during a failover of the virtual machine:
Log Name: System Event ID: 1557 Level: Error Description: Cluster service failed to update the cluster configuration data on the witness resource. Please ensure that the witness resource is online and accessible. Log Name: System Source: Microsoft-Windows-Ntfs Event ID: 140 Description: The system failed to flush data to the transaction log. Corruption may occur in VolumeId: WitnessDisk:, DeviceName: \Device\HarddiskVolumeNN. ({Device Busy} The device is currently busy.) And ultimately Log Name: System Source: Ntfs Level: Warning Description: {Delayed Write Failed} Windows was unable to save all the data for the file . The data has been lost. This error may be caused by a failure of your computer hardware or network connection. Please try to save this file elsewhere.
It appears that the witness disk had a non-responsive period during the failover of the VM that caused an update to the cluster database to fail, thus rendering the copy of the cluster database contained on the witness disk corrupt. The disk itself is fine, thus there are no faults in the cluster resource status. everything appears hunky dory. There could be other causes leading to the same situation, but in this case the issue corelates with a VM failover.
We need to replace the defective database with a fresh copy from one of the nodes.
Solution
The usual warning: If this procedure is new to you, seek help before attempting to do this in production. If your cluster has other issues, messing with the quorum setup may end you in serious trouble. And if you have any doubts what so ever about the integrity of the drive/LUN, replace it with a new one.
Warnings aside, this procedure is usually safe, and as long as the cluster is otherwise healthy you can do this live without scheduling downtime.
Action plan
- Remove the quorum witness from the cluster.
- Check that the disk is listed as available storage and online.
- Take ownership of the defective “cluster” folder on the root ofr the quorum witness drive.
- Rename it to “oldCluster” in case we need to extract some data.
- Add the disk back as a quorum witness
- Wait to check that the error messages does not re-appear.
- If they do re-appear
- Order a new LUN
- Add it to the cluster
- Use the new LUN as a quorum witness
- Remove the old LUN from the cluster