This is just a quick note to remember the solution and EventIDs.
The VM fails to start complaining about failed resources or resource not available in Failover Cluster manager. Analysis of the event log reveals messages related to VirtualFC:
- EventID 32110 from Hyper-V-SynthFC: ‘VMName’: NPIV virtual port operation on virtual port (WWN) failed with an error: The world wide port name already exists on the fabric. (Virtual machine ID ID)
- EventID 32265 from Hyper-V-SynthFC: ‘VMName’: Virtual port (WWN) creation failed with a NPIV error(Virtual machine ID ID).
- EventID 32100 from Hyper-V-VMMS: ‘VMNAME’: NPIV virtual port operation on virtual port (WWN) failed with an unknown error. (Virtual machine ID ID)
- EventID 1205 from Microsoft-Windows-FailoverClustering: The Cluster service failed to bring clustered role ‘SCVMM VM Name Resources’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.
The events point in the direction of Virtual Fibre Channel or Fibre Channel issues. After a while we realised that one of the nodes in the cluster did not release the WWN when a VM migrated away from it. Further analysis revealed that the FC driver versions were different.
- Make sure all cluster nodes are running the exact same driver and firmware for the SAN and network adapters. This is crucial for failovers to function smoothly.
- To “release” the stuck WWNs you have to reboot the offending node. To figure out which node is holding the WWN you have to consult the FC Switch logs. Or you could just do a rolling restart and restart all nodes until it starts working.
- I have successfully worked around the problem by removing and re-adding the virtual FC adapters n the VM that is not working. I do not know why this resolved the problem.
- Another workaround would be to change the WWN on the virtual FC adapters. You would of course have to make this change at the SAN side as well.