Script to migrate VMs back to their preferred node


You have a set of VMs that are not running at their preferred node due to maintenance or some kind of outage triggering unscheduled migration. You have set one (and just one) preferred host for all you VMs. You have done this because you are want to balance your VMs manually to guarantee a certain minimum of performance to each VM. By the way, automatic load balancing cannot do that, there will be a lag between a usage spike and load balancing if load balancing is required. But I digress. The point is, the VMs are not running where they should, and you have not enabled automatic failback because you are afraid of node flapping or other inconveniences that could create problems. Hopefully though, you have some kind of monitoring in place to tell you that the VMs are restless and you need to fix the problem and subsequently corral them into their designated hosts. Oh, and you are using Virtual Machine Manager. You could do this on the individual cluster level as well, but that would be another script for another day.

If you understand the scenario above and self-identify with at least parts of it, this script is for you. If not, this script could cause all kinds of trouble.

Script Notes

  • You can skip “Connect to VMM Server” and “Add VMM Cmdlets” if you are running this script from the SCVMM PowerShell window.
  • The MoveVMs variable can be set to $false to get a list. this could be a smart choice for your first run.
  • The script ignores VMs that are not clustered and VMs that does not have a preferred server set.
  • I do not know what will happen if you have more than one preferred server set.

The script

#   _____     __     ______     ______     __  __     ______     ______     _____     ______     ______     ______    #
#  /\  __-.  /\ \   /\___  \   /\___  \   /\ \_\ \   /\  == \   /\  __ \   /\  __-.  /\  ___\   /\  ___\   /\  == \   #
#  \ \ \/\ \ \ \ \  \/_/  /__  \/_/  /__  \ \____ \  \ \  __<   \ \  __ \  \ \ \/\ \ \ \ \__ \  \ \  __\   \ \  __<   #
#   \ \____-  \ \_\   /\_____\   /\_____\  \/\_____\  \ \_____\  \ \_\ \_\  \ \____-  \ \_____\  \ \_____\  \ \_\ \_\ #
#    \/____/   \/_/   \/_____/   \/_____/   \/_____/   \/_____/   \/_/\/_/   \/____/   \/_____/   \/_____/   \/_/ /_/ #
#                                                                                                                     #
#                                                                                            #
#                                          -----=== Elevation required ===----                                        #
# Purpose: List VMs that are not running at their preferred host, and migrate them to the correct host.               #
#                                                                                                                     #
# Notes:                                                                                                              #
# There is an option to disable VM migration. If migration is disabled, a list is returned of VMs that are running at #
# the wrong host.                                                                                                     #
#                                                                                                                     #
$CaptureTime = (Get-Date -Format "yyyy-MM-dd HH:mm:ss")
Write-Host "-----$CaptureTime-----`n"
# Add the VMM cmdlets to the powershell
Import-Module -Name "virtualmachinemanager"
# Connect to the VMM server
Get-VMMServer –ComputerName VMM.Server.Name|Select-Object Name
$HostGroup = "All Hosts\SQLMGMT\*" #End this with a star. You can go down to an individual VM. All Hosts\Hostgroup\VM.
$MoveVMs = $true #If you set this to true, we will try to migrate VMS to their preferred host.
#List VMS in the host group
$VMs = Get-SCVirtualMachine | where { $_.IsHighlyAvailable -eq $true -and $_.ClusterPreferredOwner -ne $null -and $_.HostGroupPath -like $HostGroup }
# Process
Foreach ($VM in $VMs) 
    # Get the Preferred Owner and the Current Owner
    $Preferred = Get-SCVirtualMachine $VM.Name | Select-Object -ExpandProperty clusterpreferredowner
    $Current = $VM.HostName
    # List discrepancies
    If ($Preferred -ne $Current) 
        Write-Host "VM $VM should be running at $Preferred but is running at $Current." -ForegroundColor Yellow
        If ($MoveVMs -eq $true)
            $NewHost = Get-SCVMHost -ComputerName $Preferred.Name
            Write-Host "We are trying to move $VM from  $Current to $NewHost." -ForegroundColor Green
            Move-SCVirtualMachine -VM $VM -VMHost $NewHost|Select-Object ComputerNameString, HostName

Hyper-V VM with VirtualFC fails to start


This is just a quick note to remember the solution and EventIDs.

The VM fails to start complaining about failed resources or resource not available in Failover Cluster manager. Analysis of the event log reveals messages related to VirtualFC:

  • EventID 32110 from Hyper-V-SynthFC: ‘VMName’: NPIV virtual port operation on virtual port (WWN) failed with an error: The world wide port name already exists on the fabric. (Virtual machine ID ID)
  • EventID 32265 from Hyper-V-SynthFC: ‘VMName’: Virtual port (WWN) creation failed with a NPIV error(Virtual machine ID ID).
  • EventID 32100 from Hyper-V-VMMS: ‘VMNAME’: NPIV virtual port operation on virtual port (WWN) failed with an unknown error. (Virtual machine ID ID)
  • EventID 1205 from Microsoft-Windows-FailoverClustering: The Cluster service failed to bring clustered role ‘SCVMM VM Name Resources’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.


The events point in the direction of Virtual Fibre Channel or Fibre Channel issues. After a while we realised that one of the nodes in the cluster did not release the WWN when a VM migrated away from it. Further analysis revealed that the FC driver versions were different.




  • Make sure all cluster nodes are running the exact same driver and firmware for the SAN and network adapters. This is crucial for failovers to function smoothly.
  • To “release” the stuck WWNs you have to reboot the offending node. To figure out which node is holding the WWN you have to consult the FC Switch logs. Or you could just do a rolling restart and restart all nodes until it starts working.
  • I have successfully worked around the problem by removing and re-adding the virtual FC adapters n the VM that is not working. I do not know why this resolved the problem.
  • Another workaround would be to change the WWN on the virtual FC adapters. You would of course have to make this change at the SAN side as well.

New CSV Volume does not work, Event ID 5120


A newly converted Cluster Shared Volume refuses to come on-line. Cluster Validation passed with flying colours pre conversion. Looking in the event log you find this:

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 05.06.2016 15:01:31
Event ID: 5120
Task Category: Cluster Shared Volume
Level: Error
Computer: HyperVHostname
Cluster Shared Volume ‘Volume1’ (‘VMStore1’) has entered a paused state because of ‘(c00000be)’. All I/O will temporarily be queued until a path to the volume is reestablished.

The event is repeated on all nodes.


The crafty SAN-Admins has probably enabled some kind of fancy SAN-mirroring on your LUN. If you check, you will probably find twice the amount of storage paths compared to your usual amount. A typical SAN has 4 connections per LUN, and thus you may see 8 paths. Be aware that your results may vary. The point is that you now have more than usual. Problem is that you cannot use all of the paths simultaneously. Half of them are for the SAN mirror, and your LUNS are offline at the mirror location. If a failover is triggered at the SAN side, your primary paths go down and your secondary paths come alive. Your poor server knows nothing about this though, it is only able to register that some of the paths does not work even if they claim to be operative. This confuses Failover Clustering. And if there is one thing Failover Clustering does not like, it is getting confused. As a result the CSV volume is put in a paused state while it waits for the confusion to disappear.


You have to give MPIO permission to verify the claims made by the SAN as to whether or not a path is active. Run the following powershell command on all cluster nodes. Be aware that this is a system wide setting and is activated for all MPIO connections that use the Microsoft DSM.

Set-MPIOSetting -NewPathVerificationState Enabled

Then reboot the nodes and all should be well in the realm again.

Windows Server 2012R2 stuck at “Updating your system”

Update 2017.01.24: Several people have reported that adding a blank pending.xml file helps, so I have added it to the list.


After installing updates through Microsoft Update, the reboot never completes. You can wait for several days, but nothing happens, the process is stuck at X%:



I will try to give a somewhat chronological approach to get your server running again. I do experience this issue from time to time, but thankfully it is pretty rare. That makes it a bit harder to troubleshoot though.

Warning: this post contains last-ditch attempts and other dangerous stuff that could destroy your server. Use at your own risk. If you do not understand how to perform the procedures listed below, you should not attempt them on your own. Especially not in production.

First you wait, then you wait some more

Some updates may take a very long time to complete. More so if the server is an underpowered VM. Thus, it is worth letting it roll overnight just in case it is really slow. Another trick is to send a Ctrl+Alt+Del to the server. Sometimes that will cancel the stuck update, allowing the boot sequence to continue.

Then you poke around in the hardware

Hardware errors can cause all kinds of issues during the update process. If you are experiencing this issue on a physical server, check any relevant ILO/IDRAC/IMM/BMC logs, and visit the server to check for warning lights. A quick memory test would also be good, as memory failures are one of the most prevalent physical causes of such problems.

If that does not help, blame the software

It was Windows that got us into this mess in the first place, so surely now is the time to point the finger of blame at the software side?

Try booting into Safe Mode. If you are lucky the updates will finish installing in safe mode, and all you have to do is reboot. If you are unlucky, there are two ways to make Windows try to roll back the updates:

  • Delete C:\Windows\WinSxS\pending.xml
  • Create a blank pending.xml file in C:\Windows\WinSxS
  • Run DISM /image:C:\ /cleanup-image /revertpendingactions


Then reboot. If a boot is successful, see if installing the updates in batches works better. Or just do not patch. Ever. Until you are hacked or something breaks. Just kidding, patching is a necessary evil.

Up a certain creek without a paddle…

If you are unable to enter Safe Mode, chances are the OS is pooched. I have experienced this once on Win2012R2. No matter what I did, the system refused to boot. From what I could tell, a pending change was waiting for a roll-back that required a reboot, and was thus unable to complete the cycle, ergo preventing the server to boot before it had rebooted. If that sounds crazy, well, it is. Time to re-image and restore from backup. The No. 1 suspect in my case was KB3000850, which is a composite “Roll-Up” containing lots of other updates. This may cause conflicts when Windows Update tries to install the same update twice in the same run, first as a part of the Roll-Up, and then as a stand-alone update. This is supposed to work, but it doesn’t always work.

You could try the rollback methods listed above in the recovery console. If that does not work, try running sfc /scannow /offbootdir=c:\ /offwindir=c:\windows from the recovery console. Maybe you will get lucky, but most likely you won’t…



On a side note, KB3000850 has been a general irksome pain in the butt. It is best installed from an offline .exe by itself in a dark room at midnight on the day before a full moon while you walk around the console in counter-clockwise circles dressed in a Techno-Mage outfit chanting “Who do you serve, and who do you trust?“.

Slow startup on VMs with Virtual Fibre Channel


After you enable Virtual Fibre Channel on a Hyper-V VM it takes forever to start up. In this instance forever equals about two minutes. The VM is stuck in the Starting… stat for most of this period.


During startup, the Hypervisor waits for LUNs to appear on the virtual HBA before the VM is allowed to boot the OS. When there are no LUNs defined for a VM, i.e. when you are deploying a new VM, the Hypervisor patiently waits for 90 seconds before it gives up. Thus startup of the VM is delayed by 90 seconds if there are no LUNs presented to the VM, or if the SAN is down or misconfigured. Event ID 32213 from Hyper-V-SynthFC is logged:



Depending on your specific cause, one of the following should do the trick:

  • Present some LUNs
  • Remove the Virtual HBA adapters if they are not in use
  • Correct the SAN config to make sure the VM is able to talk to the SAN

The search for the missing armor plate

One dark and stormy evening, The Knights of HyperV had trouble getting a new host in contact with The Wasteland of Nexus. One of the armor plates to which the connections were attached was not responding. Actually, it appeared to be missing entirely. Something which was odd, as it had been installed by a trusted minion just days earlier. The Knights sent an expedition to the gates of Hell (otherwise known as Dell CMC) to investigate. The envoys tried to open the gates, but instead of open doors, they were greeted with an error message.

After a lengthy discussion with the insane gatekeeper of LDAP-Auth, the envoys were finally granted access into The Ocean of Known Bugs that lay beyond the gates. A boat carried them over to The island of iDRAC. The journey was bumpy, and once ashore the envoys wasted  several recovering from seasickness and the general discomfort caused by the putrid smell of the bugs.

Once recovered, they demanded access to the scrolls of system inventory for the server in question. To the horror of the envoys, the inventory only listed one armor plate, instead of the expected two. Luckily, the existing armor plate was made of the stable Intel-alloy as expected, but the second plate was missing. Could it have been stolen? Perhaps by one of the competing service team minions that dwelled in The Cursed Forests of Sharepoint? They would have to venture into the physical realm of the Hypervisors to find out for sure. The Knights tracked down the minion responsible for the armor plate installation and interrogated her for details. She insisted that the plate had to be there still, pleading to avoid another trip down the long and dangerous road to The Physical Realm of the Hypervisors, and suggested that the problem may be a curse. A spell from the book of dark forbidden magic, putting a veil over the labyrinth of UEFI and thus preventing the armor plate from being seen or used by the server.

Could it be? The knights snorted in disbelief, but as they had no other ideas at the time, they traveled to The Wizard of Badgerville and beseeched him to remove the curse if there was any. The wizard demanded an offering of three sausages and some boiled rice (as he was hungry). After devouring the food, the wizard started walking in circles around the remote console and muttered incomprehensible incantations that was somehow transmitted to the host without him ever touching the keyboard. A long time passed as the knights watched the wizard. At first, they watched in awe. But as the time went by, awe turned to glances, glances turned to boredom, until finally the knights were all sound asleep. Sometime later, whether hours or days we do not know, they were snatched away from slumber land. This annoyed The Knights, as they were all awoken from pleasant dreams about conquering the realm of VMWare.

“The deed is done!” declared The Wizard and vanished into a puff of smoke. The Knights staggered over to the console and were amazed to find that the server was not only able to see the missing armor plate, but it was already connected to the spirit world and jabbering happily with the domain controller. However, who had cursed the server? And why? Could it be the witch with the wardrobe of broken firmware patches? Or the unholy dark riders of VMWare? We may never find out, but if we are lucky, one day another tale will be told about the adventures of The Knights of HyperV…

Failed to allocate VMQ


Event ID 113 from Hyper-V-VmSwitch is logged each time a VM is started on a host:

Failed to allocate VMQ for NIC [GUID]  (Friendly Name: [VM name]) on switch [GUID](Friendly Name: [Switch name]). Reason – The OID failed. Status = An invalid parameter was passed to a service or function.SNAGHTML37e5e773


This is caused by a bug. Download and install KB3031598 to fix the problem.

Cluster Shared Volumes password policy


Failover Cluster validation genereates a warning in the Storage section under “Validate CSV Settings”. The error message states:

Failure while setting up to run Cluster Shared Volumes support testing on node [FQDN]: The password does not meet the password policy requirements. Check the minimum password length, password complexity and password history requirements.

No failure audits in the security log, and no significant error messages detected elsewhere.


This error was logged shortly after a change in the password policy for the Windows AD domain the cluster is a member of. The current minimum password length was set to 14 (max) and complexity requirements were enabled:


This is a fairly standard setup, as written security policies usually mandate a password length far exceeding 14 characters for IT staff. Thus, I already knew that the problem was not related to the user initiating the validation, as the length of his/her password already exceeded 14 characters before the enforcement policy change.

Lab tests verified that the problem was related to the Default domain password policy. Setting the policy as above makes the cluster validation fail. The question is why. Further lab tests revealed that the limit is 12 characters. That is, if you set the Minimum length to 12 characters the test will pass with flying colors as long as there are no other problems related to CSV. I still wondered why though. The problem is with the relation between the local and domain security policies of a domain joined computer. To understand this, it helps to be aware of the fact that Failover Cluster Validation creates a local user called CliTest2 on all nodes during the CSV test:


The local user store on a domain joined computer is subject to the same password policies as are defined in the Default Domain GPO. Thus, when the domain policy is changed this will also affect any local accounts on all domain joined computers. As far as I can tell, the Failover Cluster validation process creates the CliTest2 user with a 12 character password. This has few security ramifications, as the user is deleted as soon as the validation process ends.


The solution is relatively simple to describe. You have to create a separate Password Policy for you failover cluster nodes where Minimum Password Length is set to 12 or less. This requires that you keep your cluster nodes in a separate Organizational Unit from your user and service accounts. That is a good thing to do anyway, but be aware that moving servers from one OU to another may have adverse effects.

You then create and link a GPO to the cluster node OU and set the Minimum Password Length to 12 in the new GPO. That is the only setting that should be defined in this GPO. Then check the Link order for the OU and make sure that your new GPO has a link order 1, or at least a lower link order than the Default Domain policy. Then you just have to run GPUPDATE /Force on all cluster nodes and try the cluster validation again.

If the above description sounds like a foreign language, please ask for help before you try implementing it. Group Policies may be a fickle fiend, and small changes may lead to huge unforeseen consequences.

DNS Operation refused on Cluster Aware Updating resource


On one of my Hyper-V clusters, Event ID 1196 from FailoverClustering is logged in the system log every fifteen minutes. The event lists the name of the resource and the error message “DNS operation refused”. What it is trying to tell me is that the cluster is unable to register a network name resource in DNS due to a DNS 9005 status code. A 9005 status code translates to “Operation refused”. In this case it was a CAU network name resource which is a part of Cluster Aware Updating.

Continue reading “DNS Operation refused on Cluster Aware Updating resource”

Networks, teaming and heartbeats for clusters


In this guide, a fabric is a separate network infrastructure, be it SAN, WAN or LAN. A network may or may not be connected to a dedicated fabric. Some fabrics have more than one network.

The cluster nodes should be connected to each other over at least two independent networks/fabrics. The more independent the better. Ideally, the networks should share no components at all, but as a minimum they should be connected to separate NICs in the server. Ergo, if you want to use NIC teaming you should have at least 4 physical network ports on at least two separate NICs. The more the merrier, but be aware that as with all other forms of redundancy, higher redundancy equals higher complexity.

If you do not have more than one network port or only one network team, do not add an additional virtual network adapter/vlan for “heartbeat purposes”. The most prevalent network faults today are caused by someone unplugging the wrong cable, deactivating the wrong switch port or other user errors. Having separate vlans over the same physical infrastructure rarely offers any protection from this. You are better off just using the one adapter/team.

Previously, each Windows cluster needed a separate heartbeat network used to detect node failures. From Windows 2008 and newer (and maybe also on 2003) the “heartbeat” traffic is sent over all available networks between the cluster nodes unless we manually block it on specific cluster networks. Thus, we no longer need a separate dedicated heartbeat network, but adding a second network ensures that the cluster will survive failures on the primary network. Some cluster roles such as Hyper-V require multiple networks, so check what the requirements are for your specific implementation.

Quick takeaway

If you are designing a cluster and need a quick no-nonsense guideline regarding networks, here it comes:

  • If you use shared storage, you need at least 3 separate fabrics
  • If you use local storage, you need at least 2 separate fabrics

All but a few clusters I have been troubleshooting have had serious shortcomings and design failures in the networking department. The top problems:

  • Way to few fabrics.
  • Mixing storage and network traffic on the same fabric
  • Mixing internal and external traffic on the same fabric
  • Outdated faulty NIC firmware and drivers
  • Bad, poorly designed NICs from Qlogic and Emulex
  • Converged networking

Do not set yourself up for failure.


If you haven’t implemented IPv6 yet in your datacenter, you should disable IPv6 on all cluster nodes. If you don’t, you run a high risk of unnecessary failovers due to IPv6 to IPv4 conversion mishaps on the failover cluster virtual adapter. As long as IPv6 is active on the server, the failover cluster virtual adapter will use IPv6, even if none of the cluster networks have a valid IPv6 address. This causes all heartbeat traffic to be converted to/from IPv4 on the fly, which sometimes will fail. If you want to use IPv6, make sure all cluster nodes and domain controllers have a valid IPv6 address that is not link local (fe80:), and make sure you have routers, switches and firewalls that support IPv6 and are configured properly. You will also need IPv6 dns in the active directory domain.

Disabling IPv6

Do NOT disable IPv6 on the network adapters. The protocol binding for IPv6 should be enabled:


Instead, use the DisabledComponents registry setting. See Disable IPv6 for details.


Storage networks

If you use IP-based storage like ISCSI, SMB or FCOE, make sure you do not mix it with other traffic. Dedicated physical adapters should always be used for storage traffic. Moreover, if you are one of the unlucky few using FCOE you should seriously consider converting to FC or SMB3.

Hyper-V networks

In a perfect world, you should have six or more separate networks/fabrics for Hyper-v clusters. Sadly though, the world is seldom perfect. The absolute minimum for production clusters is two networks. Using only one network in production will cause nothing but trouble, so please do not try. Determining whether or not to use teaming also complicates matters further. As a general guide, I would strongly recommend that you always have a dedicated storage fabric with HA, that is teaming or MPIO, unless you use local storage on the cluster nodes. The storage connection is the most important one in any form of cluster. If the storage connection fails, everything else falls apart in seconds. For the other networks, throughput is more important than high availability. If you have to make a choice between HA and separate fabrics, chose separate fabrics for all other networks than the storage network.

7 Physical networks/fabrics

· Internal/Cluster/CSV (if local)/Heartbeat

· Public network for VMs

· VM Host management

· Live Migration

· 2*Storage (ISCSI, FC, SMB3)

· Backup

5 Physical networks/fabrics

· Internal/Cluster/CSV (if local)/Heartbeat/Live Migration

· Public network for vm, VM guest management

· VM Host management

· 2*Storage (ISCSI, FC, SMB3)

4 Physical networks/fabrics

· Internal/Live Migration

· Public & Management

· 2*Storage



Most blade server chasses today have a total of six fabric backplanes, grouped in three groups where each group connects to a separate adapter in the blade. Thus, each network adapter or FC HBA is connected to two separate fabrics/backplanes. The groups could be named A,B and C, with the fabrics named A1, A2, B1 and so on. Each group should have identical backplanes, that is the backplane in A1 should be the same model as the backplane in A2.

If we have Fibrechannel (FC) backplanes in group A, and 10G Ethernet backplanes in group B & C, we have several possible implementations. Group A will always be storage in this example, as FC is a dedicated storage network.


Here, we have teaming implemented on both B and C. Thus, we use the 4 networks configuration from above, splitting our traffic in Internal and Public/Management. This implementation may generate some conflicts during Live Migrations, but in return we get High Availability for all groups.


By splitting group B and C in two single ports, we get 5 fabrics and a more granulated separation of traffic at the cost of High Availability.

Hyper-V trunk adapters/teams on 2012

If you are using Hyper-V virtual switches bound to a physical port or team on you Hyper-V hosts, Hyper-V Extensible Virtual Switch should be the only bound protocol. Note: Do not change these settings manually, Hyper-V manager will change the settings automatically when you configure the virtual switch. If you bind the Hyper-V Extensible virtual switch protocol manually, creation of the virtual switch may fail.


Teaming in Windows 2012

In Windows 2012 we finally got native support for nic teaming. You access the nic teaming dialog from Server Manager. You can find a short description of the features here:, and a more detailed one here: Windows Server 2012 NIC Teaming (LBFO) Deployment and Management.

Native teaming support rids us of some of the problems related to unstable vendor teaming drivers, and makes setup of nic teaming a unified experience no matter what nics you are using. Note: never use nic teaming on ISCSI networks. Use MPIO instead.

A note on Active/Active teaming

It is possible to use active active teaming, thus aggregating the bandwidth of two or more adapters to support higher throughput. This is a fantastic technology, especially on 1G ethernet adapters where bandwidth congestion can become a problem. There is, however a snag; a lot of professional datacenters have a complete ban on active/active teaming due to years of teaming problems. I have my self been victim of unstable active/active teams, so I know this to be a real issue. I do think this is less of a problem in Windows 2012 than it was on previous versions, but there may still be configurations that just does not work. The more complex your network infrastructure is, the less likely active/active teaming is to work. Connecting all members in the team to the same switch increases the chance of success. This also makes the team dependent on a single switch of course, but if the alternative is bandwidth congestion or no teaming at all, it does not really matter.

I recommend talking to your local network specialist about teaming before creating a design dependent on active/active teaming.

Using multiple vlans per adapter or team

It has become common practice to use more than one vlan per team, or even more than one vlan per adapter. I do not recommend this for clusters, with the exception of adapters/teams connected to a Hyper-V switch. An especially stupid thing to do is mixing ISCSI traffic with other traffic on the same physical adapter. I have dealt with the aftermath of such a setup, and it does not look pretty unless data corruption is your kind of fun. And if you create a second vlan just to get an internal network for cluster heartbeat traffic on the same physical adapters you are using for client connections, you are not really achieving anything other than making your cluster more complex. The cluster validation report will even warn you about this, as it will detect more than one interface with the same MAC address.