Cluster node fails to join cluster after boot

Problem

During a maintenance window, one of five Hyper-v cluster nodes failed to come out of maintenance mode after a reboot. SCVMM was used to facilitate maintenance mode. The system log shows the following error messages repeatedly:

Service Control Manager Event ID 7024

The Cluster Service service terminated with the following service-specific error:
Cannot create a file when that file already exists.

FailoverClustering Event ID 1070

The node failed to join failover cluster [clustername] due to error code ‘183’.

Service Control Manager Event ID 7031

The Cluster Service service terminated with the following service-specific error:
The Cluster Service service terminated unexpectedly.  It has done this 6377 time(s).  The following corrective action will be taken in 60000 milliseconds: Restart the service.

SNAGHTML115b89e

SNAGHTML1170717

SNAGHTML117b921

Analysis

I had high hopes of a quick fix. The cluster is relatively new, and we had recently changed the network architecture by adding a second switch. Thus, I instructed the tech who discovered the fault to try rebooting and checking that the server was able to reach the other nodes on all its interfaces. That river ran dry quickly though, as the local network was proved to be working perfectly.

Looking through the Windows Update log and the KBs for the installed updates did not reveal any clues. Making it even more suspicious, a cluster validation ensured me that all nodes had the same updates. Almost. Except for one. Hopefully I looked closer, but of course, this was some .Net framework update rollup missing on a different node.

I decided to give up all hope of an impressive five minute fix and venture into the realm of the cluster log. It is possible to read the cluster log in the event log system, but I highly recommend generating the text file and opening it in notepad++ or some other editor with search highlighting. I find it a lot easier on the eyes and mind. (Oh, and if you use the event log reader, DO NOT filter out information messages. For some reason, the eureka moment is often hidden in an information message.) The cluster log is a bit like the forbidden forest; it looks scary in daylight and even scarier in the dark during an unscheduled failover. It is easy to get lost down a track interpreting hundreds of “strange” messages, only to discover that they were benign. To make it worse, they covered a timespan of about half a second. The wrong second of course, not the one where the problem actually occurred. To say it mildly, the cluster service is very talkative. Especially so when something is wrong. As event 7031 told us, the cluster service is busy trying to start once a minute. Each try spews out thousands of log messages. The log I was working with had 574 942 lines and covered a timespan 68 minutes. That is about 8450 lines per service start.

Anyway, into the forbidden forest I went with nothing but a couple of event log messages to guide me. After a while, I isolated one cluster service startup attempt with correlated system event log data.   I discovered the following error:

ERR   mscs::GumAgent::ExecuteHandlerLocally: AlreadyExists(183)’ because of ‘already exists'([Servername]- LiveMigration Vlan 197)

The sneaky cluster service had tried fooling me into believing that the fault was file system related, when in fact it was a networking mess after all! You may recognize error 183 from system event id 1070 above. Thus we can conclude with fairly high probability that we have found the culprit. But what does it mean? I checked the name of the adapter in question and its teaming configuration. I ventured even further and checked for disconnected and missing network adapters, but none were to be found, neither in the device manager or in the registry. Then it struck med. The problem was staring me straight in the eye. The line above the error message, an innocent looking INFO message, was referring to the network adapter by an almost identical but nevertheless different name:

INFO [ClNet] Adapter Intel(R) Ethernet 10G 2P X520-k bNDC #2 – VLAN : VLAN197 LiveMigration is still attached to network VLAN197 Live Migration

A short but efficient third degree interrogation of the tech revealed that the network names had been changed some weeks prior to make them consistent on all cluster nodes. Ensuring network name consistency is in itself a noble task, but one that should be completed before the cluster is formed. It should of course be possible to change network names at any time, but for some reason the cluster service has not been able to persist the network configuration properly. As long as the node is kept running this poses no problem. However, when the node is rebooted for whatever reason, the cluster service gets lost running around in the forest looking for network adapters. Network adapters that are present but silent as the cluster service is calling them by the wrong name. I have not been able to figure out exactly what happens, not to mention what caused it in the first place, but I can guess. My guess is that different parts of the persisted cluster configuration came out of sync. This probably links network adapters to their cluster network names:

SNAGHTML151cc90

I have found this fault now on two clusters in as many days, and those are my first encounters with it. I suspect the fault is caused by a bug or “feature” introduced in a recent update to Win2012R2.

Solution

The solution is simple. Or at least I hope it is simple. A usual I strongly encourage you to seek assistance if you do not fully understand the steps below and their implications.

  • First you have to disable the cluster service to keep it out of the way. You can do this in srvmgr.msc, PowerShell, cmd.exe or through telepathy.
  • Wait for the cluster service to stop if it was running (or kill it manually if you are impatient).
  • Change the name of ALL network adapter that have an ipv4 and/or ipv6 address and are used by the cluster service. Changing the name of only the troublesome adapter mentioned in the log may not be enough. Make sure you do not mess around with any physical teaming members, SAN HBAs, virtual cluster adapters or anything else that is not directly used as a cluster network.
  • Before:image After: image
  • Enable the cluster service
  • Wait for the cluster service to start, or start it manually
  • The node should now be able to join the cluster
  • Stop the cluster service again  (properly this time, do not kill it)
  • Change the network adapter names back
  • Start the cluster service again
  • Verify that the node joins the cluster successfully
  • Restart the node to verify that the settings are persisted to disk.

Author: DizzyBadger

SQL DBA Principal Analyst

7 thoughts on “Cluster node fails to join cluster after boot”

  1. Saved my life! Thank you! In our case we removed the VMware tools and that took the NIC drivers with it. When we put them back I renamed the NIC different than before, so the cluster service on that computer would not start. After reading your post, I named them like before and voila! It works. Why would a cluster service be so dependent on such small things??

  2. Thank you! In our case, we stopped the cluster service, renamed the NIC Teaming name and restarted the cluster service.

  3. Thank you! In our case Wndows 2012 R2 cluster with VMWare we stopped the Cluster Service en renamed the NIC’s. After re-joining the cluster we repeated the first step and changed the NIC’s names back.

Leave a Reply