Node unable to join cluster because the cluster db is out of date

Scenario

  • Complex 3-node AOAG cluster
  • 2 nodes in room A form a SQL Server failover cluster instance (FCI)
  • 1 node in room B is a stand-alone instance
  • The FCI and node 3 form an always on availability group (AOAG)
  • Alle nodes are in the same Windows failover cluster
  • All nodes run Windows Server 2022

Problem

The problem was reported as such: Node 3 is unable to join the cluster, and the AOAG sync has stopped. A look in the cluster events tab in Failover Cluster Manager revealed the following error being repeated; FailoverClustering Event ID 5398:

Cluster failed to start. The latest copy of cluster configuration data was not available within the set of nodes attempting to start the cluster. Changes to the cluster occurred while the set of nodes were not in membership and as a result were not able to receive configuration data updates.

And also; FailoverClustering Event ID 1652:

Cluster node 'Node 3' failed to join the cluster. A UDP connection could not be established to node(s) 'Node 1'. Verify network connectivity and configuration of any network firewalls

Analysis

Just looking at the error messages listed, one might be inclined to believe that something is seriously wrong with the cluster. Cluster database paxos (version) tag mismatch problems often lead to having to evict and re-join the complaining nodes. But experience, and especially with multi-room clusters, has taught me that this is seldom necessary. The cluster configuration database issue is just a symptom of the underlying network issue. What it is trying to say is that the concurrency of the database across the nodes cannot be verified, or that one of the nodes are unable to download the current version from one of the other nodes. Maybe due to an even number of active nodes, or not enough nodes online to form a majority.

A cluster validation run (Network only) was started, and indeed, there was a complete breakdown in communication between node 1 and 3. In both directions. Quote from the validation report:

Node 1 is not reachable from node 3. It is necessary that each cluster node can
communicate each other cluster node by a minimum of one network path (though multiple paths
are recommended to avoid a single point of failure). Please verify that existing networks
are configured properly or add additional networks.

If the communication works in one direction, you will only get one such message. In this case, we also have the corresponding message indicating a to-way issue:

Node  is not reachable from node 1. It is necessary that each cluster node can communicate each other cluster node by a minimum of one network path (though multiple paths are recommended to avoid a single point of failure). Please verify that existing networks are configured properly or add additional networks.

Be aware that a two-way issue does not necessarily indicate a problem with more than one node. It does however point to a problem located near or at the troublesome node, whereas a one-way issue points more in the direction of a firewall issue.

Thus, a search of the central firewall log repository was started. It failed to shine a light on the matter though. Sadly, that is not uncommon in these cases. A targeted search performed directly on the networking devices in combination with a review of relevant firewall rules and routing policies by a network engineer is often needed to root out the issue.

The cluster had been running without any changes or issues for quite some time, but a similar problem had occurred at least once before. Last time it was due to a network change, and we knew that a change to parts of the network infrastructure had recently been performed. But stil, something smelled fishy. And as we could not agree on where the smell came from, we chose to analyse a bit more before we summoned the network people.

The funny thing was that communication between node 2 and node 3 was working. One would think that the problem should be located on the interlink between room A and B, but if that was the case, why did it only affect node 1? We triggered a restart of the cluster service on node 3. The result was that the cluster, and thereby the AOAG listener and databases went down, quorum was re-arbitrated, and node 1 was kicked out. The FCI and AOAG primary failed over to node 2, node 3 joined the cluster and began to sync changes to the databases, and node 1 was offline.

So, the hunt was refocused. This time we were searching diligently for something wrong on node 1. Another validation report was triggered, this time not just for networking. It revealed several interesting things, whereof two became crucial to solving the problem.

1: Cluster networking was now not working at all on node 1, and as a result network validation failed completely. Quote from the validation report:

An error occurred while executing the test.
There was an error initializing the network tests.

There was an error creating the server side agent (CPrepSrv).

Retrieving the COM class factory for remote component with CLSID {E1568352-586D-43E4-933F-8E6DC4DE317A} from machine Node 1 failed due to the following error: 80080005 Node 1.

2: There was a pending reboot. On all nodes. Quote:

The following servers have updates applied which are pending a reboot to take effect. It is recommended to reboot the servers to complete the patching process.
Node 1
Node 2
Node 2

Now, it is important to note that patching and installation of software updates on these servers are very tightly regulated due to SLA-concerns. Such work should always end with a reboot, and there are fixed service windows to adhere to with the exception of emergency updates of critical patches. No such updates had been applied recently.

Rummaging around in the registry, two potential culprits were discovered: Microsoft xps print spooler drivers and Microsoft Edge browser updates. Now why these updates were installed, and why they should kill DCOM and by extension failover clustering I do not now. But they did. After restarting node 1, everything started working as expected again. Findings in the application installation log would indicate that Microsoft Edge was the problem, but this has not been verified. It does however make more sense than the XPS print spooler.

Solution

If you have this problem and you believe that the network/firewall/routing is not the issue, run as cluster validation report and look for pending reboots. You find them under “Validate Software Update Levels” in the “System Configuration” section.

If you have some pending reboots, restart the nodes one by one. The problem should vanish.

If you do not have any pending reboots, try rebooting the nodes anyway. If that does not help, hunt down someone from networking and have them look for traffic. Successful traffic is usually not logged, but you can turn it on temporarily. Capturing the network traffic on all of the nodes and running a wireshark analysis would be my next action point if the networking people are unable to find anything.

Images

Primary replica is not joined to the Availability group, or clusters past morphing into clusters present

Problem

I was upgrading an Availability group from SQL 2012 on Win 2012R2 to SQL 2016 on Win2016. I had expected to create the new AOAG as a separate cluster and move the data manually, but the users are always complaining when I want to use my allotted downtime quotas, so I decided to try a rolling upgrade instead. This post is a journal of some of the perils I encountered along the way, and how I overcame them. There were countless others, but most of them were related to crappy hardware, wrong hardware being delivered, missing LUNS on the SAN, delusional people who believe they can lock out DBAs from supporting systems, dragons, angry badgers, solar flares and whichever politician you dislike the most. Anyways, on with the tale of clusters past morphing into clusters present…

I started with adding the new node to the failover cluster. This went surprisingly well, in spite of the old servers being at least two generations older than my new rack servers. Sadly, both the new and the old servers are made by the evil wizards behind the silver slanted E due to factors outside of my control. But I digress. The cluster join went flawlessly. There was some yellow complaints about the nodes not having the same OS version in the cluster validation scroll, but everything worked.

Then came adding the new server as a replica in the availability group. This is done from the primary replica, and I just uttered a previously prepared spell from the book of disaster recovery belonging to this cluster, adding the name of the new node. As far as I can remember this is just the result of the standard “Add replica” wizard. The spell ran without complaints, and my new node was online.

This is the point where it all went to heck in a small hand-basket carried by an angry badger. I noticed a yellow warning next to the new node in the AOAG dashboard. But as the databases were all in the synchronizing state on the new replica, I believed this to be a note complaining about the OS-version. I was wrong. In my ignorance, I failed over to the new node and had the application  team minions run some tests. They came back positive, so I removed the old nodes in preparation for adding the last one. I even ran the Update-ClusterFunctionalLevel Powershell command without issues. But the warning persisted. This is the contents of the warning:

Availability replica not joined.

SNAGHTML57bbb24f

And it was no longer a lone warning, the AOAG dashboard did not look pretty as both the old nodes refused to accept the new node as their new primary replica.

Analysis

As far as I can tell, the join AOAG script failed in some way. It did not report any errors, but still, there is no doubt that something did go wrong.

The solution as reported by MSDN is simple, just join the availability group by casting the “alter availability group groupname join” spell from the secondary replica that is not joined. The attentive reader has probably already realized that this is the primary replica, and as you probably suspect, the aforementioned command fails.

Casting the following spell lists the replicas and their join state: “select join_state, join_state_desc from sys.dm_hadr_availability_replica_cluster_states”. This is the result:

image

In some way I have put the node in an invalid state. It still works perfectly, but I guess there is only a question about when, not if this issue is about to grow into a bigger problem.

Solution

With such an elaborate backstory, you would not be wrong to expect an equally elaborate solution. Whether or not it is, is really in the eye of the beholder.

Just the usual note of warning first: If you are new to availability groups, and all this cluster stuff sounds like the dark magic it is, I would highly suggest that you do not try to travel down the same path as me. Rather, you should turn around at the entrance and run as fast as you can into the relative safety of creating another cluster alongside the old one. Then migrate the data by backing up on the old cluster and restoring on the new cluster. And if backups and restores on availability groups sounds equally scary, then ask yourself whether or not you are ready to run AOAG in production. In contrast to what is often said in marketing materials and at conferences, AOAG is difficult and scary to the beginner. But there are lots of nice training resources out there, even some free ones.

Now, with the warnings out of the way, here is what ended up working for me. I tried a lot of different solutions, but I was bound by the following limitation: The service has to be online. That translates to no reboots, no AOAG-destroy and recreate, no cluster rebuilds and so on. A combination of which would probably have solved the problem in less than an hour of downtime. But I was allowed none, so this is what I did:

  • Remove any remaining nodes and replicas that are not Win2016 SQL2016.
  • Run the Powershell command Update-ClusterFunctionalLevel to make sure that the cluster is running in Win2016 mode.
  • Build another Win 2016 SQL 2016 node
  • Join the new node to the cluster
  • Make sure that the cluster validation scroll seems reasonable. This is a fluffy point I know, but there are way to many variables to make an exhaustive list. https://lokna.no/?p=1687 mentions some of the issues you may encounter.
  • Join the new node to the availability group as a secondary replica.
  • Fail the availability group over to the new node (make sure you are in synchronous commit mode for this step).
  • Everything is OK.

image

  • Fail back to the first node
  • Change back to asynchronous commit (if that is you default mode, otherwise leave it as synchronous).

 

Thus I have successfully upgraded a 2-node AOAG cluster from Win2012R2 and SQL 2012 to Win2016 and SQL 2016 with three failovers as the only downtime. In QA. Production may become an interesting journey, IF the change request is approved. There may be an update if I survive the process…

 

Update and final notes

I have now been through the same process in production, with similar results. I do not recommend doing this in production, the normal migration to a new cluster is far preferable, especially when you are crossing 2 SQL Server versions on the way. Then again, if the reduced downtime is worth the risk…

Be aware that a failover to a new node is a one way process. Once the SQL 2016 node becomes the primary replica, the database is updated to the latest file format, currently 852 whereas SQL 2012 is 706. And as far as I can tell from the log there is a significant number of upgrades to be made. See http://sqlserverbuilds.blogspot.no/2014/01/sql-server-internal-database-versions.html for a list of version numbers.

image

Error 3930 installing SQL 2012 SP2 with CU3 in cluster

Problem

I was patching one of my clusters to SQL2012 SP2 and SP2 CU3 when something bad happened. This particular cluster is a 3 node cluster with a FCI Primary AOAG replica instance on node 1 and 2, and a stand alone Secondary AOAG replica instance on node 3. Node 3 is used for HADR when the shared storage or other shared infrastructure has an outage.

The update passed QAT with flying colors, but sadly that does not always guarantee a successful production run. My standard patch procedure for this cluster:

  • Patch node 3
  • Patch node 2 (passive FCI node)
  • AOAG failover to node 3, node 3 becomes AOAG Primary
  • FCI failover from node 1 to node 2
  • Patch node 1
  • FCI failover to node 1
  • AOAG failover to node 1

When I tried to fail over the FCI to node 2 (step 4 above), the instance failed. First, I was worried that the SP2 upgrade process may be very lengthy or slow and triggering the FCI timeouts. An inspection of the SLQ Server error log revealed that this was not the case. Instead, I was the victim of a dreaded master database failure:

015-01-12 01:28:02.82 spid7s      Database 'master' is upgrading script 'msdb110_upgrade.sql' from level 184552836 to level 184554932.
2015-01-12 01:28:02.82 spid7s      ----------------------------------
2015-01-12 01:28:02.82 spid7s      Starting execution of PRE_MSDB.SQL
2015-01-12 01:28:02.82 spid7s      ----------------------------------
2015-01-12 01:28:02.96 spid7s      Error: 3930, Severity: 16, State: 1.
2015-01-12 01:28:02.96 spid7s      The current transaction cannot be committed and cannot support operations that write to the log file. Roll back the transaction.
2015-01-12 01:28:02.96 spid7s      Error: 912, Severity: 21, State: 2.
2015-01-12 01:28:02.96 spid7s      Script level upgrade for database 'master' failed because upgrade step 'msdb110_upgrade.sql' encountered error 3930, state 1, severity 16. This is a serious error condition which might interfere with regular operation and the database will be taken offline. If the error happened during upgrade of the 'master' database, it will prevent the entire SQL Server instance from starting. Examine the previous errorlog entries for errors, take the appropriate corrective actions and re-start the database so that the script upgrade steps run to completion.
2015-01-12 01:28:02.97 spid7s      Error: 3417, Severity: 21, State: 3.
2015-01-12 01:28:02.97 spid7s      Cannot recover the master database. SQL Server is unable to run. Restore master from a full backup, repair it, or rebuild it. For more information about how to rebuild the master database, see SQL Server Books Online.
2015-01-12 01:28:02.97 spid7s      SQL Server shutdown has been initiated
2015-01-12 01:28:02.97 spid7s      SQL Trace was stopped due to server shutdown. Trace ID = '1'. This is an informational message only; no user action is required.

Analysis

In case misbehaving SQL Server instances are able to smell fear, I am glad I was located several miles away from the datacenter at this point in time. While a rebuild of master is certainly doable even in a complex setup such as this, it is not something you want to do at 2am without a detailed plan if you don’t have to. Thus, I tried failing the instance back to node 1 (running SP1 CU 11). To my amazement it came online straight away. I have seen similar issues reduce clustered instances to an unrecognizable puddle of zeros and ones in a corner on the SAN drive, so this was a welcome surprise. Feeling lucky, I tried another failover to node 2, only to be greeted with another failure and the exact same errors in the log. A quick search revealed several similar issues, but no exact matches and no feasible solutions. The closest was a suggestion to disable replication during the upgrade. As you probably know, AOAG is just replication in a fancy dress, so I went looking for my Disaster Recovery Runbook that contains ready made scripts and plans for disabling and re-enabling AOAG. My only problem is that disabling AOAG will take down the AOAG listener, thus disconnecting all clients. Such antics results in grumpy client systems, web service downtimes and a lot of paperwork for instance reviews, and is therefore something to avoid if at all possible. Just for the fun of it, I decided to try making Node 2 the AOAG Primary during the upgrade. To my astonishment, this worked like a charm. Crisis (and paperwork) averted.

Solution

You have to promote the FCI to AOAG Primary during the upgrade from SP2 to SP1. The upgrade is triggered by failing the FCI over from a node running SP1 to a node running SP2, in my case the failover from node 1 to node 2 after patching node 2.

Sadly, there is no fixed procedure for patching failover cluster instances. Some patches will only install on the active FCI node, and will then continue to patch all nodes automatically. But most patches follow the recipe above, where the passive node(s) are patched first.

This issue will probably not affect “clean” AOAG or FCI clusters where you only apply one technology. If you use FCI with replication on the other hand, you may experience the same issue.

Definitions

AOAG = Always On Availability Group

FCI = Failover cluster Instance

HADR = High Availability / Disaster Recovery