Node unable to join cluster because the cluster db is out of date

Scenario

  • Complex 3-node AOAG cluster
  • 2 nodes in room A form a SQL Server failover cluster instance (FCI)
  • 1 node in room B is a stand-alone instance
  • The FCI and node 3 form an always on availability group (AOAG)
  • Alle nodes are in the same Windows failover cluster
  • All nodes run Windows Server 2022

Problem

The problem was reported as such: Node 3 is unable to join the cluster, and the AOAG sync has stopped. A look in the cluster events tab in Failover Cluster Manager revealed the following error being repeated; FailoverClustering Event ID 5398:

Cluster failed to start. The latest copy of cluster configuration data was not available within the set of nodes attempting to start the cluster. Changes to the cluster occurred while the set of nodes were not in membership and as a result were not able to receive configuration data updates.

And also; FailoverClustering Event ID 1652:

Cluster node 'Node 3' failed to join the cluster. A UDP connection could not be established to node(s) 'Node 1'. Verify network connectivity and configuration of any network firewalls

Analysis

Just looking at the error messages listed, one might be inclined to believe that something is seriously wrong with the cluster. Cluster database paxos (version) tag mismatch problems often lead to having to evict and re-join the complaining nodes. But experience, and especially with multi-room clusters, has taught me that this is seldom necessary. The cluster configuration database issue is just a symptom of the underlying network issue. What it is trying to say is that the concurrency of the database across the nodes cannot be verified, or that one of the nodes are unable to download the current version from one of the other nodes. Maybe due to an even number of active nodes, or not enough nodes online to form a majority.

A cluster validation run (Network only) was started, and indeed, there was a complete breakdown in communication between node 1 and 3. In both directions. Quote from the validation report:

Node 1 is not reachable from node 3. It is necessary that each cluster node can
communicate each other cluster node by a minimum of one network path (though multiple paths
are recommended to avoid a single point of failure). Please verify that existing networks
are configured properly or add additional networks.

If the communication works in one direction, you will only get one such message. In this case, we also have the corresponding message indicating a to-way issue:

Node  is not reachable from node 1. It is necessary that each cluster node can communicate each other cluster node by a minimum of one network path (though multiple paths are recommended to avoid a single point of failure). Please verify that existing networks are configured properly or add additional networks.

Be aware that a two-way issue does not necessarily indicate a problem with more than one node. It does however point to a problem located near or at the troublesome node, whereas a one-way issue points more in the direction of a firewall issue.

Thus, a search of the central firewall log repository was started. It failed to shine a light on the matter though. Sadly, that is not uncommon in these cases. A targeted search performed directly on the networking devices in combination with a review of relevant firewall rules and routing policies by a network engineer is often needed to root out the issue.

The cluster had been running without any changes or issues for quite some time, but a similar problem had occurred at least once before. Last time it was due to a network change, and we knew that a change to parts of the network infrastructure had recently been performed. But stil, something smelled fishy. And as we could not agree on where the smell came from, we chose to analyse a bit more before we summoned the network people.

The funny thing was that communication between node 2 and node 3 was working. One would think that the problem should be located on the interlink between room A and B, but if that was the case, why did it only affect node 1? We triggered a restart of the cluster service on node 3. The result was that the cluster, and thereby the AOAG listener and databases went down, quorum was re-arbitrated, and node 1 was kicked out. The FCI and AOAG primary failed over to node 2, node 3 joined the cluster and began to sync changes to the databases, and node 1 was offline.

So, the hunt was refocused. This time we were searching diligently for something wrong on node 1. Another validation report was triggered, this time not just for networking. It revealed several interesting things, whereof two became crucial to solving the problem.

1: Cluster networking was now not working at all on node 1, and as a result network validation failed completely. Quote from the validation report:

An error occurred while executing the test.
There was an error initializing the network tests.

There was an error creating the server side agent (CPrepSrv).

Retrieving the COM class factory for remote component with CLSID {E1568352-586D-43E4-933F-8E6DC4DE317A} from machine Node 1 failed due to the following error: 80080005 Node 1.

2: There was a pending reboot. On all nodes. Quote:

The following servers have updates applied which are pending a reboot to take effect. It is recommended to reboot the servers to complete the patching process.
Node 1
Node 2
Node 2

Now, it is important to note that patching and installation of software updates on these servers are very tightly regulated due to SLA-concerns. Such work should always end with a reboot, and there are fixed service windows to adhere to with the exception of emergency updates of critical patches. No such updates had been applied recently.

Rummaging around in the registry, two potential culprits were discovered: Microsoft xps print spooler drivers and Microsoft Edge browser updates. Now why these updates were installed, and why they should kill DCOM and by extension failover clustering I do not now. But they did. After restarting node 1, everything started working as expected again. Findings in the application installation log would indicate that Microsoft Edge was the problem, but this has not been verified. It does however make more sense than the XPS print spooler.

Solution

If you have this problem and you believe that the network/firewall/routing is not the issue, run as cluster validation report and look for pending reboots. You find them under “Validate Software Update Levels” in the “System Configuration” section.

If you have some pending reboots, restart the nodes one by one. The problem should vanish.

If you do not have any pending reboots, try rebooting the nodes anyway. If that does not help, hunt down someone from networking and have them look for traffic. Successful traffic is usually not logged, but you can turn it on temporarily. Capturing the network traffic on all of the nodes and running a wireshark analysis would be my next action point if the networking people are unable to find anything.

Images

Error 87 when installing SQL Server Failover Cluster instance

Problem

When trying to install a SQL Server Failover Cluster instance, the installer fails with error 87. Full error message from the summary log:

Feature: Database Engine Services
Status: Failed
Reason for failure: An error occurred during the setup process of the feature.
Next Step: Use the following information to resolve the error, uninstall this feature, and then run the setup process again.
Component name: SQL Server Database Engine Services Instance Features
Component error code: 87
Error description: The parameter is incorrect.

And from the GUI:

Analysis

This is an error I have not come across before.

The SQL Server instance is kind of installed, but it is not working, so you have to uninstall it and remove it from the cluster to try again. This is rather common when a clustered installation fails though.

As luck would have it, the culprit was easy to identify. A quick look in the cluster log (the one for the SQL Server installation, not the actual cluster log) revealed that the IP address supplied for the SQL Server instance was invalid. The cluster in question was located in a very small subnet, a /26 subnet. The IP allocated was .63. A /26 subnet contains 64 addresses. As you may or may not know, the first and last addresses in a subnet are reserved. The first address (.0) is the network address, and the last address (yes, that would be .63) is reserved as the broadcast address. It is also common to reserve the second or second to last address for a gateway, that would be .1 or .62 in this case.

Snippet from the log:

Solution

Allocate a different IP address. In our case that meant moving the cluster to a different subnet. as the subnet was completely filled to the brim.

Action plan:

  • Replace the IP on node 1
  • Wait for the cluster to arbitrate using the heartbeat VLAN or direct attach crossover cable.
  • Add an IP to the cluster goup resource group and give it an address in the new subnet.
  • Bring the new IP online
  • Replace the IP on node 2
  • Wait for the cluster to achieve quorum
  • Remove the old IP from the Cluster Group resource group
  • Validate the cluster
  • Re-try the SQL Server Failover Cluster instance installation.

Cluster validation fails: Culture is not supported

Problem

The Failover Cluster validation report shows an error for Inventory, List Operating System Information:

An error occurred while executing the test. There was an error getting information about the operating systems on the node. Culture is not supported. Parameter name: culture 3072 (0x0c00) is an invalid culture identifier.

If you look at the summary for the offending node, you will find that Locale and Pagefiles are missing.

Analysis

Clearly there is something wrong with the locale settings on the offending node. As the sample shows the locale is set to nb-NO for Norwegian, Norway. I immediately suspected that to be the culprit. Most testing is done on en-US, and the rest of us that want to see a sane 24-hour clock without latin abbreviations and a date with the month where it should be located usually have to suffer.

I was unable to determine exactly where the badger was buried, but the solution was simple enough.

Solution

Part 1

  • Make sure that the Region & language and Date & Time settings (modern settings) are set correctly on all nodes. Be aware of differences between the offending node and working nodes.
  • Make sure that the System Locale is set correctly in the Control Panel, Region, Administrative window.
  • Make sure that Windows Update works and is enabled on all nodes.
  • Check the Languages list under Region & Language (modern settings). If it flashes “Windows update” under one or more of the languages, you still have a Windows Update problem or an internet access problem.
  • Try to validate the cluster again. If the error still appears, go to the next line.
  • Run Windows Update and Microsoft Update on all cluster nodes.
  • Restart all cluster nodes.
  • Try to validate the cluster again. If the error still appears, go to part 2.

Part 2

  • Make sure that you have completed part 1.
  • Go to Control Panel, Region.
  • In the Region dialog, locate Formats, Format.
  • Change the format to English (United States). Be sure to select the correct English locale.
  • Click OK.
  • Run the validation again. It should be successful.
  • Change the format back.
  • Run the validation again. It should be successful.

If you are still not successful, there is something seriously wrong with you operating system. I have not yet had a case where the above steps does not resolve the problem, but I suspect that running chkdsk, sfc -scannow or a full node re-installation would be next. Look into the rest of the validation report for clues to other problems.

Event 20501 Hyper-V-VMMS

Problem

The following event is logged non-stop in the Hyper-V High Availability log:

Log Name:      Microsoft-Windows-Hyper-V-High-Availability-Admin
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          27.07.2017 12.59.35
Event ID:      20501
Task Category: None
Level:         Warning
Description:
Failed to register cluster name in the local user groups: A new member could not be added to a local group because the member has the wrong account type. (0x8007056C). Hyper-V will retry the operation.

image

Analysis

I got this in as an error report on a new Windows Server 2016 Hyper-V cluster that I had not built myself. I ran a full cluster validation report, and it returned this warning:

Validating network name resource Name: [Cluster resource] for Active Directory issues.

Validating create computer object permissions was not run because the Cluster network name is part of the local Administrators group. Ensure that the Cluster Network name has “Create Computer Object” permissions.

I then checked AD, and found that the cluster object did in fact have the Create Computer Object permissions mentioned in the message.

The event log error refers to the cluster computer object being a member of the local admins group. I checked, and found that it was the case. The nodes themselves were also added as local admins on all cluster nodes. That is, the computer objects for node 1, 2 and so on was a member of the local admins group on all nodes. My records show that this practice was necessary when using SOFS storage in 2012. It is not necessary for Hyper-V clusters using FC-based shared storage.

The permissions needed to create a cluster in AD

  • Local admin on all the cluster nodes
  • Create computer objects on the Computers container, the default container for new computers in AD. This could be changes, in which case you need permissions in the new container.
  • Read all properties permissions in the Computers container.
  • If you specify a specific OU for the cluster object, you need permissions in this OU in addition to the new objects container.
  • If your nodes are located in a specific OU, and not the Computers OU, you will also need permissions in the specific OU as the cluster object will be created in the OU where the nodes reside.

See Grant create computer object permissions to the cluster for more details.

Solution

As usual, a warning: If you do not understand these tasks and their possible ramifications, seek help from someone that does before you continue.

Solution 1, low impact

If it is difficult to destroy the cluster as it requires the VMs to be removed from the cluster temporarily, you can try this method. We do not know if there are other detrimental effects caused by not having the proper permissions when creating the cluster.

  • Remove the cluster object from the local admin on all cluster nodes.
  • Remove the cluster nodes from the local admin group on all nodes.
  • Make sure that the cluster object has create computer objects permissions on the OU in which the cluster object and nodes are located
  • Make sure that the cluster object and the cluster node computer objects are all located in the same OU.
  • Validate the cluster and make sure that it is all green.

Solution 2, high impact

Shotgun approach, removes any collateral damage from failed attempts at fixing the problem.

  • Migrate any VMs away from the cluster
  • Remove the cluster from VMM if it is a member.
  • Remove the “Create computer objects” permissions for the cluster object
  • Destroy the cluster.
  • Delete the cluster object from AD
  • Re-create the cluster with the same name and IP, using a domain admin account.
  • Add create computer objects and read all properties permissions to the new cluster object in the current OU. 
  • Validate the cluster and make sure it is all green.
  • Add the server to VMM if necessary.
  • Migrate the VMs back.

Cluster disk resource XX contains an invalid mount point

Problem

During cluster startup or failover one of the following event is logged in the system event log:

SNAGHTML342b0d8SNAGHTML341df38

Event-ID 1208 from Physical Disk Resource: Cluster disk resource ‘[Resource name]’ contains an invalid mount point. Both the source and target disks associated with the mount point must be clustered disks, and must be members of the same group.
Mount point ‘[Mount path]’ for volume ‘\\?\Volume{[GUID]}\’ references an invalid target disk. Please ensure that the target disk is also a clustered disk and in the same group as the source disk (hosting the mount point).

Cause and investigation

The cause could of course be the fact that the base drive is not a clustered disk as the event message states. If that is the case, read a book about WFC (Windows failover clustering) and try again. If not, I have found the following causes:

  • If the mount point path is C:\$Recycle.bin\[guid], it is caused by replacing a SAN drive with another one at the same drive letter or mount point but with a different LUN. This confuses the recycle bin.
  • If the clustered drive for either the mount point or the volume being mounted is in maintenance mode and/or currently running autchk/chkdsk. This could happen so quickly that you are unable to detect it, and when you come back to check, the services are already up and running. Unless you disable it, WFC will run autochk/chkdsk when a drive with the dirty bit set is brought online. This is probably logged somewhere, but I have yet to determine in which log. Look in the application event log for Chkdsk events or something like this:

Event 17207 from MSSQL[instance]:

Event 1066 from FailoverClustering

 

Resolution

  • If it is the recycle.bin folder, make sure you have a backup of your data and delete the mount point folder under C:\recycle.bin. You might have to take ownership of the folder to be able to complete this task. If files are in use, take all cluster resources offline and try again.
  • If you suspect a corrupt mount point or drive, run chkdsk on ALL clustered drives. See https://lokna.no/?p=1194 for details.

Check C:\Windows\Cluster\Reports (default location) for files titled ChkDSK_[volume].txt, indicating that the cluster service has triggered an automatic chkdsk on a drive.

Run disk maintenance on a failover cluster mountpoint

Problem

“Validate this cluster” or another tool tells you that the dirty bit is set for a cluster shared volume, and taking the disk offline and online again (to trigger autochk) does not help. The error message from “Validate this cluster” looks like this:

SNAGHTML2c4a12c

 

Continue reading “Run disk maintenance on a failover cluster mountpoint”

Permission error installing Failover Cluster instance

Problem

While testing out MSSQL 2012 Always On Failover Clustering in my lab, I stumbled upon a strange error which I have never seen before: “Updating permission settings for file [Shared drive]\[Mountpoint]\System Volume Information\ResumeKeyFilter.store failed”. This happened for all drives that was mounted as a mount point (folder) instead of a drive letter.image

Continue reading “Permission error installing Failover Cluster instance”