Failover Cluster Archives - BlackCat Reasearch Facility

Node unable to join cluster because the cluster db is out of date

Scenario

Complex 3-node AOAG cluster
2 nodes in room A form a SQL Server failover cluster instance (FCI)
1 node in room B is a stand-alone instance
The FCI and node 3 form an always on availability group (AOAG)
Alle nodes are in the same Windows failover cluster
All nodes run Windows Server 2022

Problem

The problem was reported as such: Node 3 is unable to join the cluster, and the AOAG sync has stopped. A look in the cluster events tab in Failover Cluster Manager revealed the following error being repeated; FailoverClustering Event ID 5398:

Cluster failed to start. The latest copy of cluster configuration data was not available within the set of nodes attempting to start the cluster. Changes to the cluster occurred while the set of nodes were not in membership and as a result were not able to receive configuration data updates.

And also; FailoverClustering Event ID 1652:

Cluster node 'Node 3' failed to join the cluster. A UDP connection could not be established to node(s) 'Node 1'. Verify network connectivity and configuration of any network firewalls

Analysis

Just looking at the error messages listed, one might be inclined to believe that something is seriously wrong with the cluster. Cluster database paxos (version) tag mismatch problems often lead to having to evict and re-join the complaining nodes. But experience, and especially with multi-room clusters, has taught me that this is seldom necessary. The cluster configuration database issue is just a symptom of the underlying network issue. What it is trying to say is that the concurrency of the database across the nodes cannot be verified, or that one of the nodes are unable to download the current version from one of the other nodes. Maybe due to an even number of active nodes, or not enough nodes online to form a majority.

A cluster validation run (Network only) was started, and indeed, there was a complete breakdown in communication between node 1 and 3. In both directions. Quote from the validation report:

Node 1 is not reachable from node 3. It is necessary that each cluster node can
communicate each other cluster node by a minimum of one network path (though multiple paths
are recommended to avoid a single point of failure). Please verify that existing networks
are configured properly or add additional networks.

If the communication works in one direction, you will only get one such message. In this case, we also have the corresponding message indicating a to-way issue:

Node  is not reachable from node 1. It is necessary that each cluster node can communicate each other cluster node by a minimum of one network path (though multiple paths are recommended to avoid a single point of failure). Please verify that existing networks are configured properly or add additional networks.

Be aware that a two-way issue does not necessarily indicate a problem with more than one node. It does however point to a problem located near or at the troublesome node, whereas a one-way issue points more in the direction of a firewall issue.

Thus, a search of the central firewall log repository was started. It failed to shine a light on the matter though. Sadly, that is not uncommon in these cases. A targeted search performed directly on the networking devices in combination with a review of relevant firewall rules and routing policies by a network engineer is often needed to root out the issue.

The cluster had been running without any changes or issues for quite some time, but a similar problem had occurred at least once before. Last time it was due to a network change, and we knew that a change to parts of the network infrastructure had recently been performed. But stil, something smelled fishy. And as we could not agree on where the smell came from, we chose to analyse a bit more before we summoned the network people.

The funny thing was that communication between node 2 and node 3 was working. One would think that the problem should be located on the interlink between room A and B, but if that was the case, why did it only affect node 1? We triggered a restart of the cluster service on node 3. The result was that the cluster, and thereby the AOAG listener and databases went down, quorum was re-arbitrated, and node 1 was kicked out. The FCI and AOAG primary failed over to node 2, node 3 joined the cluster and began to sync changes to the databases, and node 1 was offline.

So, the hunt was refocused. This time we were searching diligently for something wrong on node 1. Another validation report was triggered, this time not just for networking. It revealed several interesting things, whereof two became crucial to solving the problem.

1: Cluster networking was now not working at all on node 1, and as a result network validation failed completely. Quote from the validation report:

An error occurred while executing the test.
There was an error initializing the network tests.

There was an error creating the server side agent (CPrepSrv).

Retrieving the COM class factory for remote component with CLSID {E1568352-586D-43E4-933F-8E6DC4DE317A} from machine Node 1 failed due to the following error: 80080005 Node 1.

2: There was a pending reboot. On all nodes. Quote:

The following servers have updates applied which are pending a reboot to take effect. It is recommended to reboot the servers to complete the patching process.
Node 1
Node 2
Node 2

Now, it is important to note that patching and installation of software updates on these servers are very tightly regulated due to SLA-concerns. Such work should always end with a reboot, and there are fixed service windows to adhere to with the exception of emergency updates of critical patches. No such updates had been applied recently.

Rummaging around in the registry, two potential culprits were discovered: Microsoft xps print spooler drivers and Microsoft Edge browser updates. Now why these updates were installed, and why they should kill DCOM and by extension failover clustering I do not now. But they did. After restarting node 1, everything started working as expected again. Findings in the application installation log would indicate that Microsoft Edge was the problem, but this has not been verified. It does however make more sense than the XPS print spooler.

Solution

If you have this problem and you believe that the network/firewall/routing is not the issue, run as cluster validation report and look for pending reboots. You find them under “Validate Software Update Levels” in the “System Configuration” section.

If you have some pending reboots, restart the nodes one by one. The problem should vanish.

If you do not have any pending reboots, try rebooting the nodes anyway. If that does not help, hunt down someone from networking and have them look for traffic. Successful traffic is usually not logged, but you can turn it on temporarily. Capturing the network traffic on all of the nodes and running a wireshark analysis would be my next action point if the networking people are unable to find anything.

Images

Last edit: Wednesday, November 1, 2023

Error 87 when installing SQL Server Failover Cluster instance

Problem

When trying to install a SQL Server Failover Cluster instance, the installer fails with error 87. Full error message from the summary log:

Feature: Database Engine Services
Status: Failed
Reason for failure: An error occurred during the setup process of the feature.
Next Step: Use the following information to resolve the error, uninstall this feature, and then run the setup process again.
Component name: SQL Server Database Engine Services Instance Features
Component error code: 87
Error description: The parameter is incorrect.

And from the GUI:

Analysis

This is an error I have not come across before.

The SQL Server instance is kind of installed, but it is not working, so you have to uninstall it and remove it from the cluster to try again. This is rather common when a clustered installation fails though.

As luck would have it, the culprit was easy to identify. A quick look in the cluster log (the one for the SQL Server installation, not the actual cluster log) revealed that the IP address supplied for the SQL Server instance was invalid. The cluster in question was located in a very small subnet, a /26 subnet. The IP allocated was .63. A /26 subnet contains 64 addresses. As you may or may not know, the first and last addresses in a subnet are reserved. The first address (.0) is the network address, and the last address (yes, that would be .63) is reserved as the broadcast address. It is also common to reserve the second or second to last address for a gateway, that would be .1 or .62 in this case.

Snippet from the log:

Solution

Allocate a different IP address. In our case that meant moving the cluster to a different subnet. as the subnet was completely filled to the brim.

Action plan:

Replace the IP on node 1
Wait for the cluster to arbitrate using the heartbeat VLAN or direct attach crossover cable.
Add an IP to the cluster goup resource group and give it an address in the new subnet.
Bring the new IP online
Replace the IP on node 2
Wait for the cluster to achieve quorum
Remove the old IP from the Cluster Group resource group
Validate the cluster
Re-try the SQL Server Failover Cluster instance installation.

Last edit: Monday, October 24, 2022

Quorum witness is online but does not work

Problem

The cluster appears to be working fine, but every 15 minutes or so the following events are logged on the node that owns the quorum witness disk:

Source:        Microsoft-Windows-Ntfs
Event ID:      98
Level:         Information
Description:
Volume WitnessDisk: (\Device\HarddiskVolumeNN) is healthy.  No action is needed.


Event ID:      1558
Source:        Microsoft-Windows-FailoverClustering
Level:         Warning
Description:
The cluster service detected a problem with the witness resource. The witness resource will be failed over to another node within the cluster in an attempt to reestablish access to cluster configuration data.


Log Name:      System
Event ID:      1069
Level:         Error
Description:
Cluster resource 'Witness' of type 'Physical Disk' in clustered role 'Cluster Group' failed.

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

Analysis

Some digging in the event log identified a disk error incident during a failover of the virtual machine:

Log Name:      System
Event ID:      1557
Level:         Error
Description:
Cluster service failed to update the cluster configuration data on the witness resource. Please ensure that the witness resource is online and accessible.


Log Name:      System
Source:        Microsoft-Windows-Ntfs
Event ID:      140
Description:
The system failed to flush data to the transaction log. Corruption may occur in VolumeId: WitnessDisk:, DeviceName: \Device\HarddiskVolumeNN.
({Device Busy}
The device is currently busy.)

And ultimately

Log Name:      System
Source:        Ntfs
Level:         Warning
Description:
{Delayed Write Failed} Windows was unable to save all the data for the file . The data has been lost. This error may be caused by a failure of your computer hardware or network connection. Please try to save this file elsewhere.

It appears that the witness disk had a non-responsive period during the failover of the VM that caused an update to the cluster database to fail, thus rendering the copy of the cluster database contained on the witness disk corrupt. The disk itself is fine, thus there are no faults in the cluster resource status. everything appears hunky dory. There could be other causes leading to the same situation, but in this case the issue corelates with a VM failover.

We need to replace the defective database with a fresh copy from one of the nodes.

Solution

The usual warning: If this procedure is new to you, seek help before attempting to do this in production. If your cluster has other issues, messing with the quorum setup may end you in serious trouble. And if you have any doubts what so ever about the integrity of the drive/LUN, replace it with a new one.

Warnings aside, this procedure is usually safe, and as long as the cluster is otherwise healthy you can do this live without scheduling downtime.

Action plan

Remove the quorum witness from the cluster.
Check that the disk is listed as available storage and online.
Take ownership of the defective “cluster” folder on the root ofr the quorum witness drive.
Rename it to “oldCluster” in case we need to extract some data.
Add the disk back as a quorum witness
Wait to check that the error messages does not re-appear.
If they do re-appear
- Order a new LUN
- Add it to the cluster
- Use the new LUN as a quorum witness
- Remove the old LUN from the cluster

Last edit: Thursday, February 24, 2022

Cluster validation fails: Culture is not supported

Problem

The Failover Cluster validation report shows an error for Inventory, List Operating System Information:

An error occurred while executing the test. There was an error getting information about the operating systems on the node. Culture is not supported. Parameter name: culture 3072 (0x0c00) is an invalid culture identifier.

If you look at the summary for the offending node, you will find that Locale and Pagefiles are missing.

Analysis

Clearly there is something wrong with the locale settings on the offending node. As the sample shows the locale is set to nb-NO for Norwegian, Norway. I immediately suspected that to be the culprit. Most testing is done on en-US, and the rest of us that want to see a sane 24-hour clock without latin abbreviations and a date with the month where it should be located usually have to suffer.

I was unable to determine exactly where the badger was buried, but the solution was simple enough.

Solution

Part 1

Make sure that the Region & language and Date & Time settings (modern settings) are set correctly on all nodes. Be aware of differences between the offending node and working nodes.
Make sure that the System Locale is set correctly in the Control Panel, Region, Administrative window.
Make sure that Windows Update works and is enabled on all nodes.
Check the Languages list under Region & Language (modern settings). If it flashes “Windows update” under one or more of the languages, you still have a Windows Update problem or an internet access problem.
Try to validate the cluster again. If the error still appears, go to the next line.
Run Windows Update and Microsoft Update on all cluster nodes.
Restart all cluster nodes.
Try to validate the cluster again. If the error still appears, go to part 2.

Part 2

Make sure that you have completed part 1.
Go to Control Panel, Region.
In the Region dialog, locate Formats, Format.
Change the format to English (United States). Be sure to select the correct English locale.
Click OK.
Run the validation again. It should be successful.
Change the format back.
Run the validation again. It should be successful.

If you are still not successful, there is something seriously wrong with you operating system. I have not yet had a case where the above steps does not resolve the problem, but I suspect that running chkdsk, sfc -scannow or a full node re-installation would be next. Look into the rest of the validation report for clues to other problems.

Last edit: Friday, October 2, 2020

Reading the cluster log with Message Analyzer

Microsoft Message Analyzer, the successor to Network Monitor, has the ability to read a lot more than just network captures. In this post I will show how you can open set of cluster logs from a SQL Server Failover Cluster instance. If you are new to Message Analyzer I recommend that you glance at the Microsoft Message Analyzer operating guide while you read this post for additional information.

Side quest: Basic cluster problem remediation

Remember that the cluster log is a debug log used for analyzing what went wrong AFTER you get it all working again. In most cases your cluster should self-heal, and all you have to do is figure out what went wrong and what you should do different to prevent it from happening again. If your cluster is still down and you are reading this post, you are on the wrong path.

Below you will find a simplified action plan for getting your cluster back online. I will assume that you have exhausted you normal troubleshooting process to no avail, that your cluster is down and that you do not know why. The type of Failover Cluster is somewhat irrelevant for this action plan.

If your cluster has shared storage, call your SAN person and verify that all nodes can access the storage, and that there are no gremlins in the storage and fabric logs.
If something works and something does not, restart all nodes one by one. If you cannot restart a node, power cycle it.
If nothing works, shut down all nodes, then start one node. Just one.
- Verify that it has a valid connection to the rest of your environment, both networking and storage if applicable.
- If you have more than two nodes, start enough nodes to establish quorum. Usually n/2.
Verify that your hardware is working. Check OOB logs and blinking lights.
If the cluster is still not working, run a full cluster validation an correct any errors. If you had errors in the validation report BEFORE the cluster went down, your configuration is not supported and this is probably the reason for your predicament. Rectify all errors and try again.
If you gave warnings in your cluster validation report, check each one and make a decision whether or not to correct it. Some clusters will have warnings by design.
If your nodes are virtual, make sure that you are not using VMWare Raw Device Mapping. If you are, this is the probable cause of all your problems, both on this cluster and any personal problems you may have. Make the necessary changes to remove RDM.
If your nodes are virtual, make sure there are no snapshots/checkpoints. If you find any, remove them. Snapshots/checkpoints left running for > 12 hours may destroy a production cluster.
If the cluster is still not working, reformat, reinstall and restore.

Prerequisites and test environment

A running failover cluster. Any type of cluster will do, but I will use a SQL Server Failover Cluster Instance as a sample.
A workstation or server running Microsoft Message Analyzer 1.4 with all the current patches and updates as of march 2019.
The cluster nodes in the lab are named SLQ19-1 and SQL19-2 and are running Windows Server 2019 with a SQL Server 2019 CTP 2.2 Failover Cluster Instance.
To understand this post you need an understanding about how a Windows Failover Cluster works. If you have never looked at a cluster log before, this post will not teach you how to interpret the log. https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-2000-server/cc961673(v=technet.10) contains additional information about the cluster log. It is very old but still relevant, and at the time of writing the best source of information I could find. There is also an old article about the Resource Hosting Subsystem that may be of use here.

Obtaining the cluster log

To get the current cluster log, execute Get-ClusterLog -Destination C:\TEMP –SkipClusterState in an administrative PowerShell windows on one of the cluster nodes.
Be aware that the time zone in the log file will be Zulu time/GMT. MA should compensate for this.
The SkipClusterState option removes a lot of unparseable information from the file. If you are investigating a serious problem you may want to run a separate export without this option.
The TimeSpan option limits the log timespan. I used it to get a smaller sample set for this lab, and so should you if you know what timespan you want to investigate. You can also add a pre-filter in MA to limit the timespan.
You should now have one file for each cluster node in C:\Temp.
Copy the files to the machine running Message Analyzer.

Starting Message Analyzer and loading the logs

Open Message Analyzer.
Click New Session.
Enter a session name.
Click the Files-button.
Add the .log files.
Select the Cluster text log configuration.
Click Start to start parsing the files.
Wait while MA is parsing the files. Parsing time is determined by machine power and the size of the log, but it should normally take tens of minutes, not hours unless the file is hundreds of megabytes or more.

Filtering unparseable data

After MA is done parsing the file, the list looks a little disconcerting. All you see are red error messages:
Not to worry though, what you are looking at is just blank lines and other unparseable data from the file. You can read the unparseable data in the Details pane:
It is usually log data that is split on multiple lines in the log file and headers dividing different logs included in the file. A similar message as the sample above looks like this in the log file:
We can filter out these messages by adding #Timestamp to the filter pane and clicking Apply. This will filter out all messages without a timestamp.

Saving the session

To make the data load faster next time, we can save the parsed data and filter as a session. This will retain the workspace as we left it.

Click Save.
Select All Messages.
Click Save As.
Save the .matp file.

Looking for problems

The sample log files contains an incident where the iSCSI storage disappeared. This was triggered by a SAN reboot during a firmware update on a SAN without HA. I will go through some analysis of this issue to show how we can use MA to navigate the cluster logs.

To make it easier to read the log, we will add a Grouping Viewer. Click New Viewer, Grouping, Cluster Logs:
This will give you a Grouping pane on the left. Stat by clicking the Collapse All button:
Then expand the ERR group and start with the messages without a subcomponent tag. The hexadecimal numbers are the ProcessId of the process writing the error to the log. Usually this is a resource hosting subsystem process.
It is pretty clear that we have a storage problem:
To check which log contains one of these messages, select one message and look in the Details pane, Properties mode. Scroll down until you find the TraceSource property:
To read other messages logged at the same time, switch the Grouping viewer from Filter to Select mode:
If we click the same ERR group again, the Analysis Grid view will scroll to the first message in this group and mark all messages in the group.
The WARN InfoLevel for the RES SubComponent is also a good place to look for root causes:
If you want to see results from one log file only, add *TraceSource == “filename” to the grouping filter.

Last edit: Thursday, March 28, 2019

Failover Cluster: access to update the secure DNS Zone was denied.

Problem

After you have built a cluster, the Cluster Events page fills up with Event ID 1257 From FailoverClustering complaining about not being able to write to the DNS records in AD:

“Cluster network name resource failed registration of one or more associated DNS names(s) because the access to update the secure DNS Zone was denied.

Cluster Network name: X
DNS Zone: Y

Ensure that cluster name object (CNO) is granted permissions to the Secure DNS Zone.”

Solution

There may be other root cause scenarios, but in my case the problem was a static DNS reservation on the domain controller.

As usual, if you do not understand the action plan below, seek help or get educated before you continue. Your friendly local search engine is a nice place to start if you do not have a local cluster expert. This action plan includes actions that will take down parts of your cluster momentarily, so do not perform these steps on a production cluster during peak load. Schedule a maintenance window.

Identify the source of the static reservation and apply public shaming and/or pain as necessary to ensure that this does not happen again. Cluster DNS records should be dynamic.
Identify the static DNS record in your Active Directory Integrated DNS forward lookup zone. Ask for help from your DNS or AD team if necessary.
Delete the static record
Take the Cluster Name Object representing the DNS record offline in Failover Cluster manager (or by powershell). Be aware that any dependent resources will also go offline.
Bring everything back online. This should trigger a new DNS registration attempt. You could also wait for the cluster to attempt this automatically, but client connections may fail while you are waiting.
Verify that the DNS record is created as a dynamic record. It should have a current Timestamp.

Last edit: Thursday, February 7, 2019

iSCSI in the LAB

I sometimes run internal Windows Failover Clustering training, and I am sometimes asked “How can I test this at home when I do not have a SAN?”. As you may know, even though clusters without shared storage are in deed possible in the current year, some cluster types still rely heavily on shared storage. When it comes to SQL Server clusters for instance, a Failover Cluster instance (which relies on shared storage) is a lot easier to operate compared to an AOAG cluster which does not rely on shared storage. There are a lot of possible solutions to this problem, you could for instance use your home NAS as an iSCSI “SAN”, as many popular NAS boxes have iSCSI support. In this post however, I will focus on how to build a Windows Server 2019 vm with an iSCSI target for LAB usage. This is NOT intended as a guide for creating a production iSCSI server. It is most definitely a bad iSCSI server with poor performance and not suited for anything requiring production level performance.

“What will I need to do this?” I hear you ask. You will need a Windows Server 2019 VM with some spare storage mounted as a secondary drive, a domain controller VM and some cluster VMs. You could also use a physical machine if you want, but usually this kind of setup involves one physical machine running a set of VMs. In this setup I have four VMs:

DC03, the domain controller
SQL19-1 and SQL19-2, the cluster nodes for a SQL 2019 CTP 2.2 failover cluster instance
iSCSI19, the iSCSI server.

The domain controller and SQL Servers are not covered in this post. See my Failover Cluster series for information on how to build a cluster.

Action plan

Make sure that your domain controller and iSCSI client VMs are running.
Make sure that Failover Cluster features are installed on the client VMs.
Enable the iSCSI service on your cluster nodes that are going to use the iSCSI server. All you have to do is to start the iSCSI initiator program, it will ask you to enable the iSCSI service:

Create a Windows 2019 Server VM or physical server.
Add it to your lab domain.
Set a static IP, both v4 and v6. This is important for stability, iSCSI does not play well with DHCP. In fact, all servers involved in this should have static IPs to ensure that the iSCSI storage is reconnected properly at boot.
Install the iSCSI target feature using powershell.
- Install-WindowsFeature FS-iSCSITarget-Server
Add a virtual hard drive to serve as storage for the iSCSI volumes.
Initialize the drive, add a volume and format it. Assign the drive letter I:

Open Server Manager and navigate to File and Storage Services, iSCSI
Start the “New iSCSI virtual disk wizard”
Select the I: drive as a location for the new virtual disk

Select a name for you disk. Here, I have called it SQL19_Backup01

Set a disk size in accordance with your needs. As this is for a LAB environment, select a Dynamically expanding disk type.

Create a new target. I used SQL19Cluster as the name for the target.
Add the cluster nodes as initiators. You should have enabled iSCSI on the cluster nodes before this step (see above). The cluster nodes also has to be online for this to be successful.

As this is a lab we skip the authentication setup

Done!

On the cluster nodes, start the iSCSI initator.
Input the name of the iSCSI target and click the quick connect button.

Initialize, online and format the disk on one of the cluster nodes

Run storage validation in failover cluster manager to verify your setup. Make sure that your disk/disks are part of the storage validation. You may have to add the disk to the cluster and set it offline for validation to work.
Check that the disk fails over properly to the other node. It should appear with the same drive letter or mapped folder on both nodes, but only on one node at the time unless you convert it to a Cluster Shared Volume. (Cluster Shared Volumes are for Hyper-V Clusters).

Last edit: Saturday, February 2, 2019

Static IPv6 Cluster address

Problem

If you install a Windows Failover cluster using IPv6, you will get a dynamic IPv6 address as the standard failover clustering management tools only support dynamic IPv6 addresses. If you live in Microsoft Wonderland (a mythical land without any firewalls) this is fine, but for the rest of us, this is kind of impractical. We need a static address.

Solution

Run this script to add a static address. Then you have to find and remove the dynamic address. You can do this in Failover Cluster Manager or through powershell, but you have to identify it first so I cannot script it for you.

The script defaults to the Cluster Group resource group, but you can do the same for say a SQL Server instance or any other cluster group with IP addresses and network names.

#Add a static IPv6 address to the cluster group and Network name specified
$IPv6="STATIC ADDRESS HERE"
$IPv6Name="IP Address $IPv6" #Do not change this line
$Group= "Cluster Group"
$NetworkName = "Cluster Name"
                               
Add-ClusterResource -name $IPv6Name -ResourceType "IPv6 Address" -Group $Group
Get-ClusterResource -Name $IPv6Name | Get-ClusterParameter
Get-ClusterResource -Name $IPv6Name | Set-ClusterParameter -Multiple @{"PrefixLength" = "64";"Address"= $IPv6}
Get-ClusterResource -Name $IPv6Name | Start-ClusterResource
Stop-ClusterResource $NetworkName
Add-ClusterResourceDependency $IPv6Name -Resource $NetworkName
Start-ClusterResource $NetworkName

Last edit: Sunday, January 27, 2019

Interface Metric for cluster nodes

In place of the hopefully well-known binding order setting, Windows Server 2016 requires us to set the interface metric to prioritize traffic on multi-homed servers. This is especially relevant on cluster servers with a separate Internal/Heartbeat interface or LiveMigration interface.

First, list the interfaces and their current metric:

Then, change the metric of your domain-facing interface to a lower value than any other interface. In the example, we will set the metric for Public (index 10) to 14.

Set-NetIPInterface -ifIndex 10 -InterfaceMetric 14

Make sure that the domain-facing adapter has the lowest metric, and it should always be first on the list whenever an application sends network packets without a specified interface.

More about the default values

Last edit: Friday, December 14, 2018

Failover Cluster Checklist, Windows Server 2019

Introduction

This post was originally written for Windows 2012R2. This is a rework with updates for Windows 2019. It is currently a work in progress.

OK, so you want to install a cluster? This is not a “Should I build a cluster?” post, this is a “How to build a proper cluster” post. I like checklists, so I made a Windows Failover Cluster installation checklist. Some of the points have their own post, some are just a short sentence. I will add more details as time allows. The goal is to share my knowledge about how to build stable clusters. I may disagree with other best practices out there, but this list is based on my experience, what works in production and what does not. I use it to build new clusters, as well as troubleshooting clusters made by others. Clustering is so easy that anyone can build a working cluster these days, but building a stable production worthy cluster may still be like finding you way out of a maze. A difficult maze filled with ghosts, trolls and angry badgers.

There are some things you need to know about this post before you continue reading:

This list is made for production clusters. There is nothing stopping you from building a lab using this list, but if you do as I say, you will build a very expensive lab.
I work with SQL Server, Hyper-V and File clusters. This list may work for other kinds of clusters as well, but I have not tested it on recent versions.
This list was originally published in 2014 for Windows 2008R2 up until Windows 2012R2. It os now updated for Windows Server 2019. I will try to add version specific instructions when necessary.
This list is for physical clusters. I dislike virtual clusters, because most organizations are not clever enough to create functioning virtual production clusters that won’t fail miserably due to user error someday. (By “virtual clusters” I mean cluster nodes on top of hypervisors, not clustered hypervisors). It is however entirely possible to build virtual clusters using this list, especially if you employ technologies such as Virtual FC.
This is my checklist. I have spent more than a decade honing it, and it works very well for me. That does not guarantee that it will work for you. I welcome any comments on alternative approaches, but don’t expect me to agree with you.
This list is mostly written in a “How to do it” manner, and may be lacking in the “But why should I do it” department. This is due to several reasons, but mostly a lack of time on my part. I do however want you to know that there are several hours, if not days of work behind each point.
Updates will be made as I discover new information.
The list is chronological. That is, start at the top and make your way down the list. If you jump back and forth, you will not achieve the desired result.
This list is based on the LTSB (Long-term Servicing Branch) GUI version of Windows Server, not Core. You can build clusters on Core, but I do not recommend it. Clusters may be very finicky to troubleshoot when things go wrong, and doing so on Windows Core is like trying to paint a room through the keyhole. So unless you have the infrastructure and budget necessary to treat your physical servers as throw-away commodities I recommend installing the “Desktop Experience”. To elaborate, if you have trouble with a core server, you remove it and deploy a replacement server. All automated of course.
Understanding this list requires knowledge of Active Directory and basic knowledge of Failover Clustering.
There are many special cases not covered. This list is for the basic 2-10 node single datacenter cluster. The basic rules still apply though, even if you have nodes in four datacenters and use a hybrid cloud setup.

The design phase

In the design phase, there are a lot of decisions you have to make BEFORE you start building the cluster. These are just a few of them:

How many nodes do you need? Remember you need at least one standby node for HA (High Availability). Depending on the total number of nodes you may need several standby nodes. Some managers will complain about the extra nodes just sitting there unused, but they forget that they are there to provide HA. No matter the number of nodes, make sure the hardware is as equal as possible. I don’t care what the manual says, having cluster nodes with different hardware in them is a recipe for disaster. If possible, all nodes should be built on the same day by the same persons and have consecutive serial numbers.
How many network fabrics do you need? And how many can you afford? See Networks, teaming and heartbeats for clusters for more information. This is where most troublesome clusters fail.
Will you use shared storage? And what kind of shared storage? In short: FCOE is bad for you, ISCSI is relatively cheap, SMB3 is complicated and may be cheap, shared DAS/SAS is very cheap, FC is the enterprise norm and infiniband is for those who want very high performance at any cost. Note that the deployment cost for Infiniband in small deployments has fallen significantly in the last couple of years. In most cases you will have to use what is already in place in your datacenter though. And it is usually better to have something your storage guys are used to supporting. Just remember that storage is very important for your overall performance, no matter what kind of cluster. For file clusters, high throughput is important. For SQL Server, low latency is key and you should use FC or Infiniband.
What kind of hardware should you use in your cluster nodes? These are my opinions, based on my personal experience to date. My opinions on this change frequently as new generations are released, but here goes:
- Emulex should stop making any kind of hardware. It is all bad for you and bad for your cluster. If you are having trouble with cluster stability and you have Emulex made parts in your nodes, remove them at once.
- QLogic make good FC HBAs. If you have a FC SAN, QLogic HBAs are highly recommended. If you have QLogic network adapters on the other hand, use them for target practice.
- Broadcom network adapters used to be good, but the drivers for Windows are getting worse by the minute.
- Intel X560 is my current favorite network adapter. It is sold under many names, so check what chip is actually used on the cards offered by your server manufacturer.
- Use Brocade FC switches only. They are sold under many other brand names as well, I have seen them with both HP and IBM stickers.
- Use Cisco or HP ProCurve network switches, but do not use them for FC traffic.
- Make sure your nodes have local disk controllers with battery or flash backed cache. Entry level disk controllers are not worth the cardboard box they are delivered in and may slow down the most hard-core cluster.
- Intel Xeon CPUs currently reigns supreme for most applications. There are however some edge cases for SQL Server where AMD CPUs will perform better. I recommend reading Glenn Berry’s blogs for up to date SQL Server CPU information.
- HP, Lenovo and Dell all make reasonably good servers for clustering. Or, I should say equally bad, but better than the alternatives.
RACK or Blade?
- RACK servers
  - are easier to troubleshoot
  - are versatile
  - give you a lot of expansion options
  - are cheaper to buy
- Blade servers are
  - space efficient
  - cheaper to maintain if you rent rack space
  - easier to install
  - limited in terms of expansion options
Where should your nodes be located physically? I do not recommend putting them all in the same rack. The best solution is to put them in separate rooms within sub-millisecond network distance. You can also place them in separate data centers with a long distance between them if you do not use shared storage or use some kind of hybrid solution. I do not recommend SAN synchronization to data centers far, far away though, it is better to have synchronization higher up in the stack. If you only have one datacenter, place the nodes in different racks and make sure they have redundant power supplies.
Talking about power, your redundant power supplies should be connected to separate power circuits, preferably with each connected to an independent UPS.
What domain should your servers be member of, and which organizational unit should you use? Failover clustering will not work without Active Directory. No domain clusters are supported from W2019 but not recommended. You probably need AD for other stuff anyway.
The Active Directory role should NOT be installed on the cluster nodes. You should have at least two domain controllers, one of which should be a dedicated physical machine. I know that MS now supports virtualizing all your domain controllers, but that does not mean that you should do it, or that it is smart to do so. I would also recommend creating a separate OU for each cluster.
What account should you use to run the installation? Previously a separate cluster installation account was recommended, but with newer versions it is usually no problem using a regular sysadmin account. The account should be a domain administrator to make everything easy, but this checklist will work as long as you have local admin on the cluster nodes. (Be aware that some points require som form of AD write access).
And then there are a lot of product and project specifics, such as storage requirements, CPU and memory sizing and so on, all of which may affect your cluster design.

The actual checklist

All list items should be performed on each node in the cluster unless specified otherwise. You can do one node at the time or all at once until you get to cluster validation. All nodes should be ready when you run cluster validation. I find it easiest to remember everything by doing one list item for each node before I move on to the next, making notes along the way.

Mount the hardware
Set BIOS/UEFI settings as required by your environment. Remember to enable High Performance mode, otherwise you will be chasing performance gremlins.
If your cluster nodes are virtual machines, make sure that they are not allowed to be hosted by the same host. How you configure this will depend on your virtualization platform.
Install Windows Server
Copy any required media, drivers etc. to a folder on each node
Static or reserved IP addresses are recommended, bot IPv4 and IPv6.
If you are not able to use IPv6 to talk to your domain controllers, disable IPv6 completely in registry. See How to disable IPv6

Make sure all your drivers are installed using Device Manager.
Make sure you are running current BIOS, Remote Access, RAID, HBA and Network firmware according with your patch regime. If in doubt, use the latest available version from your server vendor. Do NOT download drivers and firmware from the chip vendor unless you are troubleshooting a specific problem.
Make sure your drivers are compatible with the firmware mentioned above.
Check whether the server is a member of the domain, and add it to the domain if necessary.
Activate a machine proxy if you use a proxy server to access the internet. See Proxy for cluster nodes for more information.
Activate RDP.
Create a firewall rule to allow ICMP (ping) on all interfaces regardless of profile.

New-NetFirewallRule -DisplayName "Allow ICMP all profiles IPv4" -Direction Inbound -Protocol ICMPv4  -Action Allow
New-NetFirewallRule -DisplayName "Allow ICMP all profiles IPv6" -Direction Inbound -Protocol ICMPv6  -Action Allow

Select the High performance power plan.
If virtual node, enable VRSS. If physical, enable RSS. If you are creating a Hyper-V cluster, enable VMQ as well. See https://lokna.no/?p=2464 for details.

Make sure that your nodes are located in the correct OU. The default “Computers” container is not the correct OU.
Add the failover cluster features:

Install-WindowsFeature -Name Failover-Clustering –IncludeManagementTools

Check the interface metric. Your domain facing team/adapter should have the lowest metric. See https://lokna.no/?p=2637
Disable NICs that are not in use
Install any prerequisites required by your shared storage. Check with your SAN admin for details.
Change page file settings according to Page file defaults
Install PSWindowsupdate and run it for Microsoft update.
Install cluster hotfixes. See Does your cluster have the recommended hotfixes?
If you are using shared storage, verify storage connections and MPIO in accordance with guidelines from your SAN vendor. Most SAN vendors have specific guidelines/whitepapers for Failover Clustering.
Make sure that you are connected to your shared storage on all nodes and have at least one LUN (drive) presented for validation.
Validate the configuration: Validating a Failover Cluster. Do not continue until your cluster passes validation. I have yet to see a production cluster without validation warnings, but you should document why you have each warning before you continue.
Create the cluster: Creating a Failover Cluster
Verify the Quorum configuration. Make sure dynamic quorum is enabled. You should always have a quorum witness drive (even if you don’t use it). The create cluster wizard will without fail select another quorum witness drive than the one you intended to use, so make sure to correct this as well.
Grant create computer object permissions to the cluster. This is necessary for installation of most clustered roles, and this is why each cluster should have its own OU.

Last edit: Wednesday, January 23, 2019

Scenario

Problem

Analysis

Solution

Images

Share this:

Like this:

Problem

Analysis

Solution

Share this:

Like this:

Problem

Analysis

Solution

Action plan

Share this:

Like this:

Problem

Analysis

Solution

Part 1

Part 2

Share this:

Like this:

Side quest: Basic cluster problem remediation

Prerequisites and test environment

Obtaining the cluster log

Starting Message Analyzer and loading the logs

Filtering unparseable data

Saving the session

Looking for problems

Share this:

Like this:

Problem

Solution

Share this:

Like this:

Action plan

Share this:

Like this:

Problem

Solution

Share this:

Like this:

Share this:

Like this:

Introduction

The design phase

The actual checklist

Share this:

Like this: