Removing a drive from a cluster group moves all cluster group resources to Available Storage

Problem

During routine maintenance on a SQL Server cluster we were planning to remove one of the clustered drives. We had previously replaced the SAN, and this disk was backed by an old storage unit that we wanted to decommission. So we made sure that there were no dependencies, right-clicked the drive in Failover Cluster Manager under the SQL Server role and pressed “Remove from SQL Server”. Promptly the drive vanished from view, together with all other cluster resources associated with the role…

After a slightly panicky check to make sure that the SQL Server instance was still running (it was), we started to wonder about what was happening. Running Get-ClusterResource in PowerShell revealed that all our missing resources had been moved to the “Available Storage” resource group.

image

We did a failover to verify that the instance was still working, and it gladly failed over with the Available Storage group. There is a total of 4 instances of SQL Server on the sample cluster pictured above.

Solution

The usual warning: Performing this procedure may result in an outage. If you do not understand the commands, read up on them before you try.

Move the resources back to the SQL Server resource group. If you move the SQL Server resource, that is the resource with the ResourceType SQL Server, all other dependent resources should follow. If your dependency settings are not configured correctly, you may have to move some of the resources independently.

Command: Get-ClusterResource “SQL Server (instance)”|Move-ClusterResource –Group “SQL Server (instance)”

Just replace Instance with the name of your SQL Server instance.

Then, run Get-ClusterResource|Sort-Object OwnerGroup, ResourceType to verify that all you resources are associated with the correct resource group. The result should look something like this. As a minimum, you should have an IP address, a network name, SQL Server, SQL Server Agent and one ore more Physical disk drives.

image

Event 20501 Hyper-V-VMMS

Problem

The following event is logged non-stop in the Hyper-V High Availability log:

Log Name:      Microsoft-Windows-Hyper-V-High-Availability-Admin
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          27.07.2017 12.59.35
Event ID:      20501
Task Category: None
Level:         Warning
Description:
Failed to register cluster name in the local user groups: A new member could not be added to a local group because the member has the wrong account type. (0x8007056C). Hyper-V will retry the operation.

image

Analysis

I got this in as an error report on a new Windows Server 2016 Hyper-V cluster that I had not built myself. I ran a full cluster validation report, and it returned this warning:

Validating network name resource Name: [Cluster resource] for Active Directory issues.

Validating create computer object permissions was not run because the Cluster network name is part of the local Administrators group. Ensure that the Cluster Network name has “Create Computer Object” permissions.

I then checked AD, and found that the cluster object did in fact have the Create Computer Object permissions mentioned in the message.

The event log error refers to the cluster computer object being a member of the local admins group. I checked, and found that it was the case. The nodes themselves were also added as local admins on all cluster nodes. That is, the computer objects for node 1, 2 and so on was a member of the local admins group on all nodes. My records show that this practice was necessary when using SOFS storage in 2012. It is not necessary for Hyper-V clusters using FC-based shared storage.

The permissions needed to create a cluster in AD

  • Local admin on all the cluster nodes
  • Create computer objects on the Computers container, the default container for new computers in AD. This could be changes, in which case you need permissions in the new container.
  • Read all properties permissions in the Computers container.
  • If you specify a specific OU for the cluster object, you need permissions in this OU in addition to the new objects container.
  • If your nodes are located in a specific OU, and not the Computers OU, you will also need permissions in the specific OU as the cluster object will be created in the OU where the nodes reside.

See Grant create computer object permissions to the cluster for more details.

Solution

As usual, a warning: If you do not understand these tasks and their possible ramifications, seek help from someone that does before you continue.

Solution 1, low impact

If it is difficult to destroy the cluster as it requires the VMs to be removed from the cluster temporarily, you can try this method. We do not know if there are other detrimental effects caused by not having the proper permissions when creating the cluster.

  • Remove the cluster object from the local admin on all cluster nodes.
  • Remove the cluster nodes from the local admin group on all nodes.
  • Make sure that the cluster object has create computer objects permissions on the OU in which the cluster object and nodes are located
  • Make sure that the cluster object and the cluster node computer objects are all located in the same OU.
  • Validate the cluster and make sure that it is all green.

Solution 2, high impact

Shotgun approach, removes any collateral damage from failed attempts at fixing the problem.

  • Migrate any VMs away from the cluster
  • Remove the cluster from VMM if it is a member.
  • Remove the “Create computer objects” permissions for the cluster object
  • Destroy the cluster.
  • Delete the cluster object from AD
  • Re-create the cluster with the same name and IP, using a domain admin account.
  • Add create computer objects and read all properties permissions to the new cluster object in the current OU. 
  • Validate the cluster and make sure it is all green.
  • Add the server to VMM if necessary.
  • Migrate the VMs back.

Power profile drift

Problem

I received an alarm from one of my SQL Servers about IO stall time measured in seconds and went to investigate. We have had trouble with HBA Firmware causing FC stalls previously, so I suspected another storage error. The server in question was running virtual FC, and a cascading error among the other servers on the same host seemed to confirm my initial hypothesis about a HBA problem on the host.

Analysis

The kernel mode CPU time on the host was high (the red part of the graph in Process Explorer), something that is also a pointer in the direction of storage problems. The storage minions found no issue on the SAN though. Yet another pointer towards a problem on the server itself. We restarted it twice, and the situation seemed to normalize. It was all written off as collateral damage from a VMWare fault that flooded the SAN with invalid packet some time ago. I moved one of the VMs back and let it simmer overnight. I felt overly cautious not moving them all back, but the next morning the test VM was running 80% PCU without getting anything done, and the CPU load on the host was  about 50%, running a 3 cpu vm on a 2×12 core host…

25C33835

I failed the test vm back to the spare host, and the load on the VM went down immediately:

image

At this point I was ready to take a trip to the room of servers and visit the host in person, and I was already planning a complete re-imaging of the node in my head. But then I decided to run CPU-Z first, and suddenly it all became clear.

image

 

The host is equipped with Intel Xeon E5-2690 v3 CPUs. Intel Ark informs me that the base clock is indeed 2,6GHz as reported by CPU-Z, and the turbo frequency is as high as 3,5GHz. A core speed of 1195MHz as shown in CPU-Z is usually an indicator of one of two things. Either someone has fiddled with the power saving settings, or there is something seriously wrong with the hardware.

A quick check of the power profile revealed that the server was running in the so called “balanced” mode, a mode that should be called “run-around-in-circles-and-do-nothing-mode” on servers. The question then becomes, why did this setting change?

image

My server setup checklist clearly states that server should run in High performance mode. And I had installed this particular server myself, so I know it was set correctly. The culprit was found to be a firmware upgrade installed some months back. It had the effect of resetting the power profile both in the BIOS and in Windows to the default setting. There was even a change to the fan profile, causing the server to become very hot. The server in question is a HP ProLiant DL380 Gen 9, and the ROM version is P89 v2.30 (09/13/2016).

Solution

  • First you should change the power profile to High performance in the control panel. This change requires a reboot.
  • While you are rebooting the server, enter the BIOS settings and check the power profile. I recommend Maximum Performance mode for production servers.
  • image
  • Then, check the Fan Profile
  • image
  • Try increased cooling. If your servers still get exceedingly hot, there is a maximum cooling mode, but this basically runs all the fans at maximum all the time.

This is how CPU-Z looks after the change:

image

And the core utilization on the host, this time with 8 active SQL Server VMs:

image

Cluster Quorum witness

Introduction

Since W2012R2 it is recommended that all clusters have a quorum witness regardless of the number of cluster nodes. As you may know, the purpose of the cluster witness is to ensure a majority vote in the cluster. If you have 2 nodes with one vote each and add a cluster witness you create a possibility for a majority vote. If you have 3 nodes on the other hand, adding a witness will remove the majority vote as you have 4 votes total and a possible stalemate.

If as stalemate occurs, the cluster nodes may revolt and you are unable to get it working without a force quorum, or you could take a node out behind the barn and end its misery. Not a nice situation at all. W2012R2 solves this predicament by dynamic vote assignments. As long as a quorum has been established, if votes disappear due to nodes going offline, it will turn the witness vote on and off to make sure that you always have a possibility for node majority. As long as you HAVE a disk witness that is.

There are three types of disk witnesses:

  • A SAN-connected shared witness disk, usually FC or iSCSI. Recommended for clusters that use shared SAN-based cluster disks for other purposes, otherwise not recommended. If this sounds like gibberish to you, you should use another type of witness.
  • A File share witness. Just a file share. Any type of file share would do, as long as it resides on a Windows server in the same domain as the cluster nodes. SOFS shares are recommended, but not necessary. DO NOT build a SOFS cluster for this purpose alone. You could create a VM for cluster witnesses, as each cluster witness is only about 5MiB, but it is best to find an existing physical server with a high uptime requirement in the same security zone as the cluster and create some normal SMB-shares there. I recommend a physical server because a lot of virtual servers are Hyper-V based, and having the disk witness on a vm in the cluster it is a witness for is obviously a bad idea.
  • Cloud Witness. New in W2016. If you have an Azure storage account and are able to allow the cluster nodes a connection to Azure, this is a good alternative. Especially for stretch clusters that are split between different rooms.

How to set up a simple SMB File share witness

  • Select a server to host the witness, or create one if necessary.
  • Create a folder somewhere on the server and give it a name that denotes its purpose:
  • image
  • Open the Advanced Sharing dialog
  • image
  • Enable sharing and change the permissions. Make sure that everyone is removed, and add the cluster computer object. Give the cluster computer object full control permissions
  • image
  • Open Failover Cluster manager and connect to the cluster
  • Select “Configure Cluster Quorum Settings:
  • image
  • Chose Select The Quorum Witness
    image

  • Select File Share Witness

  • image

  • Enter the path to the files share as \\server\share

  • image

  • Finish the wizard

  • Make sure the cluster witness is online:

  • image

  • Done!

DNS Operation refused on Cluster Aware Updating resource

Problem

On one of my Hyper-V clusters, Event ID 1196 from FailoverClustering is logged in the system log every fifteen minutes. The event lists the name of the resource and the error message “DNS operation refused”. What it is trying to tell me is that the cluster is unable to register a network name resource in DNS due to a DNS 9005 status code. A 9005 status code translates to “Operation refused”. In this case it was a CAU network name resource which is a part of Cluster Aware Updating.

Continue reading “DNS Operation refused on Cluster Aware Updating resource”

Cluster validation complains about KB3005628

Problem

You run failover cluster validation, and the report claims that one or more of the nodes are missing update KB3005628:

SNAGHTML338020e6

You try running Windows Update, but KB3005628 is not listed as an available update. You try downloading and installing it manually, but the installer quits without installing anything.

Analysis

KB3005628 is a fix for .Net framework 3.5, correcting a bug in KB2966827 and KB2966828. The problem is that the cluster node in question does not have the .Net framework 3.5 feature installed. It did however have  KB2966828 installed. As this is also a .Net 3.5 update, I wonder how it got installed in the first place. After reading more about KB3005628, it seems that KB2966828 could get installed even if .Net framework 3.5.1 is not installed.

So far, no matter what I do the validation report lists KB3005628 as missing on one of the cluster nodes. This may be a bug in the Failover Cluster validator.

Workaround

If the .Net Framework 3.5 feature is not installed, remove KB2966827 and KB2966828 manually from the affected node if they are installed. The validation report will still list KB3005628 as missing, but as the only function of KB3005628 is to remove KB2966827 and KB2966828 this poses no problem.

Cluster node fails to join cluster after boot

Problem

During a maintenance window, one of five Hyper-v cluster nodes failed to come out of maintenance mode after a reboot. SCVMM was used to facilitate maintenance mode. The system log shows the following error messages repeatedly:

Service Control Manager Event ID 7024

The Cluster Service service terminated with the following service-specific error:
Cannot create a file when that file already exists.

FailoverClustering Event ID 1070

The node failed to join failover cluster [clustername] due to error code ‘183’.

Service Control Manager Event ID 7031

The Cluster Service service terminated with the following service-specific error:
The Cluster Service service terminated unexpectedly.  It has done this 6377 time(s).  The following corrective action will be taken in 60000 milliseconds: Restart the service.

SNAGHTML115b89e

SNAGHTML1170717

SNAGHTML117b921

Analysis

I had high hopes of a quick fix. The cluster is relatively new, and we had recently changed the network architecture by adding a second switch. Thus, I instructed the tech who discovered the fault to try rebooting and checking that the server was able to reach the other nodes on all its interfaces. That river ran dry quickly though, as the local network was proved to be working perfectly.

Looking through the Windows Update log and the KBs for the installed updates did not reveal any clues. Making it even more suspicious, a cluster validation ensured me that all nodes had the same updates. Almost. Except for one. Hopefully I looked closer, but of course, this was some .Net framework update rollup missing on a different node.

I decided to give up all hope of an impressive five minute fix and venture into the realm of the cluster log. It is possible to read the cluster log in the event log system, but I highly recommend generating the text file and opening it in notepad++ or some other editor with search highlighting. I find it a lot easier on the eyes and mind. (Oh, and if you use the event log reader, DO NOT filter out information messages. For some reason, the eureka moment is often hidden in an information message.) The cluster log is a bit like the forbidden forest; it looks scary in daylight and even scarier in the dark during an unscheduled failover. It is easy to get lost down a track interpreting hundreds of “strange” messages, only to discover that they were benign. To make it worse, they covered a timespan of about half a second. The wrong second of course, not the one where the problem actually occurred. To say it mildly, the cluster service is very talkative. Especially so when something is wrong. As event 7031 told us, the cluster service is busy trying to start once a minute. Each try spews out thousands of log messages. The log I was working with had 574 942 lines and covered a timespan 68 minutes. That is about 8450 lines per service start.

Anyway, into the forbidden forest I went with nothing but a couple of event log messages to guide me. After a while, I isolated one cluster service startup attempt with correlated system event log data.   I discovered the following error:

ERR   mscs::GumAgent::ExecuteHandlerLocally: AlreadyExists(183)’ because of ‘already exists'([Servername]- LiveMigration Vlan 197)

The sneaky cluster service had tried fooling me into believing that the fault was file system related, when in fact it was a networking mess after all! You may recognize error 183 from system event id 1070 above. Thus we can conclude with fairly high probability that we have found the culprit. But what does it mean? I checked the name of the adapter in question and its teaming configuration. I ventured even further and checked for disconnected and missing network adapters, but none were to be found, neither in the device manager or in the registry. Then it struck med. The problem was staring me straight in the eye. The line above the error message, an innocent looking INFO message, was referring to the network adapter by an almost identical but nevertheless different name:

INFO [ClNet] Adapter Intel(R) Ethernet 10G 2P X520-k bNDC #2 – VLAN : VLAN197 LiveMigration is still attached to network VLAN197 Live Migration

A short but efficient third degree interrogation of the tech revealed that the network names had been changed some weeks prior to make them consistent on all cluster nodes. Ensuring network name consistency is in itself a noble task, but one that should be completed before the cluster is formed. It should of course be possible to change network names at any time, but for some reason the cluster service has not been able to persist the network configuration properly. As long as the node is kept running this poses no problem. However, when the node is rebooted for whatever reason, the cluster service gets lost running around in the forest looking for network adapters. Network adapters that are present but silent as the cluster service is calling them by the wrong name. I have not been able to figure out exactly what happens, not to mention what caused it in the first place, but I can guess. My guess is that different parts of the persisted cluster configuration came out of sync. This probably links network adapters to their cluster network names:

SNAGHTML151cc90

I have found this fault now on two clusters in as many days, and those are my first encounters with it. I suspect the fault is caused by a bug or “feature” introduced in a recent update to Win2012R2.

Solution

The solution is simple. Or at least I hope it is simple. A usual I strongly encourage you to seek assistance if you do not fully understand the steps below and their implications.

  • First you have to disable the cluster service to keep it out of the way. You can do this in srvmgr.msc, PowerShell, cmd.exe or through telepathy.
  • Wait for the cluster service to stop if it was running (or kill it manually if you are impatient).
  • Change the name of ALL network adapter that have an ipv4 and/or ipv6 address and are used by the cluster service. Changing the name of only the troublesome adapter mentioned in the log may not be enough. Make sure you do not mess around with any physical teaming members, SAN HBAs, virtual cluster adapters or anything else that is not directly used as a cluster network.
  • Before:image After: image
  • Enable the cluster service
  • Wait for the cluster service to start, or start it manually
  • The node should now be able to join the cluster
  • Stop the cluster service again  (properly this time, do not kill it)
  • Change the network adapter names back
  • Start the cluster service again
  • Verify that the node joins the cluster successfully
  • Restart the node to verify that the settings are persisted to disk.

Error 3930 installing SQL 2012 SP2 with CU3 in cluster

Problem

I was patching one of my clusters to SQL2012 SP2 and SP2 CU3 when something bad happened. This particular cluster is a 3 node cluster with a FCI Primary AOAG replica instance on node 1 and 2, and a stand alone Secondary AOAG replica instance on node 3. Node 3 is used for HADR when the shared storage or other shared infrastructure has an outage.

The update passed QAT with flying colors, but sadly that does not always guarantee a successful production run. My standard patch procedure for this cluster:

  • Patch node 3
  • Patch node 2 (passive FCI node)
  • AOAG failover to node 3, node 3 becomes AOAG Primary
  • FCI failover from node 1 to node 2
  • Patch node 1
  • FCI failover to node 1
  • AOAG failover to node 1

When I tried to fail over the FCI to node 2 (step 4 above), the instance failed. First, I was worried that the SP2 upgrade process may be very lengthy or slow and triggering the FCI timeouts. An inspection of the SLQ Server error log revealed that this was not the case. Instead, I was the victim of a dreaded master database failure:

015-01-12 01:28:02.82 spid7s      Database 'master' is upgrading script 'msdb110_upgrade.sql' from level 184552836 to level 184554932.
2015-01-12 01:28:02.82 spid7s      ----------------------------------
2015-01-12 01:28:02.82 spid7s      Starting execution of PRE_MSDB.SQL
2015-01-12 01:28:02.82 spid7s      ----------------------------------
2015-01-12 01:28:02.96 spid7s      Error: 3930, Severity: 16, State: 1.
2015-01-12 01:28:02.96 spid7s      The current transaction cannot be committed and cannot support operations that write to the log file. Roll back the transaction.
2015-01-12 01:28:02.96 spid7s      Error: 912, Severity: 21, State: 2.
2015-01-12 01:28:02.96 spid7s      Script level upgrade for database 'master' failed because upgrade step 'msdb110_upgrade.sql' encountered error 3930, state 1, severity 16. This is a serious error condition which might interfere with regular operation and the database will be taken offline. If the error happened during upgrade of the 'master' database, it will prevent the entire SQL Server instance from starting. Examine the previous errorlog entries for errors, take the appropriate corrective actions and re-start the database so that the script upgrade steps run to completion.
2015-01-12 01:28:02.97 spid7s      Error: 3417, Severity: 21, State: 3.
2015-01-12 01:28:02.97 spid7s      Cannot recover the master database. SQL Server is unable to run. Restore master from a full backup, repair it, or rebuild it. For more information about how to rebuild the master database, see SQL Server Books Online.
2015-01-12 01:28:02.97 spid7s      SQL Server shutdown has been initiated
2015-01-12 01:28:02.97 spid7s      SQL Trace was stopped due to server shutdown. Trace ID = '1'. This is an informational message only; no user action is required.

Analysis

In case misbehaving SQL Server instances are able to smell fear, I am glad I was located several miles away from the datacenter at this point in time. While a rebuild of master is certainly doable even in a complex setup such as this, it is not something you want to do at 2am without a detailed plan if you don’t have to. Thus, I tried failing the instance back to node 1 (running SP1 CU 11). To my amazement it came online straight away. I have seen similar issues reduce clustered instances to an unrecognizable puddle of zeros and ones in a corner on the SAN drive, so this was a welcome surprise. Feeling lucky, I tried another failover to node 2, only to be greeted with another failure and the exact same errors in the log. A quick search revealed several similar issues, but no exact matches and no feasible solutions. The closest was a suggestion to disable replication during the upgrade. As you probably know, AOAG is just replication in a fancy dress, so I went looking for my Disaster Recovery Runbook that contains ready made scripts and plans for disabling and re-enabling AOAG. My only problem is that disabling AOAG will take down the AOAG listener, thus disconnecting all clients. Such antics results in grumpy client systems, web service downtimes and a lot of paperwork for instance reviews, and is therefore something to avoid if at all possible. Just for the fun of it, I decided to try making Node 2 the AOAG Primary during the upgrade. To my astonishment, this worked like a charm. Crisis (and paperwork) averted.

Solution

You have to promote the FCI to AOAG Primary during the upgrade from SP2 to SP1. The upgrade is triggered by failing the FCI over from a node running SP1 to a node running SP2, in my case the failover from node 1 to node 2 after patching node 2.

Sadly, there is no fixed procedure for patching failover cluster instances. Some patches will only install on the active FCI node, and will then continue to patch all nodes automatically. But most patches follow the recipe above, where the passive node(s) are patched first.

This issue will probably not affect “clean” AOAG or FCI clusters where you only apply one technology. If you use FCI with replication on the other hand, you may experience the same issue.

Definitions

AOAG = Always On Availability Group

FCI = Failover cluster Instance

HADR = High Availability / Disaster Recovery

What SMB version is actually used?

To verify which SMB version is in use for a specific fileshare/connection, run the following powershell command:

Get-SmbConnection |select ShareName, Dialect

You can run this command on both the client and the server. A client/server connection will use the highest version supported by both client and server. If the client supports up to v3.02, but the server is only able to support v3.00, v3.00 will be used for the connection.

The Get-Smbconnection commandlet supports a several other outputs, use select * to list them all.

Sample output

SNAGHTML3f409e4d

This is from a Win2012R2 client, connected to a share on a Win2012 cluster with multichannel support.

Grant create computer object permissions to the cluster

This post is part of the Failover Cluster Checklist series.

 

The Failover Cluster computer object needs to be granted the appropriate permissions necessary to create cluster resource objects (computers). Some resource objects can be staged, others cannot be staged. This depends on the OS version and resource type. The easiest solution is to place each cluster in a separate OU, and give the cluster permissions to create objects in that OU only.

How to do it

  • If necessary, create a new OU and move all cluster nodes and cluster resource objects to the new OU.
  • Enable view advanced features in ADUaC.

clip_image001

  • Open the Advanced Security Settings for the OU.

clip_image002

  • Add the cluster name machine object, and grant the Create Computer objects permission.

clip_image003

  • Make sure the cluster machine Object has been granted the Read all Properties permission.
    image