Error 87 when installing SQL Server Failover Cluster instance

Problem

When trying to install a SQL Server Failover Cluster instance, the installer fails with error 87. Full error message from the summary log:

Feature: Database Engine Services
Status: Failed
Reason for failure: An error occurred during the setup process of the feature.
Next Step: Use the following information to resolve the error, uninstall this feature, and then run the setup process again.
Component name: SQL Server Database Engine Services Instance Features
Component error code: 87
Error description: The parameter is incorrect.

And from the GUI:

Analysis

This is an error I have not come across before.

The SQL Server instance is kind of installed, but it is not working, so you have to uninstall it and remove it from the cluster to try again. This is rather common when a clustered installation fails though.

As luck would have it, the culprit was easy to identify. A quick look in the cluster log (the one for the SQL Server installation, not the actual cluster log) revealed that the IP address supplied for the SQL Server instance was invalid. The cluster in question was located in a very small subnet, a /26 subnet. The IP allocated was .63. A /26 subnet contains 64 addresses. As you may or may not know, the first and last addresses in a subnet are reserved. The first address (.0) is the network address, and the last address (yes, that would be .63) is reserved as the broadcast address. It is also common to reserve the second or second to last address for a gateway, that would be .1 or .62 in this case.

Snippet from the log:

Solution

Allocate a different IP address. In our case that meant moving the cluster to a different subnet. as the subnet was completely filled to the brim.

Action plan:

  • Replace the IP on node 1
  • Wait for the cluster to arbitrate using the heartbeat VLAN or direct attach crossover cable.
  • Add an IP to the cluster goup resource group and give it an address in the new subnet.
  • Bring the new IP online
  • Replace the IP on node 2
  • Wait for the cluster to achieve quorum
  • Remove the old IP from the Cluster Group resource group
  • Validate the cluster
  • Re-try the SQL Server Failover Cluster instance installation.

Quorum witness is online but does not work

Problem

The cluster appears to be working fine, but every 15 minutes or so the following events are logged on the node that owns the quorum witness disk:

Source:        Microsoft-Windows-Ntfs
Event ID:      98
Level:         Information
Description:
Volume WitnessDisk: (\Device\HarddiskVolumeNN) is healthy.  No action is needed.


Event ID:      1558
Source:        Microsoft-Windows-FailoverClustering
Level:         Warning
Description:
The cluster service detected a problem with the witness resource. The witness resource will be failed over to another node within the cluster in an attempt to reestablish access to cluster configuration data.


Log Name:      System
Event ID:      1069
Level:         Error
Description:
Cluster resource 'Witness' of type 'Physical Disk' in clustered role 'Cluster Group' failed.

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

Analysis

Some digging in the event log identified a disk error incident during a failover of the virtual machine:

Log Name:      System
Event ID:      1557
Level:         Error
Description:
Cluster service failed to update the cluster configuration data on the witness resource. Please ensure that the witness resource is online and accessible.


Log Name:      System
Source:        Microsoft-Windows-Ntfs
Event ID:      140
Description:
The system failed to flush data to the transaction log. Corruption may occur in VolumeId: WitnessDisk:, DeviceName: \Device\HarddiskVolumeNN.
({Device Busy}
The device is currently busy.)

And ultimately

Log Name:      System
Source:        Ntfs
Level:         Warning
Description:
{Delayed Write Failed} Windows was unable to save all the data for the file . The data has been lost. This error may be caused by a failure of your computer hardware or network connection. Please try to save this file elsewhere.

It appears that the witness disk had a non-responsive period during the failover of the VM that caused an update to the cluster database to fail, thus rendering the copy of the cluster database contained on the witness disk corrupt. The disk itself is fine, thus there are no faults in the cluster resource status. everything appears hunky dory. There could be other causes leading to the same situation, but in this case the issue corelates with a VM failover.

We need to replace the defective database with a fresh copy from one of the nodes.

Solution

The usual warning: If this procedure is new to you, seek help before attempting to do this in production. If your cluster has other issues, messing with the quorum setup may end you in serious trouble. And if you have any doubts what so ever about the integrity of the drive/LUN, replace it with a new one.

Warnings aside, this procedure is usually safe, and as long as the cluster is otherwise healthy you can do this live without scheduling downtime.

Action plan

  • Remove the quorum witness from the cluster.
  • Check that the disk is listed as available storage and online.
  • Take ownership of the defective “cluster” folder on the root ofr the quorum witness drive.
  • Rename it to “oldCluster” in case we need to extract some data.
  • Add the disk back as a quorum witness
  • Wait to check that the error messages does not re-appear.
  • If they do re-appear
    • Order a new LUN
    • Add it to the cluster
    • Use the new LUN as a quorum witness
    • Remove the old LUN from the cluster

Reading the cluster log with Message Analyzer

Microsoft Message Analyzer, the successor to Network Monitor, has the ability to read a lot more than just network captures. In this post I will show how you can open  set of cluster logs from a SQL Server Failover Cluster instance. If you are new to Message Analyzer I recommend that you glance at the Microsoft Message Analyzer operating guide while you read this post for additional information.

Side quest: Basic cluster problem remediation

Remember that the cluster log is a debug log used for analyzing what went wrong AFTER you get it all working again. In most cases your cluster should self-heal, and all you have to do is figure out what went wrong and what you should do different to prevent it from happening again. If your cluster is still down and you are reading this post, you are on the wrong path.

Below you will find a simplified action plan for getting your cluster back online. I will assume that you have exhausted you normal troubleshooting process to no avail, that your cluster is down and that you do not know why. The type of Failover Cluster is somewhat irrelevant for this action plan.

  • If your cluster has shared storage, call your SAN person and verify that all nodes can access the storage, and that there are no gremlins in the storage and fabric logs.
  • If something works and something does not, restart all nodes one by one. If you cannot restart a node, power cycle it.
  • If nothing works, shut down all nodes, then start one node. Just one.
    • Verify that it has a valid connection to the rest of your environment, both networking and storage if applicable.
    • If you have more than two nodes, start enough nodes to establish quorum. Usually n/2.
  • Verify that your hardware is working. Check OOB logs and blinking lights.
  • If the cluster is still not working, run a full cluster validation an correct any errors. If you had errors in the validation report BEFORE the cluster went down, your configuration is not supported and this is probably the reason for your predicament. Rectify all errors and try again.
  • If you gave warnings in your cluster validation report, check each one and make a decision whether or not to correct it. Some clusters will have warnings by design.
  • If your nodes are virtual, make sure that you are not using VMWare Raw Device Mapping. If you are, this is the probable cause of all your problems, both on this cluster and any personal problems you may have. Make the necessary changes to remove RDM.
  • If your nodes are virtual, make sure there are no snapshots/checkpoints. If you find any, remove them. Snapshots/checkpoints left running for > 12 hours may destroy a production cluster.
  • If the cluster is still not working, reformat, reinstall and restore.

Prerequisites and test environment

  • A running failover cluster. Any type of cluster will do, but I will use a SQL Server Failover Cluster Instance as a sample.
  • A workstation or server running Microsoft Message Analyzer 1.4 with all the current patches and updates as of march 2019.
  • The cluster nodes in the lab are named SLQ19-1 and SQL19-2 and are running Windows Server 2019 with a SQL Server 2019 CTP 2.2 Failover Cluster Instance.
  • To understand this post you need an understanding about how a Windows Failover Cluster works. If you have never looked at a cluster log before, this post will not teach you how to interpret the log. https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-2000-server/cc961673(v=technet.10) contains additional information about the cluster log. It is very old but still relevant, and at the time of writing the best source of information I could find. There is also an old article about the Resource Hosting Subsystem that may be of use here.

Obtaining the cluster log

  • To get the current cluster log, execute Get-ClusterLog -Destination C:\TEMP –SkipClusterState in an administrative PowerShell windows on one of the cluster nodes.
  • Be aware that the time zone in the log file will be Zulu time/GMT. MA should compensate for this.
  • The SkipClusterState option removes a lot of unparseable information from the file. If you are investigating a serious problem you may want to run a separate  export without this option.
  • The TimeSpan option limits the log timespan. I used it to get a smaller sample set for this lab, and so should you if you know what timespan you want to investigate. You can also add a pre-filter in MA to limit the timespan.
  • You should now have one file for each cluster node in C:\Temp.
  • Copy the files to the machine running Message Analyzer.
  • image

Starting Message Analyzer and loading the logs

  • Open Message Analyzer.
  • Click New Session.
  • Enter a session name.
  • Click the Files-button.
  • image
  • Add the .log files.
  • Select the Cluster text log configuration.
  • image
  • Click Start to start parsing the files.
  • Wait while MA is parsing the files. Parsing time is determined by machine power and the size of the log, but it should normally take tens of minutes, not hours unless the file is hundreds of megabytes or more.

Filtering unparseable data

  • After MA is done parsing the file, the list looks a little disconcerting. All you see are red error messages:
  • image
  • Not to worry though, what you are looking at is just blank lines and other unparseable data from the file. You can read the unparseable data in the Details pane:
  • image
  • It is usually log data that is split on multiple lines in the log file and headers dividing different logs included in the file. A similar message as the sample above looks like this in the log file:
  • image
  • We can filter out these messages by adding #Timestamp to the filter pane and clicking Apply. This will filter out all messages without a timestamp.
  • image

Saving the session

To make the data load faster next time, we can save the parsed data and filter as a session. This will retain the workspace as we left it.

  • Click Save.
  • Select All Messages.
  • Click Save As.
  • Save the .matp file.
  • image

Looking for problems

The sample log files contains an incident where the iSCSI storage disappeared. This was triggered by a SAN reboot during a firmware update on a SAN without HA. I will go through some analysis of this issue to show how we can use MA to navigate the cluster logs.

  • To make it easier to read the log, we will add a Grouping Viewer. Click New Viewer, Grouping, Cluster Logs:
  • image
  • This will give you a Grouping pane on the left. Stat by clicking the Collapse All button:
  • image
  • Then expand the ERR group and start with the messages without a subcomponent tag. The hexadecimal numbers are the ProcessId of the process writing the error to the log. Usually this is a resource hosting subsystem process.
  • It is pretty clear that we have a storage problem:
  • image
  • To check which log contains one of these messages, select one message and look in the Details pane, Properties mode. Scroll down until you find the TraceSource property:
  • image
  • To read other messages logged at the same time, switch the Grouping viewer from Filter to Select mode:
  • image
  • If we click the same ERR group again, the Analysis Grid view will scroll to the first message in this group and mark all messages in the group.
  • image
  • The WARN InfoLevel for the RES SubComponent is also a good place to look for root causes:
  • image
  • If you want to see results from one log file only, add *TraceSource == “filename” to the grouping filter.

Failover Cluster: access to update the secure DNS Zone was denied.

Problem

After you have built a cluster, the Cluster Events page fills up with Event ID 1257 From FailoverClustering complaining about not being able to write to the DNS records in AD:

“Cluster network name resource failed registration of one or more associated DNS names(s) because the access to update the secure DNS Zone was denied.


Cluster Network name: X
DNS Zone: Y


Ensure that cluster name object (CNO) is granted permissions to the Secure DNS Zone.”

image

Solution

There may be other root cause scenarios, but in my case the problem was a static DNS reservation on the domain controller.

As usual, if you do not understand the action plan below, seek help or get educated before you continue. Your friendly local search  engine is a nice place to start if you do not have a local cluster expert. This action plan includes actions that will take down parts of your cluster momentarily, so do not perform these steps on a production cluster during peak load. Schedule a maintenance window.

  • Identify the source of the static reservation and apply public shaming and/or pain as necessary to ensure that this does not happen again. Cluster DNS records should be dynamic.
  • Identify the static DNS record in your Active Directory Integrated DNS forward lookup zone. Ask for help from your DNS or AD team if necessary.
  • Delete the static record
  • Take the Cluster Name Object representing the DNS record offline in Failover Cluster manager (or by powershell). Be aware that any dependent resources will also go offline.
  • Bring everything back online. This should trigger a new DNS registration attempt. You could also wait for the cluster to attempt this automatically, but client connections may fail while you are waiting.
  • Verify that the DNS record is created as a dynamic record. It should have a current Timestamp.

iSCSI in the LAB

I sometimes run internal Windows Failover Clustering training, and I am sometimes asked “How can I test this at home when I do not have a SAN?”. As you may know, even though clusters without shared storage are in deed possible in the current year, some cluster types still rely heavily on shared storage. When it comes to SQL Server clusters for instance, a Failover Cluster instance (which relies on shared storage) is a lot easier to operate compared to an AOAG cluster which does not rely on shared storage. There are a lot of possible solutions to this problem, you could for instance use your home NAS as an iSCSI “SAN”, as many popular NAS boxes have iSCSI support. In this post however, I will focus on how to build a Windows Server 2019 vm with an iSCSI target for LAB usage. This is NOT intended as a guide for creating a production iSCSI server. It is most definitely a bad iSCSI server with poor performance and not suited for anything requiring production level performance.

“What will I need to do this?” I hear you ask. You will need a Windows Server 2019 VM with some spare storage mounted as a secondary drive, a domain controller VM and some cluster VMs. You could also use a physical machine if you want, but usually this kind of setup involves one physical machine running a set of VMs. In this setup I have four VMs:

  • DC03, the domain controller
  • SQL19-1 and SQL19-2, the cluster nodes for a SQL 2019 CTP 2.2 failover cluster instance
  • iSCSI19, the iSCSI server.

The domain controller and SQL Servers are not covered in this post. See my Failover Cluster series for information on how to build a cluster.

Action plan

  • Make sure that your domain controller and iSCSI client VMs are running.
  • Make sure that Failover Cluster features are installed on the client VMs.
  • Enable the iSCSI service on your cluster nodes that are going to use the iSCSI server. All you have to do is to start the iSCSI initiator program, it will ask you to enable the iSCSI service:
  • clip_image001

  • Create a Windows 2019 Server VM or physical server.
  • Add it to your lab domain.
  • Set a static IP, both v4 and v6. This is important for stability, iSCSI does not play well with DHCP. In fact, all servers involved in this should have static IPs to ensure that the iSCSI storage is reconnected properly at boot.
  • Install the iSCSI target feature using powershell.
    • Install-WindowsFeature FS-iSCSITarget-Server
  • Add a virtual hard drive to serve as storage for the iSCSI volumes.
  • Initialize the drive, add a volume and format it. Assign the drive letter I:

  • clip_image002

  • Open Server Manager and navigate to File and Storage Services, iSCSI
  • Start the “New iSCSI virtual disk wizard”
  • Select the I: drive as a location for the new virtual disk

  • clip_image003

  • Select a name for you disk. Here, I have called it SQL19_Backup01
  • clip_image004

  • Set a disk size in accordance with your needs. As this is for a LAB environment, select a Dynamically expanding disk type.
  • clip_image005

  • Create a new target. I used SQL19Cluster as the name for the target.
  • Add the cluster nodes as initiators. You should have enabled iSCSI on the cluster nodes before this step (see above). The cluster nodes also has to be online for this to be successful.
  • clip_image006

  • As this is a lab we skip the authentication setup
  • clip_image007

  • Done!
  • clip_image008

  • On the cluster nodes, start the iSCSI initator.
  • Input the name of the iSCSI target and click the quick connect button.
  • clip_image009

  • Initialize, online and format the disk on one of the cluster nodes
  • clip_image010

  • Run storage validation in failover cluster manager to verify your setup. Make sure that your disk/disks are part of the storage validation. You may have to add the disk to the cluster and set it offline for validation to work.
  • Check that the disk fails over properly to the other node. It should appear with the same drive letter or mapped folder on both nodes, but only on one node at the time unless you convert it to a Cluster Shared Volume. (Cluster Shared Volumes are for Hyper-V Clusters).

Hyper-V VM with VirtualFC fails to start

Problem

This is just a quick note to remember the solution and EventIDs.

The VM fails to start complaining about failed resources or resource not available in Failover Cluster manager. Analysis of the event log reveals messages related to VirtualFC:

  • EventID 32110 from Hyper-V-SynthFC: ‘VMName’: NPIV virtual port operation on virtual port (WWN) failed with an error: The world wide port name already exists on the fabric. (Virtual machine ID ID)
  • EventID 32265 from Hyper-V-SynthFC: ‘VMName’: Virtual port (WWN) creation failed with a NPIV error(Virtual machine ID ID).
  • EventID 32100 from Hyper-V-VMMS: ‘VMNAME’: NPIV virtual port operation on virtual port (WWN) failed with an unknown error. (Virtual machine ID ID)
  • EventID 1205 from Microsoft-Windows-FailoverClustering: The Cluster service failed to bring clustered role ‘SCVMM VM Name Resources’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

Analysis

The events point in the direction of Virtual Fibre Channel or Fibre Channel issues. After a while we realised that one of the nodes in the cluster did not release the WWN when a VM migrated away from it. Further analysis revealed that the FC driver versions were different.

SNAGHTML66058b10

SNAGHTML660875d4

Solution

  • Make sure all cluster nodes are running the exact same driver and firmware for the SAN and network adapters. This is crucial for failovers to function smoothly.
  • To “release” the stuck WWNs you have to reboot the offending node. To figure out which node is holding the WWN you have to consult the FC Switch logs. Or you could just do a rolling restart and restart all nodes until it starts working.
  • I have successfully worked around the problem by removing and re-adding the virtual FC adapters n the VM that is not working. I do not know why this resolved the problem.
  • Another workaround would be to change the WWN on the virtual FC adapters. You would of course have to make this change at the SAN side as well.

Primary replica is not joined to the Availability group, or clusters past morphing into clusters present

Problem

I was upgrading an Availability group from SQL 2012 on Win 2012R2 to SQL 2016 on Win2016. I had expected to create the new AOAG as a separate cluster and move the data manually, but the users are always complaining when I want to use my allotted downtime quotas, so I decided to try a rolling upgrade instead. This post is a journal of some of the perils I encountered along the way, and how I overcame them. There were countless others, but most of them were related to crappy hardware, wrong hardware being delivered, missing LUNS on the SAN, delusional people who believe they can lock out DBAs from supporting systems, dragons, angry badgers, solar flares and whichever politician you dislike the most. Anyways, on with the tale of clusters past morphing into clusters present…

I started with adding the new node to the failover cluster. This went surprisingly well, in spite of the old servers being at least two generations older than my new rack servers. Sadly, both the new and the old servers are made by the evil wizards behind the silver slanted E due to factors outside of my control. But I digress. The cluster join went flawlessly. There was some yellow complaints about the nodes not having the same OS version in the cluster validation scroll, but everything worked.

Then came adding the new server as a replica in the availability group. This is done from the primary replica, and I just uttered a previously prepared spell from the book of disaster recovery belonging to this cluster, adding the name of the new node. As far as I can remember this is just the result of the standard “Add replica” wizard. The spell ran without complaints, and my new node was online.

This is the point where it all went to heck in a small hand-basket carried by an angry badger. I noticed a yellow warning next to the new node in the AOAG dashboard. But as the databases were all in the synchronizing state on the new replica, I believed this to be a note complaining about the OS-version. I was wrong. In my ignorance, I failed over to the new node and had the application  team minions run some tests. They came back positive, so I removed the old nodes in preparation for adding the last one. I even ran the Update-ClusterFunctionalLevel Powershell command without issues. But the warning persisted. This is the contents of the warning:

Availability replica not joined.

SNAGHTML57bbb24f

And it was no longer a lone warning, the AOAG dashboard did not look pretty as both the old nodes refused to accept the new node as their new primary replica.

Analysis

As far as I can tell, the join AOAG script failed in some way. It did not report any errors, but still, there is no doubt that something did go wrong.

The solution as reported by MSDN is simple, just join the availability group by casting the “alter availability group groupname join” spell from the secondary replica that is not joined. The attentive reader has probably already realized that this is the primary replica, and as you probably suspect, the aforementioned command fails.

Casting the following spell lists the replicas and their join state: “select join_state, join_state_desc from sys.dm_hadr_availability_replica_cluster_states”. This is the result:

image

In some way I have put the node in an invalid state. It still works perfectly, but I guess there is only a question about when, not if this issue is about to grow into a bigger problem.

Solution

With such an elaborate backstory, you would not be wrong to expect an equally elaborate solution. Whether or not it is, is really in the eye of the beholder.

Just the usual note of warning first: If you are new to availability groups, and all this cluster stuff sounds like the dark magic it is, I would highly suggest that you do not try to travel down the same path as me. Rather, you should turn around at the entrance and run as fast as you can into the relative safety of creating another cluster alongside the old one. Then migrate the data by backing up on the old cluster and restoring on the new cluster. And if backups and restores on availability groups sounds equally scary, then ask yourself whether or not you are ready to run AOAG in production. In contrast to what is often said in marketing materials and at conferences, AOAG is difficult and scary to the beginner. But there are lots of nice training resources out there, even some free ones.

Now, with the warnings out of the way, here is what ended up working for me. I tried a lot of different solutions, but I was bound by the following limitation: The service has to be online. That translates to no reboots, no AOAG-destroy and recreate, no cluster rebuilds and so on. A combination of which would probably have solved the problem in less than an hour of downtime. But I was allowed none, so this is what I did:

  • Remove any remaining nodes and replicas that are not Win2016 SQL2016.
  • Run the Powershell command Update-ClusterFunctionalLevel to make sure that the cluster is running in Win2016 mode.
  • Build another Win 2016 SQL 2016 node
  • Join the new node to the cluster
  • Make sure that the cluster validation scroll seems reasonable. This is a fluffy point I know, but there are way to many variables to make an exhaustive list. https://lokna.no/?p=1687 mentions some of the issues you may encounter.
  • Join the new node to the availability group as a secondary replica.
  • Fail the availability group over to the new node (make sure you are in synchronous commit mode for this step).
  • Everything is OK.

image

  • Fail back to the first node
  • Change back to asynchronous commit (if that is you default mode, otherwise leave it as synchronous).

 

Thus I have successfully upgraded a 2-node AOAG cluster from Win2012R2 and SQL 2012 to Win2016 and SQL 2016 with three failovers as the only downtime. In QA. Production may become an interesting journey, IF the change request is approved. There may be an update if I survive the process…

 

Update and final notes

I have now been through the same process in production, with similar results. I do not recommend doing this in production, the normal migration to a new cluster is far preferable, especially when you are crossing 2 SQL Server versions on the way. Then again, if the reduced downtime is worth the risk…

Be aware that a failover to a new node is a one way process. Once the SQL 2016 node becomes the primary replica, the database is updated to the latest file format, currently 852 whereas SQL 2012 is 706. And as far as I can tell from the log there is a significant number of upgrades to be made. See http://sqlserverbuilds.blogspot.no/2014/01/sql-server-internal-database-versions.html for a list of version numbers.

image

Failover Cluster manager fails to start

Problem

When trying to start Failover Cluster manager you get an error message: “Microsoft Management Console has stopped working”

image

Inspection of the application event log reveals an error event id 1000, also known as an application error with the following text:

Faulting application name: mmc.exe, version: 6.3.9600.17415, time stamp: 0x54504e26
Faulting module name: clr.dll, version: 4.6.1055.0, time stamp: 0x563c12de
Exception code: 0xc0000409

image

 

Solution

As usual, this is a .NET Framework debacle. Remove KB 3102467 (Update for .NET Framwework 4.6.1), or wait for a fix.

image

The SQL 2012 SP3 update killed my cluster

Problem

After an upgrade to SQL 2012 SP3, a clustered instance fails to start listing error 33009, 912 and 3417:

2016-04-28 14:32:23.39 spid14s     Error: 33009, Severity: 16, State: 2.
2016-04-28 14:32:23.39 spid14s     The database owner SID recorded in the master database differs from the database owner SID recorded in database 'msdb'. You should correct this situation by resetting the owner of database 'msdb' using the ALTER AUTHORIZATION statement.
2016-04-28 14:32:23.39 spid14s     Error: 912, Severity: 21, State: 2.
2016-04-28 14:32:23.39 spid14s     Script level upgrade for database 'master' failed because upgrade step 'msdb110_upgrade.sql' encountered error 33009, state 2, severity 16. This is a serious error condition which might interfere with regular operation and the database will be taken offline. If the error happened during upgrade of the 'master' database, it will prevent the entire SQL Server instance from starting. Examine the previous errorlog entries for errors, take the appropriate corrective actions and re-start the database so that the script upgrade steps run to completion.
2016-04-28 14:32:23.39 spid14s     Error: 3417, Severity: 21, State: 3.
2016-04-28 14:32:23.39 spid14s     Cannot recover the master database. SQL Server is unable to run. Restore master from a full backup, repair it, or rebuild it. For more information about how to rebuild the master database, see SQL Server Books Online.

Something went down the steep road to oblivion during the upgrade process, and the instance is as dead as an overrun squirrel. To be more specific, the master database does not like the looks of the owner attribute on the msdb database, and is down right refusing to complete the upgrade process.

Analysis

  • Start the instance with trace flag 902 to disable the upgrade scripts and get the server back online. You can do this by command line if you wish, but for clustered instances it is easier to use SQL Server config manager. There will be some existing startup parameters. Do not mess with those, just add a new one using the handy “Add” button provided.
  • image
  • Wait for the instance to start, and log in using SSMS.
  • To find out what SID is where, run the following script
--Sid in master
 Select SID from master..sysdatabases Where  Name = 'msdb'
--Sid in database
 Select [SID]  From msdb.sys.database_principals Where Name = 'DBO'

It will probably show 0x01  (sa) for one of the results, and an Active directory SID for the other one:

image

If you want to know what username SQL Server has stored for the SID, use this command, replacing 0x01 with the SID from the previous result:

 --Sid to username
 SELECT  Name as [LoginName]  FROM   master..syslogins  Where  SID = 0x01

Be aware that if the username has been changed in AD, lets say you changed DOMAIN\johnsmith to DOMAIN\jsmith1, SQL Server will not necessarily be aware of this. You can validate the username in AD using powershell. See https://technet.microsoft.com/en-us/library/ff730940.aspx. And just to make the point clear, do not mess around with account names in AD if they are linked to SQL Server. It may lead to login failures, and possibly cows being abducted by aliens at your local dairy farm.

Solution

Change the database owner to match the value stored in master. Remember to use brackets for AD users.

Alter Authorization on Database::msdb To [Domain\user]

Then remove the trace flag and restart the instance to try the upgrade scripts again. If the current owner is not the owner you want, allow the upgrade to finish and change it again.

Cluster Shared Volumes password policy

Problem

Failover Cluster validation genereates a warning in the Storage section under “Validate CSV Settings”. The error message states:

Failure while setting up to run Cluster Shared Volumes support testing on node [FQDN]: The password does not meet the password policy requirements. Check the minimum password length, password complexity and password history requirements.

No failure audits in the security log, and no significant error messages detected elsewhere.

Analysis

This error was logged shortly after a change in the password policy for the Windows AD domain the cluster is a member of. The current minimum password length was set to 14 (max) and complexity requirements were enabled:

image

This is a fairly standard setup, as written security policies usually mandate a password length far exceeding 14 characters for IT staff. Thus, I already knew that the problem was not related to the user initiating the validation, as the length of his/her password already exceeded 14 characters before the enforcement policy change.

Lab tests verified that the problem was related to the Default domain password policy. Setting the policy as above makes the cluster validation fail. The question is why. Further lab tests revealed that the limit is 12 characters. That is, if you set the Minimum length to 12 characters the test will pass with flying colors as long as there are no other problems related to CSV. I still wondered why though. The problem is with the relation between the local and domain security policies of a domain joined computer. To understand this, it helps to be aware of the fact that Failover Cluster Validation creates a local user called CliTest2 on all nodes during the CSV test:

SNAGHTML62b3e8f

The local user store on a domain joined computer is subject to the same password policies as are defined in the Default Domain GPO. Thus, when the domain policy is changed this will also affect any local accounts on all domain joined computers. As far as I can tell, the Failover Cluster validation process creates the CliTest2 user with a 12 character password. This has few security ramifications, as the user is deleted as soon as the validation process ends.

Solution

The solution is relatively simple to describe. You have to create a separate Password Policy for you failover cluster nodes where Minimum Password Length is set to 12 or less. This requires that you keep your cluster nodes in a separate Organizational Unit from your user and service accounts. That is a good thing to do anyway, but be aware that moving servers from one OU to another may have adverse effects.

You then create and link a GPO to the cluster node OU and set the Minimum Password Length to 12 in the new GPO. That is the only setting that should be defined in this GPO. Then check the Link order for the OU and make sure that your new GPO has a link order 1, or at least a lower link order than the Default Domain policy. Then you just have to run GPUPDATE /Force on all cluster nodes and try the cluster validation again.

If the above description sounds like a foreign language, please ask for help before you try implementing it. Group Policies may be a fickle fiend, and small changes may lead to huge unforeseen consequences.