Removing a drive from a cluster group moves all cluster group resources to Available Storage

Problem

During routine maintenance on a SQL Server cluster we were planning to remove one of the clustered drives. We had previously replaced the SAN, and this disk was backed by an old storage unit that we wanted to decommission. So we made sure that there were no dependencies, right-clicked the drive in Failover Cluster Manager under the SQL Server role and pressed “Remove from SQL Server”. Promptly the drive vanished from view, together with all other cluster resources associated with the role…

After a slightly panicky check to make sure that the SQL Server instance was still running (it was), we started to wonder about what was happening. Running Get-ClusterResource in PowerShell revealed that all our missing resources had been moved to the “Available Storage” resource group.

image

We did a failover to verify that the instance was still working, and it gladly failed over with the Available Storage group. There is a total of 4 instances of SQL Server on the sample cluster pictured above.

Solution

The usual warning: Performing this procedure may result in an outage. If you do not understand the commands, read up on them before you try.

Move the resources back to the SQL Server resource group. If you move the SQL Server resource, that is the resource with the ResourceType SQL Server, all other dependent resources should follow. If your dependency settings are not configured correctly, you may have to move some of the resources independently.

Command: Get-ClusterResource “SQL Server (instance)”|Move-ClusterResource –Group “SQL Server (instance)”

Just replace Instance with the name of your SQL Server instance.

Then, run Get-ClusterResource|Sort-Object OwnerGroup, ResourceType to verify that all you resources are associated with the correct resource group. The result should look something like this. As a minimum, you should have an IP address, a network name, SQL Server, SQL Server Agent and one ore more Physical disk drives.

image

The SQL 2012 SP3 update killed my cluster

Problem

After an upgrade to SQL 2012 SP3, a clustered instance fails to start listing error 33009, 912 and 3417:

2016-04-28 14:32:23.39 spid14s     Error: 33009, Severity: 16, State: 2.
2016-04-28 14:32:23.39 spid14s     The database owner SID recorded in the master database differs from the database owner SID recorded in database 'msdb'. You should correct this situation by resetting the owner of database 'msdb' using the ALTER AUTHORIZATION statement.
2016-04-28 14:32:23.39 spid14s     Error: 912, Severity: 21, State: 2.
2016-04-28 14:32:23.39 spid14s     Script level upgrade for database 'master' failed because upgrade step 'msdb110_upgrade.sql' encountered error 33009, state 2, severity 16. This is a serious error condition which might interfere with regular operation and the database will be taken offline. If the error happened during upgrade of the 'master' database, it will prevent the entire SQL Server instance from starting. Examine the previous errorlog entries for errors, take the appropriate corrective actions and re-start the database so that the script upgrade steps run to completion.
2016-04-28 14:32:23.39 spid14s     Error: 3417, Severity: 21, State: 3.
2016-04-28 14:32:23.39 spid14s     Cannot recover the master database. SQL Server is unable to run. Restore master from a full backup, repair it, or rebuild it. For more information about how to rebuild the master database, see SQL Server Books Online.

Something went down the steep road to oblivion during the upgrade process, and the instance is as dead as an overrun squirrel. To be more specific, the master database does not like the looks of the owner attribute on the msdb database, and is down right refusing to complete the upgrade process.

Analysis

  • Start the instance with trace flag 902 to disable the upgrade scripts and get the server back online. You can do this by command line if you wish, but for clustered instances it is easier to use SQL Server config manager. There will be some existing startup parameters. Do not mess with those, just add a new one using the handy “Add” button provided.
  • image
  • Wait for the instance to start, and log in using SSMS.
  • To find out what SID is where, run the following script
--Sid in master
 Select SID from master..sysdatabases Where  Name = 'msdb'
--Sid in database
 Select [SID]  From msdb.sys.database_principals Where Name = 'DBO'

It will probably show 0x01  (sa) for one of the results, and an Active directory SID for the other one:

image

If you want to know what username SQL Server has stored for the SID, use this command, replacing 0x01 with the SID from the previous result:

 --Sid to username
 SELECT  Name as [LoginName]  FROM   master..syslogins  Where  SID = 0x01

Be aware that if the username has been changed in AD, lets say you changed DOMAIN\johnsmith to DOMAIN\jsmith1, SQL Server will not necessarily be aware of this. You can validate the username in AD using powershell. See https://technet.microsoft.com/en-us/library/ff730940.aspx. And just to make the point clear, do not mess around with account names in AD if they are linked to SQL Server. It may lead to login failures, and possibly cows being abducted by aliens at your local dairy farm.

Solution

Change the database owner to match the value stored in master. Remember to use brackets for AD users.

Alter Authorization on Database::msdb To [Domain\user]

Then remove the trace flag and restart the instance to try the upgrade scripts again. If the current owner is not the owner you want, allow the upgrade to finish and change it again.

LCK_M_U Wait troubleshooting

Problem

I got a callout from our monitoring system, alerting me of a high Average Wait Time on one of my production servers:

image

The baseline average wait value for this server is around 9ms, so an average in the 900ms range is a definitive sign of something. Whether it is a sign of something bad or not is for the investigation to decide. All other operating parameters where in the normal range, both for the SQL Server and the connected application servers. Usually, such warnings are accompanied by a complaint from the application team about slow application performance, but not today.

Analysis

I started out running Mr. Randal’s brilliant wait stats scripts. As the graph from the monitoring system contained a lot of jitter, I went with http://www.sqlskills.com/blogs/paul/capturing-wait-statistics-period-time/ and let it run for the 30 minute default period when the waits were high to se what was going on. The result revealed that LCK_M_U had joined the usual band of ASYNC_NETWORK_IO, TRACEWRITE and CXPACKET waits:

image

LCK_M_U indicates that someone somewhere is waiting for an update lock. This particular system has a very high Read ratio versus Writes, and most writes are done during nightly update tasks. As such, this wait type was not expected and further analysis was needed. I found an old article by Kendra Little describing how to collect data using  sp_WhoIsactive: http://www.littlekendra.com/2011/02/01/whoisactive/

Using her example as a guide, I created a script to collect some data in my monitoring database. To use this script, you need a user database that you can play around with and an up to date version of Adam Machanics sp_WhoIsactive. I tried using the @find_block_leaders option, but I ran in to a known recursion level limitation so I had to remove it. The script has three parts:

Create a table

This part creates a table tagged with todays date to make it easy to separate different troubleshooting sessions. You could also schedule daily automatic runs of the entire script to collect data at specific times each day.

DECLARE @destination_table VARCHAR(4000) ;
SET @destination_table = 'WhoIsActive_' + CONVERT(VARCHAR, GETDATE(), 112) ;

DECLARE @schema VARCHAR(4000) ;
EXEC dbo.sp_WhoIsActive
@get_transaction_info = 1,
@get_plans = 1,
@find_block_leaders = 0,
@get_locks=true,
@RETURN_SCHEMA = 1,
@SCHEMA = @schema OUTPUT;

SET @schema = REPLACE(@schema, '', @destination_table) ;

PRINT @schema
EXEC(@schema) ;
go

Collect data

You can run this part several times, and the intervals can be adjusted to suit your needs. Just be aware that if you cross midnight, you either have to create another table for the new day, or modify the collection script to log to yesterdays table.

DECLARE @destination_table VARCHAR(4000) ;
SET @destination_table = 'WhoIsActive_' + CONVERT(VARCHAR, GETDATE(), 112) ;
DECLARE @msg NVARCHAR(1000) ;
DECLARE @numberOfRuns INT ;
SET @numberOfRuns = 100 ;

WHILE @numberOfRuns > 0
    BEGIN;
        EXEC dbo.sp_WhoIsActive 
		@get_transaction_info = 1, 
		@get_plans = 1, 
		@find_block_leaders = 0, 
		@get_locks=true,
		@DESTINATION_TABLE = @destination_table ;

        SET @numberOfRuns = @numberOfRuns - 1 ;

        IF @numberOfRuns > 0
            BEGIN
                SET @msg = CONVERT(CHAR(19), GETDATE(), 121) + ': ' +
                 'Logged info. Waiting...'
                RAISERROR(@msg,0,0) WITH nowait ;

                WAITFOR DELAY '00:00:05'
            END
        ELSE
            BEGIN
                SET @msg = CONVERT(CHAR(19), GETDATE(), 121) + ': ' + 'Done.'
                RAISERROR(@msg,0,0) WITH nowait ;
            END

    END ;
GO

The default is 100 runs at 5 second intervals. I had to run the collection script three times to get the data I wanted, so don’t give up if you don’t catch it on the first run.

Analyze the data

Two sample analysis scripts. The first one gives you all the collected data sorted by collection time, the second filters out the LCK_M_U waits only and has a hardcoded table name as I didn’t finish the first day.

DECLARE @destination_table NVARCHAR(2000), @dSQL NVARCHAR(4000) ;
SET @destination_table = 'WhoIsActive_' + CONVERT(VARCHAR, GETDATE(), 112) ;
SET @dSQL = N'SELECT collection_time, * FROM dbo.' +
 QUOTENAME(@destination_table) + N' order by 1 desc' ;
print @dSQL
EXEC sp_executesql @dSQL
go
select collection_time, * 
from JKL_Log.dbo.WhoIsActive_20150616
where session_id not like 51 and wait_info like '%LCK_M_U%'
order by 1 desc;

The result clearly identified the waiting code. This code is not necessarily the culprit, it could be another slow task hogging locks on the data.

SNAGHTML33261fb4

Solution

With the troublesome T-SQL code identified, the next step would be debugging the application. In this case, the code was from an Entity Framework app, and I have to get the developers of said application to look into the issue.

Error 3930 installing SQL 2012 SP2 with CU3 in cluster

Problem

I was patching one of my clusters to SQL2012 SP2 and SP2 CU3 when something bad happened. This particular cluster is a 3 node cluster with a FCI Primary AOAG replica instance on node 1 and 2, and a stand alone Secondary AOAG replica instance on node 3. Node 3 is used for HADR when the shared storage or other shared infrastructure has an outage.

The update passed QAT with flying colors, but sadly that does not always guarantee a successful production run. My standard patch procedure for this cluster:

  • Patch node 3
  • Patch node 2 (passive FCI node)
  • AOAG failover to node 3, node 3 becomes AOAG Primary
  • FCI failover from node 1 to node 2
  • Patch node 1
  • FCI failover to node 1
  • AOAG failover to node 1

When I tried to fail over the FCI to node 2 (step 4 above), the instance failed. First, I was worried that the SP2 upgrade process may be very lengthy or slow and triggering the FCI timeouts. An inspection of the SLQ Server error log revealed that this was not the case. Instead, I was the victim of a dreaded master database failure:

015-01-12 01:28:02.82 spid7s      Database 'master' is upgrading script 'msdb110_upgrade.sql' from level 184552836 to level 184554932.
2015-01-12 01:28:02.82 spid7s      ----------------------------------
2015-01-12 01:28:02.82 spid7s      Starting execution of PRE_MSDB.SQL
2015-01-12 01:28:02.82 spid7s      ----------------------------------
2015-01-12 01:28:02.96 spid7s      Error: 3930, Severity: 16, State: 1.
2015-01-12 01:28:02.96 spid7s      The current transaction cannot be committed and cannot support operations that write to the log file. Roll back the transaction.
2015-01-12 01:28:02.96 spid7s      Error: 912, Severity: 21, State: 2.
2015-01-12 01:28:02.96 spid7s      Script level upgrade for database 'master' failed because upgrade step 'msdb110_upgrade.sql' encountered error 3930, state 1, severity 16. This is a serious error condition which might interfere with regular operation and the database will be taken offline. If the error happened during upgrade of the 'master' database, it will prevent the entire SQL Server instance from starting. Examine the previous errorlog entries for errors, take the appropriate corrective actions and re-start the database so that the script upgrade steps run to completion.
2015-01-12 01:28:02.97 spid7s      Error: 3417, Severity: 21, State: 3.
2015-01-12 01:28:02.97 spid7s      Cannot recover the master database. SQL Server is unable to run. Restore master from a full backup, repair it, or rebuild it. For more information about how to rebuild the master database, see SQL Server Books Online.
2015-01-12 01:28:02.97 spid7s      SQL Server shutdown has been initiated
2015-01-12 01:28:02.97 spid7s      SQL Trace was stopped due to server shutdown. Trace ID = '1'. This is an informational message only; no user action is required.

Analysis

In case misbehaving SQL Server instances are able to smell fear, I am glad I was located several miles away from the datacenter at this point in time. While a rebuild of master is certainly doable even in a complex setup such as this, it is not something you want to do at 2am without a detailed plan if you don’t have to. Thus, I tried failing the instance back to node 1 (running SP1 CU 11). To my amazement it came online straight away. I have seen similar issues reduce clustered instances to an unrecognizable puddle of zeros and ones in a corner on the SAN drive, so this was a welcome surprise. Feeling lucky, I tried another failover to node 2, only to be greeted with another failure and the exact same errors in the log. A quick search revealed several similar issues, but no exact matches and no feasible solutions. The closest was a suggestion to disable replication during the upgrade. As you probably know, AOAG is just replication in a fancy dress, so I went looking for my Disaster Recovery Runbook that contains ready made scripts and plans for disabling and re-enabling AOAG. My only problem is that disabling AOAG will take down the AOAG listener, thus disconnecting all clients. Such antics results in grumpy client systems, web service downtimes and a lot of paperwork for instance reviews, and is therefore something to avoid if at all possible. Just for the fun of it, I decided to try making Node 2 the AOAG Primary during the upgrade. To my astonishment, this worked like a charm. Crisis (and paperwork) averted.

Solution

You have to promote the FCI to AOAG Primary during the upgrade from SP2 to SP1. The upgrade is triggered by failing the FCI over from a node running SP1 to a node running SP2, in my case the failover from node 1 to node 2 after patching node 2.

Sadly, there is no fixed procedure for patching failover cluster instances. Some patches will only install on the active FCI node, and will then continue to patch all nodes automatically. But most patches follow the recipe above, where the passive node(s) are patched first.

This issue will probably not affect “clean” AOAG or FCI clusters where you only apply one technology. If you use FCI with replication on the other hand, you may experience the same issue.

Definitions

AOAG = Always On Availability Group

FCI = Failover cluster Instance

HADR = High Availability / Disaster Recovery

SQL agent jobs cause Audit failure in security log

Problem

When you audit your security log, something you are doing every day of course, you discover that SQL server is causing an audit failure fairly often:

image

Audit failure event 4625, The user account has expired:

image

Closer inspection reveals that the event is triggered by the SQL server engine service account. You could easily be lead to believe that the engine account itself has expired, as it is the only account mentioned by name in the error message. That is not the case here, as the “Account for which logon failed” is a null sid, also known as S-1-0-0 or nobody. This is Windows’ way of telling you that something failed, but I’m not going to say exactly what it was.

Analysis

The error appeared about every hour on the hour, so agent jobs is the top suspect. But the job logs disagreed, every job was working like a charm. Analysis of the central log repository revealed that the error appeared suddenly without any changes being made on the server at the time.

I discovered this on a cluster, so I tried failing over to another node. That didn’t change anything, the error followed sqlserver.exe to the other node. At least I had proven beyond reasonable doubt that the error was directly related to something SQL does. I was unable to find any fault on the server except from the audit failure. I checked all SQL server service accounts, none of which had an expiry date set. I drifted back to the agent jobs again, as agent job ownership has given me a hard time earlier. For some reason, if the person creating the jobs/maintenance plans is a sysadmin by group membership and not by direct membership, all agent jobs fail to execute unless you use a proxy account. I have blogged about this before though (https://lokna.no/?p=1267), so I always check for this when a new instance is installed and correct if necessary.

Then it struck me: there is a best practice somewhere stating that SQL Server installs are supposed to be executed as a special account, and not an account associated with the person that performs the installation. This is in case that person quits and we should delete his/her account. The agent jobs for backup, dbcc checkdb and such are usually created using another account though, but maybe there was an anomaly. I ran a quick check, and yes, the agent jobs were owned by the setup user. And this user was since marked as expired for security reasons, as it is only used  during setup and as a way into the server in case someone removes all other sysadmins and locks themselves out. I know there are other ways into a SQL Server you don’t have access to, but this is a lot easier as long as you have access to the domain.

To list job owners:

select name, owner_sid from msdb.dbo.sysjobs

image

0x1 is sa, all the other jobs were owned by the setup user. These SIDs are in hex format. To convert to a username, run this:

SELECT SUSER_SNAME (0xHEXSID)

solution

Change the job and/or maintenance plan owner. See https://lokna.no/?p=325