I was patching one of my clusters to SQL2012 SP2 and SP2 CU3 when something bad happened. This particular cluster is a 3 node cluster with a FCI Primary AOAG replica instance on node 1 and 2, and a stand alone Secondary AOAG replica instance on node 3. Node 3 is used for HADR when the shared storage or other shared infrastructure has an outage.
The update passed QAT with flying colors, but sadly that does not always guarantee a successful production run. My standard patch procedure for this cluster:
- Patch node 3
- Patch node 2 (passive FCI node)
- AOAG failover to node 3, node 3 becomes AOAG Primary
- FCI failover from node 1 to node 2
- Patch node 1
- FCI failover to node 1
- AOAG failover to node 1
When I tried to fail over the FCI to node 2 (step 4 above), the instance failed. First, I was worried that the SP2 upgrade process may be very lengthy or slow and triggering the FCI timeouts. An inspection of the SLQ Server error log revealed that this was not the case. Instead, I was the victim of a dreaded master database failure:
015-01-12 01:28:02.82 spid7s Database 'master' is upgrading script 'msdb110_upgrade.sql' from level 184552836 to level 184554932. 2015-01-12 01:28:02.82 spid7s ---------------------------------- 2015-01-12 01:28:02.82 spid7s Starting execution of PRE_MSDB.SQL 2015-01-12 01:28:02.82 spid7s ---------------------------------- 2015-01-12 01:28:02.96 spid7s Error: 3930, Severity: 16, State: 1. 2015-01-12 01:28:02.96 spid7s The current transaction cannot be committed and cannot support operations that write to the log file. Roll back the transaction. 2015-01-12 01:28:02.96 spid7s Error: 912, Severity: 21, State: 2. 2015-01-12 01:28:02.96 spid7s Script level upgrade for database 'master' failed because upgrade step 'msdb110_upgrade.sql' encountered error 3930, state 1, severity 16. This is a serious error condition which might interfere with regular operation and the database will be taken offline. If the error happened during upgrade of the 'master' database, it will prevent the entire SQL Server instance from starting. Examine the previous errorlog entries for errors, take the appropriate corrective actions and re-start the database so that the script upgrade steps run to completion. 2015-01-12 01:28:02.97 spid7s Error: 3417, Severity: 21, State: 3. 2015-01-12 01:28:02.97 spid7s Cannot recover the master database. SQL Server is unable to run. Restore master from a full backup, repair it, or rebuild it. For more information about how to rebuild the master database, see SQL Server Books Online. 2015-01-12 01:28:02.97 spid7s SQL Server shutdown has been initiated 2015-01-12 01:28:02.97 spid7s SQL Trace was stopped due to server shutdown. Trace ID = '1'. This is an informational message only; no user action is required.
In case misbehaving SQL Server instances are able to smell fear, I am glad I was located several miles away from the datacenter at this point in time. While a rebuild of master is certainly doable even in a complex setup such as this, it is not something you want to do at 2am without a detailed plan if you don’t have to. Thus, I tried failing the instance back to node 1 (running SP1 CU 11). To my amazement it came online straight away. I have seen similar issues reduce clustered instances to an unrecognizable puddle of zeros and ones in a corner on the SAN drive, so this was a welcome surprise. Feeling lucky, I tried another failover to node 2, only to be greeted with another failure and the exact same errors in the log. A quick search revealed several similar issues, but no exact matches and no feasible solutions. The closest was a suggestion to disable replication during the upgrade. As you probably know, AOAG is just replication in a fancy dress, so I went looking for my Disaster Recovery Runbook that contains ready made scripts and plans for disabling and re-enabling AOAG. My only problem is that disabling AOAG will take down the AOAG listener, thus disconnecting all clients. Such antics results in grumpy client systems, web service downtimes and a lot of paperwork for instance reviews, and is therefore something to avoid if at all possible. Just for the fun of it, I decided to try making Node 2 the AOAG Primary during the upgrade. To my astonishment, this worked like a charm. Crisis (and paperwork) averted.
You have to promote the FCI to AOAG Primary during the upgrade from SP2 to SP1. The upgrade is triggered by failing the FCI over from a node running SP1 to a node running SP2, in my case the failover from node 1 to node 2 after patching node 2.
Sadly, there is no fixed procedure for patching failover cluster instances. Some patches will only install on the active FCI node, and will then continue to patch all nodes automatically. But most patches follow the recipe above, where the passive node(s) are patched first.
This issue will probably not affect “clean” AOAG or FCI clusters where you only apply one technology. If you use FCI with replication on the other hand, you may experience the same issue.
AOAG = Always On Availability Group
FCI = Failover cluster Instance
HADR = High Availability / Disaster Recovery