sql_text returns *password—–

Problem

While hunting for performance bottlenecks using Spotlight some months ago I came across some strange statements listed only as *password—– (the – continues for a couple of pages). This was new to me, and even a search on the pesky trackers-with-integrated-search came up blank. We passed it off to the developers and never heard back, and I forgot about the whole thing. Then, during a stress test in QA it popped up on the radar again, and this time I decided to dig into it myself.

Analysis

Being a good boy, I was using extended events to trace down the problem. I do not know why, but profiler still feels more at home. Anyhow, I set up a session triggering on sql_text like “%*password—%. Looking something like this:

CREATE EVENT SESSION [pwdstar] ON SERVER
ADD EVENT sqlserver.rpc_starting( ACTION(sqlserver.client_app_name,sqlserver.client_connection_id,sqlserver.client_hostname,sqlserver.client_pid,sqlserver.database_name,sqlserver.session_id,sqlserver.sql_text,sqlserver.tsql_stack,sqlserver.username)
WHERE ([package0].[greater_than_uint64]([sqlserver].[database_id],(4)) AND [package0].[equal_boolean]([sqlserver].[is_system],(0)) AND ([sqlserver].[like_i_sql_unicode_string]([sqlserver].[sql_text],N'%*password--%') AND [sqlserver].[equal_i_sql_unicode_string]([sqlserver].[database_name],N'DBNAME'))))
ADD TARGET package0.event_file(SET filename=N'pwdstar',max_file_size=(200))
WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=ON,STARTUP_STATE=OFF
GO

I came back an hour later to look at the results, and was surprised to find that the tsql_stack returned a more normal looking statement:

image

Back to the trackers I went, this time trying to figure out why sql_text differed from statement. This produced far better results, and I came across this article: https://www.sqlskills.com/blogs/jonathan/understanding-the-sql_text-action-in-extended-events/, and following the links produced the following statement from Mr. Keheyias:

· If the sql_text Action returns:

<action name="sql_text" package="sqlserver">
<type name="unicode_string" package="package0" >
<value>*password------------------------------</value>
<text></text>
</type></action>

the text was stripped by the engine because it contained a password reference in it.  This functionality occurs in SQL Trace as well and is a security measure to prevent passwords from being returned in the trace results.

As I said, this was news to me, but the function seems to be an old one not necessarily related to xEvents. Extended Events does however use a more aggressive masking algorithm than what is used in profiler from what I can tell.

Solution

If you come across such queries in your profiler or xEvents results, you are dealing with something SQL server believes is password related. The results are thus hidden. The easiest way to get the statement is capturing tsql_stack as well. In SQL 2016 you can capture the statement directly:

image

Microsoft Update with PSWindowsUpdate

Preface

Most of my Windows servers are patched by WSUS, SCCM or a similar automated patch management solution at regular intervals. But not all. Some servers are just too important to be autopatched. This is a combination of SLA requirements making downtime difficult to schedule and the sheer impact of a botched patch run on backend servers. Thus, a more hands-on approach is needed. In W2012R2 and far back this was easily achieved by running the manual Windows Update application. I ran through the process in QA, let it simmer for a while and went on to repeat the process in production if no nefarious effects were found during testing. Some systems even have three or more staging levels. It is a very manual process, but it works, and as we are required to hand-hold the servers during the update anyway, it does not really cost anything. Then along came Windows Server 2016. Or Windows 10 I should really say, as the Update-module in W2016 is carbon copied from W10 without changes. It is even trying to convince me to install W10 Creators update on my servers…

clip_image001

In Windows Server 2016 the lazy bastards at Microsoft just could not be bothered to implement the functionality from W2012R2 WU. It is no longer possible to defer specific updates I do not want, such as the stupid Silverlight mess. If I want Microsoft update, then I have to take it all. And if I should become slightly insane and suddenly decide I want driver updates from WU, the only way to do that is to go through device manager and check every single device for updates. Or install WUMT, a shady custom WU client of unknown origin.

I could of course use WSUS or SCCM to push just the updates I want, but then I have to magically imagine what updates each server wants and add them to an ever growing number of target groups. Every time I have a patch run. Now that is expensive. If I had enough of the “special needs” servers to justify the manpower-cost, I would have done so long ago. Thus, another solution was needed…

PSWindowsUpdate to the rescue. PSWindUpdate is a Powershell module written by a user called MichalGajda on the technet gallery enabling management of Windows Update through Powershell. In this post I go through how to install the module and use it to run Microsoft Update in a way that resembles the functionality from W2012R2. You could tell the module to install a certain list of updates, but I found it easier to hide the unwanted updates. It also ensures that they are not added by mistake with the next round of patches.

Getting started

(See the following chapters for details.)

  • You should of course start by installing the module. This should be a one-time deal, unless a new version has been released since last time you used it. New versions of the module should of course be tested in QA like any other software.
  • Then, make sure that Microsoft Update is active.
  • Check for updates to get a list of available patches.
  • Hide any unwanted patches
  • Install the updates
  • Re-check for updates to make sure there are no “round-two” patches to install.

Continue reading “Microsoft Update with PSWindowsUpdate”

No Microsoft Update

Problem

I was preparing to roll out SQL Server 2016 and Windows Server 2016 and had deployed the first server in  production. I suddenly noticed that even if I selected “Check online for updates from Microsoft Update” in the horrible new update dialog, I never got any of the additional updates. Btw, this link/button only appears when you have an internal SCCM or WSUS server configured. Clicking the normal Check For Updates button will get updates from WSUS.

image

 

Analysis

This was working as expected in the lab, but the lab does not have the fancy System Center Configuration Manager and WSUS systems. So of course I blamed SCCM and uninstalled the agent. But to no avail, still no updates. I lurked around the update dialog and found that the “Give me updates for other Microsoft products..” option was grayed out and disabled. I am sure that I checked this box during installation, as I remember looking for its location. But it was no longer selected, it was even grayed out.

image

This smells of GPOs. But I also remembered trying to get this option checked by a GPO to save time during installation, and that it was not possible to do so in Win2012R2. Into the Group Policy Manager of the lab DC I went…

It appears that GPO management of the Microsoft Update option has been added in Win2016:

image

This option is not available in Win2012R2, but as we have a GPO that defines “Configure Automatic Updates”, it defaults to disabled.

solution

Alternative 1: Upgrade your domain controllers to Win2016.

Alternative 2: Install the Win2016 .admx files on all your domain controllers and administrative workstations.

Then, change the GPO ensuring that “Install updates for other Microsoft products is enabled. Selecting 3 – Auto download used to be a safe setting.

Alternative 3: Remove the GPO or set “Configure Automatic Updates” to “Not Configured”, thus allowing local configuration.

The SQL 2012 SP3 update killed my cluster

Problem

After an upgrade to SQL 2012 SP3, a clustered instance fails to start listing error 33009, 912 and 3417:

2016-04-28 14:32:23.39 spid14s     Error: 33009, Severity: 16, State: 2.
2016-04-28 14:32:23.39 spid14s     The database owner SID recorded in the master database differs from the database owner SID recorded in database 'msdb'. You should correct this situation by resetting the owner of database 'msdb' using the ALTER AUTHORIZATION statement.
2016-04-28 14:32:23.39 spid14s     Error: 912, Severity: 21, State: 2.
2016-04-28 14:32:23.39 spid14s     Script level upgrade for database 'master' failed because upgrade step 'msdb110_upgrade.sql' encountered error 33009, state 2, severity 16. This is a serious error condition which might interfere with regular operation and the database will be taken offline. If the error happened during upgrade of the 'master' database, it will prevent the entire SQL Server instance from starting. Examine the previous errorlog entries for errors, take the appropriate corrective actions and re-start the database so that the script upgrade steps run to completion.
2016-04-28 14:32:23.39 spid14s     Error: 3417, Severity: 21, State: 3.
2016-04-28 14:32:23.39 spid14s     Cannot recover the master database. SQL Server is unable to run. Restore master from a full backup, repair it, or rebuild it. For more information about how to rebuild the master database, see SQL Server Books Online.

Something went down the steep road to oblivion during the upgrade process, and the instance is as dead as an overrun squirrel. To be more specific, the master database does not like the looks of the owner attribute on the msdb database, and is down right refusing to complete the upgrade process.

Analysis

  • Start the instance with trace flag 902 to disable the upgrade scripts and get the server back online. You can do this by command line if you wish, but for clustered instances it is easier to use SQL Server config manager. There will be some existing startup parameters. Do not mess with those, just add a new one using the handy “Add” button provided.
  • image
  • Wait for the instance to start, and log in using SSMS.
  • To find out what SID is where, run the following script
--Sid in master
 SELECT SID FROM master..sysdatabases WHERE  Name = 'msdb'
--Sid in database
 SELECT [SID]  FROM msdb.sys.database_principals WHERE Name = 'DBO'

It will probably show 0x01  (sa) for one of the results, and an Active directory SID for the other one:

image

If you want to know what username SQL Server has stored for the SID, use this command, replacing 0x01 with the SID from the previous result:

 --Sid to username
 SELECT  Name AS [LoginName]  FROM   master..syslogins  WHERE  SID = 0x01

Be aware that if the username has been changed in AD, lets say you changed DOMAIN\johnsmith to DOMAIN\jsmith1, SQL Server will not necessarily be aware of this. You can validate the username in AD using powershell. See https://technet.microsoft.com/en-us/library/ff730940.aspx. And just to make the point clear, do not mess around with account names in AD if they are linked to SQL Server. It may lead to login failures, and possibly cows being abducted by aliens at your local dairy farm.

Solution

Change the database owner to match the value stored in master. Remember to use brackets for AD users.

ALTER Authorization ON DATABASE::msdb TO [DOMAIN\USER]

Then remove the trace flag and restart the instance to try the upgrade scripts again. If the current owner is not the owner you want, allow the upgrade to finish and change it again.

LCK_M_U Wait troubleshooting

Problem

I got a callout from our monitoring system, alerting me of a high Average Wait Time on one of my production servers:

image

The baseline average wait value for this server is around 9ms, so an average in the 900ms range is a definitive sign of something. Whether it is a sign of something bad or not is for the investigation to decide. All other operating parameters where in the normal range, both for the SQL Server and the connected application servers. Usually, such warnings are accompanied by a complaint from the application team about slow application performance, but not today.

Analysis

I started out running Mr. Randal’s brilliant wait stats scripts. As the graph from the monitoring system contained a lot of jitter, I went with http://www.sqlskills.com/blogs/paul/capturing-wait-statistics-period-time/ and let it run for the 30 minute default period when the waits were high to se what was going on. The result revealed that LCK_M_U had joined the usual band of ASYNC_NETWORK_IO, TRACEWRITE and CXPACKET waits:

image

LCK_M_U indicates that someone somewhere is waiting for an update lock. This particular system has a very high Read ratio versus Writes, and most writes are done during nightly update tasks. As such, this wait type was not expected and further analysis was needed. I found an old article by Kendra Little describing how to collect data using  sp_WhoIsactive: http://www.littlekendra.com/2011/02/01/whoisactive/

Using her example as a guide, I created a script to collect some data in my monitoring database. To use this script, you need a user database that you can play around with and an up to date version of Adam Machanics sp_WhoIsactive. I tried using the @find_block_leaders option, but I ran in to a known recursion level limitation so I had to remove it. The script has three parts:

Create a table

This part creates a table tagged with todays date to make it easy to separate different troubleshooting sessions. You could also schedule daily automatic runs of the entire script to collect data at specific times each day.

DECLARE @destination_table VARCHAR(4000) ;
SET @destination_table = 'WhoIsActive_' + CONVERT(VARCHAR, GETDATE(), 112) ;
 
DECLARE @schema VARCHAR(4000) ;
EXEC dbo.sp_WhoIsActive
@get_transaction_info = 1,
@get_plans = 1,
@find_block_leaders = 0,
@get_locks=TRUE,
@RETURN_SCHEMA = 1,
@SCHEMA = @schema OUTPUT;
 
SET @schema = REPLACE(@schema, '<table_name>', @destination_table) ;
 
PRINT @schema
EXEC(@schema) ;
GO
</table_name>

Collect data

You can run this part several times, and the intervals can be adjusted to suit your needs. Just be aware that if you cross midnight, you either have to create another table for the new day, or modify the collection script to log to yesterdays table.

DECLARE @destination_table VARCHAR(4000) ;
SET @destination_table = 'WhoIsActive_' + CONVERT(VARCHAR, GETDATE(), 112) ;
DECLARE @msg NVARCHAR(1000) ;
DECLARE @numberOfRuns INT ;
SET @numberOfRuns = 100 ;
 
WHILE @numberOfRuns &gt; 0
    BEGIN;
        EXEC dbo.sp_WhoIsActive 
		@get_transaction_info = 1, 
		@get_plans = 1, 
		@find_block_leaders = 0, 
		@get_locks=TRUE,
		@DESTINATION_TABLE = @destination_table ;
 
        SET @numberOfRuns = @numberOfRuns - 1 ;
 
        IF @numberOfRuns &gt; 0
            BEGIN
                SET @msg = CONVERT(CHAR(19), GETDATE(), 121) + ': ' +
                 'Logged info. Waiting...'
                RAISERROR(@msg,0,0) WITH nowait ;
 
                WAITFOR DELAY '00:00:05'
            END
        ELSE
            BEGIN
                SET @msg = CONVERT(CHAR(19), GETDATE(), 121) + ': ' + 'Done.'
                RAISERROR(@msg,0,0) WITH nowait ;
            END
 
    END ;
GO

The default is 100 runs at 5 second intervals. I had to run the collection script three times to get the data I wanted, so don’t give up if you don’t catch it on the first run.

Analyze the data

Two sample analysis scripts. The first one gives you all the collected data sorted by collection time, the second filters out the LCK_M_U waits only and has a hardcoded table name as I didn’t finish the first day.

DECLARE @destination_table NVARCHAR(2000), @dSQL NVARCHAR(4000) ;
SET @destination_table = 'WhoIsActive_' + CONVERT(VARCHAR, GETDATE(), 112) ;
SET @dSQL = N'SELECT collection_time, * FROM dbo.' +
 QUOTENAME(@destination_table) + N' order by 1 desc' ;
print @dSQL
EXEC sp_executesql @dSQL
GO
SELECT collection_time, * 
FROM JKL_Log.dbo.WhoIsActive_20150616
WHERE session_id NOT LIKE 51 AND wait_info LIKE '%LCK_M_U%'
ORDER BY 1 DESC;

The result clearly identified the waiting code. This code is not necessarily the culprit, it could be another slow task hogging locks on the data.

SNAGHTML33261fb4

Solution

With the troublesome T-SQL code identified, the next step would be debugging the application. In this case, the code was from an Entity Framework app, and I have to get the developers of said application to look into the issue.

Error 3930 installing SQL 2012 SP2 with CU3 in cluster

Problem

I was patching one of my clusters to SQL2012 SP2 and SP2 CU3 when something bad happened. This particular cluster is a 3 node cluster with a FCI Primary AOAG replica instance on node 1 and 2, and a stand alone Secondary AOAG replica instance on node 3. Node 3 is used for HADR when the shared storage or other shared infrastructure has an outage.

The update passed QAT with flying colors, but sadly that does not always guarantee a successful production run. My standard patch procedure for this cluster:

  • Patch node 3
  • Patch node 2 (passive FCI node)
  • AOAG failover to node 3, node 3 becomes AOAG Primary
  • FCI failover from node 1 to node 2
  • Patch node 1
  • FCI failover to node 1
  • AOAG failover to node 1

When I tried to fail over the FCI to node 2 (step 4 above), the instance failed. First, I was worried that the SP2 upgrade process may be very lengthy or slow and triggering the FCI timeouts. An inspection of the SLQ Server error log revealed that this was not the case. Instead, I was the victim of a dreaded master database failure:

015-01-12 01:28:02.82 spid7s      Database 'master' is upgrading script 'msdb110_upgrade.sql' from level 184552836 to level 184554932.
2015-01-12 01:28:02.82 spid7s      ----------------------------------
2015-01-12 01:28:02.82 spid7s      Starting execution of PRE_MSDB.SQL
2015-01-12 01:28:02.82 spid7s      ----------------------------------
2015-01-12 01:28:02.96 spid7s      Error: 3930, Severity: 16, State: 1.
2015-01-12 01:28:02.96 spid7s      The current transaction cannot be committed and cannot support operations that write to the log file. Roll back the transaction.
2015-01-12 01:28:02.96 spid7s      Error: 912, Severity: 21, State: 2.
2015-01-12 01:28:02.96 spid7s      Script level upgrade for database 'master' failed because upgrade step 'msdb110_upgrade.sql' encountered error 3930, state 1, severity 16. This is a serious error condition which might interfere with regular operation and the database will be taken offline. If the error happened during upgrade of the 'master' database, it will prevent the entire SQL Server instance from starting. Examine the previous errorlog entries for errors, take the appropriate corrective actions and re-start the database so that the script upgrade steps run to completion.
2015-01-12 01:28:02.97 spid7s      Error: 3417, Severity: 21, State: 3.
2015-01-12 01:28:02.97 spid7s      Cannot recover the master database. SQL Server is unable to run. Restore master from a full backup, repair it, or rebuild it. For more information about how to rebuild the master database, see SQL Server Books Online.
2015-01-12 01:28:02.97 spid7s      SQL Server shutdown has been initiated
2015-01-12 01:28:02.97 spid7s      SQL Trace was stopped due to server shutdown. Trace ID = '1'. This is an informational message only; no user action is required.

Analysis

In case misbehaving SQL Server instances are able to smell fear, I am glad I was located several miles away from the datacenter at this point in time. While a rebuild of master is certainly doable even in a complex setup such as this, it is not something you want to do at 2am without a detailed plan if you don’t have to. Thus, I tried failing the instance back to node 1 (running SP1 CU 11). To my amazement it came online straight away. I have seen similar issues reduce clustered instances to an unrecognizable puddle of zeros and ones in a corner on the SAN drive, so this was a welcome surprise. Feeling lucky, I tried another failover to node 2, only to be greeted with another failure and the exact same errors in the log. A quick search revealed several similar issues, but no exact matches and no feasible solutions. The closest was a suggestion to disable replication during the upgrade. As you probably know, AOAG is just replication in a fancy dress, so I went looking for my Disaster Recovery Runbook that contains ready made scripts and plans for disabling and re-enabling AOAG. My only problem is that disabling AOAG will take down the AOAG listener, thus disconnecting all clients. Such antics results in grumpy client systems, web service downtimes and a lot of paperwork for instance reviews, and is therefore something to avoid if at all possible. Just for the fun of it, I decided to try making Node 2 the AOAG Primary during the upgrade. To my astonishment, this worked like a charm. Crisis (and paperwork) averted.

Solution

You have to promote the FCI to AOAG Primary during the upgrade from SP2 to SP1. The upgrade is triggered by failing the FCI over from a node running SP1 to a node running SP2, in my case the failover from node 1 to node 2 after patching node 2.

Sadly, there is no fixed procedure for patching failover cluster instances. Some patches will only install on the active FCI node, and will then continue to patch all nodes automatically. But most patches follow the recipe above, where the passive node(s) are patched first.

This issue will probably not affect “clean” AOAG or FCI clusters where you only apply one technology. If you use FCI with replication on the other hand, you may experience the same issue.

Definitions

AOAG = Always On Availability Group

FCI = Failover cluster Instance

HADR = High Availability / Disaster Recovery

Asynchronous AOAG down after restart of primary node

Background

This article is the result of a long day in the woods on a SAR mission turning into an even longer night due to a difficult cluster.

I have many clusters. One of them is a cluster with 3 nodes. 2 of them are running a regular Failover Cluster Instance with shared storage, while the third node has local storage and serves as an AOAG replica for the most critical databases. The Failover cluster instance is the primary replica as long as we don’t have a HA/DR scenario where the SAN is down or massive hardware issues. This is a very specific setup, but I would not be surprised if this problem could be caused by restarting the primary replica on any async AOAG.

Problem

 

During scheduled maintenance, I failed over the instance containing the primary replica from one node to another manually. This is what usually happens:

  • SQL instance is manually moved from node A to node B
  • The AOAG listener cluster resources fail on node A
  • The AOAG listener cluster resources are automatically moved to node B
  • The AOAG listener cluster resources come on line on node B

But not this time. This time, the AOAG listener objects came online on node A. Such a thing is not supposed to happen. In my experience, AOAG Listeners always stays with the primary node.

Note: I do not recommend this procedure as a standard maintenance procedure. It is always best to make sure that the instance you are restarting is NOT the primary replica of any AOAG.

Anyway, the end result is that SQL Server patiently waits for the AOAG listeners to come on line on the correct server. Or, perhaps patiently is not the correct word. It spews angry error messages in the logs, and the AOAG dashboard is all red.

SNAGHTMLd170a1

“Availability replica is disconnected”

“Availability replica does not have a healthy role”

SNAGHTMLd1f856

Solution

The solution is quite simple. It does however it require crossing over to the dark side, performing unspeakable dark magic. Magic that should never be performed in the presence of an AOAG listener, much less be performed on the listener itself. But simple it is, as long as you are the kind of person who knows your way around the part of Windows where Failover Cluster Manager dwells. Or is familiar with the sky blue realm of the Powershell cluster commandlets. Both will suffice. However, if you are not such a person, if some of the sentences in this article sounds like dark incantations overheard in a shady tavern; please seek assistance from someone who is before you proceed down this path to solution in production. Please note that normally, AOAG listener resources should never be manipulated manually. Trying to do so usually just makes the situation even more dire. But here goes:

Take the AOAG listener resources offline manually, and then bring them back online. Doing so should make the resources realize the errors of their ways and promptly enter the failed state. The cluster and SQL Server should detect this and take action. All failed listeners should be whisked over to the primary replica node and brought online without any need for further input from you.

However, if the listeners are still offline, active on the wrong cluster node, or even worse locked in a failed state, you still have options. Suggestions listed by increasing approximated time consumption.

  • Move the listener resources manually to the node containing the primary replica.
  • Shut down all the nodes, then start just one and let SQL Server initialize completely before you start the others.
  • Look for underlying domain and network issues preventing the listener from starting.
  • Destroy and re-create the availability group and listeners.

The server cannot accept TCP connections after disabling named pipes

Problem

I always disable the named pipes protocol on my SQL Servers if they are going to accept network connections from clients. I am not sure if named pipes still poises a threat to my network in a 10GB Ethernet world, but I have spent a considerable amount of time troubleshooting network latencies caused by named pipes previously. For some background on TCP/IP vs NP, check out this article by devpro. Going into SQL Config manager and turning it of has become part of my routine, so imagine my surprise when one of the junior DBAs at my workplace told me that this caused a new cluster instance he was installing to fail. I could of course have dismissed it as some strange fault during installation, but a quick survey revealed that he had followed my SQL Server cluster installation manual to the letter. And this happened only in production, not in the identical QA environment. The only error message logged during SQL startup was this one:

“Error: 26058, Severity: 16, State: 1. A TCP provider is enabled, but there are no TCP listening ports configured. The server cannot accept TCP connections.”

The failover cluster service does not like SQL Server instances that are not network connectable, so it promptly shut down the instance, followed by a slew of error messages in the event log. If I remember correctly, the default timeout for a SQL Server instance to come online after it is started is 3 minutes. That is, it has to be ready for client connections and respond to the cluster service’s health checks in 3 minutes. This does not include any time needed to recover user databases, so in a physical environment with a functional storage subsystem this poses no problem at all. If you have slow hardware or other problems causing a slow startup you can increase the timeout through failover cluster manager, but I have personally never had the need to increase it.

So far, I have only seen this issue in the following scenario:

  • MSSQL 2012 SP1
  • Windows 2012
  • Failover cluster
  • >1 instance of SQL Server in the cluster

But I have been unable to reproduce the problem consistently in my lab. I have otherwise equal cluster environments, where some have this issue and some does not.

Steps to reproduce:

  • Install instance 1
  • Apply patches to Instance 1
  • Test failover of instance 1
  • Disable named pipes on instance 1
  • Restart Instance 1
  • Install instance 2
  • Apply patches to instance 2
  • Test Failover of instance 2
  • Disable named pipes on instance 2
  • Restart instance 2
  • Instance 2 fails

Analysis

We spent quite a lot of time comparing the working cluster with the failing one. Down to firmware versions, drivers and patches applied. Everything was equal. I even considered a local disk fault, but failing over to another node didn’t change anything. The registry settings also looked ok. On a cluster, the registry settings are only available on the node currently owning the instance.  The network settings are found in HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL11.[Instance name]\MSSQLServer\SuperSocketNetLib\.

SNAGHTML864377

WARNING: NEVER try to change anything directly in the registry on a clustered instance, especially one that is not working. See Technet for a guide to changing clustered registry settings on an offline SQL instance.

As the sample shows, we even had Listen All enabled.

Workaround

After a lot of tinkering we came up empty and resorted to re-enabling Named Pipes. If anyone finds or know of a solution to this, please leave a comment below.

Note: the following guide is vague by design. It is very difficult to provide a proper step by step guide for this workaround covering all alternatives. If you have questions, leave a comment including details about your situation, and I will try to clarify.

To enable named pipes again, we just started sqlserver directly through the command line on the active node. This will usually bring it online. If not, stop the cluster service while you do this. Then we enabled named pipes in Config Manager and stopped the instance again. After that, we were able to start the instance as expected, albeit with named pipes enabled.

Warning: If you stop the cluster service on an active node, the cluster may try to fail over the instance to another node if there are other active nodes. Make sure the instance is listed as Failed before you try this. If all else fails, you may have to shut down the other nodes as well, thus taking down all instances in the cluster.

Clustered MSDTC fails

Problem

While setting up a new clustered Distributed transaction coordinator for a SQL Server FCI, it fails to come online when restarted. This time it happened after I enabled network dtc access, but I have had this happen a lot during patching and cluster failover. Usually, I would just remove and reinstall, but that doesn’t seem to help this time. No matte what I did, FC Manager would just list it as failed:

image

Analysis

Looking in Services, I could see the DTC service was disabled:

image

The GUID in the service name can be matched to the cluster resource in the registry. This is useful if you have more than one DTC in your cluster, as FCI will not allow you to have several DTC resources with the same name. additional DTCs are named “New Distributed Transaction Coordinator (1)”, 2 and so on.

image

I tried to enable the service, only to be greeted with a snarky “This service is marked for deletion” message.

image

Then, I tried removing the resource and adding a new one, as this is my standard MO whenever I have trouble with a clustered MSDTC. Doing that I ended up with another “Marked for deletion” DTC service. My next idea was to fail over the instance, but then I thought, what if this had been in production? Thus, I kept on searching for another solution. And the solution turned out to be a simple one…

Solution

Log out ALL user sessions from the active node. This means all, yourself and any disconnected others included. Then log back in again, and bring the DTC resource online.

image

And by the way, remember change the policy for the DTC, to make sure that such errors doesn’t take down and fail over the entire instance. It could have solved the problem, but it could just as easily lead to the instance failing back and forth until it fails. Adding a script or policy that automatically logs out inactive users from the cluster nodes once a day is also a good idea.

image

Event 324 from SQLAgent OpenCluster (reason: 5).

Problem

Overzealous monitoring alerts you to an error logged during a cluster failover, more specifically Event ID 324 from SQLAgent$InstanceName:

SNAGHTML19452993

Analysis

As mentioned this happens during failover, one that otherwise may pass without incident. Further analysis of the Application log shows that recovery isn’t done at the time. The next messages in the log are related to the server starting up and running recovery on the new node. For some reason this takes longer than expected. Maybe there was a lot of transactions in flight at the time of failover, maybe the server or storage is to slow, or maybe you were in the process of installing an update to SQLServer which may lead to extensive recovery times. Or it may be something completely different. Whatever it was, it caused the cluster service to try to start the SQLAgent before the node was ready. Reason 5 is probably access denied. Thus, the issue could be related to lack of permissions. I have yet to catch one of these early enough to have a cluster debug log containing the time of the error. Analysis of the cluster in question revealed another access related error at about the same time, ESENT Event ID 490:

SNAGHTML1955d914

This error is related to lack of permissions for the SQLServer engine and Agent runas accounts. Whether or not these accounts should have Local Admin permissions on the node is a never ending discussion. I have found though, that granting the permissions causes far less trouble in a clustered environment than not doing so. There is always another issue, always another patch or feature requiring yet another explicit permission. From a security stand point, it is easy to argue that the data served by the SQL Server is far more valuable than the local server it runs on. If an attacker is able to gain access to the runas accounts, he already has access to read and change/delete the data. What happens to the local server after that is to me rather irrelevant. But security regulations aren’t guaranteed to be neither logical nor sane.

Solution/Workaround

To solve the permission issue, you can either:

  • Add the necessary local permissions for the runas accounts as discussed in KB2811566 and wait for the next “feature” requiring you to add even more permissions to something else. Also, make sure the Agent account has the proper permissions to your backup folders and make sure you are able to create new databases. Not being able to do so may be caused by the engine account not having the proper permissions to your data/log folders.
  • Add the SQL Server Engine and Agent runas accounts to the local administrators group on the server.

Do NOT grant the runas accounts Domain Admin permissions. Ever.

Regarding the open cluster error:

On the servers I have analyzed with this issue, the log always shows the agent starting successfully within two minutes of the error, and it only happens during failover. I have yet to find it on servers where the permissions issue is not solved (using either method), but I am not 100% sure that they are related. I can however say that the message can safely be ignored as long as the Agent account is able to start successfully after the message.