The server cannot accept TCP connections after disabling named pipes

Problem

I always disable the named pipes protocol on my SQL Servers if they are going to accept network connections from clients. I am not sure if named pipes still poises a threat to my network in a 10GB Ethernet world, but I have spent a considerable amount of time troubleshooting network latencies caused by named pipes previously. For some background on TCP/IP vs NP, check out this article by devpro. Going into SQL Config manager and turning it of has become part of my routine, so imagine my surprise when one of the junior DBAs at my workplace told me that this caused a new cluster instance he was installing to fail. I could of course have dismissed it as some strange fault during installation, but a quick survey revealed that he had followed my SQL Server cluster installation manual to the letter. And this happened only in production, not in the identical QA environment. The only error message logged during SQL startup was this one:

“Error: 26058, Severity: 16, State: 1. A TCP provider is enabled, but there are no TCP listening ports configured. The server cannot accept TCP connections.”

The failover cluster service does not like SQL Server instances that are not network connectable, so it promptly shut down the instance, followed by a slew of error messages in the event log. If I remember correctly, the default timeout for a SQL Server instance to come online after it is started is 3 minutes. That is, it has to be ready for client connections and respond to the cluster service’s health checks in 3 minutes. This does not include any time needed to recover user databases, so in a physical environment with a functional storage subsystem this poses no problem at all. If you have slow hardware or other problems causing a slow startup you can increase the timeout through failover cluster manager, but I have personally never had the need to increase it.

So far, I have only seen this issue in the following scenario:

  • MSSQL 2012 SP1
  • Windows 2012
  • Failover cluster
  • >1 instance of SQL Server in the cluster

But I have been unable to reproduce the problem consistently in my lab. I have otherwise equal cluster environments, where some have this issue and some does not.

Steps to reproduce:

  • Install instance 1
  • Apply patches to Instance 1
  • Test failover of instance 1
  • Disable named pipes on instance 1
  • Restart Instance 1
  • Install instance 2
  • Apply patches to instance 2
  • Test Failover of instance 2
  • Disable named pipes on instance 2
  • Restart instance 2
  • Instance 2 fails

Analysis

We spent quite a lot of time comparing the working cluster with the failing one. Down to firmware versions, drivers and patches applied. Everything was equal. I even considered a local disk fault, but failing over to another node didn’t change anything. The registry settings also looked ok. On a cluster, the registry settings are only available on the node currently owning the instance.  The network settings are found in HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL11.[Instance name]\MSSQLServer\SuperSocketNetLib\.

SNAGHTML864377

WARNING: NEVER try to change anything directly in the registry on a clustered instance, especially one that is not working. See Technet for a guide to changing clustered registry settings on an offline SQL instance.

As the sample shows, we even had Listen All enabled.

Workaround

After a lot of tinkering we came up empty and resorted to re-enabling Named Pipes. If anyone finds or know of a solution to this, please leave a comment below.

Note: the following guide is vague by design. It is very difficult to provide a proper step by step guide for this workaround covering all alternatives. If you have questions, leave a comment including details about your situation, and I will try to clarify.

To enable named pipes again, we just started sqlserver directly through the command line on the active node. This will usually bring it online. If not, stop the cluster service while you do this. Then we enabled named pipes in Config Manager and stopped the instance again. After that, we were able to start the instance as expected, albeit with named pipes enabled.

Warning: If you stop the cluster service on an active node, the cluster may try to fail over the instance to another node if there are other active nodes. Make sure the instance is listed as Failed before you try this. If all else fails, you may have to shut down the other nodes as well, thus taking down all instances in the cluster.

Event 324 from SQLAgent OpenCluster (reason: 5).

Problem

Overzealous monitoring alerts you to an error logged during a cluster failover, more specifically Event ID 324 from SQLAgent$InstanceName:

SNAGHTML19452993

Analysis

As mentioned this happens during failover, one that otherwise may pass without incident. Further analysis of the Application log shows that recovery isn’t done at the time. The next messages in the log are related to the server starting up and running recovery on the new node. For some reason this takes longer than expected. Maybe there was a lot of transactions in flight at the time of failover, maybe the server or storage is to slow, or maybe you were in the process of installing an update to SQLServer which may lead to extensive recovery times. Or it may be something completely different. Whatever it was, it caused the cluster service to try to start the SQLAgent before the node was ready. Reason 5 is probably access denied. Thus, the issue could be related to lack of permissions. I have yet to catch one of these early enough to have a cluster debug log containing the time of the error. Analysis of the cluster in question revealed another access related error at about the same time, ESENT Event ID 490:

SNAGHTML1955d914

This error is related to lack of permissions for the SQLServer engine and Agent runas accounts. Whether or not these accounts should have Local Admin permissions on the node is a never ending discussion. I have found though, that granting the permissions causes far less trouble in a clustered environment than not doing so. There is always another issue, always another patch or feature requiring yet another explicit permission. From a security stand point, it is easy to argue that the data served by the SQL Server is far more valuable than the local server it runs on. If an attacker is able to gain access to the runas accounts, he already has access to read and change/delete the data. What happens to the local server after that is to me rather irrelevant. But security regulations aren’t guaranteed to be neither logical nor sane.

Solution/Workaround

To solve the permission issue, you can either:

  • Add the necessary local permissions for the runas accounts as discussed in KB2811566 and wait for the next “feature” requiring you to add even more permissions to something else. Also, make sure the Agent account has the proper permissions to your backup folders and make sure you are able to create new databases. Not being able to do so may be caused by the engine account not having the proper permissions to your data/log folders.
  • Add the SQL Server Engine and Agent runas accounts to the local administrators group on the server.

Do NOT grant the runas accounts Domain Admin permissions. Ever.

Regarding the open cluster error:

On the servers I have analyzed with this issue, the log always shows the agent starting successfully within two minutes of the error, and it only happens during failover. I have yet to find it on servers where the permissions issue is not solved (using either method), but I am not 100% sure that they are related. I can however say that the message can safely be ignored as long as the Agent account is able to start successfully after the message.

Does your cluster have the recommended hotfixes?

From time to time, the good people at Microsoft publish a list of problems with failover clustering that has been resolved. This list, as all such bad/good news comes in the form of a KB, namely KB2784261. I check this list from time to time. Some of them relate to a specific issue, while others are more of the go-install-them-at-once type. As a general rule, I recommend installing ALL hotfixes regardless of the attached warning telling you to only install them if you experience a particular problem. In my experience, hotfixes are at least as stable as regular patches, if not better. That being said, sooner or later you will run across patches or hotfixes that will make a mess and give you a bad or very bad day. But then again, that is why cluster admins always fight for funding of a proper QA/T environment. Preferably one that is equal to the production system in every way possible.

Anyways, this results in having to check all my servers to see if they have the hotfixes installed. Luckily some are included in Microsoft Update, but some you have to install manually. To simplify this process, I made the following powershell script. It takes a list of hotfixes, and returns a list of the ones who are missing from the system. This script could easily be adapted to run against several servers at once, but I have to battle way to many internal firewalls to attempt such dark magic. Be aware that some hotfixes have multiple KB numbers and may not show up even if they are installed. This usually happens when updates are bundled together as a cummulative package or superseded by a new version. The best way to test if patch/hotfix X needs to be installed is to try to install it. The installer will tell you whether or not the patch is applicable.

Edit: Since the original, I have added KB lists for 2008 R2 and 2012 R2 based clusters. All you have to do is replace the ” $recommendedPatches = ” list with the one you need. Links to the correct KB list is included for each OS. I have also discovered that some of the hotfixes are available through Microsoft Update-Catalog, thus bypassing the captcha email hurdle.

2012 version

$menucolor = [System.ConsoleColor]::gray
write-host "╔═══════════════════════════════════════════════════════════════════════════════════════════╗"-ForegroundColor $menucolor
write-host "║                              Identify missing patches                                     ║"-ForegroundColor $menucolor
write-host "║                              Jan Kåre Lokna - lokna.no                                    ║"-ForegroundColor $menucolor
write-host "║                                       v 1.2                                               ║"-ForegroundColor $menucolor
write-host "║                                  Requires elevation: No                                   ║"-ForegroundColor $menucolor
write-host "╚═══════════════════════════════════════════════════════════════════════════════════════════╝"-ForegroundColor $menucolor
#List your patches here. Updated list of patches at http://support.microsoft.com/kb/2784261
$recommendedPatches = "KB2916993", "KB2929869","KB2913695", "KB2878635", "KB2894464", "KB2838043", "KB2803748", "KB2770917"
 
$missingPatches = @()
foreach($_ in $recommendedPatches){
    if (!(get-hotfix -id $_ -ea:0)) { 
        $missingPatches += $_ 
    }    
}
$intMissing = $missingPatches.Count
$intRecommended = $recommendedpatches.count
Write-Host "$env:COMPUTERNAME is missing $intMissing of $intRecommended patches:" 
$missingPatches

2008R2 Version

A list of recommended patches for Win 2008 R2 can be found here:  KB2545685

$recommendedPatches = "KB2531907", "KB2550886","KB2552040", "KB2494162", "KB2524478", "KB2520235"

2012 R2 Version

A list of recommended patches for Win 2012 R2 can be found here:  KB2920151

#All clusters
$recommendedPatches = "KB3130944", "KB3137691", "KB3139896", "KB3130939", "KB3123538", "KB3091057", "KB3013769", "KB3000850", "KB2919355"
#Hyper-V Clusters
$recommendedPatches = "KB3130944", "KB3137691", "KB3139896", "KB3130939", "KB3123538", "KB3091057", "KB3013769", "KB3000850", "KB2919355", "KB3090343", "KB3060678", "KB3063283", "KB3072380"

Hyper-V

If you are using the Hyper-V role, you can find additional fixes for 2012 R2 in KB2920151 below the general cluster hotfixes. If you use NVGRE, look at this list as well: KB2974503

Sample output (computer name redacted)

SNAGHTML2af111ec

Edit:

I have finally updated my script to remove those pesky red error messages seen in the sample above.

Limit the size of the error log

Problem

Something bad happens, and the error log is  bloated with error messages very fast, triggering low disk space alarms and eventually filling the drive to the brim. For a concrete example, see https://lokna.no/?p=1263.

Analysis

Even though I have spent considerable time searching for a solution to this, so far I have been at a loss. My best bet has been limiting the number of files. If the default limit of 6 isn’t changed, there seems to be no entry in the registry. On clustered instances such errors as mentioned above can cause a failover roundabout, and I have seen several hundred log files being generated (on SQL Server 2008R2). Changing the limit adds a registry value, and this seems to force SQL Server to enforce the limit.

This doesn’t solve the most pressing issue though. If some sort of error starts pushing several log messages every second, a small base volume with say 10GiB of free space is filled to the brim in hours if not minutes. Thus, when you are woken up by the low disk space alarm it is already to late to stop it, and you run the chance of NTFS corruption. In my experience, the chance of NTFS corruption is increasing on a logarithmic scale when you pass 5% free space on a volume. 0 bytes of free space with processes still trying to write more data to the volume more or less guarantees some kind of NTFS corruption unless you catch it quickly.

Continue reading “Limit the size of the error log”

Weak event created

Problem

In windows failover cluster manager, clicking on an node in the tree will raise the following error:

SNAGHTML106567d

This is one of the biggest error messages I have ever seen in a Microsoft product, that is, regarding the size of the message box Smilefjes. The error states: “A weak event was created and I lives on the wrong object, there is a very high chance this will fail, pleas review your code to prevent the issue” and goes on with a .net call stack.

Analysis

This feels like a .Net framework issue to me, and a quick search rustled up the following post on Technet: http://blogs.technet.com/b/askcore/archive/2013/01/14/error-in-failover-cluster-manager-after-install-of-kb2750149.aspx, stating that this is a bug caused by KB 2750149.

Solution

Install KB2803748, available here: http://support.microsoft.com/kb/2803748

To check if both or any of the patches are installed, run the following powershell commands:

Get-HotFix -id KB2750149
Get-HotFix -id KB2803748

How SQL Server 2012 Service Pack 1 destroyed my life

Or to be more exact: seven days of my life, with every waking hour spent troubleshooting and  finally reinstalling servers.

Problem

Some time after you install MSSQL 2012 SP 1 strange things start to happen on the server. Maybe you are unable to start management studio, or if you’re really unfortunate you can’t log in to the server at all. Maybe random services start to fail, Windows Update refuses to apply updates, and if you boot the server it might refuse to start SQL server at all or just greet you with a friendly BSOD. But a couple of days before everything was hunky-dory, and you are absolutely 100% sure nothing has changed since then. Yet the server fails, and a couple of days later another one goes down with the same symptoms. It’s as if they have contracted ebola or the swine flu or some other strange virus. It seems to be attacking the entire server all at once, and the only common denominator is they are all running MS SQL 2012 SP1.

Continue reading “How SQL Server 2012 Service Pack 1 destroyed my life”

Orchestrator bug kills database server

Update

20131031: Code to limit the size of the error log files added to the solution. So far only tested on SQL 2012 SP1 CU2.

Problem

The Database server/operating system starts complaining about low disk space. Analysis reveals that the volume containing the default MSSQL log folder is very low on space, maybe even 0 bytes free. Further analysis shows that the ERRORLOG files are taking up all remaining space on the volume.

Analysis

The ERROROLOG files is filled to the brim with messages like this one:

Date		07.01.2013 21:09:09
Log		SQL Server (Current - 07.01.2013 14:57:00)

Source		spid29s

Message
The activated proc '[Microsoft.SystemCenter.Orchestrator.Maintenance].[MaintenanceWorker]' running on queue 'OrchestratorDB.Microsoft.SystemCenter.Orchestrator.Maintenance.MaintenanceServiceQueue' output the following:  'Error: 9002, The transaction log for database 'OrchestratorDB' is full. To find out why space in the log cannot be reused, see the log_reuse_wait_desc column in sys.databases'

image

image

Orchestrator has a maintenance process that does not handle a situation were the database transaction log is unable to grow, either due to restrictions or due to the fact that the disk is full. When this happens, it bloats the ERRORLOG with the messages mentioned above. This happens very quickly as it keeps retrying several times each second, so even a 10GB volume is filled in hours. If this situation is left unchecked for a couple of days, the risk of volume NTFS corruption is high.

solution

  • Disable access to the Orchestrator database or stop the Orchestrator servers to prevent further bloating.
  • Stop the database instance
  • Delete the bloated error log to free up space
  • Restart the instance
  • Change the number of error logs SQL server is allowed to create (default is 6, sample limits to 10) and limit the size of the error log files to 50MB each. This will give you a total of 11 error logfiles, each with a max size of about 50 000 KB for a total of 550MB.
    USE [master]
    GO
    EXEC xp_instance_regwrite N'HKEY_LOCAL_MACHINE',
     N'Software\Microsoft\MSSQLServer\MSSQLServer',
     N'NumErrorLogs', REG_DWORD, 10
    GO
    EXEC xp_instance_regwrite N'HKEY_LOCAL_MACHINE',
     N'Software\Microsoft\MSSQLServer\MSSQLServer', 
     N'ErrorLogSizeInKb', REG_DWORD, 50000
    GO
  • Solve the problem regarding the Orchestrator DB transaction log. Usually such problems are caused by either a failed backup job (thus no log truncation) or the simple fact that you need to allocate more space for the data. If a failed backup job is the problem, consider shrinking the transaction logs back down to their normal size. See https://lokna.no/?p=349 for more information about shrinking troublesome logs (Norwegian version only).
  • Enable access to the Orchestrator database.
  • Monitor the error logs and see if the values you chose for log size and number of logs fits your workload. If the logs contain at least a day of messages, but not more than a month, you are in the butter zone.

Permission error installing Failover Cluster instance

Problem

While testing out MSSQL 2012 Always On Failover Clustering in my lab, I stumbled upon a strange error which I have never seen before: “Updating permission settings for file [Shared drive]\[Mountpoint]\System Volume Information\ResumeKeyFilter.store failed”. This happened for all drives that was mounted as a mount point (folder) instead of a drive letter.image

Continue reading “Permission error installing Failover Cluster instance”