DNS Operation refused on Cluster Aware Updating resource

Problem

On one of my Hyper-V clusters, Event ID 1196 from FailoverClustering is logged in the system log every fifteen minutes. The event lists the name of the resource and the error message “DNS operation refused”. What it is trying to tell me is that the cluster is unable to register a network name resource in DNS due to a DNS 9005 status code. A 9005 status code translates to “Operation refused”. In this case it was a CAU network name resource which is a part of Cluster Aware Updating.

Continue reading “DNS Operation refused on Cluster Aware Updating resource”

Cluster validation complains about KB3005628

Problem

You run failover cluster validation, and the report claims that one or more of the nodes are missing update KB3005628:

SNAGHTML338020e6

You try running Windows Update, but KB3005628 is not listed as an available update. You try downloading and installing it manually, but the installer quits without installing anything.

Analysis

KB3005628 is a fix for .Net framework 3.5, correcting a bug in KB2966827 and KB2966828. The problem is that the cluster node in question does not have the .Net framework 3.5 feature installed. It did however have  KB2966828 installed. As this is also a .Net 3.5 update, I wonder how it got installed in the first place. After reading more about KB3005628, it seems that KB2966828 could get installed even if .Net framework 3.5.1 is not installed.

So far, no matter what I do the validation report lists KB3005628 as missing on one of the cluster nodes. This may be a bug in the Failover Cluster validator.

Workaround

If the .Net Framework 3.5 feature is not installed, remove KB2966827 and KB2966828 manually from the affected node if they are installed. The validation report will still list KB3005628 as missing, but as the only function of KB3005628 is to remove KB2966827 and KB2966828 this poses no problem.

Cluster node fails to join cluster after boot

Problem

During a maintenance window, one of five Hyper-v cluster nodes failed to come out of maintenance mode after a reboot. SCVMM was used to facilitate maintenance mode. The system log shows the following error messages repeatedly:

Service Control Manager Event ID 7024

The Cluster Service service terminated with the following service-specific error:
Cannot create a file when that file already exists.

FailoverClustering Event ID 1070

The node failed to join failover cluster [clustername] due to error code ‘183’.

Service Control Manager Event ID 7031

The Cluster Service service terminated with the following service-specific error:
The Cluster Service service terminated unexpectedly.  It has done this 6377 time(s).  The following corrective action will be taken in 60000 milliseconds: Restart the service.

SNAGHTML115b89e

SNAGHTML1170717

SNAGHTML117b921

Analysis

I had high hopes of a quick fix. The cluster is relatively new, and we had recently changed the network architecture by adding a second switch. Thus, I instructed the tech who discovered the fault to try rebooting and checking that the server was able to reach the other nodes on all its interfaces. That river ran dry quickly though, as the local network was proved to be working perfectly.

Looking through the Windows Update log and the KBs for the installed updates did not reveal any clues. Making it even more suspicious, a cluster validation ensured me that all nodes had the same updates. Almost. Except for one. Hopefully I looked closer, but of course, this was some .Net framework update rollup missing on a different node.

I decided to give up all hope of an impressive five minute fix and venture into the realm of the cluster log. It is possible to read the cluster log in the event log system, but I highly recommend generating the text file and opening it in notepad++ or some other editor with search highlighting. I find it a lot easier on the eyes and mind. (Oh, and if you use the event log reader, DO NOT filter out information messages. For some reason, the eureka moment is often hidden in an information message.) The cluster log is a bit like the forbidden forest; it looks scary in daylight and even scarier in the dark during an unscheduled failover. It is easy to get lost down a track interpreting hundreds of “strange” messages, only to discover that they were benign. To make it worse, they covered a timespan of about half a second. The wrong second of course, not the one where the problem actually occurred. To say it mildly, the cluster service is very talkative. Especially so when something is wrong. As event 7031 told us, the cluster service is busy trying to start once a minute. Each try spews out thousands of log messages. The log I was working with had 574 942 lines and covered a timespan 68 minutes. That is about 8450 lines per service start.

Anyway, into the forbidden forest I went with nothing but a couple of event log messages to guide me. After a while, I isolated one cluster service startup attempt with correlated system event log data.   I discovered the following error:

ERR   mscs::GumAgent::ExecuteHandlerLocally: AlreadyExists(183)’ because of ‘already exists'([Servername]- LiveMigration Vlan 197)

The sneaky cluster service had tried fooling me into believing that the fault was file system related, when in fact it was a networking mess after all! You may recognize error 183 from system event id 1070 above. Thus we can conclude with fairly high probability that we have found the culprit. But what does it mean? I checked the name of the adapter in question and its teaming configuration. I ventured even further and checked for disconnected and missing network adapters, but none were to be found, neither in the device manager or in the registry. Then it struck med. The problem was staring me straight in the eye. The line above the error message, an innocent looking INFO message, was referring to the network adapter by an almost identical but nevertheless different name:

INFO [ClNet] Adapter Intel(R) Ethernet 10G 2P X520-k bNDC #2 – VLAN : VLAN197 LiveMigration is still attached to network VLAN197 Live Migration

A short but efficient third degree interrogation of the tech revealed that the network names had been changed some weeks prior to make them consistent on all cluster nodes. Ensuring network name consistency is in itself a noble task, but one that should be completed before the cluster is formed. It should of course be possible to change network names at any time, but for some reason the cluster service has not been able to persist the network configuration properly. As long as the node is kept running this poses no problem. However, when the node is rebooted for whatever reason, the cluster service gets lost running around in the forest looking for network adapters. Network adapters that are present but silent as the cluster service is calling them by the wrong name. I have not been able to figure out exactly what happens, not to mention what caused it in the first place, but I can guess. My guess is that different parts of the persisted cluster configuration came out of sync. This probably links network adapters to their cluster network names:

SNAGHTML151cc90

I have found this fault now on two clusters in as many days, and those are my first encounters with it. I suspect the fault is caused by a bug or “feature” introduced in a recent update to Win2012R2.

Solution

The solution is simple. Or at least I hope it is simple. A usual I strongly encourage you to seek assistance if you do not fully understand the steps below and their implications.

  • First you have to disable the cluster service to keep it out of the way. You can do this in srvmgr.msc, PowerShell, cmd.exe or through telepathy.
  • Wait for the cluster service to stop if it was running (or kill it manually if you are impatient).
  • Change the name of ALL network adapter that have an ipv4 and/or ipv6 address and are used by the cluster service. Changing the name of only the troublesome adapter mentioned in the log may not be enough. Make sure you do not mess around with any physical teaming members, SAN HBAs, virtual cluster adapters or anything else that is not directly used as a cluster network.
  • Before:image After: image
  • Enable the cluster service
  • Wait for the cluster service to start, or start it manually
  • The node should now be able to join the cluster
  • Stop the cluster service again  (properly this time, do not kill it)
  • Change the network adapter names back
  • Start the cluster service again
  • Verify that the node joins the cluster successfully
  • Restart the node to verify that the settings are persisted to disk.

Error 3930 installing SQL 2012 SP2 with CU3 in cluster

Problem

I was patching one of my clusters to SQL2012 SP2 and SP2 CU3 when something bad happened. This particular cluster is a 3 node cluster with a FCI Primary AOAG replica instance on node 1 and 2, and a stand alone Secondary AOAG replica instance on node 3. Node 3 is used for HADR when the shared storage or other shared infrastructure has an outage.

The update passed QAT with flying colors, but sadly that does not always guarantee a successful production run. My standard patch procedure for this cluster:

  • Patch node 3
  • Patch node 2 (passive FCI node)
  • AOAG failover to node 3, node 3 becomes AOAG Primary
  • FCI failover from node 1 to node 2
  • Patch node 1
  • FCI failover to node 1
  • AOAG failover to node 1

When I tried to fail over the FCI to node 2 (step 4 above), the instance failed. First, I was worried that the SP2 upgrade process may be very lengthy or slow and triggering the FCI timeouts. An inspection of the SLQ Server error log revealed that this was not the case. Instead, I was the victim of a dreaded master database failure:

015-01-12 01:28:02.82 spid7s      Database 'master' is upgrading script 'msdb110_upgrade.sql' from level 184552836 to level 184554932.
2015-01-12 01:28:02.82 spid7s      ----------------------------------
2015-01-12 01:28:02.82 spid7s      Starting execution of PRE_MSDB.SQL
2015-01-12 01:28:02.82 spid7s      ----------------------------------
2015-01-12 01:28:02.96 spid7s      Error: 3930, Severity: 16, State: 1.
2015-01-12 01:28:02.96 spid7s      The current transaction cannot be committed and cannot support operations that write to the log file. Roll back the transaction.
2015-01-12 01:28:02.96 spid7s      Error: 912, Severity: 21, State: 2.
2015-01-12 01:28:02.96 spid7s      Script level upgrade for database 'master' failed because upgrade step 'msdb110_upgrade.sql' encountered error 3930, state 1, severity 16. This is a serious error condition which might interfere with regular operation and the database will be taken offline. If the error happened during upgrade of the 'master' database, it will prevent the entire SQL Server instance from starting. Examine the previous errorlog entries for errors, take the appropriate corrective actions and re-start the database so that the script upgrade steps run to completion.
2015-01-12 01:28:02.97 spid7s      Error: 3417, Severity: 21, State: 3.
2015-01-12 01:28:02.97 spid7s      Cannot recover the master database. SQL Server is unable to run. Restore master from a full backup, repair it, or rebuild it. For more information about how to rebuild the master database, see SQL Server Books Online.
2015-01-12 01:28:02.97 spid7s      SQL Server shutdown has been initiated
2015-01-12 01:28:02.97 spid7s      SQL Trace was stopped due to server shutdown. Trace ID = '1'. This is an informational message only; no user action is required.

Analysis

In case misbehaving SQL Server instances are able to smell fear, I am glad I was located several miles away from the datacenter at this point in time. While a rebuild of master is certainly doable even in a complex setup such as this, it is not something you want to do at 2am without a detailed plan if you don’t have to. Thus, I tried failing the instance back to node 1 (running SP1 CU 11). To my amazement it came online straight away. I have seen similar issues reduce clustered instances to an unrecognizable puddle of zeros and ones in a corner on the SAN drive, so this was a welcome surprise. Feeling lucky, I tried another failover to node 2, only to be greeted with another failure and the exact same errors in the log. A quick search revealed several similar issues, but no exact matches and no feasible solutions. The closest was a suggestion to disable replication during the upgrade. As you probably know, AOAG is just replication in a fancy dress, so I went looking for my Disaster Recovery Runbook that contains ready made scripts and plans for disabling and re-enabling AOAG. My only problem is that disabling AOAG will take down the AOAG listener, thus disconnecting all clients. Such antics results in grumpy client systems, web service downtimes and a lot of paperwork for instance reviews, and is therefore something to avoid if at all possible. Just for the fun of it, I decided to try making Node 2 the AOAG Primary during the upgrade. To my astonishment, this worked like a charm. Crisis (and paperwork) averted.

Solution

You have to promote the FCI to AOAG Primary during the upgrade from SP2 to SP1. The upgrade is triggered by failing the FCI over from a node running SP1 to a node running SP2, in my case the failover from node 1 to node 2 after patching node 2.

Sadly, there is no fixed procedure for patching failover cluster instances. Some patches will only install on the active FCI node, and will then continue to patch all nodes automatically. But most patches follow the recipe above, where the passive node(s) are patched first.

This issue will probably not affect “clean” AOAG or FCI clusters where you only apply one technology. If you use FCI with replication on the other hand, you may experience the same issue.

Definitions

AOAG = Always On Availability Group

FCI = Failover cluster Instance

HADR = High Availability / Disaster Recovery

Networks, teaming and heartbeats for clusters

Introduction

In this guide, a fabric is a separate network infrastructure, be it SAN, WAN or LAN. A network may or may not be connected to a dedicated fabric. Some fabrics have more than one network.

The cluster nodes should be connected to each other over at least two independent networks/fabrics. The more independent the better. Ideally, the networks should share no components at all, but as a minimum they should be connected to separate NICs in the server. Ergo, if you want to use NIC teaming you should have at least 4 physical network ports on at least two separate NICs. The more the merrier, but be aware that as with all other forms of redundancy, higher redundancy equals higher complexity.

If you do not have more than one network port or only one network team, do not add an additional virtual network adapter/vlan for “heartbeat purposes”. The most prevalent network faults today are caused by someone unplugging the wrong cable, deactivating the wrong switch port or other user errors. Having separate vlans over the same physical infrastructure rarely offers any protection from this. You are better off just using the one adapter/team.

Previously, each Windows cluster needed a separate heartbeat network used to detect node failures. From Windows 2008 and newer (and maybe also on 2003) the “heartbeat” traffic is sent over all available networks between the cluster nodes unless we manually block it on specific cluster networks. Thus, we no longer need a separate dedicated heartbeat network, but adding a second network ensures that the cluster will survive failures on the primary network. Some cluster roles such as Hyper-V require multiple networks, so check what the requirements are for your specific implementation.

Quick takeaway

If you are designing a cluster and need a quick no-nonsense guideline regarding networks, here it comes:

  • If you use shared storage, you need at least 3 separate fabrics
  • If you use local storage, you need at least 2 separate fabrics

All but a few clusters I have been troubleshooting have had serious shortcomings and design failures in the networking department. The top problems:

  • Way to few fabrics.
  • Mixing storage and network traffic on the same fabric
  • Mixing internal and external traffic on the same fabric
  • Outdated faulty NIC firmware and drivers
  • Bad, poorly designed NICs from Qlogic and Emulex
  • Converged networking

Do not set yourself up for failure.

IPv6

If you haven’t implemented IPv6 yet in your datacenter, you should disable IPv6 on all cluster nodes. If you don’t, you run a high risk of unnecessary failovers due to IPv6 to IPv4 conversion mishaps on the failover cluster virtual adapter. As long as IPv6 is active on the server, the failover cluster virtual adapter will use IPv6, even if none of the cluster networks have a valid IPv6 address. This causes all heartbeat traffic to be converted to/from IPv4 on the fly, which sometimes will fail. If you want to use IPv6, make sure all cluster nodes and domain controllers have a valid IPv6 address that is not link local (fe80:), and make sure you have routers, switches and firewalls that support IPv6 and are configured properly. You will also need IPv6 dns in the active directory domain.

Disabling IPv6

Do NOT disable IPv6 on the network adapters. The protocol binding for IPv6 should be enabled:

clip_image001

Instead, use the DisabledComponents registry setting. See Disable IPv6 for details.

clip_image003

Storage networks

If you use IP-based storage like ISCSI, SMB or FCOE, make sure you do not mix it with other traffic. Dedicated physical adapters should always be used for storage traffic. Moreover, if you are one of the unlucky few using FCOE you should seriously consider converting to FC or SMB3.

Hyper-V networks

In a perfect world, you should have six or more separate networks/fabrics for Hyper-v clusters. Sadly though, the world is seldom perfect. The absolute minimum for production clusters is two networks. Using only one network in production will cause nothing but trouble, so please do not try. Determining whether or not to use teaming also complicates matters further. As a general guide, I would strongly recommend that you always have a dedicated storage fabric with HA, that is teaming or MPIO, unless you use local storage on the cluster nodes. The storage connection is the most important one in any form of cluster. If the storage connection fails, everything else falls apart in seconds. For the other networks, throughput is more important than high availability. If you have to make a choice between HA and separate fabrics, chose separate fabrics for all other networks than the storage network.

7 Physical networks/fabrics

· Internal/Cluster/CSV (if local)/Heartbeat

· Public network for VMs

· VM Host management

· Live Migration

· 2*Storage (ISCSI, FC, SMB3)

· Backup

5 Physical networks/fabrics

· Internal/Cluster/CSV (if local)/Heartbeat/Live Migration

· Public network for vm, VM guest management

· VM Host management

· 2*Storage (ISCSI, FC, SMB3)

4 Physical networks/fabrics

· Internal/Live Migration

· Public & Management

· 2*Storage

Example

clip_image004

Most blade server chasses today have a total of six fabric backplanes, grouped in three groups where each group connects to a separate adapter in the blade. Thus, each network adapter or FC HBA is connected to two separate fabrics/backplanes. The groups could be named A,B and C, with the fabrics named A1, A2, B1 and so on. Each group should have identical backplanes, that is the backplane in A1 should be the same model as the backplane in A2.

If we have Fibrechannel (FC) backplanes in group A, and 10G Ethernet backplanes in group B & C, we have several possible implementations. Group A will always be storage in this example, as FC is a dedicated storage network.

clip_image005

Here, we have teaming implemented on both B and C. Thus, we use the 4 networks configuration from above, splitting our traffic in Internal and Public/Management. This implementation may generate some conflicts during Live Migrations, but in return we get High Availability for all groups.

clip_image006

By splitting group B and C in two single ports, we get 5 fabrics and a more granulated separation of traffic at the cost of High Availability.

Hyper-V trunk adapters/teams on 2012

If you are using Hyper-V virtual switches bound to a physical port or team on you Hyper-V hosts, Hyper-V Extensible Virtual Switch should be the only bound protocol. Note: Do not change these settings manually, Hyper-V manager will change the settings automatically when you configure the virtual switch. If you bind the Hyper-V Extensible virtual switch protocol manually, creation of the virtual switch may fail.

clip_image007

Teaming in Windows 2012

In Windows 2012 we finally got native support for nic teaming. You access the nic teaming dialog from Server Manager. You can find a short description of the features here: http://technet.microsoft.com/en-us/library/hh831648.aspx, and a more detailed one here: Windows Server 2012 NIC Teaming (LBFO) Deployment and Management.

Native teaming support rids us of some of the problems related to unstable vendor teaming drivers, and makes setup of nic teaming a unified experience no matter what nics you are using. Note: never use nic teaming on ISCSI networks. Use MPIO instead.

A note on Active/Active teaming

It is possible to use active active teaming, thus aggregating the bandwidth of two or more adapters to support higher throughput. This is a fantastic technology, especially on 1G ethernet adapters where bandwidth congestion can become a problem. There is, however a snag; a lot of professional datacenters have a complete ban on active/active teaming due to years of teaming problems. I have my self been victim of unstable active/active teams, so I know this to be a real issue. I do think this is less of a problem in Windows 2012 than it was on previous versions, but there may still be configurations that just does not work. The more complex your network infrastructure is, the less likely active/active teaming is to work. Connecting all members in the team to the same switch increases the chance of success. This also makes the team dependent on a single switch of course, but if the alternative is bandwidth congestion or no teaming at all, it does not really matter.

I recommend talking to your local network specialist about teaming before creating a design dependent on active/active teaming.

Using multiple vlans per adapter or team

It has become common practice to use more than one vlan per team, or even more than one vlan per adapter. I do not recommend this for clusters, with the exception of adapters/teams connected to a Hyper-V switch. An especially stupid thing to do is mixing ISCSI traffic with other traffic on the same physical adapter. I have dealt with the aftermath of such a setup, and it does not look pretty unless data corruption is your kind of fun. And if you create a second vlan just to get an internal network for cluster heartbeat traffic on the same physical adapters you are using for client connections, you are not really achieving anything other than making your cluster more complex. The cluster validation report will even warn you about this, as it will detect more than one interface with the same MAC address.

Verify SMB3 Multichannel on your cluster

To ensure maximum throughput for file clusters and Hyper-V clusters with cluster shared volumes, ensure that SMB multichannel is working. Without it, your file transfers may be running on a single thread/cpu and be less resilient to network problems. See http://blogs.technet.com/b/josebda/archive/2012/05/13/the-basics-of-smb-multichannel-a-feature-of-windows-server-2012-and-smb-3-0.aspx for more background information. SMB multichannel requires Windows 2012 or newer.

SMB multichannel is on by default, but that does not necessarily translate to works like a charm by default. The underlying network infrastructure and network adapters have to be configured to support it. In short, you need at least one of the following:

· multiple nics

· RSS capable nics

· RDMA capable nics

· network teaming

Verify nic capability detection

Run this following powershell command on the client:

Get-SmbClientNetworkInterface

clip_image001

In this sample output, we have five RSS enabled interfaces, and no RDMA enabled interfaces. Check that the interfaces you are planning to use for SMB are listed. Teamed interfaces show up in this list as virtual nics, but the physical nics that are part of the team are hidden. This behavior is expected.

On the server, use this powershell command. For Hyper-V cluster noedes with CSV, run both the server and client commands.

Get-SmbServerNetworkInterface

clip_image002

Again, make sure the adapters and IP addresses you have dedicated to SMB traffic is shown in the list with the expected capabilities.

Verify multiple connections

The powershell commandlet Use Get-SmbMultichannelConnection lists active SMB multichannel connections on the client. You may have to start a large file copy operation before you run the command to get any data. If you add the -IncludeNotSelected option, possible connections that are not selected for use are listed. In the sample below, you will see that one of the possible connections involves crossing a gateway/firewall from 10.x to 192.x, and is therefore not used.

clip_image003

If you are unable to get any data, run Get-SmbConnection to verify that you have active SMB connections.

Enable multichannel in failover cluster manager

For SMB multichannel to be active on a clustered role, be it scale-out file server or the old-fashioned file server role, client connections has to be enabled on all participating networks. It is best practice to disable client connections on all non-client facing cluster networks, but if you want to use SMB multichannel on an internal cluster network for say a Hyper-v for instance, you have to enable client connections on the internal network(s). It is also a good practice to not have a default gateway in cluster internal networks, unless you are deploying a stretched cluster where also the internal cluster traffic has to cross a gateway. Thus, clients outside the internal cluster network should not be able to access this network anyway due to routing and/or firewall restrictions. That being said, if you are deploying a cluster where the clients are supposed to connect to the clustered file server, you should also create multiple networks accessible from the outside of the cluster. But cluster network design is a huge topic outside the scope of this post. Anyway, make sure Allow clients to connect through this network is enabled in Failover cluster manager.

clip_image004

Asynchronous AOAG down after restart of primary node

Background

This article is the result of a long day in the woods on a SAR mission turning into an even longer night due to a difficult cluster.

I have many clusters. One of them is a cluster with 3 nodes. 2 of them are running a regular Failover Cluster Instance with shared storage, while the third node has local storage and serves as an AOAG replica for the most critical databases. The Failover cluster instance is the primary replica as long as we don’t have a HA/DR scenario where the SAN is down or massive hardware issues. This is a very specific setup, but I would not be surprised if this problem could be caused by restarting the primary replica on any async AOAG.

Problem

 

During scheduled maintenance, I failed over the instance containing the primary replica from one node to another manually. This is what usually happens:

  • SQL instance is manually moved from node A to node B
  • The AOAG listener cluster resources fail on node A
  • The AOAG listener cluster resources are automatically moved to node B
  • The AOAG listener cluster resources come on line on node B

But not this time. This time, the AOAG listener objects came online on node A. Such a thing is not supposed to happen. In my experience, AOAG Listeners always stays with the primary node.

Note: I do not recommend this procedure as a standard maintenance procedure. It is always best to make sure that the instance you are restarting is NOT the primary replica of any AOAG.

Anyway, the end result is that SQL Server patiently waits for the AOAG listeners to come on line on the correct server. Or, perhaps patiently is not the correct word. It spews angry error messages in the logs, and the AOAG dashboard is all red.

SNAGHTMLd170a1

“Availability replica is disconnected”

“Availability replica does not have a healthy role”

SNAGHTMLd1f856

Solution

The solution is quite simple. It does however it require crossing over to the dark side, performing unspeakable dark magic. Magic that should never be performed in the presence of an AOAG listener, much less be performed on the listener itself. But simple it is, as long as you are the kind of person who knows your way around the part of Windows where Failover Cluster Manager dwells. Or is familiar with the sky blue realm of the Powershell cluster commandlets. Both will suffice. However, if you are not such a person, if some of the sentences in this article sounds like dark incantations overheard in a shady tavern; please seek assistance from someone who is before you proceed down this path to solution in production. Please note that normally, AOAG listener resources should never be manipulated manually. Trying to do so usually just makes the situation even more dire. But here goes:

Take the AOAG listener resources offline manually, and then bring them back online. Doing so should make the resources realize the errors of their ways and promptly enter the failed state. The cluster and SQL Server should detect this and take action. All failed listeners should be whisked over to the primary replica node and brought online without any need for further input from you.

However, if the listeners are still offline, active on the wrong cluster node, or even worse locked in a failed state, you still have options. Suggestions listed by increasing approximated time consumption.

  • Move the listener resources manually to the node containing the primary replica.
  • Shut down all the nodes, then start just one and let SQL Server initialize completely before you start the others.
  • Look for underlying domain and network issues preventing the listener from starting.
  • Destroy and re-create the availability group and listeners.

The server cannot accept TCP connections after disabling named pipes

Problem

I always disable the named pipes protocol on my SQL Servers if they are going to accept network connections from clients. I am not sure if named pipes still poises a threat to my network in a 10GB Ethernet world, but I have spent a considerable amount of time troubleshooting network latencies caused by named pipes previously. For some background on TCP/IP vs NP, check out this article by devpro. Going into SQL Config manager and turning it of has become part of my routine, so imagine my surprise when one of the junior DBAs at my workplace told me that this caused a new cluster instance he was installing to fail. I could of course have dismissed it as some strange fault during installation, but a quick survey revealed that he had followed my SQL Server cluster installation manual to the letter. And this happened only in production, not in the identical QA environment. The only error message logged during SQL startup was this one:

“Error: 26058, Severity: 16, State: 1. A TCP provider is enabled, but there are no TCP listening ports configured. The server cannot accept TCP connections.”

The failover cluster service does not like SQL Server instances that are not network connectable, so it promptly shut down the instance, followed by a slew of error messages in the event log. If I remember correctly, the default timeout for a SQL Server instance to come online after it is started is 3 minutes. That is, it has to be ready for client connections and respond to the cluster service’s health checks in 3 minutes. This does not include any time needed to recover user databases, so in a physical environment with a functional storage subsystem this poses no problem at all. If you have slow hardware or other problems causing a slow startup you can increase the timeout through failover cluster manager, but I have personally never had the need to increase it.

So far, I have only seen this issue in the following scenario:

  • MSSQL 2012 SP1
  • Windows 2012
  • Failover cluster
  • >1 instance of SQL Server in the cluster

But I have been unable to reproduce the problem consistently in my lab. I have otherwise equal cluster environments, where some have this issue and some does not.

Steps to reproduce:

  • Install instance 1
  • Apply patches to Instance 1
  • Test failover of instance 1
  • Disable named pipes on instance 1
  • Restart Instance 1
  • Install instance 2
  • Apply patches to instance 2
  • Test Failover of instance 2
  • Disable named pipes on instance 2
  • Restart instance 2
  • Instance 2 fails

Analysis

We spent quite a lot of time comparing the working cluster with the failing one. Down to firmware versions, drivers and patches applied. Everything was equal. I even considered a local disk fault, but failing over to another node didn’t change anything. The registry settings also looked ok. On a cluster, the registry settings are only available on the node currently owning the instance.  The network settings are found in HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL11.[Instance name]\MSSQLServer\SuperSocketNetLib\.

SNAGHTML864377

WARNING: NEVER try to change anything directly in the registry on a clustered instance, especially one that is not working. See Technet for a guide to changing clustered registry settings on an offline SQL instance.

As the sample shows, we even had Listen All enabled.

Workaround

After a lot of tinkering we came up empty and resorted to re-enabling Named Pipes. If anyone finds or know of a solution to this, please leave a comment below.

Note: the following guide is vague by design. It is very difficult to provide a proper step by step guide for this workaround covering all alternatives. If you have questions, leave a comment including details about your situation, and I will try to clarify.

To enable named pipes again, we just started sqlserver directly through the command line on the active node. This will usually bring it online. If not, stop the cluster service while you do this. Then we enabled named pipes in Config Manager and stopped the instance again. After that, we were able to start the instance as expected, albeit with named pipes enabled.

Warning: If you stop the cluster service on an active node, the cluster may try to fail over the instance to another node if there are other active nodes. Make sure the instance is listed as Failed before you try this. If all else fails, you may have to shut down the other nodes as well, thus taking down all instances in the cluster.

Clustered MSDTC fails

Problem

While setting up a new clustered Distributed transaction coordinator for a SQL Server FCI, it fails to come online when restarted. This time it happened after I enabled network dtc access, but I have had this happen a lot during patching and cluster failover. Usually, I would just remove and reinstall, but that doesn’t seem to help this time. No matte what I did, FC Manager would just list it as failed:

image

Analysis

Looking in Services, I could see the DTC service was disabled:

image

The GUID in the service name can be matched to the cluster resource in the registry. This is useful if you have more than one DTC in your cluster, as FCI will not allow you to have several DTC resources with the same name. additional DTCs are named “New Distributed Transaction Coordinator (1)”, 2 and so on.

image

I tried to enable the service, only to be greeted with a snarky “This service is marked for deletion” message.

image

Then, I tried removing the resource and adding a new one, as this is my standard MO whenever I have trouble with a clustered MSDTC. Doing that I ended up with another “Marked for deletion” DTC service. My next idea was to fail over the instance, but then I thought, what if this had been in production? Thus, I kept on searching for another solution. And the solution turned out to be a simple one…

Solution

Log out ALL user sessions from the active node. This means all, yourself and any disconnected others included. Then log back in again, and bring the DTC resource online.

image

And by the way, remember change the policy for the DTC, to make sure that such errors doesn’t take down and fail over the entire instance. It could have solved the problem, but it could just as easily lead to the instance failing back and forth until it fails. Adding a script or policy that automatically logs out inactive users from the cluster nodes once a day is also a good idea.

image

Event 324 from SQLAgent OpenCluster (reason: 5).

Problem

Overzealous monitoring alerts you to an error logged during a cluster failover, more specifically Event ID 324 from SQLAgent$InstanceName:

SNAGHTML19452993

Analysis

As mentioned this happens during failover, one that otherwise may pass without incident. Further analysis of the Application log shows that recovery isn’t done at the time. The next messages in the log are related to the server starting up and running recovery on the new node. For some reason this takes longer than expected. Maybe there was a lot of transactions in flight at the time of failover, maybe the server or storage is to slow, or maybe you were in the process of installing an update to SQLServer which may lead to extensive recovery times. Or it may be something completely different. Whatever it was, it caused the cluster service to try to start the SQLAgent before the node was ready. Reason 5 is probably access denied. Thus, the issue could be related to lack of permissions. I have yet to catch one of these early enough to have a cluster debug log containing the time of the error. Analysis of the cluster in question revealed another access related error at about the same time, ESENT Event ID 490:

SNAGHTML1955d914

This error is related to lack of permissions for the SQLServer engine and Agent runas accounts. Whether or not these accounts should have Local Admin permissions on the node is a never ending discussion. I have found though, that granting the permissions causes far less trouble in a clustered environment than not doing so. There is always another issue, always another patch or feature requiring yet another explicit permission. From a security stand point, it is easy to argue that the data served by the SQL Server is far more valuable than the local server it runs on. If an attacker is able to gain access to the runas accounts, he already has access to read and change/delete the data. What happens to the local server after that is to me rather irrelevant. But security regulations aren’t guaranteed to be neither logical nor sane.

Solution/Workaround

To solve the permission issue, you can either:

  • Add the necessary local permissions for the runas accounts as discussed in KB2811566 and wait for the next “feature” requiring you to add even more permissions to something else. Also, make sure the Agent account has the proper permissions to your backup folders and make sure you are able to create new databases. Not being able to do so may be caused by the engine account not having the proper permissions to your data/log folders.
  • Add the SQL Server Engine and Agent runas accounts to the local administrators group on the server.

Do NOT grant the runas accounts Domain Admin permissions. Ever.

Regarding the open cluster error:

On the servers I have analyzed with this issue, the log always shows the agent starting successfully within two minutes of the error, and it only happens during failover. I have yet to find it on servers where the permissions issue is not solved (using either method), but I am not 100% sure that they are related. I can however say that the message can safely be ignored as long as the Agent account is able to start successfully after the message.