What SMB version is actually used?

To verify which SMB version is in use for a specific fileshare/connection, run the following powershell command:

Get-SmbConnection |select ShareName, Dialect 

You can run this command on both the client and the server. A client/server connection will use the highest version supported by both client and server. If the client supports up to v3.02, but the server is only able to support v3.00, v3.00 will be used for the connection.

The Get-Smbconnection commandlet supports a several other outputs, use select * to list them all.

Sample output

SNAGHTML3f409e4d

This is from a Win2012R2 client, connected to a share on a Win2012 cluster with multichannel support.

Networks, teaming and heartbeats for clusters

Introduction

In this guide, a fabric is a separate network infrastructure, be it SAN, WAN or LAN. A network may or may not be connected to a dedicated fabric. Some fabrics have more than one network.

The cluster nodes should be connected to each other over at least two independent networks/fabrics. The more independent the better. Ideally, the networks should share no components at all, but as a minimum they should be connected to separate NICs in the server. Ergo, if you want to use NIC teaming you should have at least 4 physical network ports on at least two separate NICs. The more the merrier, but be aware that as with all other forms of redundancy, higher redundancy equals higher complexity.

If you do not have more than one network port or only one network team, do not add an additional virtual network adapter/vlan for “heartbeat purposes”. The most prevalent network faults today are caused by someone unplugging the wrong cable, deactivating the wrong switch port or other user errors. Having separate vlans over the same physical infrastructure rarely offers any protection from this. You are better off just using the one adapter/team.

Previously, each Windows cluster needed a separate heartbeat network used to detect node failures. From Windows 2008 and newer (and maybe also on 2003) the “heartbeat” traffic is sent over all available networks between the cluster nodes unless we manually block it on specific cluster networks. Thus, we no longer need a separate dedicated heartbeat network, but adding a second network ensures that the cluster will survive failures on the primary network. Some cluster roles such as Hyper-V require multiple networks, so check what the requirements are for your specific implementation.

Quick takeaway

If you are designing a cluster and need a quick no-nonsense guideline regarding networks, here it comes:

  • If you use shared storage, you need at least 3 separate fabrics
  • If you use local storage, you need at least 2 separate fabrics

All but a few clusters I have been troubleshooting have had serious shortcomings and design failures in the networking department. The top problems:

  • Way to few fabrics.
  • Mixing storage and network traffic on the same fabric
  • Mixing internal and external traffic on the same fabric
  • Outdated faulty NIC firmware and drivers
  • Bad, poorly designed NICs from Qlogic and Emulex
  • Converged networking

Do not set yourself up for failure.

IPv6

If you haven’t implemented IPv6 yet in your datacenter, you should disable IPv6 on all cluster nodes. If you don’t, you run a high risk of unnecessary failovers due to IPv6 to IPv4 conversion mishaps on the failover cluster virtual adapter. As long as IPv6 is active on the server, the failover cluster virtual adapter will use IPv6, even if none of the cluster networks have a valid IPv6 address. This causes all heartbeat traffic to be converted to/from IPv4 on the fly, which sometimes will fail. If you want to use IPv6, make sure all cluster nodes and domain controllers have a valid IPv6 address that is not link local (fe80:), and make sure you have routers, switches and firewalls that support IPv6 and are configured properly. You will also need IPv6 dns in the active directory domain.

Disabling IPv6

Do NOT disable IPv6 on the network adapters. The protocol binding for IPv6 should be enabled:

clip_image001

Instead, use the DisabledComponents registry setting. See Disable IPv6 for details.

clip_image003

Storage networks

If you use IP-based storage like ISCSI, SMB or FCOE, make sure you do not mix it with other traffic. Dedicated physical adapters should always be used for storage traffic. Moreover, if you are one of the unlucky few using FCOE you should seriously consider converting to FC or SMB3.

Hyper-V networks

In a perfect world, you should have six or more separate networks/fabrics for Hyper-v clusters. Sadly though, the world is seldom perfect. The absolute minimum for production clusters is two networks. Using only one network in production will cause nothing but trouble, so please do not try. Determining whether or not to use teaming also complicates matters further. As a general guide, I would strongly recommend that you always have a dedicated storage fabric with HA, that is teaming or MPIO, unless you use local storage on the cluster nodes. The storage connection is the most important one in any form of cluster. If the storage connection fails, everything else falls apart in seconds. For the other networks, throughput is more important than high availability. If you have to make a choice between HA and separate fabrics, chose separate fabrics for all other networks than the storage network.

7 Physical networks/fabrics

· Internal/Cluster/CSV (if local)/Heartbeat

· Public network for VMs

· VM Host management

· Live Migration

· 2*Storage (ISCSI, FC, SMB3)

· Backup

5 Physical networks/fabrics

· Internal/Cluster/CSV (if local)/Heartbeat/Live Migration

· Public network for vm, VM guest management

· VM Host management

· 2*Storage (ISCSI, FC, SMB3)

4 Physical networks/fabrics

· Internal/Live Migration

· Public & Management

· 2*Storage

Example

clip_image004

Most blade server chasses today have a total of six fabric backplanes, grouped in three groups where each group connects to a separate adapter in the blade. Thus, each network adapter or FC HBA is connected to two separate fabrics/backplanes. The groups could be named A,B and C, with the fabrics named A1, A2, B1 and so on. Each group should have identical backplanes, that is the backplane in A1 should be the same model as the backplane in A2.

If we have Fibrechannel (FC) backplanes in group A, and 10G Ethernet backplanes in group B & C, we have several possible implementations. Group A will always be storage in this example, as FC is a dedicated storage network.

clip_image005

Here, we have teaming implemented on both B and C. Thus, we use the 4 networks configuration from above, splitting our traffic in Internal and Public/Management. This implementation may generate some conflicts during Live Migrations, but in return we get High Availability for all groups.

clip_image006

By splitting group B and C in two single ports, we get 5 fabrics and a more granulated separation of traffic at the cost of High Availability.

Hyper-V trunk adapters/teams on 2012

If you are using Hyper-V virtual switches bound to a physical port or team on you Hyper-V hosts, Hyper-V Extensible Virtual Switch should be the only bound protocol. Note: Do not change these settings manually, Hyper-V manager will change the settings automatically when you configure the virtual switch. If you bind the Hyper-V Extensible virtual switch protocol manually, creation of the virtual switch may fail.

clip_image007

Teaming in Windows 2012

In Windows 2012 we finally got native support for nic teaming. You access the nic teaming dialog from Server Manager. You can find a short description of the features here: http://technet.microsoft.com/en-us/library/hh831648.aspx, and a more detailed one here: Windows Server 2012 NIC Teaming (LBFO) Deployment and Management.

Native teaming support rids us of some of the problems related to unstable vendor teaming drivers, and makes setup of nic teaming a unified experience no matter what nics you are using. Note: never use nic teaming on ISCSI networks. Use MPIO instead.

A note on Active/Active teaming

It is possible to use active active teaming, thus aggregating the bandwidth of two or more adapters to support higher throughput. This is a fantastic technology, especially on 1G ethernet adapters where bandwidth congestion can become a problem. There is, however a snag; a lot of professional datacenters have a complete ban on active/active teaming due to years of teaming problems. I have my self been victim of unstable active/active teams, so I know this to be a real issue. I do think this is less of a problem in Windows 2012 than it was on previous versions, but there may still be configurations that just does not work. The more complex your network infrastructure is, the less likely active/active teaming is to work. Connecting all members in the team to the same switch increases the chance of success. This also makes the team dependent on a single switch of course, but if the alternative is bandwidth congestion or no teaming at all, it does not really matter.

I recommend talking to your local network specialist about teaming before creating a design dependent on active/active teaming.

Using multiple vlans per adapter or team

It has become common practice to use more than one vlan per team, or even more than one vlan per adapter. I do not recommend this for clusters, with the exception of adapters/teams connected to a Hyper-V switch. An especially stupid thing to do is mixing ISCSI traffic with other traffic on the same physical adapter. I have dealt with the aftermath of such a setup, and it does not look pretty unless data corruption is your kind of fun. And if you create a second vlan just to get an internal network for cluster heartbeat traffic on the same physical adapters you are using for client connections, you are not really achieving anything other than making your cluster more complex. The cluster validation report will even warn you about this, as it will detect more than one interface with the same MAC address.

Verify SMB3 Multichannel on your cluster

To ensure maximum throughput for file clusters and Hyper-V clusters with cluster shared volumes, ensure that SMB multichannel is working. Without it, your file transfers may be running on a single thread/cpu and be less resilient to network problems. See http://blogs.technet.com/b/josebda/archive/2012/05/13/the-basics-of-smb-multichannel-a-feature-of-windows-server-2012-and-smb-3-0.aspx for more background information. SMB multichannel requires Windows 2012 or newer.

SMB multichannel is on by default, but that does not necessarily translate to works like a charm by default. The underlying network infrastructure and network adapters have to be configured to support it. In short, you need at least one of the following:

· multiple nics

· RSS capable nics

· RDMA capable nics

· network teaming

Verify nic capability detection

Run this following powershell command on the client:

Get-SmbClientNetworkInterface

clip_image001

In this sample output, we have five RSS enabled interfaces, and no RDMA enabled interfaces. Check that the interfaces you are planning to use for SMB are listed. Teamed interfaces show up in this list as virtual nics, but the physical nics that are part of the team are hidden. This behavior is expected.

On the server, use this powershell command. For Hyper-V cluster noedes with CSV, run both the server and client commands.

Get-SmbServerNetworkInterface

clip_image002

Again, make sure the adapters and IP addresses you have dedicated to SMB traffic is shown in the list with the expected capabilities.

Verify multiple connections

The powershell commandlet Use Get-SmbMultichannelConnection lists active SMB multichannel connections on the client. You may have to start a large file copy operation before you run the command to get any data. If you add the -IncludeNotSelected option, possible connections that are not selected for use are listed. In the sample below, you will see that one of the possible connections involves crossing a gateway/firewall from 10.x to 192.x, and is therefore not used.

clip_image003

If you are unable to get any data, run Get-SmbConnection to verify that you have active SMB connections.

Enable multichannel in failover cluster manager

For SMB multichannel to be active on a clustered role, be it scale-out file server or the old-fashioned file server role, client connections has to be enabled on all participating networks. It is best practice to disable client connections on all non-client facing cluster networks, but if you want to use SMB multichannel on an internal cluster network for say a Hyper-v for instance, you have to enable client connections on the internal network(s). It is also a good practice to not have a default gateway in cluster internal networks, unless you are deploying a stretched cluster where also the internal cluster traffic has to cross a gateway. Thus, clients outside the internal cluster network should not be able to access this network anyway due to routing and/or firewall restrictions. That being said, if you are deploying a cluster where the clients are supposed to connect to the clustered file server, you should also create multiple networks accessible from the outside of the cluster. But cluster network design is a huge topic outside the scope of this post. Anyway, make sure Allow clients to connect through this network is enabled in Failover cluster manager.

clip_image004

EventID 1004 from IPMIDRV

Post originally from 2010, updated 2014.04.04 and superseded by EventID 1004 from IPMIDRV v2 in 2016

Problem

imageThe system event log is overflowing with EventID 1004 from IPMIDRV. “The IPMI device driver attempted to communicate with the IPMI BMC device during normal operation. However the operation failed due to a timeout.”

The frequency may vary from a couple of messages per day upwards to several messages per minute.

Analysis

The BMC (Baseboard Management Controller) is a component found on most server motherboards. It is a microcontroller responsible for communication between the motherboard and management software. See wikipedia for more information. The BMC is also used for communication between the motherboard and dedicated out of band management boards such as Dell iDRAC. I have seen these error messages on systems from several suppliers, most notably on IBM and Dell blade servers, but most server motherboards have a BMC. As the error message states, you can resolve this error by increasing the timeout, and this is usually sufficient. I have found that the Windows default settings for the timeouts may cause conflicts, especially on blade servers. Thus an increase in the timeout values may be in order as described on technet. Lately though, I have found this error to be a symptom of more serious problems. To understand this, we have to look at what is actually happening. If you have some kind of monitoring agent running on the server, such as SCOM or similar, the error could be triggered by said agent trying to read the current voltage levels on the motherboard. If such operations fail routinely during the day, it is a sign of a conflict. This could be competing monitoring agents querying data to frequently, an issue with the BMC itself, or an issue with the out of band management controller. In my experience, this issue is more frequent on blade servers than rack-based servers. This makes sense, as most blade servers have a local out of band controller that is continuously talking to a chassis management controller to provide a central overview of the chassis.

If the out of band controllers have a problem, this can and will affect the BMC, which in turn may affect the motherboard. Monitoring of server status is the most frequently used feature, but the BMC controller is also used for remote power control and is able to affect the state of internal components on the motherboard. We recently had an issue on a Dell M820 blade server where a memory leak in iDrac resulted in the mezzanine slots on the server being intermittently switched off. In this case it was the FibreChannel HBA. Further research revealed this to be a returning issue. This forum thread from 2011 describes a similar issue: http://en.community.dell.com/techcenter/blades/f/4436/t/19415896.aspx.

As the iDrac versions in question are different (1.5.3 in 2010 and 1.51.51 in 2014), I theorize that the issue is related to iDrac memory leaks in general and not a specific bug. Thus, any iDrac firmware bug resulting in a memory leak may cause these issues.

Solution

Low error frequency

Increase the timeout values as described on technet. I have used the following values with success on IBM servers:

image

Under HKLM\SYSTEM\CurrentControlSet\Control\IPMI are four values controlling the IPMI driver: BusyWaitTimeoutPeriod, BusyWaitPeriod, IpmbWaitTimeoutPeriod, CommandWaitTimeoutPeriod, and SlaveAddressDetectionMethod. On IBM blades, I have used BusyWaitPeriod 60(desimal) and 9 000 000 (desimal) for the rest. Changing these settings require a restart of the server.

High error frequency

Further analysis will be necessary for each case. Try to identify what program is triggering the timeouts. A blanket upgrade of Bios, Out of band Management and system drivers may be successful, but it could also introduce new problems and thus further complicate the troubleshooting. Looking for other, seemingly unrelated problems in the event log could also be a good idea. And check for other servers with similar problems. I have found that removing the server from the chassis and reseating it may remove the fault for a couple of days before it returns. This is a symptom of a memory leak. And talking about memory leaks, check for kernel mode driver memory leaks in the operating system.

If it is a dell server, try running the following command:

racadm getsysinfo

image

If the result is ERROR: Unable to perform the requested operation, something is seriously wrong with the out of band controller. Get in touch with Dell support for further help. You will need a new version of the iDrac firmware without the memory leak, or an older version and a downgrade.

If the command is successful, it should return a lot of information about the server:

image

A successful result points to an issue with monitoring software or drivers. Or maybe you just need to increase the timeouts in the registry.

Event 324 from SQLAgent OpenCluster (reason: 5).

Problem

Overzealous monitoring alerts you to an error logged during a cluster failover, more specifically Event ID 324 from SQLAgent$InstanceName:

SNAGHTML19452993

Analysis

As mentioned this happens during failover, one that otherwise may pass without incident. Further analysis of the Application log shows that recovery isn’t done at the time. The next messages in the log are related to the server starting up and running recovery on the new node. For some reason this takes longer than expected. Maybe there was a lot of transactions in flight at the time of failover, maybe the server or storage is to slow, or maybe you were in the process of installing an update to SQLServer which may lead to extensive recovery times. Or it may be something completely different. Whatever it was, it caused the cluster service to try to start the SQLAgent before the node was ready. Reason 5 is probably access denied. Thus, the issue could be related to lack of permissions. I have yet to catch one of these early enough to have a cluster debug log containing the time of the error. Analysis of the cluster in question revealed another access related error at about the same time, ESENT Event ID 490:

SNAGHTML1955d914

This error is related to lack of permissions for the SQLServer engine and Agent runas accounts. Whether or not these accounts should have Local Admin permissions on the node is a never ending discussion. I have found though, that granting the permissions causes far less trouble in a clustered environment than not doing so. There is always another issue, always another patch or feature requiring yet another explicit permission. From a security stand point, it is easy to argue that the data served by the SQL Server is far more valuable than the local server it runs on. If an attacker is able to gain access to the runas accounts, he already has access to read and change/delete the data. What happens to the local server after that is to me rather irrelevant. But security regulations aren’t guaranteed to be neither logical nor sane.

Solution/Workaround

To solve the permission issue, you can either:

  • Add the necessary local permissions for the runas accounts as discussed in KB2811566 and wait for the next “feature” requiring you to add even more permissions to something else. Also, make sure the Agent account has the proper permissions to your backup folders and make sure you are able to create new databases. Not being able to do so may be caused by the engine account not having the proper permissions to your data/log folders.
  • Add the SQL Server Engine and Agent runas accounts to the local administrators group on the server.

Do NOT grant the runas accounts Domain Admin permissions. Ever.

Regarding the open cluster error:

On the servers I have analyzed with this issue, the log always shows the agent starting successfully within two minutes of the error, and it only happens during failover. I have yet to find it on servers where the permissions issue is not solved (using either method), but I am not 100% sure that they are related. I can however say that the message can safely be ignored as long as the Agent account is able to start successfully after the message.

Unable to add shares to Windows 2012 File Cluster

Problem

image

When you try to add a share to a newly formed (and perhaps also an existing) Windows 2012 File Server Cluster, you get an error message stating that the you are unable to do so due to lack of WinRM communication between the cluster nodes. Additionally, you may spot event id 49 from WinRM MI Operation in the Windows Remote Management operational event log with the following message:

“The WinRM protocol operation failed due to the following error: The WinRM client sent a request to an HTTP server and got a response saying the requested HTTP URL was not available. This is usually returned by a HTTP server that does not support the WS-Management protocol..”

SNAGHTML6084577

Or the following text for Event 49:

“The WinRM protocol operation failed due to the following error: The connection to the specified remote host was refused. Verify that the WS-Management service is running on the remote host and configured to listen for requests on the correct port and HTTP URL..”

And event id 142 from Windows Remote management stating

“WSMan operation Enumeration failed, error code 2150859027”

SNAGHTML609f7a6

Other possible events:

EventID 0 from FileServices-Manager.Eventprovider

image

“ Exception: Caught exception Microsoft.Management.Infrastructure.CimException: The WinRM client received an HTTP status code of 502 from the remote WS-Management service.
   at Microsoft.Management.Infrastructure.Internal.Operations.CimSyncEnumeratorBase`1.MoveNext()
   at Microsoft.FileServer.Management.Plugin.Services.FSCimSession.PerformQuery(String cimNamespace, String queryString)
   at Microsoft.FileServer.Management.Plugin.Services.ClusterEnumerator.RetrieveClusterConnections(ComputerName serverName, ClusterMemberTypes memberTypeToQuery)”

Error code 504 has also been detected.

Analysis

The problem is clearly related to windows Remote Management. What was even more peculiar in this case, was the fact that when I failed over to another node, the error message disappeared. Thus I knew that the error was isolated to the one node. But even though I spent hours comparing settings on the nodes, all I was able to establish was the fact that they were exactly alike. Then I remembered something from my Exchange admin days; In earlier versions of Windows, WinRM could be removed and reinstalled from the system. I remember this because Exchange 2010 relied heavily on WinRM and remote powershell, bot of which could be a major pain to get working properly. In Win2012, remote management is heavily integrated in server manager, and I was unable to find a way to remove it. I did however find a way to turn it off an on again.

Update 2016.11.24:

I found another version of this problem where solution one did not work. It was still a WinRM-problem, but this time it was proxy-related. You may need an explicit  proxy exception for the local domain.

Solution one

Disable and enable WinRM. There are of course multiple ways to achieve this. I used powershell, but there is an option in the gui, and the command works in CMD.EXE as well. Beware, you have to use an elevated powershell prompt. When I come to think of it, most things that are worh doing seems to require an elevated shell.

Configure-SMRemoting -disable
Configure-SMRemoting -enable

That is it. no need to reboot or anything, just run the two commands and wait for them to finish. If you get a message that remoting is enforced by Group Policy, look for this GPO:

image

It has to be set as Not configured to allow you to disable and enable WinRM. If it is enforced by a domain policy, you have to block said policy temporarily while you fix this.

Enabling and disabling should also make sure that the necessary firewall settings are enabled. If you have a proxy server defined, make sure you have exceptions added for your local servers as this could also block WinRM, albeit with other error messages.

Solution two

Make sure you have an exception in your proxy definition for the local domain. For system proxy setups:

netsh winhttp set proxy [proxyserveraddress]:[proxy port] bypass-list=”*.ADDomain.local;<local>”

For other proxy configs, ask your proxy admin.

Testing winrm with powershell

You can use the Invoke-command powershell command to test powershell remote connections:

Invoke-Command -ComputerName Lab-DC -ScriptBlock { Get-ChildItem c:\ } -credential lab\sauser

This command will output a directory listing of c:\ on the computer Lab-DC. The command will be executed with the lab\sauser account. Powershell prompts for account password on execution. Sample output:

07-05-2014 11-57-04

SQL agent jobs cause Audit failure in security log

Problem

When you audit your security log, something you are doing every day of course, you discover that SQL server is causing an audit failure fairly often:

image

Audit failure event 4625, The user account has expired:

image

Closer inspection reveals that the event is triggered by the SQL server engine service account. You could easily be lead to believe that the engine account itself has expired, as it is the only account mentioned by name in the error message. That is not the case here, as the “Account for which logon failed” is a null sid, also known as S-1-0-0 or nobody. This is Windows’ way of telling you that something failed, but I’m not going to say exactly what it was.

Analysis

The error appeared about every hour on the hour, so agent jobs is the top suspect. But the job logs disagreed, every job was working like a charm. Analysis of the central log repository revealed that the error appeared suddenly without any changes being made on the server at the time.

I discovered this on a cluster, so I tried failing over to another node. That didn’t change anything, the error followed sqlserver.exe to the other node. At least I had proven beyond reasonable doubt that the error was directly related to something SQL does. I was unable to find any fault on the server except from the audit failure. I checked all SQL server service accounts, none of which had an expiry date set. I drifted back to the agent jobs again, as agent job ownership has given me a hard time earlier. For some reason, if the person creating the jobs/maintenance plans is a sysadmin by group membership and not by direct membership, all agent jobs fail to execute unless you use a proxy account. I have blogged about this before though (https://lokna.no/?p=1267), so I always check for this when a new instance is installed and correct if necessary.

Then it struck me: there is a best practice somewhere stating that SQL Server installs are supposed to be executed as a special account, and not an account associated with the person that performs the installation. This is in case that person quits and we should delete his/her account. The agent jobs for backup, dbcc checkdb and such are usually created using another account though, but maybe there was an anomaly. I ran a quick check, and yes, the agent jobs were owned by the setup user. And this user was since marked as expired for security reasons, as it is only used  during setup and as a way into the server in case someone removes all other sysadmins and locks themselves out. I know there are other ways into a SQL Server you don’t have access to, but this is a lot easier as long as you have access to the domain.

To list job owners:

select name, owner_sid from msdb.dbo.sysjobs

image

0x1 is sa, all the other jobs were owned by the setup user. These SIDs are in hex format. To convert to a username, run this:

SELECT SUSER_SNAME (0xHEXSID)

solution

Change the job and/or maintenance plan owner. See https://lokna.no/?p=325

Multiple default gateways

Update 2018-0307

I have verified this as an issue on Windows 2016 as well. Sometimes if a network adapter has been configured with a default gateway before it is added to a NIC Team, you will get multiple default gateways.

Problem

While troubleshooting a networking teaming issue on a cluster, someone sent me the a link to this article about multiple default gateways on Win 2012 native teaming: http://www.concurrency.com/blog/bug-in-nic-teaming-wizard-makes-duplicate-default-routes-in-server-2012/. The post discusses a pretty specific scenario that we didn’t have on our clusters (most of them are on 2008R2), but I discovered several nodes with more than one default route in route print:

image

The issue I was looking into was another, but I remembered a problem from a weekend some months ago that might be related: When a failover was triggered on a SQL cluster, the cluster lost communication with the outside world. To be specific: no traffic passed through the default gateway. As all cluster nodes were on the same subnet the cluster itself was content with the situation, but none of the webservers were able to communicate with the clustered SQL server as they were in a different subnet. This made the webservers sad and the webmaster angry, so we had to fix it. As this happened in production over the weekend, the focus was on a quick fix and we were unable to find a root cause at the time. A reboot of the cluster nodes did the trick, and we just wrote it off as fallout from the storage issue that triggered the failover. The discovery of multiple default gateways on the other hand prompted a more thorough investigation.

Analysis

The article mentioned above talks exclusively about Windows 2012’s native teaming software, but this cluster is running Windows 2008 R2 and is relying on teaming software provided by the NIC manufacturer (Qlogic). We have had quite a lot of problems with the Qlogic network adapters on this cluster, so I immediately suspected them to be the rotten apple. I am not sure if this problem is caused by a bug in Windows itself that is present in both 2012 and 2008R2, or if both MS and Qlogic are unable to produce a functioning NIC teaming driver, but the following is clear:

If your adapters have a default gateway when you add them to a team, there is a chance that this default gateway will not get removed from the system. This happens regardless if the operating system is Windows 2012 or Windows 2008 R2. I am not sure if gateway addresses configured by DHCP also triggers this behavior. It doesn’t happen every time, and I have yet to figure out if there are any specific triggers as I haven’t been able to reproduce the problem at will.

Solution A

To resolve this issue, follow the recommendations in  http://www.concurrency.com/blog/bug-in-nic-teaming-wizard-makes-duplicate-default-routes-in-server-2012/:

First you have to issue a command to delete all static routes to 0.0.0.0. NB! This will disconnect you from the server if you are connected remotely from outside the subnet.

image

Configure the default gateway for the team using IP properties on the virtual team adapter:

image

Do a route print to make sure you have only one default gateway under persistent routes.

Solution b

If solution A doesn’t work, issue a netsh interface ip reset command to reset the ip configuration and reboot the server. Be prepared to re-enter the ip information for all adapters if necessary.

What not to do

Do not configure the default gateway using route add, as this will result in a static route. If the computer is a node in a cluster, the gateway will be disabled at failover and isolate the server on the local subnet. See http://support.microsoft.com/kb/2161341 for information about how to configure static routes on clusters if you absolutely have to use a static route.

Does your cluster have the recommended hotfixes?

From time to time, the good people at Microsoft publish a list of problems with failover clustering that has been resolved. This list, as all such bad/good news comes in the form of a KB, namely KB2784261. I check this list from time to time. Some of them relate to a specific issue, while others are more of the go-install-them-at-once type. As a general rule, I recommend installing ALL hotfixes regardless of the attached warning telling you to only install them if you experience a particular problem. In my experience, hotfixes are at least as stable as regular patches, if not better. That being said, sooner or later you will run across patches or hotfixes that will make a mess and give you a bad or very bad day. But then again, that is why cluster admins always fight for funding of a proper QA/T environment. Preferably one that is equal to the production system in every way possible.

Anyways, this results in having to check all my servers to see if they have the hotfixes installed. Luckily some are included in Microsoft Update, but some you have to install manually. To simplify this process, I made the following powershell script. It takes a list of hotfixes, and returns a list of the ones who are missing from the system. This script could easily be adapted to run against several servers at once, but I have to battle way to many internal firewalls to attempt such dark magic. Be aware that some hotfixes have multiple KB numbers and may not show up even if they are installed. This usually happens when updates are bundled together as a cummulative package or superseded by a new version. The best way to test if patch/hotfix X needs to be installed is to try to install it. The installer will tell you whether or not the patch is applicable.

Edit: Since the original, I have added KB lists for 2008 R2 and 2012 R2 based clusters. All you have to do is replace the ” $recommendedPatches = ” list with the one you need. Links to the correct KB list is included for each OS. I have also discovered that some of the hotfixes are available through Microsoft Update-Catalog, thus bypassing the captcha email hurdle.

2012 version

$menucolor = [System.ConsoleColor]::gray
write-host "╔═══════════════════════════════════════════════════════════════════════════════════════════╗"-ForegroundColor $menucolor
write-host "║                              Identify missing patches                                     ║"-ForegroundColor $menucolor
write-host "║                              Jan Kåre Lokna - lokna.no                                    ║"-ForegroundColor $menucolor
write-host "║                                       v 1.2                                               ║"-ForegroundColor $menucolor
write-host "║                                  Requires elevation: No                                   ║"-ForegroundColor $menucolor
write-host "╚═══════════════════════════════════════════════════════════════════════════════════════════╝"-ForegroundColor $menucolor
#List your patches here. Updated list of patches at http://support.microsoft.com/kb/2784261
$recommendedPatches = "KB2916993", "KB2929869","KB2913695", "KB2878635", "KB2894464", "KB2838043", "KB2803748", "KB2770917"
 
$missingPatches = @()
foreach($_ in $recommendedPatches){
    if (!(get-hotfix -id $_ -ea:0)) { 
        $missingPatches += $_ 
    }    
}
$intMissing = $missingPatches.Count
$intRecommended = $recommendedpatches.count
Write-Host "$env:COMPUTERNAME is missing $intMissing of $intRecommended patches:" 
$missingPatches

2008R2 Version

A list of recommended patches for Win 2008 R2 can be found here:  KB2545685

$recommendedPatches = "KB2531907", "KB2550886","KB2552040", "KB2494162", "KB2524478", "KB2520235"

2012 R2 Version

A list of recommended patches for Win 2012 R2 can be found here:  KB2920151

#All clusters
$recommendedPatches = "KB3130944", "KB3137691", "KB3139896", "KB3130939", "KB3123538", "KB3091057", "KB3013769", "KB3000850", "KB2919355"
#Hyper-V Clusters
$recommendedPatches = "KB3130944", "KB3137691", "KB3139896", "KB3130939", "KB3123538", "KB3091057", "KB3013769", "KB3000850", "KB2919355", "KB3090343", "KB3060678", "KB3063283", "KB3072380"

Hyper-V

If you are using the Hyper-V role, you can find additional fixes for 2012 R2 in KB2920151 below the general cluster hotfixes. If you use NVGRE, look at this list as well: KB2974503

Sample output (computer name redacted)

SNAGHTML2af111ec

Edit:

I have finally updated my script to remove those pesky red error messages seen in the sample above.

Kernel memory leak analysis

I have had several issues in the past year involving kernel memory leaks, so I decided to make a separate blog post about general kernel memory leak analysis. In this post I mostly use the amazing Sysinternals tools for troubleshooting. You also need Poolmon.exe, a small utility currently part of the Windows Driver Kit. Sadly, this 35k self contained .exe is not available as a separate download, you have to download and install the entire 500+MiB WDK somewhere to extract it. You only have to do this once though, as there is no need to install the WDK on every system you analyze. You can just copy the executable from the machine where you installed the WDK.

Problem

Something is causing the kernel paged or non paged pools to rise uncontrollably. Screenshot from Process Explorer’s System Information dialog:

image

In this sample, the non paged pool has grown to an unhealthy 2,2GB, and continues to grow. Even though the pool limit is 128GIB and the server has a whopping  256GIB of RAM, the kernel memory pools are usually way below the 1GiB mark. You should of course baseline this to make sure you actually have an issue, but generally speaking, every time I find a Kernel memory value above 1GiB I go hunting for the cause.

Note: To show the pool limits, you have to enable symbols in Process Explorer. Scott Hanselman has blogged about that here: http://www.hanselman.com/blog/SetUpYourSystemToUseMicrosoftsPublicSymbolServer.aspx

Analysis

Kernel leaks are usually caused by a driver. Kernel leaks in the OS itself are very rare, unless you are running some sort of beta version of Windows. To investigate further, you have to fire up poolmon.exe.

image

Poolmon has a lot of shortcuts. From KB177415:

P – Sorts tag list by Paged, Non-Paged, or mixed. Note that P cycles through each one.
B – Sorts tags by max byte usage.
M – Sorts tags by max byte allocation.
T – Sort tags alphabetically by tag name.
E – Display Paged, Non-paged total across bottom. Cycles through.
A – Sorts tags by allocation size.
F – Sorts tags by “frees”.
S – Sorts tags by the differences of allocs and frees.
E – Display Paged, Non-paged total across bottom. Cycles through.
Q – Quit.

The important ones are “P”, to view either paged or non-paged pool tags, and “B”, to list the ones using the most of it at the top. The same view as above, after pressing “P” and “B”:

image

The “Cont” tag relates to “Contiguous physical memory allocations for device drivers”, and is usually the largest tag on a normal system.

And this screenshot is from the server with the non-paged leak:

image

As you can see, the LkaL tag is using more than 1GiB on its own, accounting for half of the pool. we have identified the pool tag, now we have to look for the driver that owns it. To do that, I use one of two methods:

1: Do an internet search for the pool tag.

http://blogs.technet.com/b/yongrhee/archive/2009/06/24/pool-tag-list.aspx contains a large list of tags.

2: Use Sysinternals strings together with findstr.

Most kernel mode drivers are located in “%systemroot%\System32\drivers”. First you have to start an elevated command prompt.  Make sure the Sysinternals suite is installed somewhere on the path, and enter the following commands:

  • cd “%systemroot%\System32\drivers”
  • strings * | findstr [tag]

Example:

image

Hopefully, you should now have the name of the offending driver. To gather more intel about it, use Sysinternals sigcheck:

image

In this case, the offending driver is part of Diskeeper.

Solution

You have to contact the manufacturer of the offending driver and check for an updated version. In the example, I got a version of Diskeeper without the offending driver, but a good place to start is the manufacturers website. And remember, if you already have the newest version, you can always try downgrading to a previous version. If you present your findings as detailed above, most vendors are very interested in identifying the cause of the leak and will work with you to resolve the issue.