Windows Server 2012R2 NIC Teaming

This is an attempt at giving a technical overview of how the native network teaming in Windows 2012R2 works, and how I would recommend using it. From time to time I am presented with problems “caused” by network teaming, so figuring out how it all works has been essential. Compared to the days of old, where teaming was NIC vendor dependent, todays Windows native teaming is a delight, but it is not necessarily trouble free.

Sources

Someone at Microsoft has written an excellent guide called Windows Server 2012 R2 NIC Teaming (LBFO) Deployment and Management, available at here. It gives a detailed technical guide to all the available options. I have added my field experience to the mix to create this guide.

Definitions

  • NIC: Network Interface Card. Also known as Network Adapter.
  • vNIC/virtual NIC: a team adapter on a host or another computer (virtual or physical) that use teaming.
  • Physical NIC/adapter: An adapter port that is a member of a team. Usually a physical NIC, but could be a virtual NIC if someone has made a complicated setup with teaming on a virtual machine.
  • vSwitch: A virtual switch, usually a Hyper-V switch.
  • Team member: a NIC that is a member of a team.
  • LACP: Link Aggregation Control Protocol, also IEE 802.3ad. See https://en.wikipedia.org/wiki/Link_aggregation#Link_Aggregation_Control_Protocol

Active-Active vs Active-Passive

image

If none of the adapters are set as standby, you are running an Active-Active config. If one is standby and you have a total of two adapters, you are running an Active-Passive config. If you have more than two team members, you may be running a mixed Active-Active-Passive config (strandby adapter set), or an Active-Active config without a standby adapter.

If you are using a configuration with more than one active team member on a 10G infrastructure, my recommendation is to make sure that both members are connected to the same physical switch and in the same module. If not, be prepared to sink literally hundreds, if not thousands of hours into troubleshooting that could otherwise be avoided. There are far too many problems related to the switch teaming protocols used on 10G, especially with the Cisco Nexus platform. And it is not that they do not work, it is usually an implementation problem. A particularly nasty kind of device is something Cisco refers to as a FEX or fabric extender. Again, it is not that it cannot work. It’s just that when you connect it to the main switch with a long cable run it usually works fine for a couple of months. And then it starts dropping packets and pretends nothing happened. So if you connect one of your team members to a FEX, and another to a switch, you are setting yourself up for failure.

Due to the problems mentioned above and similar troubles, many IT operations have a ban on Active-Active teaming. It is just not worth the hassle. If you really want to try it out, I recommend one of the following configurations:

  • Switch independent, Hyper-V load balancing. Naturally for vSwitch connected teams only. No, do not use Dynamic.
  • LACP with Address Hash or Hyper-V load balancing. Again, do not use Dynamic mode.

Team members

I do not recommend using more than two team members in Switch Independent teaming due to artifacts in load distribution. Your servers and switches may handle everything correctly, but the rest of the network may not. For switch dependent teaming, you should be OK, provided that all team members are connected to the same switch module. I do not recommend using more than four team members though, as it seems to be the breaking point between added redundancy and too much complexity.

Make sure all team members are using the exact same network adapter with the exact same firmware and driver versions. Mixing them up will work, but even if base jumping is legal you don’t have to go jumping. NICs are cheap, so fork over the cash for a proper Intel card.

Load distribution algorithms

Be aware that the load distribution algorithm primarily affects outbound connections only. The behavior of inbound connections and routing for switch independent mode is described for each algorithm. In switch dependent mode (either LACP or static) the switch will determine where to send the inbound packets.

Address hash

Using parts of the address components, a hash is created for each load/connection. There are three different modes available, but the default one available in the GUI (Port and IP) is mostly used. The other alternatives are IP only and MAC only. For traffic that does not support the default method, one of the others is used as fallback.

Address hash creates a very granular distribution of traffic initiated at the VM, as each packet/connection is load balanced independently. The hash is kept for the duration of the connection, as long as the active team members are the same. If a failover occurs, or if you add or remove a team member, the connections are rebalanced. The total outbound load from one source is limited by the total outbound capacity of the team and the distribution.

clip_image003[5]

Inbound connections

The IP address for the vNIC is bound to the so called primary team member, which is selected from the available team members when the team goes online. Thus, everything that use this team will share one inbound interface. Furthermore, the inbound route may be different from the outbound route. If the primary adapter goes offline, a new primary adapter is selected from the remaining team members.

Recommended usage
  • Active/passive teams with two members
  • Never ever use this for a Virtual Switch
  • Using more than two team members with this algorithm is highly discouraged. Do not do it.

MS recommends this for VM teaming, but you should never create teams in a VM. I have yet to hear a good reason to do so in production. What you do in you lab is between you and your therapist.

Hyper-V mode

Each vNIC, be it on a VM or on the host, is assigned to a team adapter and stays connected to this as long as it is online. The advantage is a predictable network path, the disadvantage is poor load balancing. As adapters are assigned in a round robin fashion, all your high bandwidth usage may overload one team adapter while the other team adapters have no traffic. There is no rebalancing of traffic. The outbound capacity for each vNIC is limited to the capacity of the Physical NIC it is attached to.

This algorithm supports VMQ.

clip_image004[5]

It may be the case that the red connection in the example above is saturating the physical NIC, thus causing trouble for the green connection. The load will not be rebalanced as long as both physical NICs are online, even if the blue connection is completely idle.

The upside is that the connection is attached to a physical NIC, and thus incoming traffic is routed to the same NIC as outbound traffic.

Inbound connections

Inbound connections for VMs are routed to the Physical NIC assigned to the vNIC. Inbound connections to a host is routed to the primary team member (see address hash). Thus inbound load is balanced for VMs, and we are able to utilize VMQ for better performance. Dynamic has the same inbound load balancing problems as Address hash for host inbound connections.

Recommended use

Not recommended for use on 2012R2, as Dynamic will offer better performance in all scenarios. But, if you need MAC address stability for VMs on a Switch Independent team, Hyper-V load distribution mode may offer a solution.

On 2012, recommended for teams that are connected to a vSwitch.

Dynamic

Dynamic is a mix between Hyper-V and Address hash. It is an attempt to create a best of both worlds-scenario by distributing outbound loads using address hash algorithms and inbound load as Hyper-V, that is each vNIC is assigned one physical NIC for inbound traffic. Outbound loads are rebalanced in real time. The team detects breaks in the communication stream where no traffic is sent. The period between two such breaks are called flowlets. After each flowlet the team will rebalance the load if deemed necessary, expecting that the next flowlet will be equal to the previous one.

The teaming algorithm will also trigger a rebalancing of outbound streams if the total load becomes very unbalanced, a team member fails or other hidden magic black-box settings should determine that immediate rebalancing is required.

This mode supports VMQ.

clip_image005

Inbound connections

Inbound connections are mapped to one specific Physical Nic for each workload, be it a VM or a workload originating on a host. Thus, the inbound path may differ from the outbound path as in address hash.

Recommended use

MS recommends this mode for all teams with the following exceptions:

  • Teams inside a VM (which I do not recommend that you do no matter what).
  • LACP Switch dependent teaming
  • Active/Passive teams

I will add the following exception: If your network contains load balancers that do not employ proper routing, e.g. F5 BigIP with the “Auto Last Hop” option enabled to overcome the problems, routing will not work together with this teaming algorithm. Use Hyper-V or Address Hash Active/passive instead.

Source MAC address in Switch independent mode

Outbound packets from a VM that is exiting the host through the Primary adapter will use the MAC address of the VM as source address. Outbound packets that are using a different physical adapter to exit the host will get another MAC address as source address to avoid triggering a MAC flapping alert on the physical switches. This is done to ensure that one MAC address is only present at one physical NIC at any one point in time. The MAC assigned to the packet is the MAC of the Physical NIC in question.

To try to clarify, for Address Hash:

  • If a packet from a VM exits through the primary team member, the MAC of the vNIC on the VM is kept as source MAC address in the packet.
  • If a packet from a VM exits through (one of) the secondary team members, the source MAC address is changed to the MAC address of the secondary team member.

for Hyper-V:

  • Every vSwitch port is assigned to a physical NIC/team member. If you use this for host teaming (no vSwitch), you have 1 vSwitch port and all inbound traffic is assigned to one physical NIC.
  • Every packet use this team member until a failover occurs for any reason

for Dynamic:

  • Every vSwitch port is assigned to a physical NIC. If you use this for host teaming (no vSwitch), you have 1 vSwitch port and all inbound traffic is assigned to one physical NIC.
  • Outbound traffic will be balanced. MAC address will be changed for packets on secondary adapters.

For Hyper-V and Dynamic, the primary is not the team primary but the assigned team member. It will thus be different for each VM.

For Host teaming without a vSwitch the behavior is similar. One of the team members’ MAC is chosen as the primary for host traffic, and MAC replacement rules applies as for VMs. Remember, you should not use Hyper-V load balancing mode for host teaming. Use Address hash or Dynamic.

Algorithm Source MAC on primary Source MAC on secondary adapters
Address hash Unchanged MAC of the secondary in use
Hyper-V Unchanged Not used
Dynamic Unchanged MAC of the secondary in use

Source MAC address in switch dependent mode

No MAC replacement is performed on outbound packets. To be overly specific:

Algorithm Source MAC on primary Source MAC on secondary adapters
Static Address hash Unchanged Unchanged
Static Hyper-V Unchanged Unchanged
Static Dynamic Unchanged Unchanged
LACP Address hash Unchanged Unchanged
LACP Hyper-V Unchanged Unchanged
LACP Dynamic Unchanged Unchanged

Cluster validation complains about KB3005628

Problem

You run failover cluster validation, and the report claims that one or more of the nodes are missing update KB3005628:

SNAGHTML338020e6

You try running Windows Update, but KB3005628 is not listed as an available update. You try downloading and installing it manually, but the installer quits without installing anything.

Analysis

KB3005628 is a fix for .Net framework 3.5, correcting a bug in KB2966827 and KB2966828. The problem is that the cluster node in question does not have the .Net framework 3.5 feature installed. It did however have  KB2966828 installed. As this is also a .Net 3.5 update, I wonder how it got installed in the first place. After reading more about KB3005628, it seems that KB2966828 could get installed even if .Net framework 3.5.1 is not installed.

So far, no matter what I do the validation report lists KB3005628 as missing on one of the cluster nodes. This may be a bug in the Failover Cluster validator.

Workaround

If the .Net Framework 3.5 feature is not installed, remove KB2966827 and KB2966828 manually from the affected node if they are installed. The validation report will still list KB3005628 as missing, but as the only function of KB3005628 is to remove KB2966827 and KB2966828 this poses no problem.

Cluster node fails to join cluster after boot

Problem

During a maintenance window, one of five Hyper-v cluster nodes failed to come out of maintenance mode after a reboot. SCVMM was used to facilitate maintenance mode. The system log shows the following error messages repeatedly:

Service Control Manager Event ID 7024

The Cluster Service service terminated with the following service-specific error:
Cannot create a file when that file already exists.

FailoverClustering Event ID 1070

The node failed to join failover cluster [clustername] due to error code ‘183’.

Service Control Manager Event ID 7031

The Cluster Service service terminated with the following service-specific error:
The Cluster Service service terminated unexpectedly.  It has done this 6377 time(s).  The following corrective action will be taken in 60000 milliseconds: Restart the service.

SNAGHTML115b89e

SNAGHTML1170717

SNAGHTML117b921

Analysis

I had high hopes of a quick fix. The cluster is relatively new, and we had recently changed the network architecture by adding a second switch. Thus, I instructed the tech who discovered the fault to try rebooting and checking that the server was able to reach the other nodes on all its interfaces. That river ran dry quickly though, as the local network was proved to be working perfectly.

Looking through the Windows Update log and the KBs for the installed updates did not reveal any clues. Making it even more suspicious, a cluster validation ensured me that all nodes had the same updates. Almost. Except for one. Hopefully I looked closer, but of course, this was some .Net framework update rollup missing on a different node.

I decided to give up all hope of an impressive five minute fix and venture into the realm of the cluster log. It is possible to read the cluster log in the event log system, but I highly recommend generating the text file and opening it in notepad++ or some other editor with search highlighting. I find it a lot easier on the eyes and mind. (Oh, and if you use the event log reader, DO NOT filter out information messages. For some reason, the eureka moment is often hidden in an information message.) The cluster log is a bit like the forbidden forest; it looks scary in daylight and even scarier in the dark during an unscheduled failover. It is easy to get lost down a track interpreting hundreds of “strange” messages, only to discover that they were benign. To make it worse, they covered a timespan of about half a second. The wrong second of course, not the one where the problem actually occurred. To say it mildly, the cluster service is very talkative. Especially so when something is wrong. As event 7031 told us, the cluster service is busy trying to start once a minute. Each try spews out thousands of log messages. The log I was working with had 574 942 lines and covered a timespan 68 minutes. That is about 8450 lines per service start.

Anyway, into the forbidden forest I went with nothing but a couple of event log messages to guide me. After a while, I isolated one cluster service startup attempt with correlated system event log data.   I discovered the following error:

ERR   mscs::GumAgent::ExecuteHandlerLocally: AlreadyExists(183)’ because of ‘already exists'([Servername]- LiveMigration Vlan 197)

The sneaky cluster service had tried fooling me into believing that the fault was file system related, when in fact it was a networking mess after all! You may recognize error 183 from system event id 1070 above. Thus we can conclude with fairly high probability that we have found the culprit. But what does it mean? I checked the name of the adapter in question and its teaming configuration. I ventured even further and checked for disconnected and missing network adapters, but none were to be found, neither in the device manager or in the registry. Then it struck med. The problem was staring me straight in the eye. The line above the error message, an innocent looking INFO message, was referring to the network adapter by an almost identical but nevertheless different name:

INFO [ClNet] Adapter Intel(R) Ethernet 10G 2P X520-k bNDC #2 – VLAN : VLAN197 LiveMigration is still attached to network VLAN197 Live Migration

A short but efficient third degree interrogation of the tech revealed that the network names had been changed some weeks prior to make them consistent on all cluster nodes. Ensuring network name consistency is in itself a noble task, but one that should be completed before the cluster is formed. It should of course be possible to change network names at any time, but for some reason the cluster service has not been able to persist the network configuration properly. As long as the node is kept running this poses no problem. However, when the node is rebooted for whatever reason, the cluster service gets lost running around in the forest looking for network adapters. Network adapters that are present but silent as the cluster service is calling them by the wrong name. I have not been able to figure out exactly what happens, not to mention what caused it in the first place, but I can guess. My guess is that different parts of the persisted cluster configuration came out of sync. This probably links network adapters to their cluster network names:

SNAGHTML151cc90

I have found this fault now on two clusters in as many days, and those are my first encounters with it. I suspect the fault is caused by a bug or “feature” introduced in a recent update to Win2012R2.

Solution

The solution is simple. Or at least I hope it is simple. A usual I strongly encourage you to seek assistance if you do not fully understand the steps below and their implications.

  • First you have to disable the cluster service to keep it out of the way. You can do this in srvmgr.msc, PowerShell, cmd.exe or through telepathy.
  • Wait for the cluster service to stop if it was running (or kill it manually if you are impatient).
  • Change the name of ALL network adapter that have an ipv4 and/or ipv6 address and are used by the cluster service. Changing the name of only the troublesome adapter mentioned in the log may not be enough. Make sure you do not mess around with any physical teaming members, SAN HBAs, virtual cluster adapters or anything else that is not directly used as a cluster network.
  • Before:image After: image
  • Enable the cluster service
  • Wait for the cluster service to start, or start it manually
  • The node should now be able to join the cluster
  • Stop the cluster service again  (properly this time, do not kill it)
  • Change the network adapter names back
  • Start the cluster service again
  • Verify that the node joins the cluster successfully
  • Restart the node to verify that the settings are persisted to disk.

Clustered MSDTC fails

Problem

While setting up a new clustered Distributed transaction coordinator for a SQL Server FCI, it fails to come online when restarted. This time it happened after I enabled network dtc access, but I have had this happen a lot during patching and cluster failover. Usually, I would just remove and reinstall, but that doesn’t seem to help this time. No matte what I did, FC Manager would just list it as failed:

image

Analysis

Looking in Services, I could see the DTC service was disabled:

image

The GUID in the service name can be matched to the cluster resource in the registry. This is useful if you have more than one DTC in your cluster, as FCI will not allow you to have several DTC resources with the same name. additional DTCs are named “New Distributed Transaction Coordinator (1)”, 2 and so on.

image

I tried to enable the service, only to be greeted with a snarky “This service is marked for deletion” message.

image

Then, I tried removing the resource and adding a new one, as this is my standard MO whenever I have trouble with a clustered MSDTC. Doing that I ended up with another “Marked for deletion” DTC service. My next idea was to fail over the instance, but then I thought, what if this had been in production? Thus, I kept on searching for another solution. And the solution turned out to be a simple one…

Solution

Log out ALL user sessions from the active node. This means all, yourself and any disconnected others included. Then log back in again, and bring the DTC resource online.

image

And by the way, remember change the policy for the DTC, to make sure that such errors doesn’t take down and fail over the entire instance. It could have solved the problem, but it could just as easily lead to the instance failing back and forth until it fails. Adding a script or policy that automatically logs out inactive users from the cluster nodes once a day is also a good idea.

image

Limit the size of the error log

Problem

Something bad happens, and the error log is  bloated with error messages very fast, triggering low disk space alarms and eventually filling the drive to the brim. For a concrete example, see https://lokna.no/?p=1263.

Analysis

Even though I have spent considerable time searching for a solution to this, so far I have been at a loss. My best bet has been limiting the number of files. If the default limit of 6 isn’t changed, there seems to be no entry in the registry. On clustered instances such errors as mentioned above can cause a failover roundabout, and I have seen several hundred log files being generated (on SQL Server 2008R2). Changing the limit adds a registry value, and this seems to force SQL Server to enforce the limit.

This doesn’t solve the most pressing issue though. If some sort of error starts pushing several log messages every second, a small base volume with say 10GiB of free space is filled to the brim in hours if not minutes. Thus, when you are woken up by the low disk space alarm it is already to late to stop it, and you run the chance of NTFS corruption. In my experience, the chance of NTFS corruption is increasing on a logarithmic scale when you pass 5% free space on a volume. 0 bytes of free space with processes still trying to write more data to the volume more or less guarantees some kind of NTFS corruption unless you catch it quickly.

Continue reading “Limit the size of the error log”

Annoying default settings

I have never quite liked the way Microsoft wants me to use Windows Explorer. The standard settings are quite annoying to me, but I understand why they are as they are on end user versions of Windows. Joe User is stupid, usually more so than you might imagine possible, so it is important to protect him against himself. On a server on the other hand, I would think we should anticipate some minimal knowledge about the file system. A server user should be able to look at a system file without thinking: “Hmm, bootmgr is a file I haven’t seen before. I should probably delete it. And that big windows folder just contains a lot of strange files I never use. I’m deleting some of those too, it will leave more room for pictures of my cat!”. But no, it has the same stupid defaults as the home editions. Because of this, I have had to create a list of all the stuff I have to remember to change whenever I log on to a new server, lest I go insane and maul the next poor user who want’s me to recover the database he “forgot” to back up before the disk crashed. Smilefjes som rekker tunge

 

Continue reading “Annoying default settings”