Configure VMQ and RSS on physical servers

Introduction

Samples below are collected from Windows Server 2016

The primary objective is to avoid weighing down Core 0 with networking traffic. This is the first core on the first NUMA node, and this core is responsible for a lot of kernel processing. If this core suffers from contention, a wild blue screen of death will appear. Thus, we want our network adapters to use other cores to process their traffic. We can achieve this in three ways, depending on what we use the adapter for:

  • Enable Receive Side Scaling (RSS) and configure it to use specific cores.
  • Enable Virtual Message Queueing and configure it to use specific cores
  • Set the preferred NUMA node

For physical machines

On network adapters used for generic traffic, we should enable RSS and disable VMQ. On adapters that are part of a virtual switch, we should disable RSS and enable VMQ. The preferred NUMA node should be configured for all physical adapters.

For virtual machines

If the machine has more than one CPU, enable vRSS.

Investigating the NUMA architecture

Sockets and NUMA nodes

Sysinternals coreinfo -s -n will show the relationship between logical processors, sockets and NUMA nodes. In the example below we have a CPU with two sockets and four nodes.

clip_image002

Closest NUMA node

Each PCIE adapter is physically connected to a specific NUMA node. If possible, RSS / VMQ should be mapped to cores on the same NUMA node that the NIC is connected to. Get-NetadApterRss will show you which NUMA node is closest for each adapter. The one in the sample is connected to/closest to NUMA node 0 as the NUMA distance for cores in group 0 is 0. We can also see that the NUMA distance to node 1 for this particular port is lower than the distance to nodes 2 and 3. This is caused by the fact that node 0 and 1 are on the same physical CPU, whereas node 2 and 3 are on another physical CPU.

clip_image004 Continue reading “Configure VMQ and RSS on physical servers”

Windows Server 2012R2 NIC Teaming

This is an attempt at giving a technical overview of how the native network teaming in Windows 2012R2 works, and how I would recommend using it. From time to time I am presented with problems “caused” by network teaming, so figuring out how it all works has been essential. Compared to the days of old, where teaming was NIC vendor dependent, todays Windows native teaming is a delight, but it is not necessarily trouble free.

Sources

Someone at Microsoft has written an excellent guide called Windows Server 2012 R2 NIC Teaming (LBFO) Deployment and Management, available at here. It gives a detailed technical guide to all the available options. I have added my field experience to the mix to create this guide.

Definitions

  • NIC: Network Interface Card. Also known as Network Adapter.
  • vNIC/virtual NIC: a team adapter on a host or another computer (virtual or physical) that use teaming.
  • Physical NIC/adapter: An adapter port that is a member of a team. Usually a physical NIC, but could be a virtual NIC if someone has made a complicated setup with teaming on a virtual machine.
  • vSwitch: A virtual switch, usually a Hyper-V switch.
  • Team member: a NIC that is a member of a team.
  • LACP: Link Aggregation Control Protocol, also IEE 802.3ad. See https://en.wikipedia.org/wiki/Link_aggregation#Link_Aggregation_Control_Protocol

Active-Active vs Active-Passive

image

If none of the adapters are set as standby, you are running an Active-Active config. If one is standby and you have a total of two adapters, you are running an Active-Passive config. If you have more than two team members, you may be running a mixed Active-Active-Passive config (strandby adapter set), or an Active-Active config without a standby adapter.

If you are using a configuration with more than one active team member on a 10G infrastructure, my recommendation is to make sure that both members are connected to the same physical switch and in the same module. If not, be prepared to sink literally hundreds, if not thousands of hours into troubleshooting that could otherwise be avoided. There are far too many problems related to the switch teaming protocols used on 10G, especially with the Cisco Nexus platform. And it is not that they do not work, it is usually an implementation problem. A particularly nasty kind of device is something Cisco refers to as a FEX or fabric extender. Again, it is not that it cannot work. It’s just that when you connect it to the main switch with a long cable run it usually works fine for a couple of months. And then it starts dropping packets and pretends nothing happened. So if you connect one of your team members to a FEX, and another to a switch, you are setting yourself up for failure.

Due to the problems mentioned above and similar troubles, many IT operations have a ban on Active-Active teaming. It is just not worth the hassle. If you really want to try it out, I recommend one of the following configurations:

  • Switch independent, Hyper-V load balancing. Naturally for vSwitch connected teams only. No, do not use Dynamic.
  • LACP with Address Hash or Hyper-V load balancing. Again, do not use Dynamic mode.

Team members

I do not recommend using more than two team members in Switch Independent teaming due to artifacts in load distribution. Your servers and switches may handle everything correctly, but the rest of the network may not. For switch dependent teaming, you should be OK, provided that all team members are connected to the same switch module. I do not recommend using more than four team members though, as it seems to be the breaking point between added redundancy and too much complexity.

Make sure all team members are using the exact same network adapter with the exact same firmware and driver versions. Mixing them up will work, but even if base jumping is legal you don’t have to go jumping. NICs are cheap, so fork over the cash for a proper Intel card.

Load distribution algorithms

Be aware that the load distribution algorithm primarily affects outbound connections only. The behavior of inbound connections and routing for switch independent mode is described for each algorithm. In switch dependent mode (either LACP or static) the switch will determine where to send the inbound packets.

Address hash

Using parts of the address components, a hash is created for each load/connection. There are three different modes available, but the default one available in the GUI (Port and IP) is mostly used. The other alternatives are IP only and MAC only. For traffic that does not support the default method, one of the others is used as fallback.

Address hash creates a very granular distribution of traffic initiated at the VM, as each packet/connection is load balanced independently. The hash is kept for the duration of the connection, as long as the active team members are the same. If a failover occurs, or if you add or remove a team member, the connections are rebalanced. The total outbound load from one source is limited by the total outbound capacity of the team and the distribution.

clip_image003[5]

Inbound connections

The IP address for the vNIC is bound to the so called primary team member, which is selected from the available team members when the team goes online. Thus, everything that use this team will share one inbound interface. Furthermore, the inbound route may be different from the outbound route. If the primary adapter goes offline, a new primary adapter is selected from the remaining team members.

Recommended usage
  • Active/passive teams with two members
  • Never ever use this for a Virtual Switch
  • Using more than two team members with this algorithm is highly discouraged. Do not do it.

MS recommends this for VM teaming, but you should never create teams in a VM. I have yet to hear a good reason to do so in production. What you do in you lab is between you and your therapist.

Hyper-V mode

Each vNIC, be it on a VM or on the host, is assigned to a team adapter and stays connected to this as long as it is online. The advantage is a predictable network path, the disadvantage is poor load balancing. As adapters are assigned in a round robin fashion, all your high bandwidth usage may overload one team adapter while the other team adapters have no traffic. There is no rebalancing of traffic. The outbound capacity for each vNIC is limited to the capacity of the Physical NIC it is attached to.

This algorithm supports VMQ.

clip_image004[5]

It may be the case that the red connection in the example above is saturating the physical NIC, thus causing trouble for the green connection. The load will not be rebalanced as long as both physical NICs are online, even if the blue connection is completely idle.

The upside is that the connection is attached to a physical NIC, and thus incoming traffic is routed to the same NIC as outbound traffic.

Inbound connections

Inbound connections for VMs are routed to the Physical NIC assigned to the vNIC. Inbound connections to a host is routed to the primary team member (see address hash). Thus inbound load is balanced for VMs, and we are able to utilize VMQ for better performance. Dynamic has the same inbound load balancing problems as Address hash for host inbound connections.

Recommended use

Not recommended for use on 2012R2, as Dynamic will offer better performance in all scenarios. But, if you need MAC address stability for VMs on a Switch Independent team, Hyper-V load distribution mode may offer a solution.

On 2012, recommended for teams that are connected to a vSwitch.

Dynamic

Dynamic is a mix between Hyper-V and Address hash. It is an attempt to create a best of both worlds-scenario by distributing outbound loads using address hash algorithms and inbound load as Hyper-V, that is each vNIC is assigned one physical NIC for inbound traffic. Outbound loads are rebalanced in real time. The team detects breaks in the communication stream where no traffic is sent. The period between two such breaks are called flowlets. After each flowlet the team will rebalance the load if deemed necessary, expecting that the next flowlet will be equal to the previous one.

The teaming algorithm will also trigger a rebalancing of outbound streams if the total load becomes very unbalanced, a team member fails or other hidden magic black-box settings should determine that immediate rebalancing is required.

This mode supports VMQ.

clip_image005

Inbound connections

Inbound connections are mapped to one specific Physical Nic for each workload, be it a VM or a workload originating on a host. Thus, the inbound path may differ from the outbound path as in address hash.

Recommended use

MS recommends this mode for all teams with the following exceptions:

  • Teams inside a VM (which I do not recommend that you do no matter what).
  • LACP Switch dependent teaming
  • Active/Passive teams

I will add the following exception: If your network contains load balancers that do not employ proper routing, e.g. F5 BigIP with the “Auto Last Hop” option enabled to overcome the problems, routing will not work together with this teaming algorithm. Use Hyper-V or Address Hash Active/passive instead.

Source MAC address in Switch independent mode

Outbound packets from a VM that is exiting the host through the Primary adapter will use the MAC address of the VM as source address. Outbound packets that are using a different physical adapter to exit the host will get another MAC address as source address to avoid triggering a MAC flapping alert on the physical switches. This is done to ensure that one MAC address is only present at one physical NIC at any one point in time. The MAC assigned to the packet is the MAC of the Physical NIC in question.

To try to clarify, for Address Hash:

  • If a packet from a VM exits through the primary team member, the MAC of the vNIC on the VM is kept as source MAC address in the packet.
  • If a packet from a VM exits through (one of) the secondary team members, the source MAC address is changed to the MAC address of the secondary team member.

for Hyper-V:

  • Every vSwitch port is assigned to a physical NIC/team member. If you use this for host teaming (no vSwitch), you have 1 vSwitch port and all inbound traffic is assigned to one physical NIC.
  • Every packet use this team member until a failover occurs for any reason

for Dynamic:

  • Every vSwitch port is assigned to a physical NIC. If you use this for host teaming (no vSwitch), you have 1 vSwitch port and all inbound traffic is assigned to one physical NIC.
  • Outbound traffic will be balanced. MAC address will be changed for packets on secondary adapters.

For Hyper-V and Dynamic, the primary is not the team primary but the assigned team member. It will thus be different for each VM.

For Host teaming without a vSwitch the behavior is similar. One of the team members’ MAC is chosen as the primary for host traffic, and MAC replacement rules applies as for VMs. Remember, you should not use Hyper-V load balancing mode for host teaming. Use Address hash or Dynamic.

Algorithm Source MAC on primary Source MAC on secondary adapters
Address hash Unchanged MAC of the secondary in use
Hyper-V Unchanged Not used
Dynamic Unchanged MAC of the secondary in use

Source MAC address in switch dependent mode

No MAC replacement is performed on outbound packets. To be overly specific:

Algorithm Source MAC on primary Source MAC on secondary adapters
Static Address hash Unchanged Unchanged
Static Hyper-V Unchanged Unchanged
Static Dynamic Unchanged Unchanged
LACP Address hash Unchanged Unchanged
LACP Hyper-V Unchanged Unchanged
LACP Dynamic Unchanged Unchanged

Multiple default gateways

Update 2018-0307

I have verified this as an issue on Windows 2016 as well. Sometimes if a network adapter has been configured with a default gateway before it is added to a NIC Team, you will get multiple default gateways.

Problem

While troubleshooting a networking teaming issue on a cluster, someone sent me the a link to this article about multiple default gateways on Win 2012 native teaming: http://www.concurrency.com/blog/bug-in-nic-teaming-wizard-makes-duplicate-default-routes-in-server-2012/. The post discusses a pretty specific scenario that we didn’t have on our clusters (most of them are on 2008R2), but I discovered several nodes with more than one default route in route print:

image

The issue I was looking into was another, but I remembered a problem from a weekend some months ago that might be related: When a failover was triggered on a SQL cluster, the cluster lost communication with the outside world. To be specific: no traffic passed through the default gateway. As all cluster nodes were on the same subnet the cluster itself was content with the situation, but none of the webservers were able to communicate with the clustered SQL server as they were in a different subnet. This made the webservers sad and the webmaster angry, so we had to fix it. As this happened in production over the weekend, the focus was on a quick fix and we were unable to find a root cause at the time. A reboot of the cluster nodes did the trick, and we just wrote it off as fallout from the storage issue that triggered the failover. The discovery of multiple default gateways on the other hand prompted a more thorough investigation.

Analysis

The article mentioned above talks exclusively about Windows 2012’s native teaming software, but this cluster is running Windows 2008 R2 and is relying on teaming software provided by the NIC manufacturer (Qlogic). We have had quite a lot of problems with the Qlogic network adapters on this cluster, so I immediately suspected them to be the rotten apple. I am not sure if this problem is caused by a bug in Windows itself that is present in both 2012 and 2008R2, or if both MS and Qlogic are unable to produce a functioning NIC teaming driver, but the following is clear:

If your adapters have a default gateway when you add them to a team, there is a chance that this default gateway will not get removed from the system. This happens regardless if the operating system is Windows 2012 or Windows 2008 R2. I am not sure if gateway addresses configured by DHCP also triggers this behavior. It doesn’t happen every time, and I have yet to figure out if there are any specific triggers as I haven’t been able to reproduce the problem at will.

Solution A

To resolve this issue, follow the recommendations in  http://www.concurrency.com/blog/bug-in-nic-teaming-wizard-makes-duplicate-default-routes-in-server-2012/:

First you have to issue a command to delete all static routes to 0.0.0.0. NB! This will disconnect you from the server if you are connected remotely from outside the subnet.

image

Configure the default gateway for the team using IP properties on the virtual team adapter:

image

Do a route print to make sure you have only one default gateway under persistent routes.

Solution b

If solution A doesn’t work, issue a netsh interface ip reset command to reset the ip configuration and reboot the server. Be prepared to re-enter the ip information for all adapters if necessary.

What not to do

Do not configure the default gateway using route add, as this will result in a static route. If the computer is a node in a cluster, the gateway will be disabled at failover and isolate the server on the local subnet. See http://support.microsoft.com/kb/2161341 for information about how to configure static routes on clusters if you absolutely have to use a static route.