Since W2012R2 it is recommended that all clusters have a quorum witness regardless of the number of cluster nodes. As you may know, the purpose of the cluster witness is to ensure a majority vote in the cluster. If you have 2 nodes with one vote each and add a cluster witness you create a possibility for a majority vote. If you have 3 nodes on the other hand, adding a witness will remove the majority vote as you have 4 votes total and a possible stalemate.
If as stalemate occurs, the cluster nodes may revolt and you are unable to get it working without a force quorum, or you could take a node out behind the barn and end its misery. Not a nice situation at all. W2012R2 solves this predicament by dynamic vote assignments. As long as a quorum has been established, if votes disappear due to nodes going offline, it will turn the witness vote on and off to make sure that you always have a possibility for node majority. As long as you HAVE a disk witness that is.
There are three types of disk witnesses:
- A SAN-connected shared witness disk, usually FC or iSCSI. Recommended for clusters that use shared SAN-based cluster disks for other purposes, otherwise not recommended. If this sounds like gibberish to you, you should use another type of witness.
- A File share witness. Just a file share. Any type of file share would do, as long as it resides on a Windows server in the same domain as the cluster nodes. SOFS shares are recommended, but not necessary. DO NOT build a SOFS cluster for this purpose alone. You could create a VM for cluster witnesses, as each cluster witness is only about 5MiB, but it is best to find an existing physical server with a high uptime requirement in the same security zone as the cluster and create some normal SMB-shares there. I recommend a physical server because a lot of virtual servers are Hyper-V based, and having the disk witness on a vm in the cluster it is a witness for is obviously a bad idea.
- Cloud Witness. New in W2016. If you have an Azure storage account and are able to allow the cluster nodes a connection to Azure, this is a good alternative. Especially for stretch clusters that are split between different rooms.
How to set up a simple SMB File share witness
- Select a server to host the witness, or create one if necessary.
- Create a folder somewhere on the server and give it a name that denotes its purpose:
- Open the Advanced Sharing dialog
- Enable sharing and change the permissions. Make sure that everyone is removed, and add the cluster computer object. Give the cluster computer object full control permissions
- Open Failover Cluster manager and connect to the cluster
- Select “Configure Cluster Quorum Settings:
Select File Share Witness
Enter the path to the files share as \\server\share
Finish the wizard
Make sure the cluster witness is online:
Sometimes when I restart one of my Windows 10 computers the network never gets online. I have to disable/enable the network to get it back. the reason seems to be an IP conflict with the address 0.0.0.0. This computer has a fixed IP, no DHCP is involved. The NIC is an Intel I219-V.
Event ID 4199, TCPIP: The system detected an address conflict for IP address 0.0.0.0 with the system having network hardware address 20-4C-9E-49-38-8A.
A quick check in the tracking system revealed this article from Cisco: http://www.cisco.com/c/en/us/support/docs/ios-nx-os-software/8021x/116529-problemsolution-product-00.html. It talks about a conflict between the IP conflict detection system in Windows and an ARP Probe sent by the switch as part of IP Device Tracking. I am no Cisco expert, but I would like to have a chat with whoever thought that IP conflict detection should start BEFORE the nic has an IP set…
As far as I can tell the IP Tracking function on the switch is enabled by default from IOS version 15.2.
Turn off IP Device Tracking at the switch
https://supportforums.cisco.com/discussion/11960461/ip-device-tracking talks about running the following commands on the switch:
switch(config)# int range gig1/0/1 – 24
switch(config-if)# nmsp attach suppress
This is supposed to turn off the IP Device tracking on a per switch basis. I do not have access to my switching infrastructure, so I have not tested this. I will update this post if I get the opportunity to test it.
Turn off the Gratuitous ARP Function
Refer to this ancient KB: https://support.microsoft.com/en-us/kb/219374. It is written for NT4, but it still works. Be aware, this basically turns off IP Conflict detection completely.
Upgrade your NIC driver
And hope that it helps…
I was spending Christmas with relatives on the western coast of Norway. A part of Norway where foul weather is no stranger and heavy rain is the norm. Where the Vikings learned to handle the waves of the Atlantic ocean next door. Thus we are no strangers to voltage spikes. Power outages due to Thor’s angry electrons are quite common. It can be pretty, but such evening skies are usually a harbinger of bad weather.
And surely it was. The next morning we were hammered by gale force winds. I would rate it as a medium storm, but the meteorologists gave it an official name (Urd, and old Viking female name) and called it extreme weather. The second night of the storm I was awaken by a loud crack from the direction of the intake breaker box located in the guest room, followed by thunder. The howling from the UPS in the server closet revealed a power outage. I waited for about 10 minutes, but all I could hear was the storm. The reason for waiting is this: If there is one place in the house you do not want to be during a lightning-strike, it is with your nose in the breaker box trying to get the power back on.
After a while the lack of electrical heating won over my concerns for further strikes, and I went to look at the main breaker box. I was expecting one of the breakers reduced to a pile of rubble, but everything looked OK. If you wonder how I found my way through the darkness, let us just say that you do not grow up in this part of Norway without learning how to find one of your many flashlights in the dark. As all seemed OK in the intake box, on I went to investigate the distribution box in the next room. This is where the main distribution breaker, residual current device and surge protectors are located. Both the surge protectors and the residual current device were triggered, along with several circuit breakers. I primed the residual current device and switched it back on. Then I reset the circuit breakers, verified that the electric heater in my room was working and went back to sleep.
I was raised from my slumber by the users (i.e. my relatives) a couple of hours later. They complained about missing internet service and beseeched me to investigate. And investigate I did. For a normal residential house they have quite the advanced setup (I might be to blame for this), but it is made such to be resilient. After the installment of extra surge protectors some years back, the culprit is usually the ADSL modem. There are sadly no phone line surge protectors available that are powerful enough to resist the onslaught, so when the angry electrons enter the house through the phone lines they usually end up killing the modem. The ISDN phone connected to the same line has survived for more than a decade, but that is a German made ancient Siemens device unlike the chinesium crap the ISP calls an ADSL modem that is usually replaced twice per year.
A quick look at the modem revealed not a hint of status LEDs. Hoping for a quick fix in form of a power supply replacement I took it down from its mounting bracket, only to discover the unmistakable rattle of destroyed components from within. The VPN box was also dead.
I called the ISP and convinced them to send over a new modem. Due to this still being Christmas, it would take two days. Which isn’t bad, but usually we could get one the same day.
A quick summary of the components: The Wireless AP is there to provide a consistent WLAN. The modem has one built in, but each time it is replaced, the settings change. The VPN box is placed there by me to facilitate remote support. The DSL splitter is connected to the outside line and sends one signal to the ISDN NT1 and another signal to the modem. The NT1 is located in another room. The box on the lower right is supplied by the satellite TV supplier, and its function is unclear. It has some kind of wireless function, and I suspect it is a dedicated WLAN for the satellite decoder to call home.
But back to the modem. When you remove the top dark-out cover it looks like this:
It identifies itself as a ZyXEL P8702N, which as far as I can tell is an ISP special, that is, only sold to ISPs. The hardware supports both an internal DSL modem and an external modem/adapter.
I was curious as to which components produced the rattle, so I removed the top cover. No screws, only fidgety plastic clips. Does not look like it is designed to be serviced. First glance revealed three separate confirmed problems.
1 – MNC G4804DG
This chip is a dual port gigabit ethernet line transformer. There are two of them, which correlates to the four LAN ports. There is also a WAN port connected to the G1806DG on the left. There are clear signs of carbon on the board, evident of a blue smoke leak. And as we all know, if the magic blue smoke gets out of the chip it stops working. This should have prompted me to investigate further at the other end of port 3, but more on that later.
2 – DSL line “protection”
The designers have tried to protect the modem from angry electrons by connecting the DSL line to a gas discharge tube and two in-line capacitors. As the picture clearly tells, this was not enough. As far as I could find out the capacitors are low quality chinesium, and I guess that goes for most of this box.
A close-up reveals further damage, even carbon on the connector itself. I would guess that the two capacitors were the source of the loud noise.
3 – Unknown chip
This could be the “modem” part, but it was to small and damaged to identify with the equipment I had available. The board shows a trail of destruction from the DSL-port down to this chip.
All I could find is a broken 3-pin part, probably some kind of transistor. The power was luckily all that was broken, and a retrofit universal model from the local supplier brought it back to life. Local as in 50 clicks away, but I digress.
I promised to return to case 1 from the modem. The one about carbon on the network interface transformer. After replacing the modem I quickly discovered that the server was no longer accessible. This is your typical small-business setup with one box running file, print, AD and accounting software connected to a couple of clients. There was sadly no time for pictures, but to sum it up, the angry electrons killed a HP Procurve switch and a network adapter in one of the computers.
All this from a single lightning strike far away. The angry electrons of Thor are not to be scoffed at.
First a friendly warning; This post details procedures for messing with the time service on domain controllers. As always, if you do not understand the commands or their consequences; seek guidance.
I have been upgrading my lab to Windows Server 2016 in preparation for a production rollout. Some may feel I am late to the game, but I have always been reluctant to roll out new server operating systems quickly. I prefer to have a good baseline of other peoples problems to look for in your friendly neighborhood tracking service (AKA search engine) when something goes wrong.
Anyways, some weeks ago I rolled out 2016 on my domain controller. When I came back to upgrade the Hyper-V hosts, I noticed time was off by 126 seconds between the DC and the client. As the clock on the DC was correct, I figured the problem was client related. Into the abyss of w32tm we go.
The Windows Time Service is not exactly known for its user friendliness, so I just started with the normal shotgun approach at the client:
net stop w32time
w32tm /config /syncfromflags:domhier
net start w32time
These commands, if executed at an administrative command prompt, will remind the client to get its time from the domain time sync hierarchy, in other words one of the DCs. If possible. Otherwise it will just let the clock drift until it passes the time delta maximum, at which time it will not be able to talk to the DC any more. This is usually the point when your friendly local monitoring system will alert you to the issue. Or your users will complain. But I digress.
Issuing a w32tm /resync command afterwards should guarantee an attempt to sync, and hopefully a successful result. At least in my dreams. In reality though, it just produced another nasty error: 0x800705B4. The tracking service indicated that it translates to “operation timed out”.
The next step was to try a stripchart. The stripchart option instructs w32tm to query a given computer and show the time delta between the local and remote computer. Kind of like ping for time servers. The result should look something like this:
But unfortunately, this is what I got:
I shall spare you the details of all the head-scratching and ancient Viking rituals performed at the poor client to no avail. Suffice it to say that I finally realized the problem had to be related to the DC upgrade. I tried running the stripchart from the DC itself against localhost, and that failed as well. That should have been a clue that something was wrong with Time Service itself. But as troubleshooting the Time Service involves decoding its registry keys, I went to confirm the firewall rules instead. Which of course were hunky-dory.
I then ran dcdiag /test:advertising /v to check if the server was set to advertise as a time server:
The next step was to reset the configuration for the Time Service. The official procedure is as follows:
net stop w32time
net start w32time
This procedure usually ends with some error message complaining about the service being unable to start due to some kind of permission issue with the service. I seem to remember error 1902 is one of the options. If this happens, first try 2 consecutive reboots. Yes, two. Not one. Don’t ask why, no one knows. If that does not help, try again but this time with a reboot after the unregister command.
The procedure ran flawlessly this time, but it did not solve the problem.
Time to don the explorer’s hat and venture into the maze of the registry. The Time Service hangs out in HKLM\System\CurrentControlSet\Services\W32Time. After some digging around, I found that the NTP Server Enabled key was set to 0. Which would suggest that it was turned off. I mean, registry settings are tricksy, but there are limits. I tried changing it to 1 and restarted the service.
Suddenly, everything works. The question is why… Not why it started working, but why the setting was changed to 0. I am positive time sync was working fine prior to the upgrade. Back to the tracking service I went. Could there be a new method for time sync in Windows 2016? Was it all a big conspiracy caused by Russian hackers in league with Trump? Of course not. As usual the culprits are the makers of the code.
My scenario is not a complete match, but in KB3201265 Microsoft admits to having botched the upgrade process for Windows Time Service in both Windows Server 2016 and the corresponding Windows 10 1607. Basically, it applies default registry settings for a non-domain-joined server. Optimistic as always they tell you to export the registry settings for the service PRIOR to upgrading. As if I have the time to read every KB they publish. Anyways, it also details a couple of other possible solutions, such as how to get the previous registry settings out of Windows.old.
My recommendation is as such: Do not upgrade your domain controllers. Especially not in production. I only did it in the lab because I wanted to save time.
If you as me have put yourself in this situation, and honestly, why else would you have read this far, I recommend following method 3 in KB3201265. Unless you feel comfortable exploring the registry and fixing it manually.
The event log fills up with Event ID 2 from Kernel-EventTracing stating Session “” failed to start with the following error: 0xC0000022.
If you look into the system data for one of the events, you will find the associated ProcessID and ThreadID:
If the event is relatively current, the Process ID should still be registered by the offending process. Open Process Explorer and list processes by PID:
We can clearly see that the culprit is one of those pesky WMI-processes. The ThreadID is a lot more fluctuating than the ProcessID, but we can always take a chance and se if it will reveal more data. I spent a few minutes writing this, and in that time it had already disappeared. I waited for another event, and immediately went to process explorer to look for thread 18932. Sadly though, this didn’t do me any good. For someone more versed in kernel API calls the data might make some sense, but not to me.
I had more luck rummaging around in the ad-profile generator (google search). It pointed me in the direction of KB3087042. It talks about WMI calls to the LBFO teaming (Windows 2012 native network teaming) and conflicts with third-party WMI providers. Some more digging around indicated that the third-party WMI provider in question is HP WBEM. HP WBEM is a piece of software used on HP servers to facilitate centralized server management (HP Insight). As KB3087042 states the third-party provider is not the culprit. That implies a fault in Windows itself, but one must not admit such things publicly of course.
In their infinite wisdom (or as an attempt to compensate for their lack thereof), the good people of Microsoft has also provided a manual workaround for the issue. It is a bit difficult to understand, so I will provide my own version below.
As usual, if the following looks to you as something that belongs in a Harry Potter charms class, please seek assistance before you implement this in production. You will be messing with central operating system files, and a slip of the hand may very well end up with a defective server. You have been warned.
But let us get on with the fix. First, you have to get yourself an administrative command prompt. The good old fashioned black cmd.exe (or any of the 16 available colors). There is no reason why this would not work in one of those fancy new blue PowerShell thingy’s as well, but why take unnecessary risks?
Then, we have a list of four incantations – uh.., commands to run through. Be aware that if for some reason your system drive is not C:, you will have to take that into account. And then spend five hours repenting and trying to come up with a good excuse for why you did it in the first place. Or perhaps spend the time looking for the person who did it and give them a good talking to. But I digress. The commands to run from the administrative command prompt are as follows:
Takeown /f c:\windows\inf
icacls c:\windows\inf /grant “NT AUTHORITY\NETWORK SERVICE”:”(OI)(CI)(F)”
icacls c:\windows\inf\netcfgx.0.etl /grant “NT AUTHORITY\NETWORK SERVICE”:F
icacls c:\windows\inf\netcfgx.1.etl /grant “NT AUTHORITY\NETWORK SERVICE”:F
The first command takes ownership of the Windows\Inf folder. This is done to make sure that you are able to make the changes. The three icacls-commands grants permissions to the NETWORK SERVICE system account on the INF-folder and two ETL-files. The result should look something like this:
To test if you were successful, run this command:
And look for the highlighted result:
Should you want to learn more about the icacls command, this is a good starting point.
This point is very important. If you do not hand over ownership of Windows\Inf back to the system, bad things will happen in your life.
This time, you only need a normal file explorer window. Open it, and navigate to C:\Windows. Then open the advanced security dialog for the folder.
Next to the name of the current owner (should be your account) click the change button/link.
Then, select the Local Computer as location and NT SERVICE\TrustedInstaller as object name. Click Check Names to make sure you entered everything correctly. If you did, the object name changes to TrustedInstaller (underlined).
Click OK twice to get back to the file explorer window. If you did not get any error messages, you are done.
It IS possible to script the ownership transfer as well, but in my experience the failure rate is way to high. I guess the writers of the KB agrees, as they have only given a manual approach.
Fore some reason, the Store Icon comes back to haunt you every time you restart. That is, it stays pinned to the task bar no matter what, and if you un-pin it, like a zombie it will rise from the grave as soon as you reboot…
This is probably a scheme to make us buy more of those stupid “modern” apps. Not that there aren’t useful apps, but they are few and far between. Anyways, the point is to get rid of the icon. I could of course disable the store altogether, but I just want it out of my way and off my lawn –eh, taskbar.
The good people of Microsoft has finally given us a proper option to get rid of it. Salvation comes in the form of a GPO called “Do not allow pinning Store app to the Taskbar”. The wording is such as to make us believe that it is all our fault to begin with, but no matter, lets just remove it.
The GPO is hidden in User Configuration under Policies, Administrative Templates,Start Menu and Taskbar:
Set it as enabled and deploy it to your users as best fits you. If you are looking to make this change on you own local computer without a domain, just start gpedit.msc to edit your local policy.
When trying to start Failover Cluster manager you get an error message: “Microsoft Management Console has stopped working”
Inspection of the application event log reveals an error event id 1000, also known as an application error with the following text:
Faulting application name: mmc.exe, version: 6.3.9600.17415, time stamp: 0x54504e26
Faulting module name: clr.dll, version: 4.6.1055.0, time stamp: 0x563c12de
Exception code: 0xc0000409
As usual, this is a .NET Framework debacle. Remove KB 3102467 (Update for .NET Framwework 4.6.1), or wait for a fix.
It was nearing the end of summer, but most of the Knights of Hyper-V were still on vacation. There was of course always one knight on call, but the others were lazily roaming the countryside, or lounging along the bank of a river pretending to be on a fishing trip. Some even went on expeditions to far away realms looking for trouble, relaxation, fancy fishing gear, VMWare-proof armor, or new riding boots. The all-seeing monitors however, were not on vacation. To be honest we do not even know if they ever sleep, they just seem to take turns going into hibernation mode. Instead they had spent the summer installing new crystal orbs, automated all-seeing eyes and such. One of their new contraptions was some kind of network enabled spooky ghost detector. Its purpose was to send probes into The Wasteland of Nexus and attempt to locate signs of the ghosts of forgotten VMs and other security problems.
This came about as flaws had been discovered in the procedure for disposal of outdated VMs. The minions responsible for dealing with outdated VM disposal had gotten increasingly bureaucratic, spending most of their time hassling others with demands of forms filled in triplicate to update documentation. And such tasks are of course important, but the most important thing is to actually dispose of the old VM. The result was a number of undocumented (as the documentation had been updated) VMs roaming The Wasteland of Nexus without updated security software, making the entire realm vulnerable to outside attacks from beyond the wall. Firewall that is.
Such was the back-story, when one dark and gloomy midsummer morning, a trouble ticket landed in the inbox of the knight on call with a loud boom. It was another list of suspect activities detected in the wasteland. A couple of probes had returned during the night, complaining about servers without patches several years old. To add a little spice to the mix, this was ghost servers. If you nocked on the right door they would answer, but they were not listed anywhere. Not in the labyrinthine CMDB, and certainly not in any of the address books. For all intents and purposes they did not exist. Except of course for the undeniable fact that they most certainly did. This was something that could provide days, if not months of confused contemplation for social studies majors, human resources, project managers and others of similar ilk. But the knight was an engineer and simply scoffed at such irrelevancies. To him this was simply a problem looking for a solution. But which solution? The available information pointed to an ancient server from 2010. That is a very long time ago, and at least two documentations systems has been sent off to Valhalla by the way of funeral pyre in the meantime. The current buzzword-friendly variant was named after the Chinese philosopher Confucius. He was the inventor of the term “Do not do to others what you do not want done to yourself”, but if such terms was to be enforced in documentation systems, violent outbreaks would be the norm, as most documentation can be interpreted as a form of torture. Anyways, no trace of the ghosts were found in the current system, and the old ones were burned. There was always a faint hope that someone had kept a personal log mentioning the ghosts former names, but no such luck was to be had this time around.
The knight went back to the all-seeing monitors and requested more information to aid him in his search. Another probe was dispatched into the spirit world, this time with instructions to look for identifying marks instead of fuzzing about missing security updates, foul stenches and gates left open. While waiting for the probes to return, the knight identified an old long forgotten storage system. The storage minions swore it had been properly decommissioned and disposed of years ago, but it was found to be chugging along under a desk, consuming power and collecting dust.
Another sub-quest expedition to the physical realm of Hyper-V hosts revealed that someone had been re-inserting old decommissioned servers that were kept around for spare parts into the magic cabinet of the silver slanted ‘E’. Or it could of course bee that they had never been removed in the first place due to bureaucratic loops and lost scrolls of Todo. Anyways, the knight had bagged two ghosts.
We rejoin our knight the next morning. For once it was a good morning. The sun was shining, and the success of yesterday’s sub-quests were still lingering in the knights mind. Sadly, that would soon change. The probes were back, and they were happily reporting that the former names of the ghosts had been decoded. This identified the responsible service team, but the service team minions were all relatively new and had never heard of these old ghosts. Armed with new knowledge the knight went straight to the VMM daemons to demand an explanation. But to his great alarm, he found that the VMM daemons to had never heard of these ghosts. Feverishly the knight searched the scrolls of physical servers, in a vain hope that the servers nevertheless were physical beings, but no. No such server had ever existed. With that, only one possible solution remained; the ghost were located in the realm of VMWare!
There was no choice other than to beseech the man with the crowbar to borrow his Hazard Suit and plan an expedition to the toxic fields of vCenter. Once there, the ghosts were immediately detected. On a closer (but hasty) inspection of the remaining area, the knight also identified two other ghosts. He quickly filled out a scroll identifying the ghosts, and went back to more pleasing surroundings. He then updated the trouble ticket and forwarded it to the unholy riders of VMWare, hoping that he wouldn’t have to go back for a long, long time.
This is an attempt at giving a technical overview of how the native network teaming in Windows 2012R2 works, and how I would recommend using it. From time to time I am presented with problems “caused” by network teaming, so figuring out how it all works has been essential. Compared to the days of old, where teaming was NIC vendor dependent, todays Windows native teaming is a delight, but it is not necessarily trouble free.
Someone at Microsoft has written an excellent guide called Windows Server 2012 R2 NIC Teaming (LBFO) Deployment and Management, available at here. It gives a detailed technical guide to all the available options. I have added my field experience to the mix to create this guide.
- NIC: Network Interface Card. Also known as Network Adapter.
- vNIC/virtual NIC: a team adapter on a host or another computer (virtual or physical) that use teaming.
- Physical NIC/adapter: An adapter port that is a member of a team. Usually a physical NIC, but could be a virtual NIC if someone has made a complicated setup with teaming on a virtual machine.
- vSwitch: A virtual switch, usually a Hyper-V switch.
- Team member: a NIC that is a member of a team.
- LACP: Link Aggregation Control Protocol, also IEE 802.3ad. See https://en.wikipedia.org/wiki/Link_aggregation#Link_Aggregation_Control_Protocol
Active-Active vs Active-Passive
If none of the adapters are set as standby, you are running an Active-Active config. If one is standby and you have a total of two adapters, you are running an Active-Passive config. If you have more than two team members, you may be running a mixed Active-Active-Passive config (strandby adapter set), or an Active-Active config without a standby adapter.
If you are using a configuration with more than one active team member on a 10G infrastructure, my recommendation is to make sure that both members are connected to the same physical switch and in the same module. If not, be prepared to sink literally hundreds, if not thousands of hours into troubleshooting that could otherwise be avoided. There are far too many problems related to the switch teaming protocols used on 10G, especially with the Cisco Nexus platform. And it is not that they do not work, it is usually an implementation problem. A particularly nasty kind of device is something Cisco refers to as a FEX or fabric extender. Again, it is not that it cannot work. It’s just that when you connect it to the main switch with a long cable run it usually works fine for a couple of months. And then it starts dropping packets and pretends nothing happened. So if you connect one of your team members to a FEX, and another to a switch, you are setting yourself up for failure.
Due to the problems mentioned above and similar troubles, many it operations have a ban on Active-Active teaming. It is just not worth the hassle. If you really want to try it out, I recommend one of the following configurations:
- Switch independent, Hyper-V load balancing. Naturally for vSwitch connected teams only. No, do not use Dynamic.
- LACP with Address Hash or Hyper-V load balancing. Again, do not use Dynamic mode.
I do not recommend using more than two team members in Switch Independent teaming due to artifacts in load distribution. Your servers and switches may handle everything correctly, but the rest of the network may not. For switch dependent teaming, you should be OK, provided that all team members are connected to the same switch module. I do not recommend using more than four team members though, as it seems to be the breaking point between added redundancy and too much complexity.
Make sure all team members are using the exact same network adapter with the exact same firmware and driver versions. Mixing them up will work, but even if base jumping is legal you don’t have to go jumping. NICs are cheap, so fork over the cash for a proper intel card.
Load distribution algorithms
Be aware that the load distribution algorithm primarily affects outbound connections only. The behavior of inbound connections and routing for switch independent mode is described for each algorithm. In switch dependent mode (either LACP or static) the switch will determine where to send the inbound packets.
Using parts of the address components, a hash is created for each load/connection. There are three different modes available, but the default one available in the GUI (Port and IP) is mostly used. The other alternatives are IP only and MAC only. For traffic that does not support the default method, one of the others is used as fallback.
Address hash creates a very granular distribution of traffic initiated at the VM, as each packet/connection is load balanced independently. The hash is kept for the duration of the connection, as long as the active team members are the same. If a failover occurs, or if you add or remove a team member, the connections are rebalanced. The total outbound load from one source is limited by the total outbound capacity of the team and the distribution.
The IP address for the vNIC is bound to the so called primary team member, which is selected from the available team members when the team goes online. Thus, everything that use this team will share one inbound interface. Furthermore, the inbound route may be different from the outbound route. If the primary adapter goes offline, a new primary adapter is selected from the remaining team members.
- Active/passive teams with two members
- Never ever use this for a Virtual Switch
- Using more than two team members with this algorithm is highly discouraged. Do not do it.
MS recommends this for VM teaming, but you should never create teams in a VM. I have yet to hear a good reason to do so in production. What you do in you lab is between you and your therapist.
Each vNIC, be it on a VM or on the host, is assigned to a team adapter and stays connected to this as long as it is online. The advantage is a predictable network path, the disadvantage is poor load balancing. As adapters are assigned in a round robin fashion, all your high bandwidth usage may overload one team adapter while the other team adapters have no traffic. There is no rebalancing of traffic. The outbound capacity for each vNIC is limited to the capacity of the Physical NIC it is attached to.
This algorithm supports VMQ.
It may be the case that the red connection in the example above is saturating the physical NIC, thus causing trouble for the green connection. The load will not be rebalanced as long as both physical NICs are online, even if the blue connection is completely idle.
The upside is that the connection is attached to a physical NIC, and thus incoming traffic is routed to the same NIC as outbound traffic.
Inbound connections for VMs are routed to the Physical NIC assigned to the vNIC. Inbound connections to a host is routed to the primary team member (see address hash). Thus inbound load is balanced for VMs, and we are able to utilize VMQ for better performance. Dynamic has the same inbound load balancing problems as Address hash for host inbound connections.
Not recommended for use on 2012R2, as Dynamic will offer better performance in all scenarios. But, if you need MAC address stability for VMs on a Switch Independent team, Hyper-V load distribution mode may offer a solution.
On 2012, recommended for teams that are connected to a vSwitch.
Dynamic is a mix between Hyper-V and Address hash. It is an attempt to create a best of both worlds-scenario by distributing outbound loads using address hash algorithms and inbound load as Hyper-V, that is each vNIC is assigned one physical NIC for inbound traffic. Outbound loads are rebalanced in real time. The team detects breaks in the communication stream where no traffic is sent. The period between two such breaks are called flowlets. After each flowlet the team will rebalance the load if deemed necessary, expecting that the next flowlet will be equal to the previous one.
The teaming algorithm will also trigger a rebalancing of outbound streams if the total load becomes very unbalanced, a team member fails or other hidden magic black-box settings should determine that immediate rebalancing is required.
This mode supports VMQ.
Inbound connections are mapped to one specific Physical Nic for each workload, be it a VM or a workload originating on a host. Thus, the inbound path may differ from the outbound path as in address hash.
MS recommends this mode for all teams with the following exceptions:
- Teams inside a VM (which I do not recommend that you do no matter what).
- LACP Switch dependent teaming
- Active/Passive teams
I will add the following exception: If your network contains load balancers that do not employ proper routing, e.g. F5 BigIP with the “Auto Last Hop” option enabled to overcome the problems, routing will not work together with this teaming algorithm. Use Hyper-V or Address Hash Active/passive instead.
Source MAC address in Switch independent mode
Outbound packets from a VM that is exiting the host through the Primary adapter will use the MAC address of the VM as source address. Outbound packets that are using a different physical adapter to exit the host will get another MAC address as source address to avoid triggering a MAC flapping alert on the physical switches. This is done to ensure that one MAC address is only present at one physical NIC at any one point in time. The MAC assigned to the packet is the MAC of the Physical NIC in question.
To try to clarify, for Address Hash:
- If a packet from a VM exits through the primary team member, the MAC of the vNIC on the VM is kept as source MAC address in the packet.
- If a packet from a VM exits through (one of) the secondary team members, the source MAC address is changed to the MAC address of the secondary team member.
- Every vSwitch port is assigned to a physical NIC/team member. If you use this for host teaming (no vSwitch), you have 1 vSwitch port and all inbound traffic is assigned to one physical NIC.
- Every packet use this team member until a failover occurs for any reason
- Every vSwitch port is assigned to a physical NIC. If you use this for host teaming (no vSwitch), you have 1 vSwitch port and all inbound traffic is assigned to one physical NIC.
- Outbound traffic will be balanced. MAC address will be changed for packets on secondary adapters.
For Hyper-V and Dynamic, the primary is not the team primary but the assigned team member. It will thus be different for each VM.
For Host teaming without a vSwitch the behavior is similar. One of the team members’ MAC is chosen as the primary for host traffic, and MAC replacement rules applies as for VMs. Remember, you should not use Hyper-V load balancing mode for host teaming. Use Address hash or Dynamic.
|Algorithm||Source MAC on primary||Source MAC on secondary adapters|
|Address hash||Unchanged||MAC of the secondary in use|
|Dynamic||Unchanged||MAC of the secondary in use|
Source MAC address in switch dependent mode
No MAC replacement is performed on outbound packets. To be overly specific:
|Algorithm||Source MAC on primary||Source MAC on secondary adapters|
|Static Address hash||Unchanged||Unchanged|
|LACP Address hash||Unchanged||Unchanged|