Windows 2012 Nic Teaming pooched

Problem

Something bad happened during a firmware/driver update on one of my servers, which resulted in the network adapters being unavailable for teaming in the native nic teaming module. The server had Broadcom Netextreme II adapters, and we suspect that the BACS application is the culprit, since it also supports teaming. The problem presented the following symptoms:

  • Constant reinstallation of drivers in device manager for several minutes
  • The adapters were missing from Network Connections, but visible in device manager
  • No adapters were available for teaming in Windows Nic Teaming

Solution

  • First you enter device manager and enable Show hidden devices:
    image
  • Look for greyed out devices, that is devices that are not connected. Delete/uninstall all of them. You will probably be left with at least one “Microsoft Network Adapter Multiplexor Driver #n” that you are not able to uninstall.
  • Uninstall ALL broadcom drivers/software and reboot.
  • Open device manager again, and get the guid of the stubborn multiplexor adapters:
    image
  • Go to the HKLM\System\CurrentControlSet\Services\NdislmPlatform\Parameters\Teams registry key:
    image
  • Delete the entire team key that corresponds to the key you found above, that is the key labeled {GUID} under Teams:
    image
  • Reinstall the latest Broadcom drivers without BACS and BASP:
    image
  • Reboot, and re-create the teams using Nic Teaming

Lifecycle Controller update required on Dell server

Problem

For some reason, the OS deploy fails and afterwards this message appears at boot: “Lifecycle Controller update required”. Manual install of OS and subsequent Lifecycle controller firmware update doesn’t help. Any attempt to enter the Lifecycle Controller results in the system ignoring your request and booting normally.

image

Solution

  • First, you press F2 to enter system setup
  • Then, go looking for the iDrac settings menu
  • Enter it, and browse down to the Lifecycle Controller option
    image
  • Select Yes for the Cancel Lifecycle Controller Actions option.
    image
  • Finish, save settings and reboot.

If this doesn’t solve the problem, there is a Lifecycle Controller Repair Package available for download over at the Dell support site. I have yet to figure out how that thing works though, as the release notes are not available for download at the moment. I would suggest opening a support ticket if you have to go down this route.

Windows Automatic Maintenance triggers AEAPPINVW8 crash

Problem

Each time Windows Automatic Maintenance (hereafter known as automaint) is triggered, the following message appears in the application event log shortly thereafter: Event ID 1001 from Windows Error Reporting.

image

This happens on several of my servers.

Analysis

I know it is triggered by automaint only because it appears every night at 03:00, which is the time automaint is scheduled. I tried triggering automaint manually, and the error message promptly appeared in the event log. The scheduled task that triggers the error is called Program Data Update, which is part of the Customer Experience Improvement Program. This is a task that collects information about software installations, uninstalls and such. Analysis so far show that this affects all of my Win2012 servers, as well as some Windows 8 and windows 8.1 clients, but it has yet to cause any adverse effects other than the error message. I have tried to figure out exactly what it is failing at, so far to no avail, but I post this as a pointer to others who wonder what is causing the event log message. I will update this post when and if I find a solution.

The system event log is bloated with WMI Performance Adapter messages

Problem

A couple of times each minute, the WMI Performance Adapter Service is started and stopped, resulting in an informational message in the system event log (event 7036 from System Control Manager to be exact). This not only fills the log, but also causes pressure on the system due to the constant starting and stopping of the service. I have yet to see this issue on Win2008R2, but I have read reports from others: http://serverfault.com/questions/108829/why-is-my-system-event-log-full-of-wmi-performance-adapter-messages. Most of my Win 2012 servers exhibit the issue, but for some reason my 2008R2 servers have been spared.

image

Analysis

The root cause of this is usually SCOM, Splunk or similar agents who are collecting performance data from the server. The issue is not a problem per se, it is just a result of the fact that the monitoring agents are running a WMI query now and then. The problem is with log readability, it can mask other errors and let them slide out of the event log “window”, that is the amount of data the event log is allowed to contain at any point in time. I had a 20MiB max log size on one server, and it was only able to hold log data for about four days.

image

Solution

The solution is quite simple, you just have to set the startup type for the WMI Performance Adapter Service to Automatic:

image

Thus, you ensure that the service is kept running instead of restarting every 5 seconds. I have yet to see any adverse effects of this so far, but all the servers I have tested this on are physical database servers with tons of resources. The Wmi Performance Adapter service (wmiapsrv.exe) is only using about 7MB of ram on my servers. The Wmi provider host, that is also heavily utilized by SCOM/SPLUNK, is much more of a resource hog:

image

Generating and reading cluster logs

NOTE: this post was originally from 2010, it was updated for win2012 in august 2013.

If you want to read the cluster log on Windows 2008/2012 failover clusters, you have to generate it first. This log is considered sort of a debug level log, so it is not written to disk in a readable format by default. The log is however stored on disk as a circular .etl file, and it can be written out to a readable cluster.log file on demand. There are two ways you can create this file, by using cluster.exe or by PowerShell. Windows 2008/2008R2 supports both, while Windows 2003 is so old that it only supports the .log text file format and thus creates a readable log by default. Windows 2012 on the other hand considers cluster.exe to be too “old-school”, so it supports PowerShell only.

Be aware that readable might be an undeserving description of the cluster.log file. It is not for the faint of heart, and it should NOT be your first entry point when troubleshooting cluster issues. I usually access it only as a last resort when all else fails, or when I try to decipher why the cluster had issues AFTER I have solved the problem at hand.

Continue reading “Generating and reading cluster logs”

PERCSAS2 Event ID 129

Problem

Event ID 129 from percsas2 shows up in the system event log several times a day, stating “Reset to device, \Device\RaidPort4, was issued.”

image

I suddenly noticed this event in the log on four of my servers (Dell M820 blades). This is usually a bad tiding, foreboding imminent disk failure or a system wide badger infestation. As these servers are all quite new though and still running fine, I suspected the problem may be located elsewhere. The other culprit is usually drivers or firmware. Amazing as it may sound, it actually does happen that vendor support engineers are correct in demanding you update everything and the kitchen sink.

Continue reading “PERCSAS2 Event ID 129”

Creating firewall rules for SQL server using Powershell

On Win2012/Powershell 3 there is a commandlet called “New-NetFirewallRule” that allows for scripted creation of firewall rules. This makes it a lot easier to get them rules right. I have previously used GPO to push this to my SQL servers, but sadly I have discovered that it does not always work. For some reason, servers don’t like to have their firewall rules pushed by GPO. This meant I had to check them every time anyway, so I just resorted to creating them manually. But now, thanks to the wonders of Powershell 3, maybe I won’t have to do that again Smilefjes

More information about the commandlet can be found here: http://technet.microsoft.com/en-us/library/jj554908.aspx

Sample code

This code creates rules to allow the SQL server browser (UDP 1434), the standard engine port for two instances (TCP 1433 and 1434) and the default port for AOAG endpoints (TCP 5022).

New-NetFirewallRule -DisplayName "MSSQL BROWSER UDP" -Direction Inbound -LocalPort 1434 -Protocol UDP -Action Allow
New-NetFirewallRule -DisplayName "MSSQL ENGINE TCP" -Direction Inbound -LocalPort 1433-1434 -Protocol TCP -Action Allow
New-NetFirewallRule -DisplayName "MSSQL AOAG EP TCP" -Direction Inbound -LocalPort 5022 -Protocol TCP -Action Allow

Aligning dynamic disks

If you are using dynamic disks for some reason, and please avoid using them if you don’t have to, the partitions on it are likely to be misaligned for SQL server. I discovered this while trying to configure an AO Availability Group where one of the replicas were using local SSD drives configured with software RAID 1.

Problem

Since this was a new setup for me, I ran the old “wmic partition get BlockSize, StartingOffset, Name, Index”  just to make sure everything was in order. To my astonishment, it was not: image For some reason, the partition is using the old Win 2003 31,5 KB offset! To make it worse, I discovered this AFTER I had installed SQL server. Since dynamic disk and software raid basically sucks, information about this on the great interweb was sparse. But after some searching I found a cure, at least for volumes without RAID, at http://blogs.utexas.edu/alex/2013/04/04/windows-aligning-dynamic-disks/. (Link dead as of 2016.10)

Solution

Based on the above mentioned blog post, with my comments and changes for RAID. Be aware, this process may be destructive. This guide assumes that you, as I did, already have an active mirror with the wrong alignment. If you have fresh drives, just ignore the parts about breaking the mirror and moving data.

  • Make sure you have a valid backup
  • Be prepared to do a clean install if necessary
  • Break the mirror
  • Give both drives new drive-letters and restart the server to make sure no active application/service is using the drive
  • Run diskpart, and select one of the drives that was part of the mirror. If you don’t know how to do this, STOP and ask someone to help you or read up on diskpart BEFORE you continue.
  • Execute the following diskpart commands against the selected drive. This guide expects that you want one volume to fill the entire drive. If you don’t want this, think long and hard about why and consider changing your mind.
  • clean
  • online disk
  • attributes disk clear readonly
  • convert gpt
  • select part 1
  • delete part override
  • create partition msr size 128
  • convert dynamic
  • create volume simple align=1024
  • Screenshots image image
  • Then, format the new partition with 64K allocation unit size
  • Move the data from the other partition that was part of the mirror, and hope this will work
  • Run the diskpart commands against the other disk. Be aware, this will delete the data, so make sure the move command was successful. And this time, skip the last create volume command.
  • Select second disk
  • clean
  • online disk
  • attributes disk clear readonly
  • convert gpt
  • select part 1
  • delete part override
  • create partition msr size 128
  • convert dynamic

You should now have two dynamic disks, one with the data and one unallocated. Now, to add the mirror. I discovered that I was unable to add the second drive back as a mirror, the add mirror option was grayed out. I solved this by first shrinking the original by 50MB, and then creating the mirror. I didn’t test this extensively, but I would guess that 5 or 10 megabytes of free space would have been enough.

Weak event created

Problem

In windows failover cluster manager, clicking on an node in the tree will raise the following error:

SNAGHTML106567d

This is one of the biggest error messages I have ever seen in a Microsoft product, that is, regarding the size of the message box Smilefjes. The error states: “A weak event was created and I lives on the wrong object, there is a very high chance this will fail, pleas review your code to prevent the issue” and goes on with a .net call stack.

Analysis

This feels like a .Net framework issue to me, and a quick search rustled up the following post on Technet: http://blogs.technet.com/b/askcore/archive/2013/01/14/error-in-failover-cluster-manager-after-install-of-kb2750149.aspx, stating that this is a bug caused by KB 2750149.

Solution

Install KB2803748, available here: http://support.microsoft.com/kb/2803748

To check if both or any of the patches are installed, run the following powershell commands:

Get-HotFix -id KB2750149
Get-HotFix -id KB2803748

Annoying default settings

I have never quite liked the way Microsoft wants me to use Windows Explorer. The standard settings are quite annoying to me, but I understand why they are as they are on end user versions of Windows. Joe User is stupid, usually more so than you might imagine possible, so it is important to protect him against himself. On a server on the other hand, I would think we should anticipate some minimal knowledge about the file system. A server user should be able to look at a system file without thinking: “Hmm, bootmgr is a file I haven’t seen before. I should probably delete it. And that big windows folder just contains a lot of strange files I never use. I’m deleting some of those too, it will leave more room for pictures of my cat!”. But no, it has the same stupid defaults as the home editions. Because of this, I have had to create a list of all the stuff I have to remember to change whenever I log on to a new server, lest I go insane and maul the next poor user who want’s me to recover the database he “forgot” to back up before the disk crashed. Smilefjes som rekker tunge

 

Continue reading “Annoying default settings”