Post originally from 2010, updated 2014.04.04 and superseded by EventID 1004 from IPMIDRV v2 in 2016


imageThe system event log is overflowing with EventID 1004 from IPMIDRV. “The IPMI device driver attempted to communicate with the IPMI BMC device during normal operation. However the operation failed due to a timeout.”

The frequency may vary from a couple of messages per day upwards to several messages per minute.


The BMC (Baseboard Management Controller) is a component found on most server motherboards. It is a microcontroller responsible for communication between the motherboard and management software. See wikipedia for more information. The BMC is also used for communication between the motherboard and dedicated out of band management boards such as Dell iDRAC. I have seen these error messages on systems from several suppliers, most notably on IBM and Dell blade servers, but most server motherboards have a BMC. As the error message states, you can resolve this error by increasing the timeout, and this is usually sufficient. I have found that the Windows default settings for the timeouts may cause conflicts, especially on blade servers. Thus an increase in the timeout values may be in order as described on technet. Lately though, I have found this error to be a symptom of more serious problems. To understand this, we have to look at what is actually happening. If you have some kind of monitoring agent running on the server, such as SCOM or similar, the error could be triggered by said agent trying to read the current voltage levels on the motherboard. If such operations fail routinely during the day, it is a sign of a conflict. This could be competing monitoring agents querying data to frequently, an issue with the BMC itself, or an issue with the out of band management controller. In my experience, this issue is more frequent on blade servers than rack-based servers. This makes sense, as most blade servers have a local out of band controller that is continuously talking to a chassis management controller to provide a central overview of the chassis.

If the out of band controllers have a problem, this can and will affect the BMC, which in turn may affect the motherboard. Monitoring of server status is the most frequently used feature, but the BMC controller is also used for remote power control and is able to affect the state of internal components on the motherboard. We recently had an issue on a Dell M820 blade server where a memory leak in iDrac resulted in the mezzanine slots on the server being intermittently switched off. In this case it was the FibreChannel HBA. Further research revealed this to be a returning issue. This forum thread from 2011 describes a similar issue:

As the iDrac versions in question are different (1.5.3 in 2010 and 1.51.51 in 2014), I theorize that the issue is related to iDrac memory leaks in general and not a specific bug. Thus, any iDrac firmware bug resulting in a memory leak may cause these issues.


Low error frequency

Increase the timeout values as described on technet. I have used the following values with success on IBM servers:


Under HKLM\SYSTEM\CurrentControlSet\Control\IPMI are four values controlling the IPMI driver: BusyWaitTimeoutPeriod, BusyWaitPeriod, IpmbWaitTimeoutPeriod, CommandWaitTimeoutPeriod, and SlaveAddressDetectionMethod. On IBM blades, I have used BusyWaitPeriod 60(desimal) and 9 000 000 (desimal) for the rest. Changing these settings require a restart of the server.

High error frequency

Further analysis will be necessary for each case. Try to identify what program is triggering the timeouts. A blanket upgrade of Bios, Out of band Management and system drivers may be successful, but it could also introduce new problems and thus further complicate the troubleshooting. Looking for other, seemingly unrelated problems in the event log could also be a good idea. And check for other servers with similar problems. I have found that removing the server from the chassis and reseating it may remove the fault for a couple of days before it returns. This is a symptom of a memory leak. And talking about memory leaks, check for kernel mode driver memory leaks in the operating system.

If it is a dell server, try running the following command:

racadm getsysinfo


If the result is ERROR: Unable to perform the requested operation, something is seriously wrong with the out of band controller. Get in touch with Dell support for further help. You will need a new version of the iDrac firmware without the memory leak, or an older version and a downgrade.

If the command is successful, it should return a lot of information about the server:


A successful result points to an issue with monitoring software or drivers. Or maybe you just need to increase the timeouts in the registry.

Overzealous monitoring alerts you to an error logged during a cluster failover, more specifically Event ID 324 from SQLAgent$InstanceName:



As mentioned this happens during failover, one that otherwise may pass without incident. Further analysis of the Application log shows that recovery isn’t done at the time. The next messages in the log are related to the server starting up and running recovery on the new node. For some reason this takes longer than expected. Maybe there was a lot of transactions in flight at the time of failover, maybe the server or storage is to slow, or maybe you were in the process of installing an update to SQLServer which may lead to extensive recovery times. Or it may be something completely different. Whatever it was, it caused the cluster service to try to start the SQLAgent before the node was ready. Reason 5 is probably access denied. Thus, the issue could be related to lack of permissions. I have yet to catch one of these early enough to have a cluster debug log containing the time of the error. Analysis of the cluster in question revealed another access related error at about the same time, ESENT Event ID 490:


This error is related to lack of permissions for the SQLServer engine and Agent runas accounts. Whether or not these accounts should have Local Admin permissions on the node is a never ending discussion. I have found though, that granting the permissions causes far less trouble in a clustered environment than not doing so. There is always another issue, always another patch or feature requiring yet another explicit permission. From a security stand point, it is easy to argue that the data served by the SQL Server is far more valuable than the local server it runs on. If an attacker is able to gain access to the runas accounts, he already has access to read and change/delete the data. What happens to the local server after that is to me rather irrelevant. But security regulations aren’t guaranteed to be neither logical nor sane.


To solve the permission issue, you can either:

  • Add the necessary local permissions for the runas accounts as discussed in KB2811566 and wait for the next “feature” requiring you to add even more permissions to something else. Also, make sure the Agent account has the proper permissions to your backup folders and make sure you are able to create new databases. Not being able to do so may be caused by the engine account not having the proper permissions to your data/log folders.
  • Add the SQL Server Engine and Agent runas accounts to the local administrators group on the server.

Do NOT grant the runas accounts Domain Admin permissions. Ever.

Regarding the open cluster error:

On the servers I have analyzed with this issue, the log always shows the agent starting successfully within two minutes of the error, and it only happens during failover. I have yet to find it on servers where the permissions issue is not solved (using either method), but I am not 100% sure that they are related. I can however say that the message can safely be ignored as long as the Agent account is able to start successfully after the message.

When you try to add a share to a newly formed (and perhaps also an existing) Windows 2012 File Server Cluster, you get an error message stating that the you are unable to do so due to lack of WinRM communication between the cluster nodes. Additionally, you may spot event id 49 from WinRM MI Operation in the Windows Remote Management operational event log with the following message:

“The WinRM protocol operation failed due to the following error: The WinRM client sent a request to an HTTP server and got a response saying the requested HTTP URL was not available. This is usually returned by a HTTP server that does not support the WS-Management protocol..”


Or the following text for Event 49:

“The WinRM protocol operation failed due to the following error: The connection to the specified remote host was refused. Verify that the WS-Management service is running on the remote host and configured to listen for requests on the correct port and HTTP URL..”

And event id 142 from Windows Remote management stating

“WSMan operation Enumeration failed, error code 2150859027”


Other possible events:

EventID 0 from FileServices-Manager.Eventprovider


“ Exception: Caught exception Microsoft.Management.Infrastructure.CimException: The WinRM client received an HTTP status code of 502 from the remote WS-Management service.
   at Microsoft.Management.Infrastructure.Internal.Operations.CimSyncEnumeratorBase`1.MoveNext()
   at Microsoft.FileServer.Management.Plugin.Services.FSCimSession.PerformQuery(String cimNamespace, String queryString)
   at Microsoft.FileServer.Management.Plugin.Services.ClusterEnumerator.RetrieveClusterConnections(ComputerName serverName, ClusterMemberTypes memberTypeToQuery)”

Error code 504 has also been detected.


The problem is clearly related to windows Remote Management. What was even more peculiar in this case, was the fact that when I failed over to another node, the error message disappeared. Thus I knew that the error was isolated to the one node. But even though I spent hours comparing settings on the nodes, all I was able to establish was the fact that they were exactly alike. Then I remembered something from my Exchange admin days; In earlier versions of Windows, WinRM could be removed and reinstalled from the system. I remember this because Exchange 2010 relied heavily on WinRM and remote powershell, bot of which could be a major pain to get working properly. In Win2012, remote management is heavily integrated in server manager, and I was unable to find a way to remove it. I did however find a way to turn it off an on again.

Update 2016.11.24:

I found another version of this problem where solution one did not work. It was still a WinRM-problem, but this time it was proxy-related. You may need an explicit  proxy exception for the local domain.

Solution one

Disable and enable WinRM. There are of course multiple ways to achieve this. I used powershell, but there is an option in the gui, and the command works in CMD.EXE as well. Beware, you have to use an elevated powershell prompt. When I come to think of it, most things that are worh doing seems to require an elevated shell.

Configure-SMRemoting -disable
Configure-SMRemoting -enable

That is it. no need to reboot or anything, just run the two commands and wait for them to finish. If you get a message that remoting is enforced by Group Policy, look for this GPO:


It has to be set as Not configured to allow you to disable and enable WinRM. If it is enforced by a domain policy, you have to block said policy temporarily while you fix this.

Enabling and disabling should also make sure that the necessary firewall settings are enabled. If you have a proxy server defined, make sure you have exceptions added for your local servers as this could also block WinRM, albeit with other error messages.

Solution two

Make sure you have an exception in your proxy definition for the local domain. For system proxy setups:

netsh winhttp set proxy [proxyserveraddress]:[proxy port] bypass-list=”*.ADDomain.local;<local>”

For other proxy configs, ask your proxy admin.

Testing winrm with powershell

You can use the Invoke-command powershell command to test powershell remote connections:

Invoke-Command -ComputerName Lab-DC -ScriptBlock { Get-ChildItem c:\ } -credential lab\sauser

This command will output a directory listing of c:\ on the computer Lab-DC. The command will be executed with the lab\sauser account. Powershell prompts for account password on execution. Sample output:

07-05-2014 11-57-04

When you audit your security log, something you are doing every day of course, you discover that SQL server is causing an audit failure fairly often:


Audit failure event 4625, The user account has expired:


Closer inspection reveals that the event is triggered by the SQL server engine service account. You could easily be lead to believe that the engine account itself has expired, as it is the only account mentioned by name in the error message. That is not the case here, as the “Account for which logon failed” is a null sid, also known as S-1-0-0 or nobody. This is Windows’ way of telling you that something failed, but I’m not going to say exactly what it was.


The error appeared about every hour on the hour, so agent jobs is the top suspect. But the job logs disagreed, every job was working like a charm. Analysis of the central log repository revealed that the error appeared suddenly without any changes being made on the server at the time.

I discovered this on a cluster, so I tried failing over to another node. That didn’t change anything, the error followed sqlserver.exe to the other node. At least I had proven beyond reasonable doubt that the error was directly related to something SQL does. I was unable to find any fault on the server except from the audit failure. I checked all SQL server service accounts, none of which had an expiry date set. I drifted back to the agent jobs again, as agent job ownership has given me a hard time earlier. For some reason, if the person creating the jobs/maintenance plans is a sysadmin by group membership and not by direct membership, all agent jobs fail to execute unless you use a proxy account. I have blogged about this before though (, so I always check for this when a new instance is installed and correct if necessary.

Then it struck me: there is a best practice somewhere stating that SQL Server installs are supposed to be executed as a special account, and not an account associated with the person that performs the installation. This is in case that person quits and we should delete his/her account. The agent jobs for backup, dbcc checkdb and such are usually created using another account though, but maybe there was an anomaly. I ran a quick check, and yes, the agent jobs were owned by the setup user. And this user was since marked as expired for security reasons, as it is only used  during setup and as a way into the server in case someone removes all other sysadmins and locks themselves out. I know there are other ways into a SQL Server you don’t have access to, but this is a lot easier as long as you have access to the domain.

To list job owners:

SELECT name, owner_sid FROM msdb.dbo.sysjobs


0x1 is sa, all the other jobs were owned by the setup user. These SIDs are in hex format. To convert to a username, run this:



Change the job and/or maintenance plan owner. See

While troubleshooting a networking teaming issue on a cluster, someone sent me the a link to this article about multiple default gateways on Win 2012 native teaming: The post discusses a pretty specific scenario that we didn’t have on our clusters (most of them are on 2008R2), but I discovered several nodes with more than one default route in route print:


The issue I was looking into was another, but I remembered a problem from a weekend some months ago that might be related: When a failover was triggered on a SQL cluster, the cluster lost communication with the outside world. To be specific: no traffic passed through the default gateway. As all cluster nodes were on the same subnet the cluster itself was content with the situation, but none of the webservers were able to communicate with the clustered SQL server as they were in a different subnet. This made the webservers sad and the webmaster angry, so we had to fix it. As this happened in production over the weekend, the focus was on a quick fix and we were unable to find a root cause at the time. A reboot of the cluster nodes did the trick, and we just wrote it off as fallout from the storage issue that triggered the failover. The discovery of multiple default gateways on the other hand prompted a more thorough investigation.


The article mentioned above talks exclusively about Windows 2012’s native teaming software, but this cluster is running Windows 2008 R2 and is relying on teaming software provided by the NIC manufacturer (Qlogic). We have had quite a lot of problems with the Qlogic network adapters on this cluster, so I immediately suspected them to be the rotten apple. I am not sure if this problem is caused by a bug in Windows itself that is present in both 2012 and 2008R2, or if both MS and Qlogic are unable to produce a functioning NIC teaming driver, but the following is clear:

If your adapters have a default gateway when you add them to a team, there is a chance that this default gateway will not get removed from the system. This happens regardless if the operating system is Windows 2012 or Windows 2008 R2. I am not sure if gateway addresses configured by DHCP also triggers this behavior. It doesn’t happen every time, and I have yet to figure out if there are any specific triggers as I haven’t been able to reproduce the problem at will.

Solution A

To resolve this issue, follow the recommendations in

First you have to issue a command to delete all static routes to NB! This will disconnect you from the server if you are connected remotely from outside the subnet.


Configure the default gateway for the team using IP properties on the virtual team adapter:


Do a route print to make sure you have only one default gateway under persistent routes.

Solution b

If solution A doesn’t work, issue a netsh interface ip reset command to reset the ip configuration and reboot the server. Be prepared to re-enter the ip information for all adapters if necessary.

What not to do

Do not configure the default gateway using route add, as this will result in a static route. If the computer is a node in a cluster, the gateway will be disabled at failover and isolate the server on the local subnet. See for information about how to configure static routes on clusters if you absolutely have to use a static route.

From time to time, the good people at Microsoft publish a list of problems with failover clustering that has been resolved. This list, as all such bad/good news comes in the form of a KB, namely KB2784261. I check this list from time to time. Some of them relate to a specific issue, while others are more of the go-install-them-at-once type. As a general rule, I recommend installing ALL hotfixes regardless of the attached warning telling you to only install them if you experience a particular problem. In my experience, hotfixes are at least as stable as regular patches, if not better. That being said, sooner or later you will run across patches or hotfixes that will make a mess and give you a bad or very bad day. But then again, that is why cluster admins always fight for funding of a proper QA/T environment. Preferably one that is equal to the production system in every way possible.

Anyways, this results in having to check all my servers to see if they have the hotfixes installed. Luckily some are included in Microsoft Update, but some you have to install manually. To simplify this process, I made the following powershell script. It takes a list of hotfixes, and returns a list of the ones who are missing from the system. This script could easily be adapted to run against several servers at once, but I have to battle way to many internal firewalls to attempt such dark magic. Be aware that some hotfixes have multiple KB numbers and may not show up even if they are installed. This usually happens when updates are bundled together as a cummulative package or superseded by a new version. The best way to test if patch/hotfix X needs to be installed is to try to install it. The installer will tell you whether or not the patch is applicable.

Edit: Since the original, I have added KB lists for 2008 R2 and 2012 R2 based clusters. All you have to do is replace the ” $recommendedPatches = ” list with the one you need. Links to the correct KB list is included for each OS. I have also discovered that some of the hotfixes are available through Microsoft Update-Catalog, thus bypassing the captcha email hurdle.

2012 version

$menucolor = [System.ConsoleColor]::gray
write-host "╔═══════════════════════════════════════════════════════════════════════════════════════════╗"-ForegroundColor $menucolor
write-host "║                              Identify missing patches                                     ║"-ForegroundColor $menucolor
write-host "║                              Jan Kåre Lokna -                                    ║"-ForegroundColor $menucolor
write-host "║                                       v 1.2                                               ║"-ForegroundColor $menucolor
write-host "║                                  Requires elevation: No                                   ║"-ForegroundColor $menucolor
write-host "╚═══════════════════════════════════════════════════════════════════════════════════════════╝"-ForegroundColor $menucolor
#List your patches here. Updated list of patches at
$recommendedPatches = "KB2916993", "KB2929869","KB2913695", "KB2878635", "KB2894464", "KB2838043", "KB2803748", "KB2770917"
$missingPatches = @()
foreach($_ in $recommendedPatches){
    if (!(get-hotfix -id $_ -ea:0)) { 
        $missingPatches += $_ 
$intMissing = $missingPatches.Count
$intRecommended = $recommendedpatches.count
Write-Host "$env:COMPUTERNAME is missing $intMissing of $intRecommended patches:" 

2008R2 Version

A list of recommended patches for Win 2008 R2 can be found here:  KB2545685

$recommendedPatches = "KB2531907", "KB2550886","KB2552040", "KB2494162", "KB2524478", "KB2520235"

2012 R2 Version

A list of recommended patches for Win 2012 R2 can be found here:  KB2920151

#All clusters
$recommendedPatches = "KB3130944", "KB3137691", "KB3139896", "KB3130939", "KB3123538", "KB3091057", "KB3013769", "KB3000850", "KB2919355"
#Hyper-V Clusters
$recommendedPatches = "KB3130944", "KB3137691", "KB3139896", "KB3130939", "KB3123538", "KB3091057", "KB3013769", "KB3000850", "KB2919355", "KB3090343", "KB3060678", "KB3063283", "KB3072380"


If you are using the Hyper-V role, you can find additional fixes for 2012 R2 in KB2920151 below the general cluster hotfixes. If you use NVGRE, look at this list as well: KB2974503

Sample output (computer name redacted)



I have finally updated my script to remove those pesky red error messages seen in the sample above.

I have had several issues in the past year involving kernel memory leaks, so I decided to make a separate blog post about general kernel memory leak analysis. In this post I mostly use the amazing Sysinternals tools for troubleshooting. You also need Poolmon.exe, a small utility currently part of the Windows Driver Kit. Sadly, this 35k self contained .exe is not available as a separate download, you have to download and install the entire 500+MiB WDK somewhere to extract it. You only have to do this once though, as there is no need to install the WDK on every system you analyze. You can just copy the executable from the machine where you installed the WDK.


Something is causing the kernel paged or non paged pools to rise uncontrollably. Screenshot from Process Explorer’s System Information dialog:


In this sample, the non paged pool has grown to an unhealthy 2,2GB, and continues to grow. Even though the pool limit is 128GIB and the server has a whopping  256GIB of RAM, the kernel memory pools are usually way below the 1GiB mark. You should of course baseline this to make sure you actually have an issue, but generally speaking, every time I find a Kernel memory value above 1GiB I go hunting for the cause.

Note: To show the pool limits, you have to enable symbols in Process Explorer. Scott Hanselman has blogged about that here:


Kernel leaks are usually caused by a driver. Kernel leaks in the OS itself are very rare, unless you are running some sort of beta version of Windows. To investigate further, you have to fire up poolmon.exe.


Poolmon has a lot of shortcuts. From KB177415:

P – Sorts tag list by Paged, Non-Paged, or mixed. Note that P cycles through each one.
B – Sorts tags by max byte usage.
M – Sorts tags by max byte allocation.
T – Sort tags alphabetically by tag name.
E – Display Paged, Non-paged total across bottom. Cycles through.
A – Sorts tags by allocation size.
F – Sorts tags by “frees”.
S – Sorts tags by the differences of allocs and frees.
E – Display Paged, Non-paged total across bottom. Cycles through.
Q – Quit.

The important ones are “P”, to view either paged or non-paged pool tags, and “B”, to list the ones using the most of it at the top. The same view as above, after pressing “P” and “B”:


The “Cont” tag relates to “Contiguous physical memory allocations for device drivers”, and is usually the largest tag on a normal system.

And this screenshot is from the server with the non-paged leak:


As you can see, the LkaL tag is using more than 1GiB on its own, accounting for half of the pool. we have identified the pool tag, now we have to look for the driver that owns it. To do that, I use one of two methods:

1: Do an internet search for the pool tag. contains a large list of tags.

2: Use Sysinternals strings together with findstr.

Most kernel mode drivers are located in “%systemroot%\System32\drivers”. First you have to start an elevated command prompt.  Make sure the Sysinternals suite is installed somewhere on the path, and enter the following commands:

  • cd “%systemroot%\System32\drivers”
  • strings * | findstr [tag]



Hopefully, you should now have the name of the offending driver. To gather more intel about it, use Sysinternals sigcheck:


In this case, the offending driver is part of Diskeeper.


You have to contact the manufacturer of the offending driver and check for an updated version. In the example, I got a version of Diskeeper without the offending driver, but a good place to start is the manufacturers website. And remember, if you already have the newest version, you can always try downgrading to a previous version. If you present your findings as detailed above, most vendors are very interested in identifying the cause of the leak and will work with you to resolve the issue.

Something bad happened during a firmware/driver update on one of my servers, which resulted in the network adapters being unavailable for teaming in the native nic teaming module. The server had Broadcom Netextreme II adapters, and we suspect that the BACS application is the culprit, since it also supports teaming. The problem presented the following symptoms:

  • Constant reinstallation of drivers in device manager for several minutes
  • The adapters were missing from Network Connections, but visible in device manager
  • No adapters were available for teaming in Windows Nic Teaming


  • First you enter device manager and enable Show hidden devices:
  • Look for greyed out devices, that is devices that are not connected. Delete/uninstall all of them. You will probably be left with at least one “Microsoft Network Adapter Multiplexor Driver #n” that you are not able to uninstall.
  • Uninstall ALL broadcom drivers/software and reboot.
  • Open device manager again, and get the guid of the stubborn multiplexor adapters:
  • Go to the HKLM\System\CurrentControlSet\Services\NdislmPlatform\Parameters\Teams registry key:
  • Delete the entire team key that corresponds to the key you found above, that is the key labeled {GUID} under Teams:
  • Reinstall the latest Broadcom drivers without BACS and BASP:
  • Reboot, and re-create the teams using Nic Teaming
For some reason, the OS deploy fails and afterwards this message appears at boot: “Lifecycle Controller update required”. Manual install of OS and subsequent Lifecycle controller firmware update doesn’t help. Any attempt to enter the Lifecycle Controller results in the system ignoring your request and booting normally.



  • First, you press F2 to enter system setup
  • Then, go looking for the iDrac settings menu
  • Enter it, and browse down to the Lifecycle Controller option
  • Select Yes for the Cancel Lifecycle Controller Actions option.
  • Finish, save settings and reboot.

If this doesn’t solve the problem, there is a Lifecycle Controller Repair Package available for download over at the Dell support site. I have yet to figure out how that thing works though, as the release notes are not available for download at the moment. I would suggest opening a support ticket if you have to go down this route.

Each time Windows Automatic Maintenance (hereafter known as automaint) is triggered, the following message appears in the application event log shortly thereafter: Event ID 1001 from Windows Error Reporting.


This happens on several of my servers.


I know it is triggered by automaint only because it appears every night at 03:00, which is the time automaint is scheduled. I tried triggering automaint manually, and the error message promptly appeared in the event log. The scheduled task that triggers the error is called Program Data Update, which is part of the Customer Experience Improvement Program. This is a task that collects information about software installations, uninstalls and such. Analysis so far show that this affects all of my Win2012 servers, as well as some Windows 8 and windows 8.1 clients, but it has yet to cause any adverse effects other than the error message. I have tried to figure out exactly what it is failing at, so far to no avail, but I post this as a pointer to others who wonder what is causing the event log message. I will update this post when and if I find a solution.

