EventID 1004 from IPMIDRV v2

Post originally from 2010, updated  2016.06.20. IPMI seems to be an endless source of “entertainment”…

Original post: https://lokna.no/?p=409

Problem

imageThe system event log is overflowing with EventID 1004 from IPMIDRV. “The IPMI device driver attempted to communicate with the IPMI BMC device during normal operation. However the operation failed due to a timeout.”

The frequency may vary from a couple of messages per day upwards to several messages per minute.

Analysis

The BMC (Baseboard Management Controller) is a component found on most server motherboards. It is a microcontroller responsible for communication between the motherboard and management software. See wikipedia for more information. The BMC is also used for communication between the motherboard and dedicated out of band management boards such as Dell iDRAC. I have seen these error messages on systems from several suppliers, most notably on IBM and Dell blade servers, but most server motherboards have a BMC. As the error message states, you can resolve this error by increasing the timeout, and this is usually sufficient. I have found that the Windows default settings for the timeouts may cause conflicts, especially on blade servers. Thus an increase in the timeout values may be in order as described on technet. Lately though, I have found this error to be a symptom of more serious problems. To understand this, we have to look at what is actually happening. If you have some kind of monitoring agent running on the server, such as SCOM or similar, the error could be triggered by said agent trying to read the current voltage levels on the motherboard. If such operations fail routinely during the day, it is a sign of a conflict. This could be competing monitoring agents querying data to frequently, an issue with the BMC itself, or an issue with the out of band management controller. In my experience, this issue is more frequent on blade servers than rack-based servers. This makes sense, as most blade servers have a local out of band controller that is continuously talking to a chassis management controller to provide a central overview of the chassis.

Continue reading “EventID 1004 from IPMIDRV v2”

Known “good” iDRAC firmware

Dell seems to be unable to create a working iDRAC firmware for iDRAC 7. Every time there is an update, new issues occur. See this post about IPMI errors for a recap of some of my adventures down in firmware la la land. Recent troubles have prompted me to create this list of known “good” and bad versions. I can in no way promise that they wont cause issues in your systems, I just know that ours are fairly stable with the good versions, and unstable with the bad ones. And note, when I say stable, I refer to the state of the server, not the state of the iDRAC or CMC system. As an example; iDRAC 1.57.57 is delivered with a list of no less than 82 known issues. I lost my patience at no. 20, but most of them seems to be related to problems with the iDRAC system itself and functions that does not work as expected on different web-browsers. You can download it from Dell here if you have a hankering for some light reading.

Known “good” iDRAC firmware:

  • 1.40.40
  • 1.46.45
  • 1.56.55
  • 1.66.65

1.46.45 and 1.66.65 in particular has been very stable.

Known bad iDRAC firmware:

  • All versions from 1.46.46 to 1.56.54
  • 2.10.10 (incompatibility with some CMC versions)
  • 2.21.21.21 (IPMI 1004 error)

1.51.51 (and maybe some adjacent builds) has a particularly nasty memory leak that can wreak havoc on your system, randomly turning parts of the server like HBA or network cards on and off.

iDRAC firmware under investigation:

1.57.57

So far, I have found that this version triggers around 4 IPMIDRV event id 1004 warning messages in Windows 2008R2 when you install it on a running system. Furthermore, I can confirm that it actually removes some false Link tuning errors. From the Release note list of fixes: “Link tuning error occurs when enabling ports 3 and 4 through BIOS attribute.”

image

I suspect this to be a bug that appeared in 1.56.56, but I am not certain as this error has been prominent in previous builds as well. Conclusion: Use 1.66.65 instead.

EventID 1004 from IPMIDRV

Post originally from 2010, updated 2014.04.04 and superseded by EventID 1004 from IPMIDRV v2 in 2016

Problem

imageThe system event log is overflowing with EventID 1004 from IPMIDRV. “The IPMI device driver attempted to communicate with the IPMI BMC device during normal operation. However the operation failed due to a timeout.”

The frequency may vary from a couple of messages per day upwards to several messages per minute.

Analysis

The BMC (Baseboard Management Controller) is a component found on most server motherboards. It is a microcontroller responsible for communication between the motherboard and management software. See wikipedia for more information. The BMC is also used for communication between the motherboard and dedicated out of band management boards such as Dell iDRAC. I have seen these error messages on systems from several suppliers, most notably on IBM and Dell blade servers, but most server motherboards have a BMC. As the error message states, you can resolve this error by increasing the timeout, and this is usually sufficient. I have found that the Windows default settings for the timeouts may cause conflicts, especially on blade servers. Thus an increase in the timeout values may be in order as described on technet. Lately though, I have found this error to be a symptom of more serious problems. To understand this, we have to look at what is actually happening. If you have some kind of monitoring agent running on the server, such as SCOM or similar, the error could be triggered by said agent trying to read the current voltage levels on the motherboard. If such operations fail routinely during the day, it is a sign of a conflict. This could be competing monitoring agents querying data to frequently, an issue with the BMC itself, or an issue with the out of band management controller. In my experience, this issue is more frequent on blade servers than rack-based servers. This makes sense, as most blade servers have a local out of band controller that is continuously talking to a chassis management controller to provide a central overview of the chassis.

If the out of band controllers have a problem, this can and will affect the BMC, which in turn may affect the motherboard. Monitoring of server status is the most frequently used feature, but the BMC controller is also used for remote power control and is able to affect the state of internal components on the motherboard. We recently had an issue on a Dell M820 blade server where a memory leak in iDrac resulted in the mezzanine slots on the server being intermittently switched off. In this case it was the FibreChannel HBA. Further research revealed this to be a returning issue. This forum thread from 2011 describes a similar issue: http://en.community.dell.com/techcenter/blades/f/4436/t/19415896.aspx.

As the iDrac versions in question are different (1.5.3 in 2010 and 1.51.51 in 2014), I theorize that the issue is related to iDrac memory leaks in general and not a specific bug. Thus, any iDrac firmware bug resulting in a memory leak may cause these issues.

Solution

Low error frequency

Increase the timeout values as described on technet. I have used the following values with success on IBM servers:

image

Under HKLM\SYSTEM\CurrentControlSet\Control\IPMI are four values controlling the IPMI driver: BusyWaitTimeoutPeriod, BusyWaitPeriod, IpmbWaitTimeoutPeriod, CommandWaitTimeoutPeriod, and SlaveAddressDetectionMethod. On IBM blades, I have used BusyWaitPeriod 60(desimal) and 9 000 000 (desimal) for the rest. Changing these settings require a restart of the server.

High error frequency

Further analysis will be necessary for each case. Try to identify what program is triggering the timeouts. A blanket upgrade of Bios, Out of band Management and system drivers may be successful, but it could also introduce new problems and thus further complicate the troubleshooting. Looking for other, seemingly unrelated problems in the event log could also be a good idea. And check for other servers with similar problems. I have found that removing the server from the chassis and reseating it may remove the fault for a couple of days before it returns. This is a symptom of a memory leak. And talking about memory leaks, check for kernel mode driver memory leaks in the operating system.

If it is a dell server, try running the following command:

racadm getsysinfo

image

If the result is ERROR: Unable to perform the requested operation, something is seriously wrong with the out of band controller. Get in touch with Dell support for further help. You will need a new version of the iDrac firmware without the memory leak, or an older version and a downgrade.

If the command is successful, it should return a lot of information about the server:

image

A successful result points to an issue with monitoring software or drivers. Or maybe you just need to increase the timeouts in the registry.