EventID 1004 from IPMIDRV v2

Post originally from 2010, updated  2016.06.20. IPMI seems to be an endless source of “entertainment”…

Original post: https://lokna.no/?p=409

Problem

imageThe system event log is overflowing with EventID 1004 from IPMIDRV. “The IPMI device driver attempted to communicate with the IPMI BMC device during normal operation. However the operation failed due to a timeout.”

The frequency may vary from a couple of messages per day upwards to several messages per minute.

Analysis

The BMC (Baseboard Management Controller) is a component found on most server motherboards. It is a microcontroller responsible for communication between the motherboard and management software. See wikipedia for more information. The BMC is also used for communication between the motherboard and dedicated out of band management boards such as Dell iDRAC. I have seen these error messages on systems from several suppliers, most notably on IBM and Dell blade servers, but most server motherboards have a BMC. As the error message states, you can resolve this error by increasing the timeout, and this is usually sufficient. I have found that the Windows default settings for the timeouts may cause conflicts, especially on blade servers. Thus an increase in the timeout values may be in order as described on technet. Lately though, I have found this error to be a symptom of more serious problems. To understand this, we have to look at what is actually happening. If you have some kind of monitoring agent running on the server, such as SCOM or similar, the error could be triggered by said agent trying to read the current voltage levels on the motherboard. If such operations fail routinely during the day, it is a sign of a conflict. This could be competing monitoring agents querying data to frequently, an issue with the BMC itself, or an issue with the out of band management controller. In my experience, this issue is more frequent on blade servers than rack-based servers. This makes sense, as most blade servers have a local out of band controller that is continuously talking to a chassis management controller to provide a central overview of the chassis.

If the out of band controllers have a problem, this can and will affect the BMC, which in turn may affect the motherboard. Monitoring of server status is the most frequently used feature, but the BMC controller is also used for remote power control and is able to affect the state of internal components on the motherboard. We recently had an issue on a Dell M820 blade server where a memory leak in iDrac resulted in the mezzanine slots on the server being intermittently switched off. In this case it was the FibreChannel HBA. Further research revealed this to be a returning issue. This forum thread from 2011 describes a similar issue: http://en.community.dell.com/techcenter/blades/f/4436/t/19415896.aspx.

As the iDrac versions in question are different (1.5.3 in 2010 and 1.51.51 in 2014), I theorize that the issue is related to iDrac memory leaks in general and not a specific bug. Thus, any iDrac firmware bug resulting in a memory leak may cause these issues.

2016: Analysis addendum

The IPMIDRV monster reared its ugly head again. I have grown increasingly vary of IPMI error messages after the incident with the HBA power fluctuations, and we are now at a point were a single error message in production triggers a full inquisition. This time the victims were about 20 Dell M620 and M820 servers upgraded from iDRAC 1.x to iDRAC 2.x who displayed daily 1004 events. All attempts to get a solution from The Cult of the Slanted E was unsuccessful. As usual we were left to figure it out on our own. Several minions were tasked with digging around in logs, and we consulted the Splunk Wizards to try a coordinated attack. We had high hopes in iDRAC 2.2.x, but though it reduced the number of events, in the end it failed to stop the onslaught completely. Thus, we kept on digging. We were about to chalk it up as yet another iDRAC memory leak, but one sunny day a minion stumbled across something odd. In iDRAC 2.x there is a new option called Host OS. Upon clicking it, a strange new warning appeared: RAC0690 The iDRAC Service Module is not installed on the operating system.

image_thumb1[1]

Encouraged, the minion dug on and found RAC0654 No operations can be performed on the iDRAC service module.

image_thumb3

Neither of these messages were listed in the iDRAC logs, and if we are to believe The Cult of the Slanted E they are not transferred on to support when you run the log collection tools either. According to the principle of correcting known, but seemingly irrelevant errors when stuck without better options, we went in search of a solution. Rummaging around in the Dell driver download maze we found a file identified as OM-iSM-Dell-Web-X64-2.2.0_A00.exe. This rather cryptic name is decoded as Dell iDRAC Service Module 2.2. I won’t provide a link to this as such links are rapidly outdated, just enter the service tag of your server on the download site, and the driver or whatever it is should be listed. By your time of reading this a new version is likely to have materialized, but hopefully the name “iDRAC Service Module” will survive.

We installed the driver on a test server, and low and behold, the errors vanished without a trace. And six weeks later we have yet to spot another IPMIDRV event. The Service Module tab in iDRAC is also populated:

image_thumb5

As is the Host OS Network Interfaces tab:

image_thumb7

We postulate the following: iDRAC 2.x adds functionality that communicates with the host OS. The host OS is unable to respond to these messages without the aforementioned driver. iDRAC is unable to interpret the fact that the OS is not responding (as it does not understand the request), and is thus pestering it to check if we have installed the driver. It does however neglect to mention this in the iDRAC log, a place where we would check for such messages and take action. As a result it took months to connect the dots, and The Cult of the Slanted E is still non the wiser as they gave up all hope a long time ago and suggested we replace the servers. Which we will do eventually, but in the meantime they have to function properly.

Solution

Dell iDRAC 2.0 or higher

Make sure the Dell iDRAC Service Module driver is installed on the server.

Low error frequency

Increase the timeout values as described on technet. I have used the following values with success on IBM servers:

image

Under HKLM\SYSTEM\CurrentControlSet\Control\IPMI are four values controlling the IPMI driver: BusyWaitTimeoutPeriod, BusyWaitPeriod, IpmbWaitTimeoutPeriod, CommandWaitTimeoutPeriod, and SlaveAddressDetectionMethod. On IBM blades, I have used BusyWaitPeriod 60(desimal) and 9 000 000 (desimal) for the rest. Changing these settings require a restart of the server.

High error frequency

Further analysis will be necessary for each case. Try to identify what program is triggering the timeouts. A blanket upgrade of Bios, Out of band Management and system drivers may be successful, but it could also introduce new problems and thus further complicate the troubleshooting. Looking for other, seemingly unrelated problems in the event log could also be a good idea. And check for other servers with similar problems. I have found that removing the server from the chassis and reseating it may remove the fault for a couple of days before it returns. This is a symptom of a memory leak. And talking about memory leaks, check for kernel mode driver memory leaks in the operating system.

If it is a dell server, try running the following command:

racadm getsysinfo

image_thumb1

If the result is ERROR: Unable to perform the requested operation, something is seriously wrong with the out of band controller. Get in touch with Dell support for further help. You will need a new version of the iDrac firmware without the memory leak, or an older version and a downgrade.

If the command is successful, it should return a lot of information about the server:

image_thumb4

A successful result points to an issue with monitoring software or drivers. Or maybe you just need to increase the timeouts in the registry.

Author: DizzyBadger

SQL Server DBA, Cluster expert, Principal Analyst

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.