EventID 1004 from IPMIDRV v2

Post originally from 2010, updated  2016.06.20. IPMI seems to be an endless source of “entertainment”…

Original post: https://lokna.no/?p=409

Problem

imageThe system event log is overflowing with EventID 1004 from IPMIDRV. “The IPMI device driver attempted to communicate with the IPMI BMC device during normal operation. However the operation failed due to a timeout.”

The frequency may vary from a couple of messages per day upwards to several messages per minute.

Analysis

The BMC (Baseboard Management Controller) is a component found on most server motherboards. It is a microcontroller responsible for communication between the motherboard and management software. See wikipedia for more information. The BMC is also used for communication between the motherboard and dedicated out of band management boards such as Dell iDRAC. I have seen these error messages on systems from several suppliers, most notably on IBM and Dell blade servers, but most server motherboards have a BMC. As the error message states, you can resolve this error by increasing the timeout, and this is usually sufficient. I have found that the Windows default settings for the timeouts may cause conflicts, especially on blade servers. Thus an increase in the timeout values may be in order as described on technet. Lately though, I have found this error to be a symptom of more serious problems. To understand this, we have to look at what is actually happening. If you have some kind of monitoring agent running on the server, such as SCOM or similar, the error could be triggered by said agent trying to read the current voltage levels on the motherboard. If such operations fail routinely during the day, it is a sign of a conflict. This could be competing monitoring agents querying data to frequently, an issue with the BMC itself, or an issue with the out of band management controller. In my experience, this issue is more frequent on blade servers than rack-based servers. This makes sense, as most blade servers have a local out of band controller that is continuously talking to a chassis management controller to provide a central overview of the chassis.

Continue reading “EventID 1004 from IPMIDRV v2”

EventID 1004 from IPMIDRV

Post originally from 2010, updated 2014.04.04 and superseded by EventID 1004 from IPMIDRV v2 in 2016

Problem

imageThe system event log is overflowing with EventID 1004 from IPMIDRV. “The IPMI device driver attempted to communicate with the IPMI BMC device during normal operation. However the operation failed due to a timeout.”

The frequency may vary from a couple of messages per day upwards to several messages per minute.

Analysis

The BMC (Baseboard Management Controller) is a component found on most server motherboards. It is a microcontroller responsible for communication between the motherboard and management software. See wikipedia for more information. The BMC is also used for communication between the motherboard and dedicated out of band management boards such as Dell iDRAC. I have seen these error messages on systems from several suppliers, most notably on IBM and Dell blade servers, but most server motherboards have a BMC. As the error message states, you can resolve this error by increasing the timeout, and this is usually sufficient. I have found that the Windows default settings for the timeouts may cause conflicts, especially on blade servers. Thus an increase in the timeout values may be in order as described on technet. Lately though, I have found this error to be a symptom of more serious problems. To understand this, we have to look at what is actually happening. If you have some kind of monitoring agent running on the server, such as SCOM or similar, the error could be triggered by said agent trying to read the current voltage levels on the motherboard. If such operations fail routinely during the day, it is a sign of a conflict. This could be competing monitoring agents querying data to frequently, an issue with the BMC itself, or an issue with the out of band management controller. In my experience, this issue is more frequent on blade servers than rack-based servers. This makes sense, as most blade servers have a local out of band controller that is continuously talking to a chassis management controller to provide a central overview of the chassis.

If the out of band controllers have a problem, this can and will affect the BMC, which in turn may affect the motherboard. Monitoring of server status is the most frequently used feature, but the BMC controller is also used for remote power control and is able to affect the state of internal components on the motherboard. We recently had an issue on a Dell M820 blade server where a memory leak in iDrac resulted in the mezzanine slots on the server being intermittently switched off. In this case it was the FibreChannel HBA. Further research revealed this to be a returning issue. This forum thread from 2011 describes a similar issue: http://en.community.dell.com/techcenter/blades/f/4436/t/19415896.aspx.

As the iDrac versions in question are different (1.5.3 in 2010 and 1.51.51 in 2014), I theorize that the issue is related to iDrac memory leaks in general and not a specific bug. Thus, any iDrac firmware bug resulting in a memory leak may cause these issues.

Solution

Low error frequency

Increase the timeout values as described on technet. I have used the following values with success on IBM servers:

image

Under HKLM\SYSTEM\CurrentControlSet\Control\IPMI are four values controlling the IPMI driver: BusyWaitTimeoutPeriod, BusyWaitPeriod, IpmbWaitTimeoutPeriod, CommandWaitTimeoutPeriod, and SlaveAddressDetectionMethod. On IBM blades, I have used BusyWaitPeriod 60(desimal) and 9 000 000 (desimal) for the rest. Changing these settings require a restart of the server.

High error frequency

Further analysis will be necessary for each case. Try to identify what program is triggering the timeouts. A blanket upgrade of Bios, Out of band Management and system drivers may be successful, but it could also introduce new problems and thus further complicate the troubleshooting. Looking for other, seemingly unrelated problems in the event log could also be a good idea. And check for other servers with similar problems. I have found that removing the server from the chassis and reseating it may remove the fault for a couple of days before it returns. This is a symptom of a memory leak. And talking about memory leaks, check for kernel mode driver memory leaks in the operating system.

If it is a dell server, try running the following command:

racadm getsysinfo

image

If the result is ERROR: Unable to perform the requested operation, something is seriously wrong with the out of band controller. Get in touch with Dell support for further help. You will need a new version of the iDrac firmware without the memory leak, or an older version and a downgrade.

If the command is successful, it should return a lot of information about the server:

image

A successful result points to an issue with monitoring software or drivers. Or maybe you just need to increase the timeouts in the registry.

Cluster disker på passiv node og WWN på HBA

Ved feilsøking av noen rare clusterproblemer begynte jeg plutselig å lure på hvordan clusterdisker var representert på den passive noden i et Windows 2008 cluster. De var nemlig ikke å se i Disk Management, og forsøk på failover av diskressurser feilet spektakulert og fullstendig. Litt forskning viste kjapt at joda, de skulle vært listet omtrent som dette:

image

Dermed måtte feilen være et annet sted. Litt mer fundering avslørte at vi hadde oppdatert bios og managementprosessor på dette clusteret for kort tid siden, og vi begynte å ane hvor grevlingen var. Av en eller annen grunn hadde ikke serveren fått satt riktig WWN ved omstart. Dette er et IBM HS22 blade med Qlogic HBA som får sine WWN fra managementmodulen, og en rask sjekk i SanSurfer avslørte at WWN ikke stemte overens med det som var definert i SANet. Dermed trengs en kaldstart av managementmodueln i bladet, noe som betyr å slå av serveren, trekke ut bladet, vente et minutt og sette det inn igjen. Deretter må man vente en 5 minutters tid mens IMM booter. Man kan se at den er ferdig ved at powerlyset i front av bladet begynner å blinke med betydelig langsommere takt. Først da vil bladet la seg slå på.

Update: Det har også fungert i enkelte tilfeller å ta en full shutdown av bladet, vente to minutter og så starte det igjen. Det er verdt et forsøk, da man sparer en tur til serverrommet.

Samme situasjon har jeg tidligere opplevd på HP BL460 blade. Disse har vanligvis Emulex-kort som liker å komme opp med blank WWN (00-00-00-…), eller med en WWN som begynner på 1 istedenfor 5. Dette er noe som er litt lettere å oppdage enn et nummer som er tilsynelatende riktig. Løsningen er den samme, ut med bladet for å få kaldstartet managementmodulen.

SNAGHTML109f57a

Dersom feilen oppstår på en helt ny server/nytt HostBusAdapter er det vanligvis firmware på HBA som er utdatert eller på annen måte ikke samsvarer med firmware til bladets managementmodul. Emulex liker å splitte denne i flere deler, en HBA bios og en HBA firmware, pass på at du oppdaterer alle. Og oppdater firmware på bladet samtidig. Det anbefales å bruke firmware fra bladeprodusenten og ikke fra Qlogic/Emulex. Pass også på at du oppdaterer firmwaren til den sentrale managementmodulen på IBM Bladecenter samtidig. Denne kalles AMM. På HP C7000 har man ofte Virtual Connect fibermoduler istedenfor fiberswitcher fra feks Brocade. Rent teknisk er det store forskjeller på AMM og Virtual Connect, men i denne sammenhengen er det viktigste at versjonene stemmer overens.