Unexpected sense, Sense key B code 41

Problem

The system event log is running over with Event Id 2095 from Server Administrator. “Unexpected sense. SCSI sense data: Sense key:  B Sense code: 41 Sense qualifier:  0:  Physical Disk 0:1:20 Controller 1, Connector 0”.

Several events per second at the most.

image

 

Analysis

https://en.wikipedia.org/wiki/Key_Code_Qualifier shows a list of common Sense Key Codes. Key B translates to aborted command. Key 41 is not listed, but it is very likely that something is not as it should be with disk 20 on controller 1. OMSA can tell you which disk this is, and what arrays it is a part of.

image

OMSA claims that the disk is working fine, but unless the drive is trying to tell me that it has found some missing common sense, I think I have to respectfully disagree. Such faults are usually not a good sign, especially when they are so prevalent as in this case. So I performed a firmware/driver upgrade as that will often provide some insight. In this case SUU 1809 replaced SUU 1803, that is a 6 month span in revisions.

The upgrade resulted in a new error:

Event ID 2335, Controller event log: PD 14(e0x20/s20) is not a certified drive:  Controller 1 (PERC H730P Adapter) .

image

 

OMSA tells me that the disk is in a failed state.

image

 

Time to register a support case with the evil wizards of the silver slanted E.

 

Solution

The disk was replaced. OMSA still complains about an error, specifically an enclosure error, but the iDRAC shows a green storage status.

OMSA:

image

image

 

iDRAC:

image

After restarting all the OMSA services and the iDRAC Service Module service the status returned to green in OMSA as well.

image

EventID 1004 from IPMIDRV v2

Post originally from 2010, updated  2016.06.20. IPMI seems to be an endless source of “entertainment”…

Original post: https://lokna.no/?p=409

Problem

imageThe system event log is overflowing with EventID 1004 from IPMIDRV. “The IPMI device driver attempted to communicate with the IPMI BMC device during normal operation. However the operation failed due to a timeout.”

The frequency may vary from a couple of messages per day upwards to several messages per minute.

Analysis

The BMC (Baseboard Management Controller) is a component found on most server motherboards. It is a microcontroller responsible for communication between the motherboard and management software. See wikipedia for more information. The BMC is also used for communication between the motherboard and dedicated out of band management boards such as Dell iDRAC. I have seen these error messages on systems from several suppliers, most notably on IBM and Dell blade servers, but most server motherboards have a BMC. As the error message states, you can resolve this error by increasing the timeout, and this is usually sufficient. I have found that the Windows default settings for the timeouts may cause conflicts, especially on blade servers. Thus an increase in the timeout values may be in order as described on technet. Lately though, I have found this error to be a symptom of more serious problems. To understand this, we have to look at what is actually happening. If you have some kind of monitoring agent running on the server, such as SCOM or similar, the error could be triggered by said agent trying to read the current voltage levels on the motherboard. If such operations fail routinely during the day, it is a sign of a conflict. This could be competing monitoring agents querying data to frequently, an issue with the BMC itself, or an issue with the out of band management controller. In my experience, this issue is more frequent on blade servers than rack-based servers. This makes sense, as most blade servers have a local out of band controller that is continuously talking to a chassis management controller to provide a central overview of the chassis.

Continue reading “EventID 1004 from IPMIDRV v2”

Known “good” iDRAC firmware

Dell seems to be unable to create a working iDRAC firmware for iDRAC 7. Every time there is an update, new issues occur. See this post about IPMI errors for a recap of some of my adventures down in firmware la la land. Recent troubles have prompted me to create this list of known “good” and bad versions. I can in no way promise that they wont cause issues in your systems, I just know that ours are fairly stable with the good versions, and unstable with the bad ones. And note, when I say stable, I refer to the state of the server, not the state of the iDRAC or CMC system. As an example; iDRAC 1.57.57 is delivered with a list of no less than 82 known issues. I lost my patience at no. 20, but most of them seems to be related to problems with the iDRAC system itself and functions that does not work as expected on different web-browsers. You can download it from Dell here if you have a hankering for some light reading.

Known “good” iDRAC firmware:

  • 1.40.40
  • 1.46.45
  • 1.56.55
  • 1.66.65

1.46.45 and 1.66.65 in particular has been very stable.

Known bad iDRAC firmware:

  • All versions from 1.46.46 to 1.56.54
  • 2.10.10 (incompatibility with some CMC versions)
  • 2.21.21.21 (IPMI 1004 error)

1.51.51 (and maybe some adjacent builds) has a particularly nasty memory leak that can wreak havoc on your system, randomly turning parts of the server like HBA or network cards on and off.

iDRAC firmware under investigation:

1.57.57

So far, I have found that this version triggers around 4 IPMIDRV event id 1004 warning messages in Windows 2008R2 when you install it on a running system. Furthermore, I can confirm that it actually removes some false Link tuning errors. From the Release note list of fixes: “Link tuning error occurs when enabling ports 3 and 4 through BIOS attribute.”

image

I suspect this to be a bug that appeared in 1.56.56, but I am not certain as this error has been prominent in previous builds as well. Conclusion: Use 1.66.65 instead.

EventID 1004 from IPMIDRV

Post originally from 2010, updated 2014.04.04 and superseded by EventID 1004 from IPMIDRV v2 in 2016

Problem

imageThe system event log is overflowing with EventID 1004 from IPMIDRV. “The IPMI device driver attempted to communicate with the IPMI BMC device during normal operation. However the operation failed due to a timeout.”

The frequency may vary from a couple of messages per day upwards to several messages per minute.

Analysis

The BMC (Baseboard Management Controller) is a component found on most server motherboards. It is a microcontroller responsible for communication between the motherboard and management software. See wikipedia for more information. The BMC is also used for communication between the motherboard and dedicated out of band management boards such as Dell iDRAC. I have seen these error messages on systems from several suppliers, most notably on IBM and Dell blade servers, but most server motherboards have a BMC. As the error message states, you can resolve this error by increasing the timeout, and this is usually sufficient. I have found that the Windows default settings for the timeouts may cause conflicts, especially on blade servers. Thus an increase in the timeout values may be in order as described on technet. Lately though, I have found this error to be a symptom of more serious problems. To understand this, we have to look at what is actually happening. If you have some kind of monitoring agent running on the server, such as SCOM or similar, the error could be triggered by said agent trying to read the current voltage levels on the motherboard. If such operations fail routinely during the day, it is a sign of a conflict. This could be competing monitoring agents querying data to frequently, an issue with the BMC itself, or an issue with the out of band management controller. In my experience, this issue is more frequent on blade servers than rack-based servers. This makes sense, as most blade servers have a local out of band controller that is continuously talking to a chassis management controller to provide a central overview of the chassis.

If the out of band controllers have a problem, this can and will affect the BMC, which in turn may affect the motherboard. Monitoring of server status is the most frequently used feature, but the BMC controller is also used for remote power control and is able to affect the state of internal components on the motherboard. We recently had an issue on a Dell M820 blade server where a memory leak in iDrac resulted in the mezzanine slots on the server being intermittently switched off. In this case it was the FibreChannel HBA. Further research revealed this to be a returning issue. This forum thread from 2011 describes a similar issue: http://en.community.dell.com/techcenter/blades/f/4436/t/19415896.aspx.

As the iDrac versions in question are different (1.5.3 in 2010 and 1.51.51 in 2014), I theorize that the issue is related to iDrac memory leaks in general and not a specific bug. Thus, any iDrac firmware bug resulting in a memory leak may cause these issues.

Solution

Low error frequency

Increase the timeout values as described on technet. I have used the following values with success on IBM servers:

image

Under HKLM\SYSTEM\CurrentControlSet\Control\IPMI are four values controlling the IPMI driver: BusyWaitTimeoutPeriod, BusyWaitPeriod, IpmbWaitTimeoutPeriod, CommandWaitTimeoutPeriod, and SlaveAddressDetectionMethod. On IBM blades, I have used BusyWaitPeriod 60(desimal) and 9 000 000 (desimal) for the rest. Changing these settings require a restart of the server.

High error frequency

Further analysis will be necessary for each case. Try to identify what program is triggering the timeouts. A blanket upgrade of Bios, Out of band Management and system drivers may be successful, but it could also introduce new problems and thus further complicate the troubleshooting. Looking for other, seemingly unrelated problems in the event log could also be a good idea. And check for other servers with similar problems. I have found that removing the server from the chassis and reseating it may remove the fault for a couple of days before it returns. This is a symptom of a memory leak. And talking about memory leaks, check for kernel mode driver memory leaks in the operating system.

If it is a dell server, try running the following command:

racadm getsysinfo

image

If the result is ERROR: Unable to perform the requested operation, something is seriously wrong with the out of band controller. Get in touch with Dell support for further help. You will need a new version of the iDrac firmware without the memory leak, or an older version and a downgrade.

If the command is successful, it should return a lot of information about the server:

image

A successful result points to an issue with monitoring software or drivers. Or maybe you just need to increase the timeouts in the registry.

Windows 2012 Nic Teaming pooched

Problem

Something bad happened during a firmware/driver update on one of my servers, which resulted in the network adapters being unavailable for teaming in the native nic teaming module. The server had Broadcom Netextreme II adapters, and we suspect that the BACS application is the culprit, since it also supports teaming. The problem presented the following symptoms:

  • Constant reinstallation of drivers in device manager for several minutes
  • The adapters were missing from Network Connections, but visible in device manager
  • No adapters were available for teaming in Windows Nic Teaming

Solution

  • First you enter device manager and enable Show hidden devices:
    image
  • Look for greyed out devices, that is devices that are not connected. Delete/uninstall all of them. You will probably be left with at least one “Microsoft Network Adapter Multiplexor Driver #n” that you are not able to uninstall.
  • Uninstall ALL broadcom drivers/software and reboot.
  • Open device manager again, and get the guid of the stubborn multiplexor adapters:
    image
  • Go to the HKLM\System\CurrentControlSet\Services\NdislmPlatform\Parameters\Teams registry key:
    image
  • Delete the entire team key that corresponds to the key you found above, that is the key labeled {GUID} under Teams:
    image
  • Reinstall the latest Broadcom drivers without BACS and BASP:
    image
  • Reboot, and re-create the teams using Nic Teaming

PERCSAS2 Event ID 129

Problem

Event ID 129 from percsas2 shows up in the system event log several times a day, stating “Reset to device, \Device\RaidPort4, was issued.”

image

I suddenly noticed this event in the log on four of my servers (Dell M820 blades). This is usually a bad tiding, foreboding imminent disk failure or a system wide badger infestation. As these servers are all quite new though and still running fine, I suspected the problem may be located elsewhere. The other culprit is usually drivers or firmware. Amazing as it may sound, it actually does happen that vendor support engineers are correct in demanding you update everything and the kitchen sink.

Continue reading “PERCSAS2 Event ID 129”

Poor disk performance on Dell servers

Problem

I’ve been managing a lot of Dell servers lately, where the baseline showed very poor performance for local drives connected to PERC (PowerEdge Expandable RAID Controller) controllers. Poor enough to trigger negative marks on a MSSQL RAP. Typically, read and write latency would never get below 11ms, even with next to no load on a freshly reinstalled server. Even the cheapest laptops with 4500 RPM SATA drives would outperform such stats, and these servers had 10 or 15K RPM SAS drives on a 6Gbps bus. We have a combination of H200, H700 and H710 PERC controllers on these servers, and the issues didn’t seem to follow a pattern, with one exception: all H200 equipped servers experienced poor performance.

Analysis

A support ticket with Dell gave the usual response: update your firmware and drivers. We did, and one of the H700 equipped servers got worse. Further inquiries with Dell gave a recommendation to replace the H200 controllers with the more powerful H700. After having a look at the specs for the H200 I fully agree with their assessment, although I do wonder why on earth they sold them in the first place. The H200 doesn’t appear to be worth the price of the cardboard box it is delivered in. It has absolutely no cache whatsoever, and it also disables the built in cache on the drives. Snap from the H200 users guide:

image

This sounds like something one would use in a print server or small departmental file server in a very limited budget, not in a four-way database cluster node. And it explains why the connected drives are painfully slow, you are reduced to platter speed.

Note: The H200 is replaced by the H310 on newer servers. I have yet to test it, but from what the specs tell me it is just as bad as the H200.

Update: Test data from a H310 equipped test server doing nothing but displaying the perfmon curve:

SNAGHTMLf6bef00

Continue reading “Poor disk performance on Dell servers”