Power profile drift

Problem

I received an alarm from one of my SQL Servers about IO stall time measured in seconds and went to investigate. We have had trouble with HBA Firmware causing FC stalls previously, so I suspected another storage error. The server in question was running virtual FC, and a cascading error among the other servers on the same host seemed to confirm my initial hypothesis about a HBA problem on the host.

Analysis

The kernel mode CPU time on the host was high (the red part of the graph in Process Explorer), something that is also a pointer in the direction of storage problems. The storage minions found no issue on the SAN though. Yet another pointer towards a problem on the server itself. We restarted it twice, and the situation seemed to normalize. It was all written off as collateral damage from a VMWare fault that flooded the SAN with invalid packets some time ago. One of the VMs were moved back and left to simmer overnight. I felt overly cautious not moving them all back, but the next morning the test VM was running at 80% CPU without getting anything done. The CPU load on the host was  about 50%, running a 3 cpu VM  on a 2×12 core host…

25C33835

I failed the test vm back to the spare host, and the load on the VM went down immediately:

image

At this point I was ready to take a trip to the room of servers and visit the host in person, and I was already planning a complete re-imaging of the node in my head. But then I decided to run CPU-Z first, and suddenly it all became clear.

image

 

The host is equipped with Intel Xeon E5-2690 v3 CPUs. Intel Ark informs me that the base clock is indeed 2,6GHz as reported by CPU-Z, and the turbo frequency is as high as 3,5GHz. A core speed of 1195MHz as shown in CPU-Z is usually an indicator of one of two things. Either someone has fiddled with the power saving settings, or there is something seriously wrong with the hardware.

A quick check of the power profile revealed that the server was running in the so called “balanced” mode, a mode that should be called “run-around-in-circles-and-do-nothing-mode” on servers. The question then becomes, why did this setting change?

image

My server setup checklist clearly states that server should run in High performance mode. And I had installed this particular server myself, so I know it was set correctly. The culprit was found to be a firmware upgrade installed some months back. It had the effect of resetting the power profile both in the BIOS and in Windows to the default setting. There was even a change to the fan profile, causing the server to become very hot. The server in question is a HP ProLiant DL380 Gen 9, and the ROM version is P89 v2.30 (09/13/2016).

Solution

  • First you should change the power profile to High performance in the control panel. This change requires a reboot.
  • While you are rebooting the server, enter the BIOS settings and check the power profile. I recommend Maximum Performance mode for production servers.
  • image
  • Then, check the Fan Profile
  • image
  • Try increased cooling. If your servers still get exceedingly hot, there is a maximum cooling mode, but this basically runs all the fans at maximum all the time.

This is how CPU-Z looks after the change:

image

And the core utilization on the host, this time with 8 active SQL Server VMs:

image

Session "" failed to start with the following error: 0xC0000022

Problem

The event log fills up with Event ID 2 from Kernel-EventTracing stating Session “” failed to start with the following error: 0xC0000022.

image

Analysis

If you look into the system data for one of the events, you will find the associated ProcessID and ThreadID:

image

If the event is relatively current, the Process ID  should still be registered by the offending process. Open Process Explorer and list processes by PID:

image

We can clearly see that the culprit is one of those pesky WMI-processes. The ThreadID is a lot more fluctuating than the ProcessID, but we can always take a chance and se if it will reveal more data. I spent a few minutes writing this, and in that time it had already disappeared. I waited for another event, and immediately went to process explorer to look for thread 18932. Sadly though, this didn’t do me any good. For someone more versed in kernel API calls the data might make some sense, but not to me.

image

I had more luck rummaging around in the ad-profile generator (google search). It pointed me in the direction of KB3087042. It talks about WMI calls to the LBFO teaming (Windows 2012 native network teaming) and conflicts with third-party WMI providers. Some more digging around indicated that the third-party WMI provider in question is HP WBEM. HP WBEM is a piece of software used on HP servers to facilitate centralized server management (HP Insight). As KB3087042 states the third-party provider is not the culprit. That implies a fault in Windows itself, but one must not admit such things publicly of course.

In their infinite wisdom (or as an attempt to compensate for their lack thereof), the good people of Microsoft has also provided a manual workaround for the issue. It is a bit difficult to understand, so I will provide my own version below.

Solution

As usual, if the following looks to you as something that belongs in a Harry Potter charms class, please seek assistance before you implement this in production. You will be messing with central operating system files, and a slip of the hand may very well end up with a defective server. You have been warned.

The fix

But let us get on with the fix. First, you have to get yourself an administrative command prompt. The good old fashioned black cmd.exe (or any of the 16 available colors). There is no reason why this would not work in one of those fancy new blue PowerShell thingy’s as well, but why take unnecessary risks?

Then, we have a list of four incantations – uh.., commands to run through. Be aware that if for some reason your system drive is not C:, you will have to take that into account. And then spend five hours repenting and trying to come up with a good excuse for why you did it in the first place. Or perhaps spend the time looking for the person who did it and give them a good talking to. But I digress. The commands to run from the administrative command prompt are as follows:

Takeown /f c:\windows\inf
icacls c:\windows\inf /grant “NT AUTHORITY\NETWORK SERVICE”:”(OI)(CI)(F)”
icacls c:\windows\inf\netcfgx.0.etl /grant “NT AUTHORITY\NETWORK SERVICE”:F
icacls c:\windows\inf\netcfgx.1.etl /grant “NT AUTHORITY\NETWORK SERVICE”:F

The first command takes ownership of the Windows\Inf folder. This is done to make sure that you are able to make the changes. The three icacls-commands grants permissions to the NETWORK SERVICE system account on the INF-folder and two ETL-files. The result should look something like this:

SNAGHTML46207857

To test if you were successful, run this command:

icacls c:\windows\inf

And look for the highlighted result:

image

Should you want to learn more about the icacls command, this is a good starting point.

The cleanup

This point is very important. If you do not hand over ownership of Windows\Inf back to the system, bad things will happen in your life.

This time, you only need a normal file explorer window. Open it, and navigate to C:\Windows. Then open the advanced security dialog for the folder.

Next to the name of the current owner (should be your account) click the change button/link.

SNAGHTML4628dae2

Then, select the Local Computer as location and NT SERVICE\TrustedInstaller as object name. Click Check Names to make sure you entered everything correctly. If you did, the object name changes to TrustedInstaller (underlined).

image

Click OK twice to get back to the file explorer window. If you did not get any error messages, you are done.

It IS possible to script the ownership transfer as well, but in my experience the failure rate is way to high. I guess the writers of the KB agrees, as they have only given a manual approach.

Cluster disker på passiv node og WWN på HBA

Ved feilsøking av noen rare clusterproblemer begynte jeg plutselig å lure på hvordan clusterdisker var representert på den passive noden i et Windows 2008 cluster. De var nemlig ikke å se i Disk Management, og forsøk på failover av diskressurser feilet spektakulert og fullstendig. Litt forskning viste kjapt at joda, de skulle vært listet omtrent som dette:

image

Dermed måtte feilen være et annet sted. Litt mer fundering avslørte at vi hadde oppdatert bios og managementprosessor på dette clusteret for kort tid siden, og vi begynte å ane hvor grevlingen var. Av en eller annen grunn hadde ikke serveren fått satt riktig WWN ved omstart. Dette er et IBM HS22 blade med Qlogic HBA som får sine WWN fra managementmodulen, og en rask sjekk i SanSurfer avslørte at WWN ikke stemte overens med det som var definert i SANet. Dermed trengs en kaldstart av managementmodueln i bladet, noe som betyr å slå av serveren, trekke ut bladet, vente et minutt og sette det inn igjen. Deretter må man vente en 5 minutters tid mens IMM booter. Man kan se at den er ferdig ved at powerlyset i front av bladet begynner å blinke med betydelig langsommere takt. Først da vil bladet la seg slå på.

Update: Det har også fungert i enkelte tilfeller å ta en full shutdown av bladet, vente to minutter og så starte det igjen. Det er verdt et forsøk, da man sparer en tur til serverrommet.

Samme situasjon har jeg tidligere opplevd på HP BL460 blade. Disse har vanligvis Emulex-kort som liker å komme opp med blank WWN (00-00-00-…), eller med en WWN som begynner på 1 istedenfor 5. Dette er noe som er litt lettere å oppdage enn et nummer som er tilsynelatende riktig. Løsningen er den samme, ut med bladet for å få kaldstartet managementmodulen.

SNAGHTML109f57a

Dersom feilen oppstår på en helt ny server/nytt HostBusAdapter er det vanligvis firmware på HBA som er utdatert eller på annen måte ikke samsvarer med firmware til bladets managementmodul. Emulex liker å splitte denne i flere deler, en HBA bios og en HBA firmware, pass på at du oppdaterer alle. Og oppdater firmware på bladet samtidig. Det anbefales å bruke firmware fra bladeprodusenten og ikke fra Qlogic/Emulex. Pass også på at du oppdaterer firmwaren til den sentrale managementmodulen på IBM Bladecenter samtidig. Denne kalles AMM. På HP C7000 har man ofte Virtual Connect fibermoduler istedenfor fiberswitcher fra feks Brocade. Rent teknisk er det store forskjeller på AMM og Virtual Connect, men i denne sammenhengen er det viktigste at versjonene stemmer overens.