I received an alarm from one of my SQL Servers about IO stall time measured in seconds and went to investigate. We have had trouble with HBA Firmware causing FC stalls previously, so I suspected another storage error. The server in question was running virtual FC, and a cascading error among the other servers on the same host seemed to confirm my initial hypothesis about a HBA problem on the host.
The kernel mode CPU time on the host was high (the red part of the graph in Process Explorer), something that is also a pointer in the direction of storage problems. The storage minions found no issue on the SAN though. Yet another pointer towards a problem on the server itself. We restarted it twice, and the situation seemed to normalize. It was all written off as collateral damage from a VMWare fault that flooded the SAN with invalid packet some time ago. I moved one of the VMs back and let it simmer overnight. I felt overly cautious not moving them all back, but the next morning the test VM was running 80% PCU without getting anything done, and the CPU load on the host was about 50%, running a 3 cpu vm on a 2×12 core host…
I failed the test vm back to the spare host, and the load on the VM went down immediately:
At this point I was ready to take a trip to the room of servers and visit the host in person, and I was already planning a complete re-imaging of the node in my head. But then I decided to run CPU-Z first, and suddenly it all became clear.
The host is equipped with Intel Xeon E5-2690 v3 CPUs. Intel Ark informs me that the base clock is indeed 2,6GHz as reported by CPU-Z, and the turbo frequency is as high as 3,5GHz. A core speed of 1195MHz as shown in CPU-Z is usually an indicator of one of two things. Either someone has fiddled with the power saving settings, or there is something seriously wrong with the hardware.
A quick check of the power profile revealed that the server was running in the so called “balanced” mode, a mode that should be called “run-around-in-circles-and-do-nothing-mode” on servers. The question then becomes, why did this setting change?
My server setup checklist clearly states that server should run in High performance mode. And I had installed this particular server myself, so I know it was set correctly. The culprit was found to be a firmware upgrade installed some months back. It had the effect of resetting the power profile both in the BIOS and in Windows to the default setting. There was even a change to the fan profile, causing the server to become very hot. The server in question is a HP ProLiant DL380 Gen 9, and the ROM version is P89 v2.30 (09/13/2016).
- First you should change the power profile to High performance in the control panel. This change requires a reboot.
- While you are rebooting the server, enter the BIOS settings and check the power profile. I recommend Maximum Performance mode for production servers.
- Then, check the Fan Profile
- Try increased cooling. If your servers still get exceedingly hot, there is a maximum cooling mode, but this basically runs all the fans at maximum all the time.
This is how CPU-Z looks after the change:
And the core utilization on the host, this time with 8 active SQL Server VMs: