OS layer initialization failed while updating an Intel X710

Problem

When trying to update the NVM/Firmware on an intel X710 SFP+ network interface card running on Windows the process is interrupted with the error message below:

OS layer initialization failed

This card was made by Intel and mounted in a HP ProLiant DL380 gen9. It was chosen over a genuine HP-approved card due to supply chain issues. I was trying to install version 8.5, the latest available from Intel at the time of writing.

Analysis

There is some component that is not available. I suspect a hardening issue, as someone installed an ginormous amount of hardening GPOs some time ago.

A process monitor trace shows that the tool loads pnputil and tries to install some drivers with varying degrees of success. Specifically it appears to be looking for iqvsw64e in miscellaneous temporary folders.

It has been a while since last time I read a Procmon output, but as far as I can tell the process is not successful. The files are included with the package and self-identifies as an “Intel Network Adapter Diagnostic Driver”.

Hypothesis: the nvm updater needs the diagnostic drivers to communicate with the adapter, but something blocks the installation.

I do not have access to test this, but I am pretty sure that there is a GPO blocking the installation. I tested the previous version from 2018 that had been installed successfully on the server, and it now fails with the exact same error. The next step would be to start disabling hardening GPOs, but as I do not have the access to do that directly on this server, I gave up and started looking for a workaround. Some hours later I found one.

Workaround

As per usual, if you do not fully understand the consequences of the steps in the action plan below, seek help. This could brick your server, which is a nuisance when you have 20, but a catastrophe if you have only one and the support agreement has expired.

Prerequisites

  • HP ProLiant DL380 Gen9 (should work on all currently supported HPE ProLiant DL series servers).
  • Windows Server 2016. Probably compatible with 2012-2022, but I have yet to test that.
  • Other HP-approved Intel SFP+ network adapters mounted in the same server, in my case cards equipped with Intel 560-series chips. Could work with other Intel adapters as well.
  • A copy of a not too old Service Pack for ProLiant iso for your server or similar. The SPP for DL380 gen10 has been tested, and I can confirm that it works for this purpose even though it will refuse to install any SPP updates.
  • A valid HPE support contract is recommended
  • A towel is recommended.
  • A logged-in ILO remote console or local console access.
  • The local admin password for the server.

Action plan

NB: If you are not planning an SPP update as part of this process, or if you are unable to obtain one, see the update below for an alternative approach. You need an active support contract to download SPP packages, but individual cp-s are available.

  • Install the intel drivers that correspond with your firmware version. Preferably both should be the latest available.
  • Be aware that this will disconnect your server from the network if all your adapters are intel adapters. Usually only temporarily, but if it does not come back online, you may have to reconfigure your network adapters using ILO or a local console.
  • Reboot the server.
  • Extract the NVM update
  • In an administrative command shell, navigate to the Winx64 folder.
  • Try running nvmupdatew64e.exe and verify the error message.
  • Mount the SPP iso.
  • Run launch_sum.bat from the iso as admin.
  • In the web browser that appears, accept the cert error and start a localhost guided update:
  • While the inventory is running, switch back to the command shell and keep trying to start the nvm update.
  • This will take som time, so do not give up and remember your towel.
  • Suddenly, this will happen:
  • Update all cards that have an update available.
  • Reboot the server. You may complete the SPP update before you reboot.

Update: an alternative method

As the HP SPP is a fairly large download to haul around, I kept looking for a mor lightweight workaround. If you are going to install an SPP anyway, using it makes sense, but if you are using other methods for patching your servers it is a bit overkill to use at 10GB .iso to install a 70kb temporary driver. Thus a new plan was hatched.

  • Instead of the SPP iso, get a hold of the cpnnnn update package for you HP approved Intel-based network card. For my x560 card, the current cp is cp047026.
  • Extract the files to a new folder. I have not tested whether or not it is possible to extract the package without the card being installed, but it appears to be a branded winzip self-extracter or similar so I expect it to work.
  • Inside your new folder you will find a file called ofvpdfixW64e.exe. Run it from an administrative command shell.
  • Wait for it to finish scanning for adapters.
  • You should now be able to start nvmupdatew64e.exe and upgrade your X710.

As we can see, the tool detects both the HP approved and Intel original adapters. The tool is designed for a different purpose, but that is not important. All we need is a tool that will load the diagnostic drivers and thus enable our Intel updater to function. The package also contains rfendfixW64e.exe, another fixup tool that will load the driver. The HP branded firmware update tool (HPNVMUpdate.exe) may also load the driver in some scenarios. I guess what I am saying is try all of them if one is not working. And make sure to wait for the scan to complete before you try starting nvmupdatew64e.exe.

Also, make sure to install the PROSet drivers. I have had trouble getting this to work without them.

Why this works

The HP SPP is using a branded version of the Intel NVM updater. This updater is using the same driver mentioned above, at least for the 560-series of chips. It is running in a different host process and is thus able to circumvent the hardening that blocks the installation of the driver from the intel tool directly. When the SPP inventory process is querying Intel network adapters, the driver is loaded and keeps running until you reboot the server. You may be able to get this working even without any other Intel adapters, but I have not tested this scenario. It all depends on how the SPP inventory process runs.

Verify the result

You can verify the result using the Intel-supplied powershell commandlets. They are installed together with the PROSet driver package. You activate them by running this command:

Import-Module -Name "C:\Program Files\Intel\Wired Networking\IntelNetCmdlets\IntelNetCmdlets"

And you list the VNM versions running the next command. Be aware that HP branded adapters may not respond to this command and will be listed as not supported. These commands may be relatively slow to respond, this is normal.

Get-IntelNetAdapter | ft -Property Name, DriverVersion, ETrackID, NVMVersion

Scheduled export of the security log

If you have trouble with the log being overwritten before you can read it and do not want to increase the size of the log further, you can use a scheduled PowerShell script to create regular exports. The script below creates csv files that can easily be imported to a database for further analysis.

The account running the scheduled task needs to be a local admin on the computer.

#######################################################################################################################
#   _____     __     ______     ______     __  __     ______     ______     _____     ______     ______     ______    #
#  /\  __-.  /\ \   /\___  \   /\___  \   /\ \_\ \   /\  == \   /\  __ \   /\  __-.  /\  ___\   /\  ___\   /\  == \   #
#  \ \ \/\ \ \ \ \  \/_/  /__  \/_/  /__  \ \____ \  \ \  __<   \ \  __ \  \ \ \/\ \ \ \ \__ \  \ \  __\   \ \  __<   #
#   \ \____-  \ \_\   /\_____\   /\_____\  \/\_____\  \ \_____\  \ \_\ \_\  \ \____-  \ \_____\  \ \_____\  \ \_\ \_\ #
#    \/____/   \/_/   \/_____/   \/_____/   \/_____/   \/_____/   \/_/\/_/   \/____/   \/_____/   \/_____/   \/_/ /_/ #
#                                                                                                                     #
#                                                   http://lokna.no                                                   #
#---------------------------------------------------------------------------------------------------------------------#
#                                          -----=== Elevation required ===----                                        #
#---------------------------------------------------------------------------------------------------------------------#
# Purpose:Export and store the security event log as csv.                                                             #
#                                                                                                                     #
#=====================================================================================================================#
# Notes: Schedule execution of tihis script every capturehrs hours - script execution time.                           #
# Test the script to determine the execution time, add 2 minutes for good measure.                                    #
#                                                                                                                     #
# Scheduled task: powershell.exe -ExecutionPolicy ByPass -File ExportSecurityEvents.ps1                               #
#######################################################################################################################

#Config
$path = "C:\log\security\" # Add Path, end with a backslash
$captureHrs = 20 #Capture n hours of data

#Execute
$now=Get-Date
$CaptureTime = (Get-Date -Format "yyyyMMddHHmmss")
$CaptureFrom = $now.AddHours(-$captureHrs)
$Filename = $path + $CaptureTime + 'Security_log.csv' 
$log = Get-EventLog -LogName Security -After $CaptureFrom
$log|Export-Csv $Filename -NoTypeInformation -Delimiter ";"

Logical switch uplink profile gone

Problem

When you try to connect a new VM to a logical switch you get a lot of strange error messages related to missing ports or no available switch. The errors seem random.

Analysis

If you check the logical switch properties of an affected host, you will notice that the uplink profile is missing:

image

If you look at the network adapter properties of an affected VM, you will notice that the Logical Switch field is blank:

image

This is connected to a WMI problem. Some Windows updates uninstall the VMM  WMI MOFs required for the VMM agent to manage the logical switch on the host. See details at MS Tech.

Solution

MOFCOMP to the rescue. Run the following commands in an administrative Powershell prompt. To update VMM you have to refresh the cluster/node. Note: Some versions use a different path  to the MOF-files, so verify this if the command fails.

 

image

Mofcomp “%systemdrive%\Program Files\Microsoft System Center\Virtual Machine Manager\setup\scvmmswitchportsettings.mof”
Mofcomp “%systemdrive%\Program Files\Microsoft System Center\Virtual Machine Manager\DHCPServerExtension\VMMDHCPSvr.mof”
Get-CimClass -Namespace root/virtualization/v2 -classname *vmm*

Update-module fails

Problem

When trying to update a module from PSGallery (PSWindowsUpdate in this case), the package manager claims that the PSGallery repository does not exist with one of the following errors.

  • “ Unable to find repository ‘https://www.powershellgallery.com/api/v2/’.”
  • “ Unable to find repository ‘PSGallery’.”

 

Analysis

There seems to be a problem with the URL for the PSGallery repository missing a trailing slash, as I could find a lot of posts about this online. If we do a Get-PSRepository and compare this with a Get-InstalledModule –Name PSWindowsupdate|fl we can see that the URL differs:

 

image

 

image

There is also something wrong with the link between the repository and the package, the repository line above should say PSGallery, not https://www.powershellgallery.com/api/v2/.

I do not know why or when this happened, but sometime in the second half of 2018 is my best guess based on the last time we patched PSWindowsUpdate the servers in question.

The PackageManagement version installed on these servers were rather old, version 1.0.0.1. Come to think of it, the PackageManagement moved to PSgallery itself at some point, but this version is not as it is found using Get-Module, and not Get-InstalledModule:

image

Solution

After a long and winding path I have come up the following action plan:

  • Update the NuGet provider.
  • Uninstall the “problem” module.
  • Unregister the PSGallery repository.
  • Re-register the PSGallery repository.
  • Install the latest version of PowerShellGet from PSGallery.
  • Reinstall the problem module.
  • Reeboot to remove the old PackageManagement version.

 

You can find a sample Powershell script using PSWindowsUpdate as the problem module below. If you have multiple PSGallery modules installed, you may have to re-install all of them.

Install-PackageProvider Nuget –Force
Uninstall-Module PSWindowsupdate
Unregister-PSRepository -Name 'PSGallery'
Register-PSRepository -Default
Get-PSRepository
Install-Module –Name PowerShellGet -Repository PSGallery –Force
install-module -Name PSWindowsUpdate -Repository PSGallery -Force -Verbose

Failover Cluster: access to update the secure DNS Zone was denied.

Problem

After you have built a cluster, the Cluster Events page fills up with Event ID 1257 From FailoverClustering complaining about not being able to write to the DNS records in AD:

“Cluster network name resource failed registration of one or more associated DNS names(s) because the access to update the secure DNS Zone was denied.


Cluster Network name: X
DNS Zone: Y


Ensure that cluster name object (CNO) is granted permissions to the Secure DNS Zone.”

image

Solution

There may be other root cause scenarios, but in my case the problem was a static DNS reservation on the domain controller.

As usual, if you do not understand the action plan below, seek help or get educated before you continue. Your friendly local search  engine is a nice place to start if you do not have a local cluster expert. This action plan includes actions that will take down parts of your cluster momentarily, so do not perform these steps on a production cluster during peak load. Schedule a maintenance window.

  • Identify the source of the static reservation and apply public shaming and/or pain as necessary to ensure that this does not happen again. Cluster DNS records should be dynamic.
  • Identify the static DNS record in your Active Directory Integrated DNS forward lookup zone. Ask for help from your DNS or AD team if necessary.
  • Delete the static record
  • Take the Cluster Name Object representing the DNS record offline in Failover Cluster manager (or by powershell). Be aware that any dependent resources will also go offline.
  • Bring everything back online. This should trigger a new DNS registration attempt. You could also wait for the cluster to attempt this automatically, but client connections may fail while you are waiting.
  • Verify that the DNS record is created as a dynamic record. It should have a current Timestamp.

Identify which drive \Device\Harddisk#\DR# represents

Problem

You get an error message stating that there is some kind of problem with \Device\Harddisk#\DR#. For instance Event ID 11 from Disk: The driver detected a controller error on \Device\Harddisk4\DR4.

image

The disk number referenced in the error message does not necessarily correspond to the disk id numbers in Disk Management. To figure out which disk is referenced, some digging is required.

Solution

There may be several ways to identify the drive. In this post I will show the WinObj-method, as it is the only one that has worked consistently for me.

  • First, get a hold of WinObj from http://sysinternals.com 
  • Run WinObj as admin
  • Browse to \Device\Harddisk#. We will use Harddisk4\DR4 as a sample from here on out, but you should of course replace that with the numbers from your error message.

image

  • Look at the SymLink column to identify the entry from the error message.

image

  • Go to the GLOBAL folder and sort by the SymLink column.
  • Scroll down to the \Device\Harddisk4\DR4 entries
  • You will find several entries, some for the drive and some for the volume or volumes.

image

image

  • The most interesting one in this example is the drive letter D for Volume 5 (the only volume on this drive).
  • Looking in Disk management we can identify the drive, in this case it was an empty SD-card slot. We also see that the Disk number and DR number are both 4, but there is no definitive guarantee that these numbers are equal.

image

image

Most likely, the event was caused by an improper removal of the SD card. As this is a server and the SD card slot could be a virtual device connected to the out-of-band IPMI/ILO/DRAC/IMM management chip, the message could also be related to a restart or upgrade of the out-of-band chip. In any case, there is no threat to our important data which are hopefully not located on an SD card.

If you receive this error on a SAN HBA or local RAID controller, check the management software for your device. You may need a firmware upgrade, or you could have a slightly defective drive that is about to go out in flames.

Disable automatic loading of Server Manager

Problem

When you log in to a server, the Server Manager windows loads automatically. On a small VM this can take some time and waste som e resources, especially if you forget to close it and log off.

Solution

Create a GPO to disable the automatic loading of Server Manager.

  • Start Group Policy Managment Editor
  • Create and link a new GPO on the OU/OUs where you want to apply it.
  • Find the setting Computer Configuration\Policies\Administrative Templates\System\Server Manager\Do not display Server Manager at logon
  • Enable it
  • close the GPO and wait for a GPO refresh, or trigger a gpupdate /force on any applicable computers.

image

Hypervisor not running

Problem

After upgrading my LAB to VMM 1801, and subsequently VMM 1806 (see https://lokna.no/?p=2519), VMs refuse to start on one of my hosts. EventID 20148 was logged when I tried to create a new VM. I restarted the host in hope of a quick fix, but the result was that none of the VMs living on this host wanted to boot.

Virtual machine ‘NAME’ could not be started because the hypervisor is not running”

image

Solution

For some reason the Hypervisor has been disabled. You can check this by running BCDEDIT in an administrative command prompt. The hypervisorlaunchtype should be set to Auto. If it is not, change it by running the following command:

bcdedit /set hypervisorlaunchtype auto


After that, reboot the host and everything should be running again. Unless, of course, you have a completely different issue preventing your VMs from starting.

image

Configure VMQ and RSS on physical servers

Introduction

Samples below are collected from Windows Server 2016

The primary objective is to avoid weighing down Core 0 with networking traffic. This is the first core on the first NUMA node, and this core is responsible for a lot of kernel processing. If this core suffers from contention, a wild blue screen of death will appear. Thus, we want our network adapters to use other cores to process their traffic. We can achieve this in three ways, depending on what we use the adapter for:

  • Enable Receive Side Scaling (RSS) and configure it to use specific cores.
  • Enable Virtual Message Queueing and configure it to use specific cores
  • Set the preferred NUMA node

For physical machines

On network adapters used for generic traffic, we should enable RSS and disable VMQ. On adapters that are part of a virtual switch, we should disable RSS and enable VMQ. The preferred NUMA node should be configured for all physical adapters.

For virtual machines

If the machine has more than one CPU, enable vRSS.

Investigating the NUMA architecture

Sockets and NUMA nodes

Sysinternals coreinfo -s -n will show the relationship between logical processors, sockets and NUMA nodes. In the example below we have a CPU with two sockets and four nodes.

clip_image002

Closest NUMA node

Each PCIE adapter is physically connected to a specific NUMA node. If possible, RSS / VMQ should be mapped to cores on the same NUMA node that the NIC is connected to. Get-NetadApterRss will show you which NUMA node is closest for each adapter. The one in the sample is connected to/closest to NUMA node 0 as the NUMA distance for cores in group 0 is 0. We can also see that the NUMA distance to node 1 for this particular port is lower than the distance to nodes 2 and 3. This is caused by the fact that node 0 and 1 are on the same physical CPU, whereas node 2 and 3 are on another physical CPU.

clip_image004 Continue reading “Configure VMQ and RSS on physical servers”

Hyper-V VM with VirtualFC fails to start

Problem

This is just a quick note to remember the solution and EventIDs.

The VM fails to start complaining about failed resources or resource not available in Failover Cluster manager. Analysis of the event log reveals messages related to VirtualFC:

  • EventID 32110 from Hyper-V-SynthFC: ‘VMName’: NPIV virtual port operation on virtual port (WWN) failed with an error: The world wide port name already exists on the fabric. (Virtual machine ID ID)
  • EventID 32265 from Hyper-V-SynthFC: ‘VMName’: Virtual port (WWN) creation failed with a NPIV error(Virtual machine ID ID).
  • EventID 32100 from Hyper-V-VMMS: ‘VMNAME’: NPIV virtual port operation on virtual port (WWN) failed with an unknown error. (Virtual machine ID ID)
  • EventID 1205 from Microsoft-Windows-FailoverClustering: The Cluster service failed to bring clustered role ‘SCVMM VM Name Resources’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

Analysis

The events point in the direction of Virtual Fibre Channel or Fibre Channel issues. After a while we realised that one of the nodes in the cluster did not release the WWN when a VM migrated away from it. Further analysis revealed that the FC driver versions were different.

SNAGHTML66058b10

SNAGHTML660875d4

Solution

  • Make sure all cluster nodes are running the exact same driver and firmware for the SAN and network adapters. This is crucial for failovers to function smoothly.
  • To “release” the stuck WWNs you have to reboot the offending node. To figure out which node is holding the WWN you have to consult the FC Switch logs. Or you could just do a rolling restart and restart all nodes until it starts working.
  • I have successfully worked around the problem by removing and re-adding the virtual FC adapters n the VM that is not working. I do not know why this resolved the problem.
  • Another workaround would be to change the WWN on the virtual FC adapters. You would of course have to make this change at the SAN side as well.