OS layer initialization failed while updating an Intel X710

Problem

When trying to update the NVM/Firmware on an intel X710 SFP+ network interface card running on Windows the process is interrupted with the error message below:

OS layer initialization failed

This card was made by Intel and mounted in a HP ProLiant DL380 gen9. It was chosen over a genuine HP-approved card due to supply chain issues. I was trying to install version 8.5, the latest available from Intel at the time of writing.

Analysis

There is some component that is not available. I suspect a hardening issue, as someone installed an ginormous amount of hardening GPOs some time ago.

A process monitor trace shows that the tool loads pnputil and tries to install some drivers with varying degrees of success. Specifically it appears to be looking for iqvsw64e in miscellaneous temporary folders.

It has been a while since last time I read a Procmon output, but as far as I can tell the process is not successful. The files are included with the package and self-identifies as an “Intel Network Adapter Diagnostic Driver”.

Hypothesis: the nvm updater needs the diagnostic drivers to communicate with the adapter, but something blocks the installation.

I do not have access to test this, but I am pretty sure that there is a GPO blocking the installation. I tested the previous version from 2018 that had been installed successfully on the server, and it now fails with the exact same error. The next step would be to start disabling hardening GPOs, but as I do not have the access to do that directly on this server, I gave up and started looking for a workaround. Some hours later I found one.

Workaround

As per usual, if you do not fully understand the consequences of the steps in the action plan below, seek help. This could brick your server, which is a nuisance when you have 20, but a catastrophe if you have only one and the support agreement has expired.

Prerequisites

  • HP ProLiant DL380 Gen9 (should work on all currently supported HPE ProLiant DL series servers).
  • Windows Server 2016. Probably compatible with 2012-2022, but I have yet to test that.
  • Other HP-approved Intel SFP+ network adapters mounted in the same server, in my case cards equipped with Intel 560-series chips. Could work with other Intel adapters as well.
  • A copy of a not too old Service Pack for ProLiant iso for your server or similar. The SPP for DL380 gen10 has been tested, and I can confirm that it works for this purpose even though it will refuse to install any SPP updates.
  • A valid HPE support contract is recommended
  • A towel is recommended.
  • A logged-in ILO remote console or local console access.
  • The local admin password for the server.

Action plan

NB: If you are not planning an SPP update as part of this process, or if you are unable to obtain one, see the update below for an alternative approach. You need an active support contract to download SPP packages, but individual cp-s are available.

  • Install the intel drivers that correspond with your firmware version. Preferably both should be the latest available.
  • Be aware that this will disconnect your server from the network if all your adapters are intel adapters. Usually only temporarily, but if it does not come back online, you may have to reconfigure your network adapters using ILO or a local console.
  • Reboot the server.
  • Extract the NVM update
  • In an administrative command shell, navigate to the Winx64 folder.
  • Try running nvmupdatew64e.exe and verify the error message.
  • Mount the SPP iso.
  • Run launch_sum.bat from the iso as admin.
  • In the web browser that appears, accept the cert error and start a localhost guided update:
  • While the inventory is running, switch back to the command shell and keep trying to start the nvm update.
  • This will take som time, so do not give up and remember your towel.
  • Suddenly, this will happen:
  • Update all cards that have an update available.
  • Reboot the server. You may complete the SPP update before you reboot.

Update: an alternative method

As the HP SPP is a fairly large download to haul around, I kept looking for a mor lightweight workaround. If you are going to install an SPP anyway, using it makes sense, but if you are using other methods for patching your servers it is a bit overkill to use at 10GB .iso to install a 70kb temporary driver. Thus a new plan was hatched.

  • Instead of the SPP iso, get a hold of the cpnnnn update package for you HP approved Intel-based network card. For my x560 card, the current cp is cp047026.
  • Extract the files to a new folder. I have not tested whether or not it is possible to extract the package without the card being installed, but it appears to be a branded winzip self-extracter or similar so I expect it to work.
  • Inside your new folder you will find a file called ofvpdfixW64e.exe. Run it from an administrative command shell.
  • Wait for it to finish scanning for adapters.
  • You should now be able to start nvmupdatew64e.exe and upgrade your X710.

As we can see, the tool detects both the HP approved and Intel original adapters. The tool is designed for a different purpose, but that is not important. All we need is a tool that will load the diagnostic drivers and thus enable our Intel updater to function. The package also contains rfendfixW64e.exe, another fixup tool that will load the driver. The HP branded firmware update tool (HPNVMUpdate.exe) may also load the driver in some scenarios. I guess what I am saying is try all of them if one is not working. And make sure to wait for the scan to complete before you try starting nvmupdatew64e.exe.

Also, make sure to install the PROSet drivers. I have had trouble getting this to work without them.

Why this works

The HP SPP is using a branded version of the Intel NVM updater. This updater is using the same driver mentioned above, at least for the 560-series of chips. It is running in a different host process and is thus able to circumvent the hardening that blocks the installation of the driver from the intel tool directly. When the SPP inventory process is querying Intel network adapters, the driver is loaded and keeps running until you reboot the server. You may be able to get this working even without any other Intel adapters, but I have not tested this scenario. It all depends on how the SPP inventory process runs.

Verify the result

You can verify the result using the Intel-supplied powershell commandlets. They are installed together with the PROSet driver package. You activate them by running this command:

Import-Module -Name "C:\Program Files\Intel\Wired Networking\IntelNetCmdlets\IntelNetCmdlets"

And you list the VNM versions running the next command. Be aware that HP branded adapters may not respond to this command and will be listed as not supported. These commands may be relatively slow to respond, this is normal.

Get-IntelNetAdapter | ft -Property Name, DriverVersion, ETrackID, NVMVersion

List VM Networks in SCVMM

To list the connected networks and subnets for each VM running in VMM, run the following in a VMM connected powershell window:

get-vm | select -ExpandProperty VirtualNetworkAdapters|select Name, VMNetwork, VMSubnet, IPV4Subnets, IPv4Addresses|sort-object VMnetwork|ft

Upgrade to VMM 2019, another knight’s tale

This is a story in the “Knights of Hyper-V” series, an attempt at humor with actual technical content hidden in the details.

The gremlins of the blue window had been up to their usual antics. In 2018 they promised a semi-annual update channel for System Center Virtual Machine Manager. After a lot of badgering (by angry badgers) the knights had caved and installed SCVMM 1807. (That adventure has been chronicled here). As you are probably aware, the gremlins of the blue window are not to be trusted. Come 2019 they changed their minds and pretended to never have mentioned a semi-annual channel. Thus, the knights were left with a soon-to-be unsupported installation and had to come up with a plan of attack. They could only hope for time to implement it before the gremlins changed the landscape again. Maybe a virtual dragon to be slain next time? Or dark wizards? The head knight shuddered, shrugged it off, and went to study the not so secret scrolls of SCVMM updates. It was written in gremlineese and had to be translated to the common tongue before it could be implemented. The gremlins was of the belief that everyone else was living in a soft and cushy wonderland without any walls of fire, application hobbits or networking orcs and wrote accordingly. Thus, if you just followed their plans you would sooner or later be stuck in an underground dungeon filled with stinky water without a floatation-spell or propulsion device.

Continue reading “Upgrade to VMM 2019, another knight’s tale”

Hypervisor not running

Problem

After upgrading my LAB to VMM 1801, and subsequently VMM 1806 (see https://lokna.no/?p=2519), VMs refuse to start on one of my hosts. EventID 20148 was logged when I tried to create a new VM. I restarted the host in hope of a quick fix, but the result was that none of the VMs living on this host wanted to boot.

Virtual machine ‘NAME’ could not be started because the hypervisor is not running”

image

Solution

For some reason the Hypervisor has been disabled. You can check this by running BCDEDIT in an administrative command prompt. The hypervisorlaunchtype should be set to Auto. If it is not, change it by running the following command:

bcdedit /set hypervisorlaunchtype auto


After that, reboot the host and everything should be running again. Unless, of course, you have a completely different issue preventing your VMs from starting.

image

Script to migrate VMs back to their preferred node

Situation

You have a set of VMs that are not running at their preferred node due to maintenance or some kind of outage triggering unscheduled migration. You have set one (and just one) preferred host for all you VMs. You have done this because you are want to balance your VMs manually to guarantee a certain minimum of performance to each VM. By the way, automatic load balancing cannot do that, there will be a lag between a usage spike and load balancing if load balancing is required. But I digress. The point is, the VMs are not running where they should, and you have not enabled automatic failback because you are afraid of node flapping or other inconveniences that could create problems. Hopefully though, you have some kind of monitoring in place to tell you that the VMs are restless and you need to fix the problem and subsequently corral them into their designated hosts. Oh, and you are using Virtual Machine Manager. You could do this on the individual cluster level as well, but that would be another script for another day.

If you understand the scenario above and self-identify with at least parts of it, this script is for you. If not, this script could cause all kinds of trouble.

Script Notes

  • You can skip “Connect to VMM Server” and “Add VMM Cmdlets” if you are running this script from the SCVMM PowerShell window.
  • The MoveVMs variable can be set to $false to get a list. this could be a smart choice for your first run.
  • The script ignores VMs that are not clustered and VMs that does not have a preferred server set.
  • I do not know what will happen if you have more than one preferred server set.

 

The script

 

#######################################################################################################################
#   _____     __     ______     ______     __  __     ______     ______     _____     ______     ______     ______    #
#  /\  __-.  /\ \   /\___  \   /\___  \   /\ \_\ \   /\  == \   /\  __ \   /\  __-.  /\  ___\   /\  ___\   /\  == \   #
#  \ \ \/\ \ \ \ \  \/_/  /__  \/_/  /__  \ \____ \  \ \  __< \ \  __ \  \ \ \/\ \ \ \ \__ \  \ \  __\   \ \  __<   #
#   \ \____-  \ \_\   /\_____\   /\_____\  \/\_____\  \ \_____\  \ \_\ \_\  \ \____-  \ \_____\  \ \_____\  \ \_\ \_\ #
#    \/____/   \/_/   \/_____/   \/_____/   \/_____/   \/_____/   \/_/\/_/   \/____/   \/_____/   \/_____/   \/_/ /_/ #
#                                                                                                                     #
#                                                   https://lokna.no                                                   #
#---------------------------------------------------------------------------------------------------------------------#
#                                          -----=== Elevation required ===----                                        #
#---------------------------------------------------------------------------------------------------------------------#
# Purpose: List VMs that are not running at their preferred host, and migrate them to the correct host.               #
#                                                                                                                     #
#=====================================================================================================================#
# Notes:                                                                                                              #
# There is an option to disable VM migration. If migration is disabled, a list is returned of VMs that are running at #
# the wrong host.                                                                                                     #
#                                                                                                                     #
#######################################################################################################################



$CaptureTime = (Get-Date -Format "yyyy-MM-dd HH:mm:ss")
Write-Host "-----$CaptureTime-----`n"
# Add the VMM cmdlets to the powershell
Import-Module -Name "virtualmachinemanager"

# Connect to the VMM server
Get-VMMServer –ComputerName VMM.Server.Name|Select-Object Name

#Options
$HostGroup = "All Hosts\SQLMGMT\*" #End this with a star. You can go down to an individual VM. All Hosts\Hostgroup\VM.
$MoveVMs = $true #If you set this to true, we will try to migrate VMS to their preferred host.
#List VMS in the host group
$VMs = Get-SCVirtualMachine | where { $_.IsHighlyAvailable -eq $true -and $_.ClusterPreferredOwner -ne $null -and $_.HostGroupPath -like $HostGroup }

# Process
Foreach ($VM in $VMs) 
{
    # Get the Preferred Owner and the Current Owner
    $Preferred = Get-SCVirtualMachine $VM.Name | Select-Object -ExpandProperty clusterpreferredowner
    $Current = $VM.HostName
    
    
    # List discrepancies
    If ($Preferred -ne $Current) 
    {
        Write-Host "VM $VM should be running at $Preferred but is running at $Current." -ForegroundColor Yellow
        If ($MoveVMs -eq $true)
        {
            $NewHost = Get-SCVMHost -ComputerName $Preferred.Name
            Write-Host "We are trying to move $VM from  $Current to $NewHost." -ForegroundColor Green
            Move-SCVirtualMachine -VM $VM -VMHost $NewHost|Select-Object ComputerNameString, HostName
        } 
    }
}


Hyper-V VM with VirtualFC fails to start

Problem

This is just a quick note to remember the solution and EventIDs.

The VM fails to start complaining about failed resources or resource not available in Failover Cluster manager. Analysis of the event log reveals messages related to VirtualFC:

  • EventID 32110 from Hyper-V-SynthFC: ‘VMName’: NPIV virtual port operation on virtual port (WWN) failed with an error: The world wide port name already exists on the fabric. (Virtual machine ID ID)
  • EventID 32265 from Hyper-V-SynthFC: ‘VMName’: Virtual port (WWN) creation failed with a NPIV error(Virtual machine ID ID).
  • EventID 32100 from Hyper-V-VMMS: ‘VMNAME’: NPIV virtual port operation on virtual port (WWN) failed with an unknown error. (Virtual machine ID ID)
  • EventID 1205 from Microsoft-Windows-FailoverClustering: The Cluster service failed to bring clustered role ‘SCVMM VM Name Resources’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

Analysis

The events point in the direction of Virtual Fibre Channel or Fibre Channel issues. After a while we realised that one of the nodes in the cluster did not release the WWN when a VM migrated away from it. Further analysis revealed that the FC driver versions were different.

SNAGHTML66058b10

SNAGHTML660875d4

Solution

  • Make sure all cluster nodes are running the exact same driver and firmware for the SAN and network adapters. This is crucial for failovers to function smoothly.
  • To “release” the stuck WWNs you have to reboot the offending node. To figure out which node is holding the WWN you have to consult the FC Switch logs. Or you could just do a rolling restart and restart all nodes until it starts working.
  • I have successfully worked around the problem by removing and re-adding the virtual FC adapters n the VM that is not working. I do not know why this resolved the problem.
  • Another workaround would be to change the WWN on the virtual FC adapters. You would of course have to make this change at the SAN side as well.

New CSV Volume does not work, Event ID 5120

Problem

A newly converted Cluster Shared Volume refuses to come on-line. Cluster Validation passed with flying colours pre conversion. Looking in the event log you find this:

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 05.06.2016 15:01:31
Event ID: 5120
Task Category: Cluster Shared Volume
Level: Error
Keywords:
User: SYSTEM
Computer: HyperVHostname
Description:
Cluster Shared Volume ‘Volume1’ (‘VMStore1’) has entered a paused state because of ‘(c00000be)’. All I/O will temporarily be queued until a path to the volume is reestablished.

The event is repeated on all nodes.

Analysis

The crafty SAN-Admins has probably enabled some kind of fancy SAN-mirroring on your LUN. If you check, you will probably find twice the amount of storage paths compared to your usual amount. A typical SAN has 4 connections per LUN, and thus you may see 8 paths. Be aware that your results may vary. The point is that you now have more than usual. Problem is that you cannot use all of the paths simultaneously. Half of them are for the SAN mirror, and your LUNS are offline at the mirror location. If a failover is triggered at the SAN side, your primary paths go down and your secondary paths come alive. Your poor server knows nothing about this though, it is only able to register that some of the paths does not work even if they claim to be operative. This confuses Failover Clustering. And if there is one thing Failover Clustering does not like, it is getting confused. As a result the CSV volume is put in a paused state while it waits for the confusion to disappear.

Solution

You have to give MPIO permission to verify the claims made by the SAN as to whether or not a path is active. Run the following powershell command on all cluster nodes. Be aware that this is a system wide setting and is activated for all MPIO connections that use the Microsoft DSM.

Set-MPIOSetting -NewPathVerificationState Enabled

Then reboot the nodes and all should be well in the realm again.

Windows Server 2012R2 stuck at “Updating your system”

Update 2017.01.24: Several people have reported that adding a blank pending.xml file helps, so I have added it to the list.

Problem

After installing updates through Microsoft Update, the reboot never completes. You can wait for several days, but nothing happens, the process is stuck at X%:

image

Troubleshooting

I will try to give a somewhat chronological approach to get your server running again. I do experience this issue from time to time, but thankfully it is pretty rare. That makes it a bit harder to troubleshoot though.

Warning: this post contains last-ditch attempts and other dangerous stuff that could destroy your server. Use at your own risk. If you do not understand how to perform the procedures listed below, you should not attempt them on your own. Especially not in production.

First you wait, then you wait some more

Some updates may take a very long time to complete. More so if the server is an underpowered VM. Thus, it is worth letting it roll overnight just in case it is really slow. Another trick is to send a Ctrl+Alt+Del to the server. Sometimes that will cancel the stuck update, allowing the boot sequence to continue.

Then you poke around in the hardware

Hardware errors can cause all kinds of issues during the update process. If you are experiencing this issue on a physical server, check any relevant ILO/IDRAC/IMM/BMC logs, and visit the server to check for warning lights. A quick memory test would also be good, as memory failures are one of the most prevalent physical causes of such problems.

If that does not help, blame the software

It was Windows that got us into this mess in the first place, so surely now is the time to point the finger of blame at the software side?

Try booting into Safe Mode. If you are lucky the updates will finish installing in safe mode, and all you have to do is reboot. If you are unlucky, there are two ways to make Windows try to roll back the updates:

  • Delete C:\Windows\WinSxS\pending.xml
  • Create a blank pending.xml file in C:\Windows\WinSxS
  • Run DISM /image:C:\ /cleanup-image /revertpendingactions

imageimage

Then reboot. If a boot is successful, see if installing the updates in batches works better. Or just do not patch. Ever. Until you are hacked or something breaks. Just kidding, patching is a necessary evil.

Up a certain creek without a paddle…

If you are unable to enter Safe Mode, chances are the OS is pooched. I have experienced this once on Win2012R2. No matter what I did, the system refused to boot. From what I could tell, a pending change was waiting for a roll-back that required a reboot, and was thus unable to complete the cycle, ergo preventing the server to boot before it had rebooted. If that sounds crazy, well, it is. Time to re-image and restore from backup. The No. 1 suspect in my case was KB3000850, which is a composite “Roll-Up” containing lots of other updates. This may cause conflicts when Windows Update tries to install the same update twice in the same run, first as a part of the Roll-Up, and then as a stand-alone update. This is supposed to work, but it doesn’t always work.

You could try the rollback methods listed above in the recovery console. If that does not work, try running sfc /scannow /offbootdir=c:\ /offwindir=c:\windows from the recovery console. Maybe you will get lucky, but most likely you won’t…

imageimage

image

On a side note, KB3000850 has been a general irksome pain in the butt. It is best installed from an offline .exe by itself in a dark room at midnight on the day before a full moon while you walk around the console in counter-clockwise circles dressed in a Techno-Mage outfit chanting “Who do you serve, and who do you trust?“.

Slow startup on VMs with Virtual Fibre Channel

Problem

After you enable Virtual Fibre Channel on a Hyper-V VM it takes forever to start up. In this instance forever equals about two minutes. The VM is stuck in the Starting… stat for most of this period.

Analysis

During startup, the Hypervisor waits for LUNs to appear on the virtual HBA before the VM is allowed to boot the OS. When there are no LUNs defined for a VM, i.e. when you are deploying a new VM, the Hypervisor patiently waits for 90 seconds before it gives up. Thus startup of the VM is delayed by 90 seconds if there are no LUNs presented to the VM, or if the SAN is down or misconfigured. Event ID 32213 from Hyper-V-SynthFC is logged:

SNAGHTML24190ea1

Solution

Depending on your specific cause, one of the following should do the trick:

  • Present some LUNs
  • Remove the Virtual HBA adapters if they are not in use
  • Correct the SAN config to make sure the VM is able to talk to the SAN

The search for the missing armor plate

One dark and stormy evening, The Knights of HyperV had trouble getting a new host in contact with The Wasteland of Nexus. One of the armor plates to which the connections were attached was not responding. Actually, it appeared to be missing entirely. Something which was odd, as it had been installed by a trusted minion just days earlier. The Knights sent an expedition to the gates of Hell (otherwise known as Dell CMC) to investigate. The envoys tried to open the gates, but instead of open doors, they were greeted with an error message.

After a lengthy discussion with the insane gatekeeper of LDAP-Auth, the envoys were finally granted access into The Ocean of Known Bugs that lay beyond the gates. A boat carried them over to The island of iDRAC. The journey was bumpy, and once ashore the envoys wasted  several recovering from seasickness and the general discomfort caused by the putrid smell of the bugs.

Once recovered, they demanded access to the scrolls of system inventory for the server in question. To the horror of the envoys, the inventory only listed one armor plate, instead of the expected two. Luckily, the existing armor plate was made of the stable Intel-alloy as expected, but the second plate was missing. Could it have been stolen? Perhaps by one of the competing service team minions that dwelled in The Cursed Forests of Sharepoint? They would have to venture into the physical realm of the Hypervisors to find out for sure. The Knights tracked down the minion responsible for the armor plate installation and interrogated her for details. She insisted that the plate had to be there still, pleading to avoid another trip down the long and dangerous road to The Physical Realm of the Hypervisors, and suggested that the problem may be a curse. A spell from the book of dark forbidden magic, putting a veil over the labyrinth of UEFI and thus preventing the armor plate from being seen or used by the server.

Could it be? The knights snorted in disbelief, but as they had no other ideas at the time, they traveled to The Wizard of Badgerville and beseeched him to remove the curse if there was any. The wizard demanded an offering of three sausages and some boiled rice (as he was hungry). After devouring the food, the wizard started walking in circles around the remote console and muttered incomprehensible incantations that was somehow transmitted to the host without him ever touching the keyboard. A long time passed as the knights watched the wizard. At first, they watched in awe. But as the time went by, awe turned to glances, glances turned to boredom, until finally the knights were all sound asleep. Sometime later, whether hours or days we do not know, they were snatched away from slumber land. This annoyed The Knights, as they were all awoken from pleasant dreams about conquering the realm of VMWare.

“The deed is done!” declared The Wizard and vanished into a puff of smoke. The Knights staggered over to the console and were amazed to find that the server was not only able to see the missing armor plate, but it was already connected to the spirit world and jabbering happily with the domain controller. However, who had cursed the server? And why? Could it be the witch with the wardrobe of broken firmware patches? Or the unholy dark riders of VMWare? We may never find out, but if we are lucky, one day another tale will be told about the adventures of The Knights of HyperV…