Upgrade to VMM 2019, another knight’s tale

This is a story in the “Knights of Hyper-V” series, an attempt at humor with actual technical content hidden in the details.

The gremlins of the blue window had been up to their usual antics. In 2018 they promised a semi-annual update channel for System Center Virtual Machine Manager. After a lot of badgering (by angry badgers) the knights had caved and installed SCVMM 1807. (That adventure has been chronicled here). As you are probably aware, the gremlins of the blue window are not to be trusted. Come 2019 they changed their minds and pretended to never have mentioned a semi-annual channel. Thus, the knights were left with a soon-to-be unsupported installation and had to come up with a plan of attack. They could only hope for time to implement it before the gremlins changed the landscape again. Maybe a virtual dragon to be slain next time? Or dark wizards? The head knight shuddered, shrugged it off, and went to study the not so secret scrolls of SCVMM updates. It was written in gremlineese and had to be translated to the common tongue before it could be implemented. The gremlins was of the belief that everyone else was living in a soft and cushy wonderland without any walls of fire, application hobbits or networking orcs and wrote accordingly. Thus, if you just followed their plans you would sooner or later be stuck in an underground dungeon filled with stinky water without a floatation-spell or propulsion device.

Continue reading “Upgrade to VMM 2019, another knight’s tale”

Scheduled export of the security log

If you have trouble with the log being overwritten before you can read it and do not want to increase the size of the log further, you can use a scheduled PowerShell script to create regular exports. The script below creates csv files that can easily be imported to a database for further analysis.

The account running the scheduled task needs to be a local admin on the computer.

#######################################################################################################################
#   _____     __     ______     ______     __  __     ______     ______     _____     ______     ______     ______    #
#  /\  __-.  /\ \   /\___  \   /\___  \   /\ \_\ \   /\  == \   /\  __ \   /\  __-.  /\  ___\   /\  ___\   /\  == \   #
#  \ \ \/\ \ \ \ \  \/_/  /__  \/_/  /__  \ \____ \  \ \  __<   \ \  __ \  \ \ \/\ \ \ \ \__ \  \ \  __\   \ \  __<   #
#   \ \____-  \ \_\   /\_____\   /\_____\  \/\_____\  \ \_____\  \ \_\ \_\  \ \____-  \ \_____\  \ \_____\  \ \_\ \_\ #
#    \/____/   \/_/   \/_____/   \/_____/   \/_____/   \/_____/   \/_/\/_/   \/____/   \/_____/   \/_____/   \/_/ /_/ #
#                                                                                                                     #
#                                                   http://lokna.no                                                   #
#---------------------------------------------------------------------------------------------------------------------#
#                                          -----=== Elevation required ===----                                        #
#---------------------------------------------------------------------------------------------------------------------#
# Purpose:Export and store the security event log as csv.                                                             #
#                                                                                                                     #
#=====================================================================================================================#
# Notes: Schedule execution of tihis script every capturehrs hours - script execution time.                           #
# Test the script to determine the execution time, add 2 minutes for good measure.                                    #
#                                                                                                                     #
# Scheduled task: powershell.exe -ExecutionPolicy ByPass -File ExportSecurityEvents.ps1                               #
#######################################################################################################################

#Config
$path = "C:\log\security\" # Add Path, end with a backslash
$captureHrs = 20 #Capture n hours of data

#Execute
$now=Get-Date
$CaptureTime = (Get-Date -Format "yyyyMMddHHmmss")
$CaptureFrom = $now.AddHours(-$captureHrs)
$Filename = $path + $CaptureTime + 'Security_log.csv' 
$log = Get-EventLog -LogName Security -After $CaptureFrom
$log|Export-Csv $Filename -NoTypeInformation -Delimiter ";"

Message Analyzer: Error loading VAPR.OPN

Problem

After installing and updating Microsoft Message Analyzer, it complains about an error in VAPR.OPN:

“…VAPR.opn(166,46-166,49):  Invalid literal 2.0 : Input string was not in a correct format.”

image

There is a reference to line 166, character 46-49 in the VAPR.opn file. This particular opn is a parser for the App-V protocol. If we open the file and look at its contents at the specified location it does indeed contain the text “2.0”. As far as I can tell, the code in question (2.0 as decimal) tries to convert the string 2.0 to a decimal value and fails. In a broader context, the line appears to try a version check of the protocol as line 16 refers to the protocol as MS-VAPR version 2.0.

Sadly I do not have the time required to learn the PEF OPN language structure, but it resembles C#. In C# the code “as decimal” fails to compile since decimal is not a nullable type. The “as” keyword tries to cast a value into a specific type and returns null if the conversion fails. In C#. OPN seems to have a different approach. Just as an experiment, I tried replacing 2.0 with just 2, and voila, the OPN “compiles” and the error message goes away. I do not have any APP-V captures to test, so I cannot guarantee that this will actually work.

Possible solution

  • Open your VAPR.OPN file and navigate to line 166, column 46
  • image
  • Replace “2.0” with just “2”
  • Save the file
  • Restart Message Analyzer

Workaround

Just remove the VAPR.OPN file. You will not receive the error message, and your ability to parse the MS-VAPR protocol will also vanish. Though you did not really have that ability anyway as the parser did not compile.

Logical switch uplink profile gone

Problem

When you try to connect a new VM to a logical switch you get a lot of strange error messages related to missing ports or no available switch. The errors seem random.

Analysis

If you check the logical switch properties of an affected host, you will notice that the uplink profile is missing:

image

If you look at the network adapter properties of an affected VM, you will notice that the Logical Switch field is blank:

image

This is connected to a WMI problem. Some Windows updates uninstall the VMM  WMI MOFs required for the VMM agent to manage the logical switch on the host. See details at MS Tech.

Solution

MOFCOMP to the rescue. Run the following commands in an administrative Powershell prompt. To update VMM you have to refresh the cluster/node. Note: Some versions use a different path  to the MOF-files, so verify this if the command fails.

 

image

Mofcomp “%systemdrive%\Program Files\Microsoft System Center\Virtual Machine Manager\setup\scvmmswitchportsettings.mof”
Mofcomp “%systemdrive%\Program Files\Microsoft System Center\Virtual Machine Manager\DHCPServerExtension\VMMDHCPSvr.mof”
Get-CimClass -Namespace root/virtualization/v2 -classname *vmm*

Reading the cluster log with Message Analyzer

Microsoft Message Analyzer, the successor to Network Monitor, has the ability to read a lot more than just network captures. In this post I will show how you can open  set of cluster logs from a SQL Server Failover Cluster instance. If you are new to Message Analyzer I recommend that you glance at the Microsoft Message Analyzer operating guide while you read this post for additional information.

Side quest: Basic cluster problem remediation

Remember that the cluster log is a debug log used for analyzing what went wrong AFTER you get it all working again. In most cases your cluster should self-heal, and all you have to do is figure out what went wrong and what you should do different to prevent it from happening again. If your cluster is still down and you are reading this post, you are on the wrong path.

Below you will find a simplified action plan for getting your cluster back online. I will assume that you have exhausted you normal troubleshooting process to no avail, that your cluster is down and that you do not know why. The type of Failover Cluster is somewhat irrelevant for this action plan.

  • If your cluster has shared storage, call your SAN person and verify that all nodes can access the storage, and that there are no gremlins in the storage and fabric logs.
  • If something works and something does not, restart all nodes one by one. If you cannot restart a node, power cycle it.
  • If nothing works, shut down all nodes, then start one node. Just one.
    • Verify that it has a valid connection to the rest of your environment, both networking and storage if applicable.
    • If you have more than two nodes, start enough nodes to establish quorum. Usually n/2.
  • Verify that your hardware is working. Check OOB logs and blinking lights.
  • If the cluster is still not working, run a full cluster validation an correct any errors. If you had errors in the validation report BEFORE the cluster went down, your configuration is not supported and this is probably the reason for your predicament. Rectify all errors and try again.
  • If you gave warnings in your cluster validation report, check each one and make a decision whether or not to correct it. Some clusters will have warnings by design.
  • If your nodes are virtual, make sure that you are not using VMWare Raw Device Mapping. If you are, this is the probable cause of all your problems, both on this cluster and any personal problems you may have. Make the necessary changes to remove RDM.
  • If your nodes are virtual, make sure there are no snapshots/checkpoints. If you find any, remove them. Snapshots/checkpoints left running for > 12 hours may destroy a production cluster.
  • If the cluster is still not working, reformat, reinstall and restore.

Prerequisites and test environment

  • A running failover cluster. Any type of cluster will do, but I will use a SQL Server Failover Cluster Instance as a sample.
  • A workstation or server running Microsoft Message Analyzer 1.4 with all the current patches and updates as of march 2019.
  • The cluster nodes in the lab are named SLQ19-1 and SQL19-2 and are running Windows Server 2019 with a SQL Server 2019 CTP 2.2 Failover Cluster Instance.
  • To understand this post you need an understanding about how a Windows Failover Cluster works. If you have never looked at a cluster log before, this post will not teach you how to interpret the log. https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-2000-server/cc961673(v=technet.10) contains additional information about the cluster log. It is very old but still relevant, and at the time of writing the best source of information I could find. There is also an old article about the Resource Hosting Subsystem that may be of use here.

Obtaining the cluster log

  • To get the current cluster log, execute Get-ClusterLog -Destination C:\TEMP –SkipClusterState in an administrative PowerShell windows on one of the cluster nodes.
  • Be aware that the time zone in the log file will be Zulu time/GMT. MA should compensate for this.
  • The SkipClusterState option removes a lot of unparseable information from the file. If you are investigating a serious problem you may want to run a separate  export without this option.
  • The TimeSpan option limits the log timespan. I used it to get a smaller sample set for this lab, and so should you if you know what timespan you want to investigate. You can also add a pre-filter in MA to limit the timespan.
  • You should now have one file for each cluster node in C:\Temp.
  • Copy the files to the machine running Message Analyzer.
  • image

Starting Message Analyzer and loading the logs

  • Open Message Analyzer.
  • Click New Session.
  • Enter a session name.
  • Click the Files-button.
  • image
  • Add the .log files.
  • Select the Cluster text log configuration.
  • image
  • Click Start to start parsing the files.
  • Wait while MA is parsing the files. Parsing time is determined by machine power and the size of the log, but it should normally take tens of minutes, not hours unless the file is hundreds of megabytes or more.

Filtering unparseable data

  • After MA is done parsing the file, the list looks a little disconcerting. All you see are red error messages:
  • image
  • Not to worry though, what you are looking at is just blank lines and other unparseable data from the file. You can read the unparseable data in the Details pane:
  • image
  • It is usually log data that is split on multiple lines in the log file and headers dividing different logs included in the file. A similar message as the sample above looks like this in the log file:
  • image
  • We can filter out these messages by adding #Timestamp to the filter pane and clicking Apply. This will filter out all messages without a timestamp.
  • image

Saving the session

To make the data load faster next time, we can save the parsed data and filter as a session. This will retain the workspace as we left it.

  • Click Save.
  • Select All Messages.
  • Click Save As.
  • Save the .matp file.
  • image

Looking for problems

The sample log files contains an incident where the iSCSI storage disappeared. This was triggered by a SAN reboot during a firmware update on a SAN without HA. I will go through some analysis of this issue to show how we can use MA to navigate the cluster logs.

  • To make it easier to read the log, we will add a Grouping Viewer. Click New Viewer, Grouping, Cluster Logs:
  • image
  • This will give you a Grouping pane on the left. Stat by clicking the Collapse All button:
  • image
  • Then expand the ERR group and start with the messages without a subcomponent tag. The hexadecimal numbers are the ProcessId of the process writing the error to the log. Usually this is a resource hosting subsystem process.
  • It is pretty clear that we have a storage problem:
  • image
  • To check which log contains one of these messages, select one message and look in the Details pane, Properties mode. Scroll down until you find the TraceSource property:
  • image
  • To read other messages logged at the same time, switch the Grouping viewer from Filter to Select mode:
  • image
  • If we click the same ERR group again, the Analysis Grid view will scroll to the first message in this group and mark all messages in the group.
  • image
  • The WARN InfoLevel for the RES SubComponent is also a good place to look for root causes:
  • image
  • If you want to see results from one log file only, add *TraceSource == “filename” to the grouping filter.

Connect SQL Server Management Studio to a different AD domain

Problem

  • SSMS is installed on a management workstation in DomainA.
  • The SQL Server is installed on a server in DomainB.
  • There is no trust between DomainA and DomainB.
  • The server is set to use AD Authentication only.’

Solution

Use runas /netonly to start SSMS.

The netonly switch will cause SSMS to use alternate credentials for all remote access. This will enable access to remote SQL Servers using the supplied credentials as long as you are able to authenticate to the domain. Network capture tests indicates that you need network access to a domain controller in DomainB from your management workstation for this to work.

  • Run the following command in the folder where SSMS.EXE is installed:
RUNAS /netonly /user:DomainB\user SSMS.EXE
  • Then connect to the server you want to talk to in DomainB as you would if you were running SSMS from a computer in DomainB.

SSMS will indicate that you are running as DomainA\user, but if you run a SELECT SYSTEM_USER command you will see that your commands are executed as DomainB\user. When you open the Connect to Server dialog, the DomainA user will be shown (and greyed out as usual), but you will actually connect as the specified DomainB user.

image
image

Be aware that if you want to connect to SQL Servers in several disjointed domains, you will need to have one window for each account. All of them will seem as they are using the DomainA account, so it can get a bit confusing. I recommend connecting to a server at once, that way you should be able to easily identify which domain your window is connecting to.

Update-module fails

Problem

When trying to update a module from PSGallery (PSWindowsUpdate in this case), the package manager claims that the PSGallery repository does not exist with one of the following errors.

  • “ Unable to find repository ‘https://www.powershellgallery.com/api/v2/’.”
  • “ Unable to find repository ‘PSGallery’.”

 

Analysis

There seems to be a problem with the URL for the PSGallery repository missing a trailing slash, as I could find a lot of posts about this online. If we do a Get-PSRepository and compare this with a Get-InstalledModule –Name PSWindowsupdate|fl we can see that the URL differs:

 

image

 

image

There is also something wrong with the link between the repository and the package, the repository line above should say PSGallery, not https://www.powershellgallery.com/api/v2/.

I do not know why or when this happened, but sometime in the second half of 2018 is my best guess based on the last time we patched PSWindowsUpdate the servers in question.

The PackageManagement version installed on these servers were rather old, version 1.0.0.1. Come to think of it, the PackageManagement moved to PSgallery itself at some point, but this version is not as it is found using Get-Module, and not Get-InstalledModule:

image

Solution

After a long and winding path I have come up the following action plan:

  • Update the NuGet provider.
  • Uninstall the “problem” module.
  • Unregister the PSGallery repository.
  • Re-register the PSGallery repository.
  • Install the latest version of PowerShellGet from PSGallery.
  • Reinstall the problem module.
  • Reeboot to remove the old PackageManagement version.

 

You can find a sample Powershell script using PSWindowsUpdate as the problem module below. If you have multiple PSGallery modules installed, you may have to re-install all of them.

Install-PackageProvider Nuget –Force
Uninstall-Module PSWindowsupdate
Unregister-PSRepository -Name 'PSGallery'
Register-PSRepository -Default
Get-PSRepository
Install-Module –Name PowerShellGet -Repository PSGallery –Force
install-module -Name PSWindowsUpdate -Repository PSGallery -Force -Verbose

Failover Cluster: access to update the secure DNS Zone was denied.

Problem

After you have built a cluster, the Cluster Events page fills up with Event ID 1257 From FailoverClustering complaining about not being able to write to the DNS records in AD:

“Cluster network name resource failed registration of one or more associated DNS names(s) because the access to update the secure DNS Zone was denied.


Cluster Network name: X
DNS Zone: Y


Ensure that cluster name object (CNO) is granted permissions to the Secure DNS Zone.”

image

Solution

There may be other root cause scenarios, but in my case the problem was a static DNS reservation on the domain controller.

As usual, if you do not understand the action plan below, seek help or get educated before you continue. Your friendly local search  engine is a nice place to start if you do not have a local cluster expert. This action plan includes actions that will take down parts of your cluster momentarily, so do not perform these steps on a production cluster during peak load. Schedule a maintenance window.

  • Identify the source of the static reservation and apply public shaming and/or pain as necessary to ensure that this does not happen again. Cluster DNS records should be dynamic.
  • Identify the static DNS record in your Active Directory Integrated DNS forward lookup zone. Ask for help from your DNS or AD team if necessary.
  • Delete the static record
  • Take the Cluster Name Object representing the DNS record offline in Failover Cluster manager (or by powershell). Be aware that any dependent resources will also go offline.
  • Bring everything back online. This should trigger a new DNS registration attempt. You could also wait for the cluster to attempt this automatically, but client connections may fail while you are waiting.
  • Verify that the DNS record is created as a dynamic record. It should have a current Timestamp.

Identify which drive \Device\Harddisk#\DR# represents

Problem

You get an error message stating that there is some kind of problem with \Device\Harddisk#\DR#. For instance Event ID 11 from Disk: The driver detected a controller error on \Device\Harddisk4\DR4.

image

The disk number referenced in the error message does not necessarily correspond to the disk id numbers in Disk Management. To figure out which disk is referenced, some digging is required.

Solution

There may be several ways to identify the drive. In this post I will show the WinObj-method, as it is the only one that has worked consistently for me.

  • First, get a hold of WinObj from http://sysinternals.com 
  • Run WinObj as admin
  • Browse to \Device\Harddisk#. We will use Harddisk4\DR4 as a sample from here on out, but you should of course replace that with the numbers from your error message.

image

  • Look at the SymLink column to identify the entry from the error message.

image

  • Go to the GLOBAL folder and sort by the SymLink column.
  • Scroll down to the \Device\Harddisk4\DR4 entries
  • You will find several entries, some for the drive and some for the volume or volumes.

image

image

  • The most interesting one in this example is the drive letter D for Volume 5 (the only volume on this drive).
  • Looking in Disk management we can identify the drive, in this case it was an empty SD-card slot. We also see that the Disk number and DR number are both 4, but there is no definitive guarantee that these numbers are equal.

image

image

Most likely, the event was caused by an improper removal of the SD card. As this is a server and the SD card slot could be a virtual device connected to the out-of-band IPMI/ILO/DRAC/IMM management chip, the message could also be related to a restart or upgrade of the out-of-band chip. In any case, there is no threat to our important data which are hopefully not located on an SD card.

If you receive this error on a SAN HBA or local RAID controller, check the management software for your device. You may need a firmware upgrade, or you could have a slightly defective drive that is about to go out in flames.

iSCSI in the LAB

I sometimes run internal Windows Failover Clustering training, and I am sometimes asked “How can I test this at home when I do not have a SAN?”. As you may know, even though clusters without shared storage are in deed possible in the current year, some cluster types still rely heavily on shared storage. When it comes to SQL Server clusters for instance, a Failover Cluster instance (which relies on shared storage) is a lot easier to operate compared to an AOAG cluster which does not rely on shared storage. There are a lot of possible solutions to this problem, you could for instance use your home NAS as an iSCSI “SAN”, as many popular NAS boxes have iSCSI support. In this post however, I will focus on how to build a Windows Server 2019 vm with an iSCSI target for LAB usage. This is NOT intended as a guide for creating a production iSCSI server. It is most definitely a bad iSCSI server with poor performance and not suited for anything requiring production level performance.

“What will I need to do this?” I hear you ask. You will need a Windows Server 2019 VM with some spare storage mounted as a secondary drive, a domain controller VM and some cluster VMs. You could also use a physical machine if you want, but usually this kind of setup involves one physical machine running a set of VMs. In this setup I have four VMs:

  • DC03, the domain controller
  • SQL19-1 and SQL19-2, the cluster nodes for a SQL 2019 CTP 2.2 failover cluster instance
  • iSCSI19, the iSCSI server.

The domain controller and SQL Servers are not covered in this post. See my Failover Cluster series for information on how to build a cluster.

Action plan

  • Make sure that your domain controller and iSCSI client VMs are running.
  • Make sure that Failover Cluster features are installed on the client VMs.
  • Enable the iSCSI service on your cluster nodes that are going to use the iSCSI server. All you have to do is to start the iSCSI initiator program, it will ask you to enable the iSCSI service:
  • clip_image001

  • Create a Windows 2019 Server VM or physical server.
  • Add it to your lab domain.
  • Set a static IP, both v4 and v6. This is important for stability, iSCSI does not play well with DHCP. In fact, all servers involved in this should have static IPs to ensure that the iSCSI storage is reconnected properly at boot.
  • Install the iSCSI target feature using powershell.
    • Install-WindowsFeature FS-iSCSITarget-Server
  • Add a virtual hard drive to serve as storage for the iSCSI volumes.
  • Initialize the drive, add a volume and format it. Assign the drive letter I:

  • clip_image002

  • Open Server Manager and navigate to File and Storage Services, iSCSI
  • Start the “New iSCSI virtual disk wizard”
  • Select the I: drive as a location for the new virtual disk

  • clip_image003

  • Select a name for you disk. Here, I have called it SQL19_Backup01
  • clip_image004

  • Set a disk size in accordance with your needs. As this is for a LAB environment, select a Dynamically expanding disk type.
  • clip_image005

  • Create a new target. I used SQL19Cluster as the name for the target.
  • Add the cluster nodes as initiators. You should have enabled iSCSI on the cluster nodes before this step (see above). The cluster nodes also has to be online for this to be successful.
  • clip_image006

  • As this is a lab we skip the authentication setup
  • clip_image007

  • Done!
  • clip_image008

  • On the cluster nodes, start the iSCSI initator.
  • Input the name of the iSCSI target and click the quick connect button.
  • clip_image009

  • Initialize, online and format the disk on one of the cluster nodes
  • clip_image010

  • Run storage validation in failover cluster manager to verify your setup. Make sure that your disk/disks are part of the storage validation. You may have to add the disk to the cluster and set it offline for validation to work.
  • Check that the disk fails over properly to the other node. It should appear with the same drive letter or mapped folder on both nodes, but only on one node at the time unless you convert it to a Cluster Shared Volume. (Cluster Shared Volumes are for Hyper-V Clusters).