Cluster validation fails: Culture is not supported

Problem

The Failover Cluster validation report shows an error for Inventory, List Operating System Information:

An error occurred while executing the test. There was an error getting information about the operating systems on the node. Culture is not supported. Parameter name: culture 3072 (0x0c00) is an invalid culture identifier.

If you look at the summary for the offending node, you will find that Locale and Pagefiles are missing.

Analysis

Clearly there is something wrong with the locale settings on the offending node. As the sample shows the locale is set to nb-NO for Norwegian, Norway. I immediately suspected that to be the culprit. Most testing is done on en-US, and the rest of us that want to see a sane 24-hour clock without latin abbreviations and a date with the month where it should be located usually have to suffer.

I was unable to determine exactly where the badger was buried, but the solution was simple enough.

Solution

Part 1

  • Make sure that the Region & language and Date & Time settings (modern settings) are set correctly on all nodes. Be aware of differences between the offending node and working nodes.
  • Make sure that the System Locale is set correctly in the Control Panel, Region, Administrative window.
  • Make sure that Windows Update works and is enabled on all nodes.
  • Check the Languages list under Region & Language (modern settings). If it flashes “Windows update” under one or more of the languages, you still have a Windows Update problem or an internet access problem.
  • Try to validate the cluster again. If the error still appears, go to the next line.
  • Run Windows Update and Microsoft Update on all cluster nodes.
  • Restart all cluster nodes.
  • Try to validate the cluster again. If the error still appears, go to part 2.

Part 2

  • Make sure that you have completed part 1.
  • Go to Control Panel, Region.
  • In the Region dialog, locate Formats, Format.
  • Change the format to English (United States). Be sure to select the correct English locale.
  • Click OK.
  • Run the validation again. It should be successful.
  • Change the format back.
  • Run the validation again. It should be successful.

If you are still not successful, there is something seriously wrong with you operating system. I have not yet had a case where the above steps does not resolve the problem, but I suspect that running chkdsk, sfc -scannow or a full node re-installation would be next. Look into the rest of the validation report for clues to other problems.

A new (and improved?) wasteland

This is a story in the “Knights of Hyper-V” series, an attempt at humor, with actual technical content hidden in the details. This particular one is just for fun though. Any resemblance of actual trademarks or people or events (real or those  that can only be found residing inside your mind), is purely coincidental and should be disregarded.

The knights of Hyper-V were doing some spring cleaning. Or, it was actually summer and thus to late in the year to call it spring cleaning anymore. Project setbacks, slow equipment deliveries and the plague-that-shall-not-be-named had severely hampered progress. But finally, the day had arrived to replace some of the hard working VMs with fresh new ones, running updated software versions glistening in the summer sun. Or covered in the more gloomy, but oh so common summer rain. And perhaps snow, locusts or other more or less funny local phenomenon.

The old servers was not really all that old, but a change in networking politics had ushered in an early swap over. We were leaving the The Wasteland of Nexus for a new and supposedly better (and cheaper) Wasteland. With software-defined wasteland processors or something to that effect. The knights did not really care, all they knew was that new network armor plate connections were required, and that was always a pain in the backside. The application minions would be grumpy as they would have to write scroll after scroll of requests beseeching for safe paths through the walls of fire.

But enough of the backstory. After a long, looong time the imposed quest for a new wasteland was nearing its end, and it was time for cleanup. Most of the knights were finally at summer vacation, preparing to queue along the congested paths to the beach, waiting in line to look over a cliff, visiting distant relations or hiding in a deep dungeon to escape the aforementioned plague. Only a skeleton crew (not composed of actual skeletons this time) remained to watch over the systems and do the odd cleanup job. A passing minstrel wrote an ode to one of the old servers in exchange for a late breakfast, or early lunch depending on  your point of view.:

Ode to server sixteen

New servers come in, and old ones get phased out.
It exists now only as a memory
Vanished into thin air
Like a fleeting ghost in the machine
Binary code rearranged to form new beginnings
It will always remain in our hearts

For the time being, all was well in the kingdom. All the VMs were kept in line by the automated all-seeing eye of OM, and it was time to relax, read, and practice dragon slaying if one was such inclined. Till next time, enjoy your life such as it is. Remember, things could always be worse. Before you know it, the roars of a three-headed Application bug dragon and the distant horrified screams of application team minions could wake you from your slumber…

Upgrade to VMM 2019, another knight’s tale

This is a story in the “Knights of Hyper-V” series, an attempt at humor with actual technical content hidden in the details.

The gremlins of the blue window had been up to their usual antics. In 2018 they promised a semi-annual update channel for System Center Virtual Machine Manager. After a lot of badgering (by angry badgers) the knights had caved and installed SCVMM 1807. (That adventure has been chronicled here). As you are probably aware, the gremlins of the blue window are not to be trusted. Come 2019 they changed their minds and pretended to never have mentioned a semi-annual channel. Thus, the knights were left with a soon-to-be unsupported installation and had to come up with a plan of attack. They could only hope for time to implement it before the gremlins changed the landscape again. Maybe a virtual dragon to be slain next time? Or dark wizards? The head knight shuddered, shrugged it off, and went to study the not so secret scrolls of SCVMM updates. It was written in gremlineese and had to be translated to the common tongue before it could be implemented. The gremlins was of the belief that everyone else was living in a soft and cushy wonderland without any walls of fire, application hobbits or networking orcs and wrote accordingly. Thus, if you just followed their plans you would sooner or later be stuck in an underground dungeon filled with stinky water without a floatation-spell or propulsion device.

Continue reading “Upgrade to VMM 2019, another knight’s tale”

Scheduled export of the security log

If you have trouble with the log being overwritten before you can read it and do not want to increase the size of the log further, you can use a scheduled PowerShell script to create regular exports. The script below creates csv files that can easily be imported to a database for further analysis.

The account running the scheduled task needs to be a local admin on the computer.

#######################################################################################################################
#   _____     __     ______     ______     __  __     ______     ______     _____     ______     ______     ______    #
#  /\  __-.  /\ \   /\___  \   /\___  \   /\ \_\ \   /\  == \   /\  __ \   /\  __-.  /\  ___\   /\  ___\   /\  == \   #
#  \ \ \/\ \ \ \ \  \/_/  /__  \/_/  /__  \ \____ \  \ \  __<   \ \  __ \  \ \ \/\ \ \ \ \__ \  \ \  __\   \ \  __<   #
#   \ \____-  \ \_\   /\_____\   /\_____\  \/\_____\  \ \_____\  \ \_\ \_\  \ \____-  \ \_____\  \ \_____\  \ \_\ \_\ #
#    \/____/   \/_/   \/_____/   \/_____/   \/_____/   \/_____/   \/_/\/_/   \/____/   \/_____/   \/_____/   \/_/ /_/ #
#                                                                                                                     #
#                                                   https://lokna.no                                                   #
#---------------------------------------------------------------------------------------------------------------------#
#                                          -----=== Elevation required ===----                                        #
#---------------------------------------------------------------------------------------------------------------------#
# Purpose:Export and store the security event log as csv.                                                             #
#                                                                                                                     #
#=====================================================================================================================#
# Notes: Schedule execution of tihis script every capturehrs hours - script execution time.                           #
# Test the script to determine the execution time, add 2 minutes for good measure.                                    #
#                                                                                                                     #
# Scheduled task: powershell.exe -ExecutionPolicy ByPass -File ExportSecurityEvents.ps1                               #
#######################################################################################################################

#Config
$path = "C:\log\security\" # Add Path, end with a backslash
$captureHrs = 20 #Capture n hours of data

#Execute
$now=Get-Date
$CaptureTime = (Get-Date -Format "yyyyMMddHHmmss")
$CaptureFrom = $now.AddHours(-$captureHrs)
$Filename = $path + $CaptureTime + 'Security_log.csv' 
$log = Get-EventLog -LogName Security -After $CaptureFrom
$log|Export-Csv $Filename -NoTypeInformation -Delimiter ";"

Message Analyzer: Error loading VAPR.OPN

Problem

After installing and updating Microsoft Message Analyzer, it complains about an error in VAPR.OPN:

“…VAPR.opn(166,46-166,49):  Invalid literal 2.0 : Input string was not in a correct format.”

image

There is a reference to line 166, character 46-49 in the VAPR.opn file. This particular opn is a parser for the App-V protocol. If we open the file and look at its contents at the specified location it does indeed contain the text “2.0”. As far as I can tell, the code in question (2.0 as decimal) tries to convert the string 2.0 to a decimal value and fails. In a broader context, the line appears to try a version check of the protocol as line 16 refers to the protocol as MS-VAPR version 2.0.

Sadly I do not have the time required to learn the PEF OPN language structure, but it resembles C#. In C# the code “as decimal” fails to compile since decimal is not a nullable type. The “as” keyword tries to cast a value into a specific type and returns null if the conversion fails. In C#. OPN seems to have a different approach. Just as an experiment, I tried replacing 2.0 with just 2, and voila, the OPN “compiles” and the error message goes away. I do not have any APP-V captures to test, so I cannot guarantee that this will actually work.

Possible solution

  • Open your VAPR.OPN file and navigate to line 166, column 46
  • image
  • Replace “2.0” with just “2”
  • Save the file
  • Restart Message Analyzer

Workaround

Just remove the VAPR.OPN file. You will not receive the error message, and your ability to parse the MS-VAPR protocol will also vanish. Though you did not really have that ability anyway as the parser did not compile.

Logical switch uplink profile gone

Problem

When you try to connect a new VM to a logical switch you get a lot of strange error messages related to missing ports or no available switch. The errors seem random.

Analysis

If you check the logical switch properties of an affected host, you will notice that the uplink profile is missing:

image

If you look at the network adapter properties of an affected VM, you will notice that the Logical Switch field is blank:

image

This is connected to a WMI problem. Some Windows updates uninstall the VMM  WMI MOFs required for the VMM agent to manage the logical switch on the host. See details at MS Tech.

Solution

MOFCOMP to the rescue. Run the following commands in an administrative Powershell prompt. To update VMM you have to refresh the cluster/node. Note: Some versions use a different path  to the MOF-files, so verify this if the command fails.

 

image

Mofcomp “%systemdrive%\Program Files\Microsoft System Center\Virtual Machine Manager\setup\scvmmswitchportsettings.mof”
Mofcomp “%systemdrive%\Program Files\Microsoft System Center\Virtual Machine Manager\DHCPServerExtension\VMMDHCPSvr.mof”
Get-CimClass -Namespace root/virtualization/v2 -classname *vmm*

Reading the cluster log with Message Analyzer

Microsoft Message Analyzer, the successor to Network Monitor, has the ability to read a lot more than just network captures. In this post I will show how you can open  set of cluster logs from a SQL Server Failover Cluster instance. If you are new to Message Analyzer I recommend that you glance at the Microsoft Message Analyzer operating guide while you read this post for additional information.

Side quest: Basic cluster problem remediation

Remember that the cluster log is a debug log used for analyzing what went wrong AFTER you get it all working again. In most cases your cluster should self-heal, and all you have to do is figure out what went wrong and what you should do different to prevent it from happening again. If your cluster is still down and you are reading this post, you are on the wrong path.

Below you will find a simplified action plan for getting your cluster back online. I will assume that you have exhausted you normal troubleshooting process to no avail, that your cluster is down and that you do not know why. The type of Failover Cluster is somewhat irrelevant for this action plan.

  • If your cluster has shared storage, call your SAN person and verify that all nodes can access the storage, and that there are no gremlins in the storage and fabric logs.
  • If something works and something does not, restart all nodes one by one. If you cannot restart a node, power cycle it.
  • If nothing works, shut down all nodes, then start one node. Just one.
    • Verify that it has a valid connection to the rest of your environment, both networking and storage if applicable.
    • If you have more than two nodes, start enough nodes to establish quorum. Usually n/2.
  • Verify that your hardware is working. Check OOB logs and blinking lights.
  • If the cluster is still not working, run a full cluster validation an correct any errors. If you had errors in the validation report BEFORE the cluster went down, your configuration is not supported and this is probably the reason for your predicament. Rectify all errors and try again.
  • If you gave warnings in your cluster validation report, check each one and make a decision whether or not to correct it. Some clusters will have warnings by design.
  • If your nodes are virtual, make sure that you are not using VMWare Raw Device Mapping. If you are, this is the probable cause of all your problems, both on this cluster and any personal problems you may have. Make the necessary changes to remove RDM.
  • If your nodes are virtual, make sure there are no snapshots/checkpoints. If you find any, remove them. Snapshots/checkpoints left running for > 12 hours may destroy a production cluster.
  • If the cluster is still not working, reformat, reinstall and restore.

Prerequisites and test environment

  • A running failover cluster. Any type of cluster will do, but I will use a SQL Server Failover Cluster Instance as a sample.
  • A workstation or server running Microsoft Message Analyzer 1.4 with all the current patches and updates as of march 2019.
  • The cluster nodes in the lab are named SLQ19-1 and SQL19-2 and are running Windows Server 2019 with a SQL Server 2019 CTP 2.2 Failover Cluster Instance.
  • To understand this post you need an understanding about how a Windows Failover Cluster works. If you have never looked at a cluster log before, this post will not teach you how to interpret the log. https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-2000-server/cc961673(v=technet.10) contains additional information about the cluster log. It is very old but still relevant, and at the time of writing the best source of information I could find. There is also an old article about the Resource Hosting Subsystem that may be of use here.

Obtaining the cluster log

  • To get the current cluster log, execute Get-ClusterLog -Destination C:\TEMP –SkipClusterState in an administrative PowerShell windows on one of the cluster nodes.
  • Be aware that the time zone in the log file will be Zulu time/GMT. MA should compensate for this.
  • The SkipClusterState option removes a lot of unparseable information from the file. If you are investigating a serious problem you may want to run a separate  export without this option.
  • The TimeSpan option limits the log timespan. I used it to get a smaller sample set for this lab, and so should you if you know what timespan you want to investigate. You can also add a pre-filter in MA to limit the timespan.
  • You should now have one file for each cluster node in C:\Temp.
  • Copy the files to the machine running Message Analyzer.
  • image

Starting Message Analyzer and loading the logs

  • Open Message Analyzer.
  • Click New Session.
  • Enter a session name.
  • Click the Files-button.
  • image
  • Add the .log files.
  • Select the Cluster text log configuration.
  • image
  • Click Start to start parsing the files.
  • Wait while MA is parsing the files. Parsing time is determined by machine power and the size of the log, but it should normally take tens of minutes, not hours unless the file is hundreds of megabytes or more.

Filtering unparseable data

  • After MA is done parsing the file, the list looks a little disconcerting. All you see are red error messages:
  • image
  • Not to worry though, what you are looking at is just blank lines and other unparseable data from the file. You can read the unparseable data in the Details pane:
  • image
  • It is usually log data that is split on multiple lines in the log file and headers dividing different logs included in the file. A similar message as the sample above looks like this in the log file:
  • image
  • We can filter out these messages by adding #Timestamp to the filter pane and clicking Apply. This will filter out all messages without a timestamp.
  • image

Saving the session

To make the data load faster next time, we can save the parsed data and filter as a session. This will retain the workspace as we left it.

  • Click Save.
  • Select All Messages.
  • Click Save As.
  • Save the .matp file.
  • image

Looking for problems

The sample log files contains an incident where the iSCSI storage disappeared. This was triggered by a SAN reboot during a firmware update on a SAN without HA. I will go through some analysis of this issue to show how we can use MA to navigate the cluster logs.

  • To make it easier to read the log, we will add a Grouping Viewer. Click New Viewer, Grouping, Cluster Logs:
  • image
  • This will give you a Grouping pane on the left. Stat by clicking the Collapse All button:
  • image
  • Then expand the ERR group and start with the messages without a subcomponent tag. The hexadecimal numbers are the ProcessId of the process writing the error to the log. Usually this is a resource hosting subsystem process.
  • It is pretty clear that we have a storage problem:
  • image
  • To check which log contains one of these messages, select one message and look in the Details pane, Properties mode. Scroll down until you find the TraceSource property:
  • image
  • To read other messages logged at the same time, switch the Grouping viewer from Filter to Select mode:
  • image
  • If we click the same ERR group again, the Analysis Grid view will scroll to the first message in this group and mark all messages in the group.
  • image
  • The WARN InfoLevel for the RES SubComponent is also a good place to look for root causes:
  • image
  • If you want to see results from one log file only, add *TraceSource == “filename” to the grouping filter.

Connect SQL Server Management Studio to a different AD domain

Problem

  • SSMS is installed on a management workstation in DomainA.
  • The SQL Server is installed on a server in DomainB.
  • There is no trust between DomainA and DomainB.
  • The server is set to use AD Authentication only.’

Solution

Use runas /netonly to start SSMS.

The netonly switch will cause SSMS to use alternate credentials for all remote access. This will enable access to remote SQL Servers using the supplied credentials as long as you are able to authenticate to the domain. Network capture tests indicates that you need network access to a domain controller in DomainB from your management workstation for this to work.

  • Run the following command in the folder where SSMS.EXE is installed:
RUNAS /netonly /user:DomainB\user SSMS.EXE
  • Then connect to the server you want to talk to in DomainB as you would if you were running SSMS from a computer in DomainB.

SSMS will indicate that you are running as DomainA\user, but if you run a SELECT SYSTEM_USER command you will see that your commands are executed as DomainB\user. When you open the Connect to Server dialog, the DomainA user will be shown (and greyed out as usual), but you will actually connect as the specified DomainB user.

image
image

Be aware that if you want to connect to SQL Servers in several disjointed domains, you will need to have one window for each account. All of them will seem as they are using the DomainA account, so it can get a bit confusing. I recommend connecting to a server at once, that way you should be able to easily identify which domain your window is connecting to.

Sepura 300-00384 6 by 6 charger water damage

The Sepura 300-00384 is a 12 bay charger for the Sepura STP series of Tetra Radios. It has 6 bays for charging of loose batteries, and 6 bays for direct charging of a radio with battery.

Problem

Someone has spilled water in one of the charging slots, or inserted a very wet battery for charging. The charging slot is defective. This model of charger is not very rugged despite its extortionate price of around 850USD (not including the power cable of all things). It has a layer of conformal coating on the PCBs, but the chassis is made of a very cheap type of plastic. According to the markings the electronics is made by a company called Powersolve Electronics Ltd.But all things being equal, I manage four of these, and all of them has survived several years of rugged gear-destroying search and rescue volunteers without issue so far, so they must have done something right.

Analysis

The green death and its preceding cousin white water damage was detected upon inspection of the battery connector. One of the pins has also been bent out of shape, and the black carbonation left on it indicates that it has created a short circuit at some point.

image

Some disassembly is in order to inspect the rest of the board. The charger is held together by a million screws hidden under some of the abundant rubber feet. That is, there are screws hidden under all the rubber feet, but only some of them can be removed at this stage so do not bother removing all of the rubber feet. The red arrows on the picture below points at the screws you need to remove to be able to take off the bottom plate. They are standard PH2 coarse thread screws for plastics, the type you will find in most cheap and some expensive electronic devices.

image

The screws located underneath the rubber feet not removed above cannot be unscrewed until you have removed the bottom plate.

The charging electronics is split onto seven circuit boards. Six charging circuits, each controlling one battery slot and one radio slot, and one power distribution board (PDB). To remove and inspect our defective slot board we have to loosen the power distribution board and remove the screws holding the charging board. There is a power cable connecting each charging board to the PDB, and it is accessible underneath the PDB when you lift it slightly. Disconnect the power cable BEFORE you remove the charging module. The cable has a connector in both ends, and I found that disconnecting the PDB-end was easiest.

image

I disconnected all of the power cables to inspect the PDB for damage on both sides.

image

image

Closer inspection of the faulty module reveals water ingress beyond the externally exposed pins and what looks like the remains of a short circuit arcing in addition to what could be detected from the outside. The pins were shorted due to the buildup of green death.

Repair

  • The green death was removed using electronics cleaner (any quick evaporating type will do), some q-tips and a the reverse of a small knife/scalpel. Be careful not to damage the conformal coating.
  • The carbon deposits on the pin were removed with a scalpel and a small flat-head screwdriver.
  • Some small square-head pliers were used to straighten out the bent pin and make it look like the others. Most small lightweight pliers for electronics use should work, but avoids using large radio pliers as you could easily destroy the pins.
  • The arc damage was polished off with electronics cleaner and a lint free cloth.
  • A small dab of Fluid Film NAS was added to the base of the pins to prevent any further corrosion.

Assembly and testing

Be careful not to bend any of the pins when re-inserting the charging module. Remember you have five pins in total, three battery pins and two radio pins at the other end of the board.

image

The image above shows the charger during final testing. Both the battery and radio slots are working properly now.

Update-module fails

Problem

When trying to update a module from PSGallery (PSWindowsUpdate in this case), the package manager claims that the PSGallery repository does not exist with one of the following errors.

  • “ Unable to find repository ‘https://www.powershellgallery.com/api/v2/’.”
  • “ Unable to find repository ‘PSGallery’.”

 

Analysis

There seems to be a problem with the URL for the PSGallery repository missing a trailing slash, as I could find a lot of posts about this online. If we do a Get-PSRepository and compare this with a Get-InstalledModule –Name PSWindowsupdate|fl we can see that the URL differs:

 

image

 

image

There is also something wrong with the link between the repository and the package, the repository line above should say PSGallery, not https://www.powershellgallery.com/api/v2/.

I do not know why or when this happened, but sometime in the second half of 2018 is my best guess based on the last time we patched PSWindowsUpdate the servers in question.

The PackageManagement version installed on these servers were rather old, version 1.0.0.1. Come to think of it, the PackageManagement moved to PSgallery itself at some point, but this version is not as it is found using Get-Module, and not Get-InstalledModule:

image

Solution

After a long and winding path I have come up the following action plan:

  • Update the NuGet provider.
  • Uninstall the “problem” module.
  • Unregister the PSGallery repository.
  • Re-register the PSGallery repository.
  • Install the latest version of PowerShellGet from PSGallery.
  • Reinstall the problem module.
  • Reeboot to remove the old PackageManagement version.

 

You can find a sample Powershell script using PSWindowsUpdate as the problem module below. If you have multiple PSGallery modules installed, you may have to re-install all of them.

Install-PackageProvider Nuget –Force
Uninstall-Module PSWindowsupdate
Unregister-PSRepository -Name 'PSGallery'
Register-PSRepository -Default
Get-PSRepository
Install-Module –Name PowerShellGet -Repository PSGallery –Force
install-module -Name PSWindowsUpdate -Repository PSGallery -Force -Verbose