Quorum witness is online but does not work

Problem

The cluster appears to be working fine, but every 15 minutes or so the following events are logged on the node that owns the quorum witness disk:

Source:        Microsoft-Windows-Ntfs
Event ID:      98
Level:         Information
Description:
Volume WitnessDisk: (\Device\HarddiskVolumeNN) is healthy.  No action is needed.


Event ID:      1558
Source:        Microsoft-Windows-FailoverClustering
Level:         Warning
Description:
The cluster service detected a problem with the witness resource. The witness resource will be failed over to another node within the cluster in an attempt to reestablish access to cluster configuration data.


Log Name:      System
Event ID:      1069
Level:         Error
Description:
Cluster resource 'Witness' of type 'Physical Disk' in clustered role 'Cluster Group' failed.

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

Analysis

Some digging in the event log identified a disk error incident during a failover of the virtual machine:

Log Name:      System
Event ID:      1557
Level:         Error
Description:
Cluster service failed to update the cluster configuration data on the witness resource. Please ensure that the witness resource is online and accessible.


Log Name:      System
Source:        Microsoft-Windows-Ntfs
Event ID:      140
Description:
The system failed to flush data to the transaction log. Corruption may occur in VolumeId: WitnessDisk:, DeviceName: \Device\HarddiskVolumeNN.
({Device Busy}
The device is currently busy.)

And ultimately

Log Name:      System
Source:        Ntfs
Level:         Warning
Description:
{Delayed Write Failed} Windows was unable to save all the data for the file . The data has been lost. This error may be caused by a failure of your computer hardware or network connection. Please try to save this file elsewhere.

It appears that the witness disk had a non-responsive period during the failover of the VM that caused an update to the cluster database to fail, thus rendering the copy of the cluster database contained on the witness disk corrupt. The disk itself is fine, thus there are no faults in the cluster resource status. everything appears hunky dory. There could be other causes leading to the same situation, but in this case the issue corelates with a VM failover.

We need to replace the defective database with a fresh copy from one of the nodes.

Solution

The usual warning: If this procedure is new to you, seek help before attempting to do this in production. If your cluster has other issues, messing with the quorum setup may end you in serious trouble. And if you have any doubts what so ever about the integrity of the drive/LUN, replace it with a new one.

Warnings aside, this procedure is usually safe, and as long as the cluster is otherwise healthy you can do this live without scheduling downtime.

Action plan

  • Remove the quorum witness from the cluster.
  • Check that the disk is listed as available storage and online.
  • Take ownership of the defective “cluster” folder on the root ofr the quorum witness drive.
  • Rename it to “oldCluster” in case we need to extract some data.
  • Add the disk back as a quorum witness
  • Wait to check that the error messages does not re-appear.
  • If they do re-appear
    • Order a new LUN
    • Add it to the cluster
    • Use the new LUN as a quorum witness
    • Remove the old LUN from the cluster

Cluster validation fails: Culture is not supported

Problem

The Failover Cluster validation report shows an error for Inventory, List Operating System Information:

An error occurred while executing the test. There was an error getting information about the operating systems on the node. Culture is not supported. Parameter name: culture 3072 (0x0c00) is an invalid culture identifier.

If you look at the summary for the offending node, you will find that Locale and Pagefiles are missing.

Analysis

Clearly there is something wrong with the locale settings on the offending node. As the sample shows the locale is set to nb-NO for Norwegian, Norway. I immediately suspected that to be the culprit. Most testing is done on en-US, and the rest of us that want to see a sane 24-hour clock without latin abbreviations and a date with the month where it should be located usually have to suffer.

I was unable to determine exactly where the badger was buried, but the solution was simple enough.

Solution

Part 1

  • Make sure that the Region & language and Date & Time settings (modern settings) are set correctly on all nodes. Be aware of differences between the offending node and working nodes.
  • Make sure that the System Locale is set correctly in the Control Panel, Region, Administrative window.
  • Make sure that Windows Update works and is enabled on all nodes.
  • Check the Languages list under Region & Language (modern settings). If it flashes “Windows update” under one or more of the languages, you still have a Windows Update problem or an internet access problem.
  • Try to validate the cluster again. If the error still appears, go to the next line.
  • Run Windows Update and Microsoft Update on all cluster nodes.
  • Restart all cluster nodes.
  • Try to validate the cluster again. If the error still appears, go to part 2.

Part 2

  • Make sure that you have completed part 1.
  • Go to Control Panel, Region.
  • In the Region dialog, locate Formats, Format.
  • Change the format to English (United States). Be sure to select the correct English locale.
  • Click OK.
  • Run the validation again. It should be successful.
  • Change the format back.
  • Run the validation again. It should be successful.

If you are still not successful, there is something seriously wrong with you operating system. I have not yet had a case where the above steps does not resolve the problem, but I suspect that running chkdsk, sfc -scannow or a full node re-installation would be next. Look into the rest of the validation report for clues to other problems.

Failover Cluster Checklist, Windows Server 2019

Introduction

This post was originally written for Windows 2012R2. This is a rework with updates for Windows 2019. It is currently a work in progress.

OK, so you want to install a cluster? This is not a “Should I build a cluster?” post, this is a “How to build a proper cluster” post. I like checklists, so I made a Windows Failover Cluster installation checklist. Some of the points have their own post, some are just a short sentence. I will add more details as time allows. The goal is to share my knowledge about how to build stable clusters. I may disagree with other best practices out there, but this list is based on my experience, what works in production and what does not. I use it to build new clusters, as well as troubleshooting clusters made by others. Clustering is so easy that anyone can build a working cluster these days, but building a stable production worthy cluster may still be like finding you way out of a maze. A difficult maze filled with ghosts, trolls and angry badgers.

There are some things you need to know about this post before you continue reading:

  • This list is made for production clusters. There is nothing stopping you from building a lab using this list, but if you do as I say, you will build a very expensive lab.
  • I work with SQL Server, Hyper-V and File clusters. This list may work for other kinds of clusters as well, but I have not tested it on recent versions.
  • This list was originally published in 2014 for Windows 2008R2 up until Windows 2012R2. It os now updated for Windows Server 2019. I will try to add version specific instructions when necessary.
  • This list is for physical clusters. I dislike virtual clusters, because most organizations are not clever enough to create functioning virtual production clusters that won’t fail miserably due to user error someday. (By “virtual clusters” I mean cluster nodes on top of hypervisors, not clustered hypervisors). It is however entirely possible to build virtual clusters using this list, especially if you employ technologies such as Virtual FC.
  • This is my checklist. I have spent more than a decade honing it, and it works very well for me. That does not guarantee that it will work for you. I welcome any comments on alternative approaches, but don’t expect me to agree with you.
  • This list is mostly written in a “How to do it” manner, and may be lacking in the “But why should I do it” department. This is due to several reasons, but mostly a lack of time on my part. I do however want you to know that there are several hours, if not days of work behind each point.
  • Updates will be made as I discover new information.
  • The list is chronological. That is, start at the top and make your way down the list. If you jump back and forth, you will not achieve the desired result.
  • This list is based on the LTSB (Long-term Servicing Branch) GUI version of Windows Server, not Core. You can build clusters on Core, but I do not recommend it. Clusters may be very finicky to troubleshoot when things go wrong, and doing so on Windows Core is like trying to paint a room through the keyhole. So unless you have the infrastructure and budget necessary to treat your physical servers as throw-away commodities I recommend installing the “Desktop Experience”. To elaborate, if you have trouble with a core server, you remove it and deploy a replacement server. All automated of course.
  • Understanding this list requires knowledge of Active Directory and basic knowledge of Failover Clustering.
  • There are many special cases not covered. This list is for the basic 2-10 node single datacenter cluster. The basic rules still apply though, even if you have nodes in four datacenters and use a hybrid cloud setup.

The design phase

In the design phase, there are a lot of decisions you have to make BEFORE you start building the cluster. These are just a few of them:

  • How many nodes do you need? Remember you need at least one standby node for HA (High Availability). Depending on the total number of nodes you may need several standby nodes. Some managers will complain about the extra nodes just sitting there unused, but they forget that they are there to provide HA. No matter the number of nodes, make sure the hardware is as equal as possible. I don’t care what the manual says, having cluster nodes with different hardware in them is a recipe for disaster. If possible, all nodes should be built on the same day by the same persons and have consecutive serial numbers.
  • How many network fabrics do you need? And how many can you afford? See Networks, teaming and heartbeats for clusters for more information. This is where most troublesome clusters fail.
  • Will you use shared storage? And what kind of shared storage? In short: FCOE is bad for you, ISCSI is relatively cheap, SMB3 is complicated and may be cheap, shared DAS/SAS is very cheap, FC is the enterprise norm and infiniband is for those who want very high performance at any cost. Note that the deployment cost for Infiniband in small deployments has fallen significantly in the last couple of years. In most cases you will have to use what is already in place in your datacenter though. And it is usually better to have something your storage guys are used to supporting. Just remember that storage is very important for your overall performance, no matter what kind of cluster. For file clusters, high throughput is important. For SQL Server, low latency is key and you should use FC or Infiniband.
  • What kind of hardware should you use in your cluster nodes? These are my opinions, based on my personal experience to date. My opinions on this change frequently as new generations are released, but here goes:
    • Emulex should stop making any kind of hardware. It is all bad for you and bad for your cluster. If you are having trouble with cluster stability and you have Emulex made parts in your nodes, remove them at once.
    • QLogic make good FC HBAs. If you have a FC SAN, QLogic HBAs are highly recommended. If you have QLogic network adapters on the other hand, use them for target practice.
    • Broadcom network adapters used to be good, but the drivers for Windows are getting worse by the minute.
    • Intel X560 is my current favorite network adapter. It is sold under many names, so check what chip is actually used on the cards offered by your server manufacturer.
    • Use Brocade FC switches only. They are sold under many other brand names as well, I have seen them with both HP and IBM stickers.
    • Use Cisco or HP ProCurve network switches, but do not use them for FC traffic.
    • Make sure your nodes have local disk controllers with battery or flash backed cache. Entry level disk controllers are not worth the cardboard box they are delivered in and may slow down the most hard-core cluster.
    • Intel Xeon CPUs currently reigns supreme for most applications. There are however some edge cases for SQL Server where AMD CPUs will perform better. I recommend reading Glenn Berry’s blogs for up to date SQL Server CPU information.
    • HP, Lenovo and Dell all make reasonably good servers for clustering. Or, I should say equally bad, but better than the alternatives.
  • RACK or Blade?
    • RACK servers
      • are easier to troubleshoot
      • are versatile
      • give you a lot of expansion options
      • are cheaper to buy
    • Blade servers are
      • space efficient
      • cheaper to maintain if you rent rack space
      • easier to install
      • limited in terms of expansion options
  • Where should your nodes be located physically? I do not recommend putting them all in the same rack. The best solution is to put them in separate rooms within sub-millisecond network distance. You can also place them in separate data centers with a long distance between them if you do not use shared storage or use some kind of hybrid solution. I do not recommend SAN synchronization to data centers far, far away though, it is better to have synchronization higher up in the stack. If you only have one datacenter, place the nodes in different racks and make sure they have redundant power supplies.
  • Talking about power, your redundant power supplies should be connected to separate power circuits, preferably with each connected to an independent UPS.
  • What domain should your servers be member of, and which organizational unit should you use? Failover clustering will not work without Active Directory. No domain clusters are supported from W2019 but not recommended. You probably need AD for other stuff anyway.
  • The Active Directory role should NOT be installed on the cluster nodes. You should have at least two domain controllers, one of which should be a dedicated physical machine. I know that MS now supports virtualizing all your domain controllers, but that does not mean that you should do it, or that it is smart to do so. I would also recommend creating a separate OU for each cluster.
  • What account should you use to run the installation? Previously a separate cluster installation account was recommended, but with newer versions it is usually no problem using a regular sysadmin account. The account should be a domain administrator to make everything easy, but this checklist will work as long as you have local admin on the cluster nodes. (Be aware that some points require som form of AD write access).
  • And then there are a lot of product and project specifics, such as storage requirements, CPU and memory sizing and so on, all of which may affect your cluster design.

The actual checklist

All list items should be performed on each node in the cluster unless specified otherwise. You can do one node at the time or all at once until you get to cluster validation. All nodes should be ready when you run cluster validation. I find it easiest to remember everything by doing one list item for each node before I move on to the next, making notes along the way.

  • Mount the hardware
  • Set BIOS/UEFI settings as required by your environment. Remember to enable High Performance mode, otherwise you will be chasing performance gremlins.
  • If your cluster nodes are virtual machines, make sure that they are not allowed to be hosted by the same host. How you configure this will depend on your virtualization platform.
  • Install Windows Server
  • Copy any required media, drivers etc. to a folder on each node
  • Static or reserved IP addresses are recommended, bot IPv4 and IPv6.
  • If you are not able to use IPv6 to talk to your domain controllers, disable IPv6 completely in registry. See How to disable IPv6
  • Make sure all your drivers are installed using Device Manager.
  • Make sure you are running current BIOS, Remote Access, RAID, HBA and Network firmware according with your patch regime. If in doubt, use the latest available version from your server vendor. Do NOT download drivers and firmware from the chip vendor unless you are troubleshooting a specific problem.
  • Make sure your drivers are compatible with the firmware mentioned above.
  • Check whether the server is a member of the domain, and add it to the domain if necessary.
    Activate a machine proxy if you use a proxy server to access the internet. See Proxy for cluster nodes for more information.
  • Activate RDP.
  • Create a firewall rule to allow ICMP (ping) on all interfaces regardless of profile.
New-NetFirewallRule -DisplayName "Allow ICMP all profiles IPv4" -Direction Inbound -Protocol ICMPv4  -Action Allow
New-NetFirewallRule -DisplayName "Allow ICMP all profiles IPv6" -Direction Inbound -Protocol ICMPv6  -Action Allow
  • Select the High performance power plan.
  • If virtual node, enable VRSS. If physical, enable RSS. If you are creating a Hyper-V cluster, enable VMQ as well. See https://lokna.no/?p=2464 for details.
  • Make sure that your nodes are located in the correct OU. The default “Computers” container is not the correct OU.

  • Add the failover cluster features:
Install-WindowsFeature -Name Failover-Clustering –IncludeManagementTools
  • Check the interface metric. Your domain facing team/adapter should have the lowest metric. See https://lokna.no/?p=2637 
  • Disable NICs that are not in use
  • Install any prerequisites required by your shared storage. Check with your SAN admin for details.
  • Change page file settings according to Page file defaults
  • Install PSWindowsupdate and run it for Microsoft update.
  • Install cluster hotfixes. See Does your cluster have the recommended hotfixes?
  • If you are using shared storage, verify storage connections and MPIO in accordance with guidelines from your SAN vendor. Most SAN vendors have specific guidelines/whitepapers for Failover Clustering.
  • Make sure that you are connected to your shared storage on all nodes and have at least one LUN (drive) presented for validation.
  • Validate the configuration: Validating a Failover Cluster. Do not continue until your cluster passes validation. I have yet to see a production cluster without validation warnings, but you should document why you have each warning before you continue.
  • Create the cluster: Creating a Failover Cluster
  • Verify the Quorum configuration. Make sure dynamic quorum is enabled.  You should always have a quorum witness drive (even if you don’t use it). The create cluster wizard will without fail select another quorum witness drive than the one you intended to use, so make sure to correct this as well.
  • Grant create computer object permissions to the cluster. This is necessary for installation of most clustered roles, and this is why each cluster should have its own OU.

Upgrade to VMM 1801, a knights tale

This is a story in the “Knights of Hyper-V” series, an attempt at humor with actual technical content hidden in the details.

A proclamation had been issued several moons ago by the gremlins of the blue window, declaring that a new version of the Virtual Machine Manager had been released. This had mostly been ignored by our merry knights, they were all busy building new systems, putting out fires and slaying dragons. You know, the usual stuff. Thus they had no time to spare for doing such things as maintenance on systems that were chugging along nicely without issues. But when a second proclamation appeared about an even newer version, it was decided to spend some time trying to do an upgrade in the lab, down in the spare dungeon.

Alas, this was not to be an easy task. The lab servers were in dire need of some maintenance as well, and one of the host flat out refused to respond to commands. Closer inspection revealed a “No bootable device” error on the local console. The results of a botched patching run a long time ago. For some reason the main partition was no longer marked active, a relatively easy fix in diskpart. But on to the main quest. Rumors would have it that there was no in place upgrade from SCVMM 2016 to SCVMM 1801. Those rumors were true indeed.

A knight was sent into the maze of documentation to look for answers. He came upon several dead ends and a lot of references to the hidden cat of 404, but he persisted and finally ended up at https://docs.microsoft.com/en-us/system-center/vmm/upgrade-vmm?view=sc-vmm-1801. Just as in the upgrade from SCVMM 2012 to SCVMM 2016, an uninstall/reinstall was required.

A cunning plan is devised

The SCVMM 1801 scroll of  system requirements were reviewed to make sure that our systems were supported. The spare dungeon contains a single VM running both SCVMM and SQL Server and some old hosts. The VMM VM has the following setup:

  • Windows Server 2016
  • SQL Server 2012 SP4
  • SCVMM 2016 4.0.2314.0 (UR5)

After some pondering around the table reading the scroll of upgrade instructions mentioned above, the following plan was agreed upon:

  • Checkpoint/snapshot the VMM server.
  • Create a Copy-Only backup of the VMM database.
  • Reboot the VMM Server to make sure there are no pending reboots or other nasty stuff lurking in memory.
  • Uninstall VMM 2016.
  • Restart the server.
  • Install VMM 1801.
  • Upgrade VMM to 1807.
  • Remove the checkpoint.
  • Update the VMM Agent on the hosts.
  • Turn off Diagnostic and Usage data.

Note: If you are running other System Center products, make sure that you review the upgrade sequence. Especially noteworthy is the fact that Operations Manager should be upgraded before VMM.

Continue reading “Upgrade to VMM 1801, a knights tale”

MIM LAB6: The AD MA and Run profiles

This post is part of a series. The chapter index is located here.

In this lab we will configure the firs AD management agent and set up run profiles.

Continue reading “MIM LAB6: The AD MA and Run profiles”

MIM LAB5: The MIM Service / Portal Management Agent

This post is part of a series. The chapter index is located here.

In this post we will install and configure the MIM Portal / Service management agent.

Continue reading “MIM LAB5: The MIM Service / Portal Management Agent”

MIM LAB 4: Installing the MIM Portal / MIM Service

This post is part of a series. The chapter index is located here.

In this post we install and configure the MIM Portal / Service.

Be aware that I had to make some changes to things I did in previous labs to make this work. I hope I have included all the details, but I have yet to re-run a complete install to test it. Continue reading “MIM LAB 4: Installing the MIM Portal / MIM Service”

MIM LAB 2: Preparing the first MIM server

This post is part of a series. The chapter index is located here.

In this post we:

  • Create the first MIM VM and join it to AD
  • Install prerequisites
  • Set Local security policies
  • Change IIS authentication mode
  • Install SQL Server
  • Install and configure Sharepoint Foundation Services 2013

Continue reading “MIM LAB 2: Preparing the first MIM server”

MIM LAB 1: Prepare a domain controller

This post is a part of a series. The chapter index is located here.

We will look at:

  • Installing ADDS
  • Creating a domain
  • Configuring DNS
  • Creating a basic OU structure
  • Creating users and groups required for the MIM installation

Continue reading “MIM LAB 1: Prepare a domain controller”