MOMcertimport.exe not found

Scenario

  • You have a computer that is monitored using System Center Operations Manager (SCOM).
  • This computer is located outside of your normal AD structure, and as such is relying on certificate authentication. It could be located in:
    • The cloud
    • In a DMZ
    • In a disjointed domain
    • All of the above.
    • In a super secret location with way too many firewalls
  • The certificate or part of the certificate chain has expired and needs replacing
  • You are unable to run the MOMCertimport.exe tool that registers the certificate with the SCOM agent.

Solution

Note: I will assume that you have already created and installed a valid certificate on the computer in the correct way. In short:

  • Into the local computer certificate store
  • Including all root and intermediate certificates needed
  • And the private key for the certificate

Now, to make use of said certificate we would normally run MOMCertimport.exe. It is a tool located on the SCOM installation media that is written for the express purpose of informing the SCOM Agent as to which certificate it is supposed to use for communicating with the rest of the SCOM infrastructure, usually a gateway server. But maybe you do not have access to it? Or maybe, just maybe the computer in question is considered so secure that getting approval for using a tool like that will take weeks or even months?

Regedit to the rescue!

You will need the following information:

  • The certificate thumbprint
  • The certificate serial number

Action plan

If any details of this plan are unclear or confusing to you, seek assistance before you start.

  • Open regedit
  • Navigate to HKLM\Software\Microsoft\Microsoft OperationsManager\3.0\Machine Settings
  • Look for the ChannelCertificateSerialNumber value. If it does not exist, create it as a binary value.
  • Input the binary value in reverse. That is, if your serial number is AF 3C 56, input 56 3C AF. The pairs of numbers each represent a byte in hexadecimal format. Do not reverse the hex numbers, only the byte order as shown above.
  • Double check the numbers
  • Look for the ChannelCertificateHash value. If it does not exist, create it as a string value.
  • Input the certificate thumbprint into this field. This time, do not reverse the bytes. Also, remove any spaces. That is, input 99 df a3 as 99dfa3. Use lower case letters for the a b c d e f numbers. The thumbprint will usually be listed with lower case, whereas the binary value above will be listed with upper case.
  • Again, double check the numbers
  • Restart the Microsoft Monitoring Agent service
  • Look for event id 20053 in the Operations Manager event log, confirming that the certificate was valid. An invalid certificate will result in event id 20066.

Pictures

CredSSP encryption oracle remediation

Problem

One of my minions contacted me about a strange error message connecting to a server. He was running scheduled maintenance, but he was unable to connect via RDP to one of his servers. The error message looked like this:

image

“An authentication error occurred The Function requested is not supported”

“This could be due to CredSSP encryption oracle remediation”

Analysis

Some Microsoft gremlin thought it was a good idea to block remote connections to Windows 2012R2 servers missing the march 2018 CredSSP patch if your client is patched. You know, just to make it extra easy to patch the servers. They even try to blame Oracle for their mess.

According to 4093492, this fine function was enabled on 2018-05-08. “By default, after this update is installed, patched clients cannot communicate with unpatched servers.” You can override this by creating a GPO and restarting all affected systems, but that would leave you permanently vulnerable to what is in fact a security issue. Moreover, as a reboot is needed for the workaround it is easier to just patch the servers (which was our initial plan).

Solution

Install the patches from this list on your servers: https://portal.msrc.microsoft.com/en-us/security-guidance/advisory/CVE-2018-0886. If you are lucky they are just VMs and you have access to the VM console ore some kind of KVM. If you are not lucky, a trip to the server room it is.

Removing a drive from a cluster group moves all cluster group resources to Available Storage

Problem

During routine maintenance on a SQL Server cluster we were planning to remove one of the clustered drives. We had previously replaced the SAN, and this disk was backed by an old storage unit that we wanted to decommission. So we made sure that there were no dependencies, right-clicked the drive in Failover Cluster Manager under the SQL Server role and pressed “Remove from SQL Server”. Promptly the drive vanished from view, together with all other cluster resources associated with the role…

After a slightly panicky check to make sure that the SQL Server instance was still running (it was), we started to wonder about what was happening. Running Get-ClusterResource in PowerShell revealed that all our missing resources had been moved to the “Available Storage” resource group.

image

We did a failover to verify that the instance was still working, and it gladly failed over with the Available Storage group. There is a total of 4 instances of SQL Server on the sample cluster pictured above.

Solution

The usual warning: Performing this procedure may result in an outage. If you do not understand the commands, read up on them before you try.

Move the resources back to the SQL Server resource group. If you move the SQL Server resource, that is the resource with the ResourceType SQL Server, all other dependent resources should follow. If your dependency settings are not configured correctly, you may have to move some of the resources independently.

Command: Get-ClusterResource “SQL Server (instance)”|Move-ClusterResource –Group “SQL Server (instance)”

Just replace Instance with the name of your SQL Server instance.

Then, run Get-ClusterResource|Sort-Object OwnerGroup, ResourceType to verify that all you resources are associated with the correct resource group. The result should look something like this. As a minimum, you should have an IP address, a network name, SQL Server, SQL Server Agent and one ore more Physical disk drives.

image

Session "" failed to start with the following error: 0xC0000022

Problem

The event log fills up with Event ID 2 from Kernel-EventTracing stating Session “” failed to start with the following error: 0xC0000022.

image

Analysis

If you look into the system data for one of the events, you will find the associated ProcessID and ThreadID:

image

If the event is relatively current, the Process ID  should still be registered by the offending process. Open Process Explorer and list processes by PID:

image

We can clearly see that the culprit is one of those pesky WMI-processes. The ThreadID is a lot more fluctuating than the ProcessID, but we can always take a chance and se if it will reveal more data. I spent a few minutes writing this, and in that time it had already disappeared. I waited for another event, and immediately went to process explorer to look for thread 18932. Sadly though, this didn’t do me any good. For someone more versed in kernel API calls the data might make some sense, but not to me.

image

I had more luck rummaging around in the ad-profile generator (google search). It pointed me in the direction of KB3087042. It talks about WMI calls to the LBFO teaming (Windows 2012 native network teaming) and conflicts with third-party WMI providers. Some more digging around indicated that the third-party WMI provider in question is HP WBEM. HP WBEM is a piece of software used on HP servers to facilitate centralized server management (HP Insight). As KB3087042 states the third-party provider is not the culprit. That implies a fault in Windows itself, but one must not admit such things publicly of course.

In their infinite wisdom (or as an attempt to compensate for their lack thereof), the good people of Microsoft has also provided a manual workaround for the issue. It is a bit difficult to understand, so I will provide my own version below.

Solution

As usual, if the following looks to you as something that belongs in a Harry Potter charms class, please seek assistance before you implement this in production. You will be messing with central operating system files, and a slip of the hand may very well end up with a defective server. You have been warned.

The fix

But let us get on with the fix. First, you have to get yourself an administrative command prompt. The good old fashioned black cmd.exe (or any of the 16 available colors). There is no reason why this would not work in one of those fancy new blue PowerShell thingy’s as well, but why take unnecessary risks?

Then, we have a list of four incantations – uh.., commands to run through. Be aware that if for some reason your system drive is not C:, you will have to take that into account. And then spend five hours repenting and trying to come up with a good excuse for why you did it in the first place. Or perhaps spend the time looking for the person who did it and give them a good talking to. But I digress. The commands to run from the administrative command prompt are as follows:

Takeown /f c:\windows\inf
icacls c:\windows\inf /grant “NT AUTHORITY\NETWORK SERVICE”:”(OI)(CI)(F)”
icacls c:\windows\inf\netcfgx.0.etl /grant “NT AUTHORITY\NETWORK SERVICE”:F
icacls c:\windows\inf\netcfgx.1.etl /grant “NT AUTHORITY\NETWORK SERVICE”:F

The first command takes ownership of the Windows\Inf folder. This is done to make sure that you are able to make the changes. The three icacls-commands grants permissions to the NETWORK SERVICE system account on the INF-folder and two ETL-files. The result should look something like this:

SNAGHTML46207857

To test if you were successful, run this command:

icacls c:\windows\inf

And look for the highlighted result:

image

Should you want to learn more about the icacls command, this is a good starting point.

The cleanup

This point is very important. If you do not hand over ownership of Windows\Inf back to the system, bad things will happen in your life.

This time, you only need a normal file explorer window. Open it, and navigate to C:\Windows. Then open the advanced security dialog for the folder.

Next to the name of the current owner (should be your account) click the change button/link.

SNAGHTML4628dae2

Then, select the Local Computer as location and NT SERVICE\TrustedInstaller as object name. Click Check Names to make sure you entered everything correctly. If you did, the object name changes to TrustedInstaller (underlined).

image

Click OK twice to get back to the file explorer window. If you did not get any error messages, you are done.

It IS possible to script the ownership transfer as well, but in my experience the failure rate is way to high. I guess the writers of the KB agrees, as they have only given a manual approach.

Failover Cluster manager fails to start

Problem

When trying to start Failover Cluster manager you get an error message: “Microsoft Management Console has stopped working”

image

Inspection of the application event log reveals an error event id 1000, also known as an application error with the following text:

Faulting application name: mmc.exe, version: 6.3.9600.17415, time stamp: 0x54504e26
Faulting module name: clr.dll, version: 4.6.1055.0, time stamp: 0x563c12de
Exception code: 0xc0000409

image

 

Solution

As usual, this is a .NET Framework debacle. Remove KB 3102467 (Update for .NET Framwework 4.6.1), or wait for a fix.

image

Windows Server 2012R2 NIC Teaming

This is an attempt at giving a technical overview of how the native network teaming in Windows 2012R2 works, and how I would recommend using it. From time to time I am presented with problems “caused” by network teaming, so figuring out how it all works has been essential. Compared to the days of old, where teaming was NIC vendor dependent, todays Windows native teaming is a delight, but it is not necessarily trouble free.

Sources

Someone at Microsoft has written an excellent guide called Windows Server 2012 R2 NIC Teaming (LBFO) Deployment and Management, available at here. It gives a detailed technical guide to all the available options. I have added my field experience to the mix to create this guide.

Definitions

  • NIC: Network Interface Card. Also known as Network Adapter.
  • vNIC/virtual NIC: a team adapter on a host or another computer (virtual or physical) that use teaming.
  • Physical NIC/adapter: An adapter port that is a member of a team. Usually a physical NIC, but could be a virtual NIC if someone has made a complicated setup with teaming on a virtual machine.
  • vSwitch: A virtual switch, usually a Hyper-V switch.
  • Team member: a NIC that is a member of a team.
  • LACP: Link Aggregation Control Protocol, also IEE 802.3ad. See https://en.wikipedia.org/wiki/Link_aggregation#Link_Aggregation_Control_Protocol

Active-Active vs Active-Passive

image

If none of the adapters are set as standby, you are running an Active-Active config. If one is standby and you have a total of two adapters, you are running an Active-Passive config. If you have more than two team members, you may be running a mixed Active-Active-Passive config (strandby adapter set), or an Active-Active config without a standby adapter.

If you are using a configuration with more than one active team member on a 10G infrastructure, my recommendation is to make sure that both members are connected to the same physical switch and in the same module. If not, be prepared to sink literally hundreds, if not thousands of hours into troubleshooting that could otherwise be avoided. There are far too many problems related to the switch teaming protocols used on 10G, especially with the Cisco Nexus platform. And it is not that they do not work, it is usually an implementation problem. A particularly nasty kind of device is something Cisco refers to as a FEX or fabric extender. Again, it is not that it cannot work. It’s just that when you connect it to the main switch with a long cable run it usually works fine for a couple of months. And then it starts dropping packets and pretends nothing happened. So if you connect one of your team members to a FEX, and another to a switch, you are setting yourself up for failure.

Due to the problems mentioned above and similar troubles, many IT operations have a ban on Active-Active teaming. It is just not worth the hassle. If you really want to try it out, I recommend one of the following configurations:

  • Switch independent, Hyper-V load balancing. Naturally for vSwitch connected teams only. No, do not use Dynamic.
  • LACP with Address Hash or Hyper-V load balancing. Again, do not use Dynamic mode.

Team members

I do not recommend using more than two team members in Switch Independent teaming due to artifacts in load distribution. Your servers and switches may handle everything correctly, but the rest of the network may not. For switch dependent teaming, you should be OK, provided that all team members are connected to the same switch module. I do not recommend using more than four team members though, as it seems to be the breaking point between added redundancy and too much complexity.

Make sure all team members are using the exact same network adapter with the exact same firmware and driver versions. Mixing them up will work, but even if base jumping is legal you don’t have to go jumping. NICs are cheap, so fork over the cash for a proper Intel card.

Load distribution algorithms

Be aware that the load distribution algorithm primarily affects outbound connections only. The behavior of inbound connections and routing for switch independent mode is described for each algorithm. In switch dependent mode (either LACP or static) the switch will determine where to send the inbound packets.

Address hash

Using parts of the address components, a hash is created for each load/connection. There are three different modes available, but the default one available in the GUI (Port and IP) is mostly used. The other alternatives are IP only and MAC only. For traffic that does not support the default method, one of the others is used as fallback.

Address hash creates a very granular distribution of traffic initiated at the VM, as each packet/connection is load balanced independently. The hash is kept for the duration of the connection, as long as the active team members are the same. If a failover occurs, or if you add or remove a team member, the connections are rebalanced. The total outbound load from one source is limited by the total outbound capacity of the team and the distribution.

clip_image003[5]

Inbound connections

The IP address for the vNIC is bound to the so called primary team member, which is selected from the available team members when the team goes online. Thus, everything that use this team will share one inbound interface. Furthermore, the inbound route may be different from the outbound route. If the primary adapter goes offline, a new primary adapter is selected from the remaining team members.

Recommended usage
  • Active/passive teams with two members
  • Never ever use this for a Virtual Switch
  • Using more than two team members with this algorithm is highly discouraged. Do not do it.

MS recommends this for VM teaming, but you should never create teams in a VM. I have yet to hear a good reason to do so in production. What you do in you lab is between you and your therapist.

Hyper-V mode

Each vNIC, be it on a VM or on the host, is assigned to a team adapter and stays connected to this as long as it is online. The advantage is a predictable network path, the disadvantage is poor load balancing. As adapters are assigned in a round robin fashion, all your high bandwidth usage may overload one team adapter while the other team adapters have no traffic. There is no rebalancing of traffic. The outbound capacity for each vNIC is limited to the capacity of the Physical NIC it is attached to.

This algorithm supports VMQ.

clip_image004[5]

It may be the case that the red connection in the example above is saturating the physical NIC, thus causing trouble for the green connection. The load will not be rebalanced as long as both physical NICs are online, even if the blue connection is completely idle.

The upside is that the connection is attached to a physical NIC, and thus incoming traffic is routed to the same NIC as outbound traffic.

Inbound connections

Inbound connections for VMs are routed to the Physical NIC assigned to the vNIC. Inbound connections to a host is routed to the primary team member (see address hash). Thus inbound load is balanced for VMs, and we are able to utilize VMQ for better performance. Dynamic has the same inbound load balancing problems as Address hash for host inbound connections.

Recommended use

Not recommended for use on 2012R2, as Dynamic will offer better performance in all scenarios. But, if you need MAC address stability for VMs on a Switch Independent team, Hyper-V load distribution mode may offer a solution.

On 2012, recommended for teams that are connected to a vSwitch.

Dynamic

Dynamic is a mix between Hyper-V and Address hash. It is an attempt to create a best of both worlds-scenario by distributing outbound loads using address hash algorithms and inbound load as Hyper-V, that is each vNIC is assigned one physical NIC for inbound traffic. Outbound loads are rebalanced in real time. The team detects breaks in the communication stream where no traffic is sent. The period between two such breaks are called flowlets. After each flowlet the team will rebalance the load if deemed necessary, expecting that the next flowlet will be equal to the previous one.

The teaming algorithm will also trigger a rebalancing of outbound streams if the total load becomes very unbalanced, a team member fails or other hidden magic black-box settings should determine that immediate rebalancing is required.

This mode supports VMQ.

clip_image005

Inbound connections

Inbound connections are mapped to one specific Physical Nic for each workload, be it a VM or a workload originating on a host. Thus, the inbound path may differ from the outbound path as in address hash.

Recommended use

MS recommends this mode for all teams with the following exceptions:

  • Teams inside a VM (which I do not recommend that you do no matter what).
  • LACP Switch dependent teaming
  • Active/Passive teams

I will add the following exception: If your network contains load balancers that do not employ proper routing, e.g. F5 BigIP with the “Auto Last Hop” option enabled to overcome the problems, routing will not work together with this teaming algorithm. Use Hyper-V or Address Hash Active/passive instead.

Source MAC address in Switch independent mode

Outbound packets from a VM that is exiting the host through the Primary adapter will use the MAC address of the VM as source address. Outbound packets that are using a different physical adapter to exit the host will get another MAC address as source address to avoid triggering a MAC flapping alert on the physical switches. This is done to ensure that one MAC address is only present at one physical NIC at any one point in time. The MAC assigned to the packet is the MAC of the Physical NIC in question.

To try to clarify, for Address Hash:

  • If a packet from a VM exits through the primary team member, the MAC of the vNIC on the VM is kept as source MAC address in the packet.
  • If a packet from a VM exits through (one of) the secondary team members, the source MAC address is changed to the MAC address of the secondary team member.

for Hyper-V:

  • Every vSwitch port is assigned to a physical NIC/team member. If you use this for host teaming (no vSwitch), you have 1 vSwitch port and all inbound traffic is assigned to one physical NIC.
  • Every packet use this team member until a failover occurs for any reason

for Dynamic:

  • Every vSwitch port is assigned to a physical NIC. If you use this for host teaming (no vSwitch), you have 1 vSwitch port and all inbound traffic is assigned to one physical NIC.
  • Outbound traffic will be balanced. MAC address will be changed for packets on secondary adapters.

For Hyper-V and Dynamic, the primary is not the team primary but the assigned team member. It will thus be different for each VM.

For Host teaming without a vSwitch the behavior is similar. One of the team members’ MAC is chosen as the primary for host traffic, and MAC replacement rules applies as for VMs. Remember, you should not use Hyper-V load balancing mode for host teaming. Use Address hash or Dynamic.

Algorithm Source MAC on primary Source MAC on secondary adapters
Address hash Unchanged MAC of the secondary in use
Hyper-V Unchanged Not used
Dynamic Unchanged MAC of the secondary in use

Source MAC address in switch dependent mode

No MAC replacement is performed on outbound packets. To be overly specific:

Algorithm Source MAC on primary Source MAC on secondary adapters
Static Address hash Unchanged Unchanged
Static Hyper-V Unchanged Unchanged
Static Dynamic Unchanged Unchanged
LACP Address hash Unchanged Unchanged
LACP Hyper-V Unchanged Unchanged
LACP Dynamic Unchanged Unchanged

EventID 1004 from IPMIDRV v2

Post originally from 2010, updated  2016.06.20. IPMI seems to be an endless source of “entertainment”…

Original post: https://lokna.no/?p=409

Problem

imageThe system event log is overflowing with EventID 1004 from IPMIDRV. “The IPMI device driver attempted to communicate with the IPMI BMC device during normal operation. However the operation failed due to a timeout.”

The frequency may vary from a couple of messages per day upwards to several messages per minute.

Analysis

The BMC (Baseboard Management Controller) is a component found on most server motherboards. It is a microcontroller responsible for communication between the motherboard and management software. See wikipedia for more information. The BMC is also used for communication between the motherboard and dedicated out of band management boards such as Dell iDRAC. I have seen these error messages on systems from several suppliers, most notably on IBM and Dell blade servers, but most server motherboards have a BMC. As the error message states, you can resolve this error by increasing the timeout, and this is usually sufficient. I have found that the Windows default settings for the timeouts may cause conflicts, especially on blade servers. Thus an increase in the timeout values may be in order as described on technet. Lately though, I have found this error to be a symptom of more serious problems. To understand this, we have to look at what is actually happening. If you have some kind of monitoring agent running on the server, such as SCOM or similar, the error could be triggered by said agent trying to read the current voltage levels on the motherboard. If such operations fail routinely during the day, it is a sign of a conflict. This could be competing monitoring agents querying data to frequently, an issue with the BMC itself, or an issue with the out of band management controller. In my experience, this issue is more frequent on blade servers than rack-based servers. This makes sense, as most blade servers have a local out of band controller that is continuously talking to a chassis management controller to provide a central overview of the chassis.

Continue reading “EventID 1004 from IPMIDRV v2”

New CSV Volume does not work, Event ID 5120

Problem

A newly converted Cluster Shared Volume refuses to come on-line. Cluster Validation passed with flying colours pre conversion. Looking in the event log you find this:

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 05.06.2016 15:01:31
Event ID: 5120
Task Category: Cluster Shared Volume
Level: Error
Keywords:
User: SYSTEM
Computer: HyperVHostname
Description:
Cluster Shared Volume ‘Volume1’ (‘VMStore1’) has entered a paused state because of ‘(c00000be)’. All I/O will temporarily be queued until a path to the volume is reestablished.

The event is repeated on all nodes.

Analysis

The crafty SAN-Admins has probably enabled some kind of fancy SAN-mirroring on your LUN. If you check, you will probably find twice the amount of storage paths compared to your usual amount. A typical SAN has 4 connections per LUN, and thus you may see 8 paths. Be aware that your results may vary. The point is that you now have more than usual. Problem is that you cannot use all of the paths simultaneously. Half of them are for the SAN mirror, and your LUNS are offline at the mirror location. If a failover is triggered at the SAN side, your primary paths go down and your secondary paths come alive. Your poor server knows nothing about this though, it is only able to register that some of the paths does not work even if they claim to be operative. This confuses Failover Clustering. And if there is one thing Failover Clustering does not like, it is getting confused. As a result the CSV volume is put in a paused state while it waits for the confusion to disappear.

Solution

You have to give MPIO permission to verify the claims made by the SAN as to whether or not a path is active. Run the following powershell command on all cluster nodes. Be aware that this is a system wide setting and is activated for all MPIO connections that use the Microsoft DSM.

Set-MPIOSetting -NewPathVerificationState Enabled

Then reboot the nodes and all should be well in the realm again.

Windows Server 2012R2 stuck at “Updating your system”

Update 2017.01.24: Several people have reported that adding a blank pending.xml file helps, so I have added it to the list.

Problem

After installing updates through Microsoft Update, the reboot never completes. You can wait for several days, but nothing happens, the process is stuck at X%:

image

Troubleshooting

I will try to give a somewhat chronological approach to get your server running again. I do experience this issue from time to time, but thankfully it is pretty rare. That makes it a bit harder to troubleshoot though.

Warning: this post contains last-ditch attempts and other dangerous stuff that could destroy your server. Use at your own risk. If you do not understand how to perform the procedures listed below, you should not attempt them on your own. Especially not in production.

First you wait, then you wait some more

Some updates may take a very long time to complete. More so if the server is an underpowered VM. Thus, it is worth letting it roll overnight just in case it is really slow. Another trick is to send a Ctrl+Alt+Del to the server. Sometimes that will cancel the stuck update, allowing the boot sequence to continue.

Then you poke around in the hardware

Hardware errors can cause all kinds of issues during the update process. If you are experiencing this issue on a physical server, check any relevant ILO/IDRAC/IMM/BMC logs, and visit the server to check for warning lights. A quick memory test would also be good, as memory failures are one of the most prevalent physical causes of such problems.

If that does not help, blame the software

It was Windows that got us into this mess in the first place, so surely now is the time to point the finger of blame at the software side?

Try booting into Safe Mode. If you are lucky the updates will finish installing in safe mode, and all you have to do is reboot. If you are unlucky, there are two ways to make Windows try to roll back the updates:

  • Delete C:\Windows\WinSxS\pending.xml
  • Create a blank pending.xml file in C:\Windows\WinSxS
  • Run DISM /image:C:\ /cleanup-image /revertpendingactions

imageimage

Then reboot. If a boot is successful, see if installing the updates in batches works better. Or just do not patch. Ever. Until you are hacked or something breaks. Just kidding, patching is a necessary evil.

Up a certain creek without a paddle…

If you are unable to enter Safe Mode, chances are the OS is pooched. I have experienced this once on Win2012R2. No matter what I did, the system refused to boot. From what I could tell, a pending change was waiting for a roll-back that required a reboot, and was thus unable to complete the cycle, ergo preventing the server to boot before it had rebooted. If that sounds crazy, well, it is. Time to re-image and restore from backup. The No. 1 suspect in my case was KB3000850, which is a composite “Roll-Up” containing lots of other updates. This may cause conflicts when Windows Update tries to install the same update twice in the same run, first as a part of the Roll-Up, and then as a stand-alone update. This is supposed to work, but it doesn’t always work.

You could try the rollback methods listed above in the recovery console. If that does not work, try running sfc /scannow /offbootdir=c:\ /offwindir=c:\windows from the recovery console. Maybe you will get lucky, but most likely you won’t…

imageimage

image

On a side note, KB3000850 has been a general irksome pain in the butt. It is best installed from an offline .exe by itself in a dark room at midnight on the day before a full moon while you walk around the console in counter-clockwise circles dressed in a Techno-Mage outfit chanting “Who do you serve, and who do you trust?“.

Cluster Shared Volumes password policy

Problem

Failover Cluster validation genereates a warning in the Storage section under “Validate CSV Settings”. The error message states:

Failure while setting up to run Cluster Shared Volumes support testing on node [FQDN]: The password does not meet the password policy requirements. Check the minimum password length, password complexity and password history requirements.

No failure audits in the security log, and no significant error messages detected elsewhere.

Analysis

This error was logged shortly after a change in the password policy for the Windows AD domain the cluster is a member of. The current minimum password length was set to 14 (max) and complexity requirements were enabled:

image

This is a fairly standard setup, as written security policies usually mandate a password length far exceeding 14 characters for IT staff. Thus, I already knew that the problem was not related to the user initiating the validation, as the length of his/her password already exceeded 14 characters before the enforcement policy change.

Lab tests verified that the problem was related to the Default domain password policy. Setting the policy as above makes the cluster validation fail. The question is why. Further lab tests revealed that the limit is 12 characters. That is, if you set the Minimum length to 12 characters the test will pass with flying colors as long as there are no other problems related to CSV. I still wondered why though. The problem is with the relation between the local and domain security policies of a domain joined computer. To understand this, it helps to be aware of the fact that Failover Cluster Validation creates a local user called CliTest2 on all nodes during the CSV test:

SNAGHTML62b3e8f

The local user store on a domain joined computer is subject to the same password policies as are defined in the Default Domain GPO. Thus, when the domain policy is changed this will also affect any local accounts on all domain joined computers. As far as I can tell, the Failover Cluster validation process creates the CliTest2 user with a 12 character password. This has few security ramifications, as the user is deleted as soon as the validation process ends.

Solution

The solution is relatively simple to describe. You have to create a separate Password Policy for you failover cluster nodes where Minimum Password Length is set to 12 or less. This requires that you keep your cluster nodes in a separate Organizational Unit from your user and service accounts. That is a good thing to do anyway, but be aware that moving servers from one OU to another may have adverse effects.

You then create and link a GPO to the cluster node OU and set the Minimum Password Length to 12 in the new GPO. That is the only setting that should be defined in this GPO. Then check the Link order for the OU and make sure that your new GPO has a link order 1, or at least a lower link order than the Default Domain policy. Then you just have to run GPUPDATE /Force on all cluster nodes and try the cluster validation again.

If the above description sounds like a foreign language, please ask for help before you try implementing it. Group Policies may be a fickle fiend, and small changes may lead to huge unforeseen consequences.