Is your LAPS working as it should?

Intro

So, you have implemented LAPS, and you are wondering whether or not it is working as it should? Or at least, you should wonder about that. You see, LAPS is a solution with quite a few “moving parts”, and all of them have to work for your local administrator passwords to be randomized and rotated automatically. You need a Group Policy Client Side Extension on each and every Server and Workstation (client), you need a GPO using said extension, and you need to extend the schema and set AD permissions. If any of these are not working properly somewhere, LAPS will not work properly. The most usual problems are:

  • The GPO CSE is not deployed to some clients.
  • The GPO is not linked in all OUs where you have clients.

 Detection

We can easily check if LAPS is working for a specific client be reading the contents of the AD attributes associated with LAPS. We need access to read these properties, so all automated and manual testes mentioned henceforth has to be run by an account with permissions to read the properties. The LAPS operations guide details how you should configure the permissions. That being said, you should of course also test the permissions to make sure that only privileged users are able to read said properties. The properties are called:

  • ms-Mcs-AdmPwd stores the password as clear text.
  • ms-Mcs-AdmPwdExpirationTime stores the point in time for the next password change. The GPO checks this value when it is applied and resets the password if the time has passed.

We can use both of these to test if LAPS has been applied to a specific computer object at least once. If you do a manual test by using the Attribute Editor in AD Users and Computers you will see both. I have written PowerShell commands to automate the process based on the value of the ms-Mcs-AdmPwdExpirationTime attribute.

List computers without LAPS

This lists all computer objects without a LAPS expiry set. Virtual cluster computer objects are excluded. The results are exported to the file C:\TEMP\NoLaps.csv.

get-adcomputer -Properties Name, operatingSystem, Description, ms-Mcs-AdmPwdExpirationTime `
-LDAPFilter "(&(!ms-Mcs-AdmPwdExpirationTime=*)(operatingSystem=Windows*)(!Description=ClusterAwareUpdate*)(!Description=Failover cluster virtual network name account))"|`
Select Name, operatingSystem, Description, ms-Mcs-AdmPwdExpirationTime| Sort-Object Name | export-csv C:\Temp\NoLaps.csv -Delimiter ";" -NoTypeInformation

List computers with expired LAPS

Lists all computer objects where LAPS has been applied at least once, where the expiration time has passed. These are usually computers that are not powered on, maybe removed but not properly deleted from the AD. The results are exported to the file C:\TEMP\ExpiredLaps.csv.

$now = Get-Date
get-adcomputer -Properties Name, operatingSystem, Description, ms-Mcs-AdmPwdExpirationTime `
-LDAPFilter "(&(ms-Mcs-AdmPwdExpirationTime=*)(operatingSystem=Windows*)(!Description=ClusterAwareUpdate*)(!Description=Failover cluster virtual network name account))"|`
Select Name, operatingSystem, Description, @{N='ExpiryTime'; E={[DateTime]::FromFileTime($_."ms-Mcs-AdmPwdExpirationTime")}}| `
Where-Object ExpiryTime -lt $now| Sort-Object ExpiryTime| export-csv C:\Temp\ExpiredLAPS.csv -Delimiter ";" -NoTypeInformation 

Get LAPS Expiration date for one or more computer(s)

This command lists the expiration time for one or more computers based on an LDAP filter. The sample filter (Name=Badger*) will list all computers whose name starts with Badger. Computers where the expiration time is not set are filtered out. For more information about the LDAP filter syntax se this link:  https://social.technet.microsoft.com/wiki/contents/articles/5392.active-directory-ldap-syntax-filters.aspx

$now = Get-Date
 get-adcomputer -Properties Name, operatingSystem, Description, ms-Mcs-AdmPwdExpirationTime `
-LDAPFilter "(&(ms-Mcs-AdmPwdExpirationTime=*)(Name=Badger*))"|`
Select Name, operatingSystem, Description, @{N='ExpiryTime'; E={[DateTime]::FromFileTime($_."ms-Mcs-AdmPwdExpirationTime")}}| Sort-Object Name

Get LAPS expiration date for one or more computers, excluding those with no expiry set

Similar to above, but includes computer objects where the expiration time is not set. Those return 01.01.1601 01.00.00 as ExpiryTime because of the conversion of 0 from FileTime to DateTime. To put it in another way, if the expiration time is reported as 01.01.1601 01.00.00 it has not been set.

$now = Get-Date
 get-adcomputer -Properties Name, operatingSystem, Description, ms-Mcs-AdmPwdExpirationTime `
-LDAPFilter "(&(!ms-Mcs-AdmPwdExpirationTime=*)(Name=Badger*))"|`
Select Name, operatingSystem, Description, @{N='ExpiryTime'; E={[DateTime]::FromFileTime($_."ms-Mcs-AdmPwdExpirationTime")}}| Sort-Object Name

Securing Windows Active Directory

This is a list of measures you can implement to increase your Windows AD Security. The list is in no way exhaustive, and some of the items overlap. Be aware that security recommendations change over time. This article was originally created 2018.01.22. If that is several years in the past when you read this, I cannot promise that all recommendations are up to date.

LAPS – Local administrator password management

Implementing LAPS ensures that all your domain-joined computers have a unique password that is changed periodically for the local administrator account. It operates as a GPO Client Side Extension, and thus requires you to install and register a DLL on each target computer. You can do this via GPO, in your VM image, or through any other software deployment solution you may use.

On the management computers and/or the DC itself, you have to add management tools and GPO Editor templates. There is a graphical user interface and a PowerShell module. The PowerShell module also includes the commands necessary to extend the AD Schema for storing the passwords and their associated expiry date.

See https://technet.microsoft.com/en-us/mt227395.aspx for details.

Securing the built-in Administrator account

 

The built in Administrator account in the domain should be secured. The ObjectSID of the domain admin account always ends in -500, and is thus easy to identify even if the name has been changed. The guidance used to be “Disable the Administrator account”, but it has been changed due to some recovery scenarios requiring an active Administrator-account. Specifically, the Administrator account is the only account able to log on when no global catalogs are online.

See https://docs.microsoft.com/en-us/windows-server/identity/ad-ds/plan/security-best-practices/appendix-d–securing-built-in-administrator-accounts-in-active-directory for details and an implementation guide. Some highlights are shown below.

Set the DOMAIN\Administrator account as sensitive and require smart card

 

clip_image001

Create a GPO to prevent Domain Admins from logging on to member servers or workstations

I have gone a bit further than the guide here, adding Domain Admins and Guests for good measure. The “Local account and member of Administrators group” is related to denying local administrator accounts access to the computer from the network. More about this below.

Make sure that this GPO does not apply to domain controllers, that is, do not link it at the domain level.

clip_image002

 

Block remote access for local accounts

Add Guests, Local account and member of Administrators group, Domain Admins, Enterprise Admins and Schema Admins to the policy Computer Configuration\Windows Settings\Local Policies\User Rights Assignment\Deny Access to this computer from the network.

clip_image003

For details, see https://blogs.technet.microsoft.com/secguide/2014/09/02/blocking-remote-use-of-local-accounts/

Disable weak ciphers for Windows Secure Channel

You can build a GPO to limit the cipher suites used by the Windows Secure Channel API, and by extension IIS. Be aware that this does not in any way limit other usage of weak ciphers. For instance, a TomCat server running on the same computer may very well use RC4 even if you have removed it from the list of Windows secure channel ciphers.

The GPO is located at Computer Configuration\Administrative Templates\Network\SSL Configuration Settings\Cipher Suites.

When you enable this setting, you get a list of all the default ciphers as a long comma separated string. Which ciphers you get is dependent of the Windows version. The easiest way to edit this list is to copy the string into a text editor. You can change the order to change the priority and remove weak ciphers.

clip_image004

clip_image005

 

Do not allow local users to run remote elevated sessions

Do not apply this fix: https://support.microsoft.com/en-us/help/951016/description-of-user-account-control-and-remote-restrictions-in-windows

That is, do not create the HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System \LocalAccountTokenFilter Policy value, and if it exists, make sure it is set to 0. We could of course create a GPO to enforce this setting.

clip_image006

For details, see https://www.harmj0y.net/blog/redteaming/pass-the-hash-is-dead-long-live-localaccounttokenfilterpolicy/

Set a password policy and a lockout policy

 

  • Password length: 8 characters. Encourage users to create passwords with a random length between 8 and 20 characters. You want your users to have passwords that vary in length. If you set this limit to 14, chances are all passwords are exactly 14 characters long. This makes it a lot easier to crack them.
  • Complexity not required. If you require complexity, users tend to add numbers and capitals at the start and end of the password.
  • Password history: 10.
  • Maximum password age: 0, that is password never expires. To frequent password changes may lead to bad password diversity and predictable passwords. Leaked passwords are almost always exploited immediately, so there is no point in forcing a monthly password change. If you must, set the maximum age to one year. Urge users to choose new passwords that are completely different from the previous passwords. That is, do not use MypassWord1, MypasswOrd2 and so on.
  • Do not enable the reversible encryption option. Ever. Just don’t.
  • Lockout policy: Locked for 24 hours after five unsuccessful attempts.

clip_image007

clip_image008

For background information, see:

Enforce SMB Signing and disable SMB1

 

Enforce signing

You can enforce signing on both the server and client side. The server side is shown below. Be aware that some services require this setting to be disabled. If you have such services, create an overriding GPO for those servers only, leaving SMB signing on in the rest of the domain.

 

clip_image009

See https://technet.microsoft.com/en-us/library/cc731957(v=ws.11).aspx

Disable SMB1

You have to create some registry-GPO settings. Details are at the link below. Be aware that legacy clients like Windows XP will be dependent on SMBv1 on Domain Controllers to access the Sysvol share. The recommendation is still to disable SMBv1 everywhere.

 

clip_image010

 

See https://blogs.technet.microsoft.com/staysafe/2017/05/17/disable-smb-v1-in-managed-environments-with-ad-group-policy/ for details.

Create new computer objects in a separate OU, not in the Computers container

 

Thus you can delegate permissions to manage them, and you can apply GPOs to newly added computers. You do this with the rdircmp console application.

  • Log on to a domain controller.
  • Start an administrative CMD-shell
  • Execute rdircmp [FQDN of OU]

clip_image011

You can verify or check this setting using PowerShell:

Get-ADDomain |Select-Object ComputersContainer.

Limit the number of domain admins

Domain admin accounts should only be used for domain administration tasks, and you should not have many of them. Do not use service accounts with domain admin access.

A recommended number of domain admins is 5.

Avoid explicit permissions, prefer group permissions

 

All permissions in AD should preferably be given to groups, not individual users. This makes it a lot easier to manage permissions, and it is also easier to see what permissions a user has based on which groups he is a member of. That is, if you follow this principle. There will always be exceptions, but they should be few and far between.

Limit the number of people with delegated access to AD

 

AD administration task can be delegated. For instance, your service desk could be able to reset passwords and create users without full domain admin access. It is important to limit these delegations and keep tabs on them.

Use dedicated domain controllers

 

  • Make sure that your domain has at least two domain controllers.
  • If they are virtual, they should not be on the same cluster. Preferably you should have at least one dedicated physical domain controller.
  • Do not install anything on you domain controllers, with the exception of backup agents, antivirus software, monitoring agents and software deployment agents.
  • Do not enable the Hypervisor role on your physical domain controllers to run other software in a VM.
  • Make sure you have a system state backup of your domain controllers.

No trusts between domains

 

Avoid using forests and trusts between domains. Trusted domains should be handled as a single security context (e.g. dev, test, production, management etc), and thus you only really need one domain for each security context unless you want to divide it based on departments or divisions.

Enforce the Windows firewall

 

Make sure that Windows Firewall is turned on. There are many ways to do this, e.g. SCCM or GPO.

Install antivirus software on all servers and workstations

 

And make sure that it is activated and up to date. SCCM enables you to monitor and manage the default Windows Defender antivirus. Most commercial AntiVirus software comes with some kind of centralized management and monitoring tool.

Log out RDP sessions after 24 hours

 

Remote desktop server sessions that are still active (idle or disconnected) after 24 hours should be logged out automatically. Really active sessions are left for 5 days.

The GPO settings are located at:

Computer Configuration\Administrative Templates\Windows Components\Remote Desktop Services\Remote Desktop Session Host\Session Time limits

  • Set time limit for disconnected sessions: 1 day.
  • Set time limit for active but idle RDS sessions: 1 day.
  • Set time limit for active RDS sessions: 5 days.
  • End session when time limits are reached: Enabled
  • Set time limit for logoff of RemoteApp sessions: 1 day.

Group Managed Service accounts and Managed Service Accounts

Enable the domain for group managed service accounts, and encourage its use on supported services.

https://docs.microsoft.com/en-us/windows-server/security/group-managed-service-accounts/group-managed-service-accounts-overview

https://blogs.msdn.microsoft.com/markweberblog/2016/05/25/group-managed-service-accounts-gmsa-and-sql-server-2016/

Microsoft Update with PSWindowsUpdate 2.0

Preface

This is an update to my previous post about PSWindowsUpdate located here: https://lokna.no/?p=2132. The content is pretty much the same, but updated for PSWindowsUpdate 2.0.

Most of my Windows servers are patched by WSUS, SCCM or a similar automated patch management solution at regular intervals. But not all. Some servers are just too important to be autopatched. This is a combination of SLA requirements making downtime difficult to schedule and the sheer impact of a botched patch run on backend servers. Thus, a more hands-on approach is needed. In W2012R2 and far back this was easily achieved by running the manual Windows Update application. I ran through the process in QA, let it simmer for a while and went on to repeat the process in production if no nefarious effects were found during testing. Some systems even have three or more staging levels. It is a very manual process, but it works, and as we are required to hand-hold the servers during the update anyway, it does not really cost anything. Then along came Windows Server 2016. Or Windows 10 I should really say, as the Update-module in W2016 is carbon copied from W10 without changes. It is even trying to convince me to install W10 Creators update on my servers…

clip_image001

In Windows Server 2016 the lazy bastards at Microsoft just could not be bothered to implement the functionality from W2012R2 WU. It is no longer possible to defer specific updates I do not want, such as the stupid Silverlight mess. If I want Microsoft update, then I have to take it all. And if I should become slightly insane and suddenly decide I want driver updates from WU, the only way to do that is to go through device manager and check every single device for updates. Or install WUMT, a shady custom WU client of unknown origin.

I could of course use WSUS or SCCM to push just the updates I want, but then I have to magically imagine what updates each server wants and add them to an ever growing number of target groups. Every time I have a patch run. Now that is expensive. If I had enough of the “special needs” servers to justify the manpower-cost, I would have done so long ago. Thus, another solution was needed…

PSWindowsUpdate to the rescue. PSWindUpdate is a Powershell module written by a user called MichalGajda enabling management of Windows Update through Powershell. You can find it here: https://www.powershellgallery.com/packages/PSWindowsUpdate/2.0.0.0. In this post I go through how to install the module and use it to run Microsoft Update in a way that resembles the functionality from W2012R2. You could tell the module to install a certain list of updates, but I found it easier to hide the unwanted updates. It also ensures that they are not added by mistake with the next round of patches.

Getting started

(See the following chapters for details.)

  • You should of course start by installing the module. This should be a one-time deal, unless a new version has been released since last time you used it. New versions of the module should of course be tested in QA like any other software.
  • Then, make sure that Microsoft Update is active.
  • Check for updates to get a list of available patches.
  • Hide any unwanted patches
  • Install the updates
  • Re-check for updates to make sure there are no “round-two” patches to install.

Continue reading “Microsoft Update with PSWindowsUpdate 2.0”

Event 20501 Hyper-V-VMMS

Problem

The following event is logged non-stop in the Hyper-V High Availability log:

Log Name:      Microsoft-Windows-Hyper-V-High-Availability-Admin
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          27.07.2017 12.59.35
Event ID:      20501
Task Category: None
Level:         Warning
Description:
Failed to register cluster name in the local user groups: A new member could not be added to a local group because the member has the wrong account type. (0x8007056C). Hyper-V will retry the operation.

image

Analysis

I got this in as an error report on a new Windows Server 2016 Hyper-V cluster that I had not built myself. I ran a full cluster validation report, and it returned this warning:

Validating network name resource Name: [Cluster resource] for Active Directory issues.

Validating create computer object permissions was not run because the Cluster network name is part of the local Administrators group. Ensure that the Cluster Network name has “Create Computer Object” permissions.

I then checked AD, and found that the cluster object did in fact have the Create Computer Object permissions mentioned in the message.

The event log error refers to the cluster computer object being a member of the local admins group. I checked, and found that it was the case. The nodes themselves were also added as local admins on all cluster nodes. That is, the computer objects for node 1, 2 and so on was a member of the local admins group on all nodes. My records show that this practice was necessary when using SOFS storage in 2012. It is not necessary for Hyper-V clusters using FC-based shared storage.

The permissions needed to create a cluster in AD

  • Local admin on all the cluster nodes
  • Create computer objects on the Computers container, the default container for new computers in AD. This could be changes, in which case you need permissions in the new container.
  • Read all properties permissions in the Computers container.
  • If you specify a specific OU for the cluster object, you need permissions in this OU in addition to the new objects container.
  • If your nodes are located in a specific OU, and not the Computers OU, you will also need permissions in the specific OU as the cluster object will be created in the OU where the nodes reside.

See Grant create computer object permissions to the cluster for more details.

Solution

As usual, a warning: If you do not understand these tasks and their possible ramifications, seek help from someone that does before you continue.

Solution 1, low impact

If it is difficult to destroy the cluster as it requires the VMs to be removed from the cluster temporarily, you can try this method. We do not know if there are other detrimental effects caused by not having the proper permissions when creating the cluster.

  • Remove the cluster object from the local admin on all cluster nodes.
  • Remove the cluster nodes from the local admin group on all nodes.
  • Make sure that the cluster object has create computer objects permissions on the OU in which the cluster object and nodes are located
  • Make sure that the cluster object and the cluster node computer objects are all located in the same OU.
  • Validate the cluster and make sure that it is all green.

Solution 2, high impact

Shotgun approach, removes any collateral damage from failed attempts at fixing the problem.

  • Migrate any VMs away from the cluster
  • Remove the cluster from VMM if it is a member.
  • Remove the “Create computer objects” permissions for the cluster object
  • Destroy the cluster.
  • Delete the cluster object from AD
  • Re-create the cluster with the same name and IP, using a domain admin account.
  • Add create computer objects and read all properties permissions to the new cluster object in the current OU. 
  • Validate the cluster and make sure it is all green.
  • Add the server to VMM if necessary.
  • Migrate the VMs back.

Primary replica is not joined to the Availability group, or clusters past morphing into clusters present

Problem

I was upgrading an Availability group from SQL 2012 on Win 2012R2 to SQL 2016 on Win2016. I had expected to create the new AOAG as a separate cluster and move the data manually, but the users are always complaining when I want to use my allotted downtime quotas, so I decided to try a rolling upgrade instead. This post is a journal of some of the perils I encountered along the way, and how I overcame them. There were countless others, but most of them were related to crappy hardware, wrong hardware being delivered, missing LUNS on the SAN, delusional people who believe they can lock out DBAs from supporting systems, dragons, angry badgers, solar flares and whichever politician you dislike the most. Anyways, on with the tale of clusters past morphing into clusters present…

I started with adding the new node to the failover cluster. This went surprisingly well, in spite of the old servers being at least two generations older than my new rack servers. Sadly, both the new and the old servers are made by the evil wizards behind the silver slanted E due to factors outside of my control. But I digress. The cluster join went flawlessly. There was some yellow complaints about the nodes not having the same OS version in the cluster validation scroll, but everything worked.

Then came adding the new server as a replica in the availability group. This is done from the primary replica, and I just uttered a previously prepared spell from the book of disaster recovery belonging to this cluster, adding the name of the new node. As far as I can remember this is just the result of the standard “Add replica” wizard. The spell ran without complaints, and my new node was online.

This is the point where it all went to heck in a small hand-basket carried by an angry badger. I noticed a yellow warning next to the new node in the AOAG dashboard. But as the databases were all in the synchronizing state on the new replica, I believed this to be a note complaining about the OS-version. I was wrong. In my ignorance, I failed over to the new node and had the application  team minions run some tests. They came back positive, so I removed the old nodes in preparation for adding the last one. I even ran the Update-ClusterFunctionalLevel Powershell command without issues. But the warning persisted. This is the contents of the warning:

Availability replica not joined.

SNAGHTML57bbb24f

And it was no longer a lone warning, the AOAG dashboard did not look pretty as both the old nodes refused to accept the new node as their new primary replica.

Analysis

As far as I can tell, the join AOAG script failed in some way. It did not report any errors, but still, there is no doubt that something did go wrong.

The solution as reported by MSDN is simple, just join the availability group by casting the “alter availability group groupname join” spell from the secondary replica that is not joined. The attentive reader has probably already realized that this is the primary replica, and as you probably suspect, the aforementioned command fails.

Casting the following spell lists the replicas and their join state: “select join_state, join_state_desc from sys.dm_hadr_availability_replica_cluster_states”. This is the result:

image

In some way I have put the node in an invalid state. It still works perfectly, but I guess there is only a question about when, not if this issue is about to grow into a bigger problem.

Solution

With such an elaborate backstory, you would not be wrong to expect an equally elaborate solution. Whether or not it is, is really in the eye of the beholder.

Just the usual note of warning first: If you are new to availability groups, and all this cluster stuff sounds like the dark magic it is, I would highly suggest that you do not try to travel down the same path as me. Rather, you should turn around at the entrance and run as fast as you can into the relative safety of creating another cluster alongside the old one. Then migrate the data by backing up on the old cluster and restoring on the new cluster. And if backups and restores on availability groups sounds equally scary, then ask yourself whether or not you are ready to run AOAG in production. In contrast to what is often said in marketing materials and at conferences, AOAG is difficult and scary to the beginner. But there are lots of nice training resources out there, even some free ones.

Now, with the warnings out of the way, here is what ended up working for me. I tried a lot of different solutions, but I was bound by the following limitation: The service has to be online. That translates to no reboots, no AOAG-destroy and recreate, no cluster rebuilds and so on. A combination of which would probably have solved the problem in less than an hour of downtime. But I was allowed none, so this is what I did:

  • Remove any remaining nodes and replicas that are not Win2016 SQL2016.
  • Run the Powershell command Update-ClusterFunctionalLevel to make sure that the cluster is running in Win2016 mode.
  • Build another Win 2016 SQL 2016 node
  • Join the new node to the cluster
  • Make sure that the cluster validation scroll seems reasonable. This is a fluffy point I know, but there are way to many variables to make an exhaustive list. https://lokna.no/?p=1687 mentions some of the issues you may encounter.
  • Join the new node to the availability group as a secondary replica.
  • Fail the availability group over to the new node (make sure you are in synchronous commit mode for this step).
  • Everything is OK.

image

  • Fail back to the first node
  • Change back to asynchronous commit (if that is you default mode, otherwise leave it as synchronous).

 

Thus I have successfully upgraded a 2-node AOAG cluster from Win2012R2 and SQL 2012 to Win2016 and SQL 2016 with three failovers as the only downtime. In QA. Production may become an interesting journey, IF the change request is approved. There may be an update if I survive the process…

 

Update and final notes

I have now been through the same process in production, with similar results. I do not recommend doing this in production, the normal migration to a new cluster is far preferable, especially when you are crossing 2 SQL Server versions on the way. Then again, if the reduced downtime is worth the risk…

Be aware that a failover to a new node is a one way process. Once the SQL 2016 node becomes the primary replica, the database is updated to the latest file format, currently 852 whereas SQL 2012 is 706. And as far as I can tell from the log there is a significant number of upgrades to be made. See http://sqlserverbuilds.blogspot.no/2014/01/sql-server-internal-database-versions.html for a list of version numbers.

image

Microsoft Update with PSWindowsUpdate

Preface

Most of my Windows servers are patched by WSUS, SCCM or a similar automated patch management solution at regular intervals. But not all. Some servers are just too important to be autopatched. This is a combination of SLA requirements making downtime difficult to schedule and the sheer impact of a botched patch run on backend servers. Thus, a more hands-on approach is needed. In W2012R2 and far back this was easily achieved by running the manual Windows Update application. I ran through the process in QA, let it simmer for a while and went on to repeat the process in production if no nefarious effects were found during testing. Some systems even have three or more staging levels. It is a very manual process, but it works, and as we are required to hand-hold the servers during the update anyway, it does not really cost anything. Then along came Windows Server 2016. Or Windows 10 I should really say, as the Update-module in W2016 is carbon copied from W10 without changes. It is even trying to convince me to install W10 Creators update on my servers…

clip_image001

In Windows Server 2016 the lazy bastards at Microsoft just could not be bothered to implement the functionality from W2012R2 WU. It is no longer possible to defer specific updates I do not want, such as the stupid Silverlight mess. If I want Microsoft update, then I have to take it all. And if I should become slightly insane and suddenly decide I want driver updates from WU, the only way to do that is to go through device manager and check every single device for updates. Or install WUMT, a shady custom WU client of unknown origin.

I could of course use WSUS or SCCM to push just the updates I want, but then I have to magically imagine what updates each server wants and add them to an ever growing number of target groups. Every time I have a patch run. Now that is expensive. If I had enough of the “special needs” servers to justify the manpower-cost, I would have done so long ago. Thus, another solution was needed…

PSWindowsUpdate to the rescue. PSWindUpdate is a Powershell module written by a user called MichalGajda on the technet gallery enabling management of Windows Update through Powershell. In this post I go through how to install the module and use it to run Microsoft Update in a way that resembles the functionality from W2012R2. You could tell the module to install a certain list of updates, but I found it easier to hide the unwanted updates. It also ensures that they are not added by mistake with the next round of patches.

Getting started

(See the following chapters for details.)

  • You should of course start by installing the module. This should be a one-time deal, unless a new version has been released since last time you used it. New versions of the module should of course be tested in QA like any other software.
  • Then, make sure that Microsoft Update is active.
  • Check for updates to get a list of available patches.
  • Hide any unwanted patches
  • Install the updates
  • Re-check for updates to make sure there are no “round-two” patches to install.

Continue reading “Microsoft Update with PSWindowsUpdate”

No Microsoft Update

Problem

I was preparing to roll out SQL Server 2016 and Windows Server 2016 and had deployed the first server in  production. I suddenly noticed that even if I selected “Check online for updates from Microsoft Update” in the horrible new update dialog, I never got any of the additional updates. Btw, this link/button only appears when you have an internal SCCM or WSUS server configured. Clicking the normal Check For Updates button will get updates from WSUS.

image

 

Analysis

This was working as expected in the lab, but the lab does not have the fancy System Center Configuration Manager and WSUS systems. So of course I blamed SCCM and uninstalled the agent. But to no avail, still no updates. I lurked around the update dialog and found that the “Give me updates for other Microsoft products..” option was grayed out and disabled. I am sure that I checked this box during installation, as I remember looking for its location. But it was no longer selected, it was even grayed out.

image

This smells of GPOs. But I also remembered trying to get this option checked by a GPO to save time during installation, and that it was not possible to do so in Win2012R2. Into the Group Policy Manager of the lab DC I went…

It appears that GPO management of the Microsoft Update option has been added in Win2016:

image

This option is not available in Win2012R2, but as we have a GPO that defines “Configure Automatic Updates”, it defaults to disabled.

solution

Alternative 1: Upgrade your domain controllers to Win2016.

Alternative 2: Install the Win2016 .admx files on all your domain controllers and administrative workstations.

Then, change the GPO ensuring that “Install updates for other Microsoft products is enabled. Selecting 3 – Auto download used to be a safe setting.

Alternative 3: Remove the GPO or set “Configure Automatic Updates” to “Not Configured”, thus allowing local configuration.

Cluster Quorum witness

Introduction

Since W2012R2 it is recommended that all clusters have a quorum witness regardless of the number of cluster nodes. As you may know, the purpose of the cluster witness is to ensure a majority vote in the cluster. If you have 2 nodes with one vote each and add a cluster witness you create a possibility for a majority vote. If you have 3 nodes on the other hand, adding a witness will remove the majority vote as you have 4 votes total and a possible stalemate.

If as stalemate occurs, the cluster nodes may revolt and you are unable to get it working without a force quorum, or you could take a node out behind the barn and end its misery. Not a nice situation at all. W2012R2 solves this predicament by dynamic vote assignments. As long as a quorum has been established, if votes disappear due to nodes going offline, it will turn the witness vote on and off to make sure that you always have a possibility for node majority. As long as you HAVE a disk witness that is.

There are three types of disk witnesses:

  • A SAN-connected shared witness disk, usually FC or iSCSI. Recommended for clusters that use shared SAN-based cluster disks for other purposes, otherwise not recommended. If this sounds like gibberish to you, you should use another type of witness.
  • A File share witness. Just a file share. Any type of file share would do, as long as it resides on a Windows server in the same domain as the cluster nodes. SOFS shares are recommended, but not necessary. DO NOT build a SOFS cluster for this purpose alone. You could create a VM for cluster witnesses, as each cluster witness is only about 5MiB, but it is best to find an existing physical server with a high uptime requirement in the same security zone as the cluster and create some normal SMB-shares there. I recommend a physical server because a lot of virtual servers are Hyper-V based, and having the disk witness on a vm in the cluster it is a witness for is obviously a bad idea.
  • Cloud Witness. New in W2016. If you have an Azure storage account and are able to allow the cluster nodes a connection to Azure, this is a good alternative. Especially for stretch clusters that are split between different rooms.

How to set up a simple SMB File share witness

  • Select a server to host the witness, or create one if necessary.
  • Create a folder somewhere on the server and give it a name that denotes its purpose:
  • image
  • Open the Advanced Sharing dialog
  • image
  • Enable sharing and change the permissions. Make sure that everyone is removed, and add the cluster computer object. Give the cluster computer object full control permissions
  • image
  • Open Failover Cluster manager and connect to the cluster
  • Select “Configure Cluster Quorum Settings:
  • image
  • Chose Select The Quorum Witness
    image

  • Select File Share Witness

  • image

  • Enter the path to the files share as \\server\share

  • image

  • Finish the wizard

  • Make sure the cluster witness is online:

  • image

  • Done!

Running out of time in the lab

First a friendly warning; This post details procedures for messing with the time service on domain controllers. As always, if you do not understand the commands or their consequences; seek guidance.

Problem

I have been upgrading my lab to Windows Server 2016 in preparation for a production rollout. Some may feel I am late to the game, but I have always been reluctant to roll out new server operating systems quickly. I prefer to have a good baseline of other peoples problems to look for in your friendly neighborhood tracking service (AKA search engine) when something goes wrong.

Anyways, some weeks ago I rolled out 2016 on my domain controller. When I came back to upgrade the Hyper-V hosts, I noticed time was off by 126 seconds between the DC and the client. As the clock on the DC was correct, I figured the problem was client related. Into the abyss of w32tm we go.

Analysis

The Windows Time Service is not exactly known for its user friendliness, so I just started with the normal shotgun approach at the client:

net stop w32time
w32tm /config /syncfromflags:domhier
net start w32time

These commands, if executed at an administrative command prompt, will remind the client to get its time from the domain time sync hierarchy, in other words one of the DCs. If possible. Otherwise it will just let the clock drift until it passes the time delta maximum, at which time it will not be able to talk to the DC any more. This is usually the point when your friendly local monitoring system will alert you to the issue. Or your users will complain. But I digress.

Issuing a w32tm /resync command afterwards should guarantee an attempt to sync, and hopefully a successful result. At least in my dreams. In reality though, it just produced another nasty error:  0x800705B4. The tracking service indicated that it translates to “operation timed out”. 

The next step was to try a stripchart. The stripchart option instructs w32tm to query a given computer and show the time delta between the local and remote computer. Kind of like ping for time servers. The result should look something like this:

SNAGHTMLbf7b59

But unfortunately, this is what I got:

image

I shall spare you the details of all the head-scratching and ancient Viking rituals performed at the poor client to no avail. Suffice it to say that I finally realized the problem had to be related to the DC upgrade. I tried running the stripchart from the DC itself against localhost, and that failed as well. That should have been a clue that something was wrong with Time Service itself. But as troubleshooting the Time Service involves decoding its registry keys, I went to confirm the firewall rules instead. Which of course were hunky-dory.

image

I then ran dcdiag /test:advertising /v to check if the server was set to advertise as a time server:

image

 

The next step was to reset the configuration for the Time Service. The official procedure is as follows:

net stop w32time
w32tm.exe /unregister
w32tm.exe /register
net start w32time

This procedure usually ends with some error message complaining about the service being unable to start due to some kind of permission issue with the service. I seem to remember error 1902 is one of the options. If this happens, first try 2 consecutive reboots. Yes, two. Not one. Don’t ask why, no one knows. If that does not help, try again but this time with a reboot after the unregister command.

The procedure ran flawlessly this time, but it did not solve the problem.

Time to don the explorer’s hat and venture into the maze of the registry. The Time Service hangs out in HKLM\System\CurrentControlSet\Services\W32Time. After some digging around, I found that the NTP Server Enabled key was set to 0. Which would suggest that it was turned off. I mean, registry settings are tricksy, but there are limits. I tried changing it to 1 and restarted the service.

image

Suddenly, everything works. The question is why… Not why it started working, but why the setting was changed to 0. I am positive time sync was working fine prior to the upgrade. Back to the tracking service I went. Could there be a new method for time sync in Windows 2016? Was it all a big conspiracy caused by Russian hackers in league with Trump? Of course not. As usual the culprits are the makers of the code.

Solution

My scenario is not a complete match, but in KB3201265 Microsoft admits to having botched the upgrade process for Windows Time Service in both Windows Server 2016 and the corresponding Windows 10 1607. Basically, it applies default registry settings for a non-domain-joined server. Optimistic as always they tell you to export the registry settings for the service PRIOR to upgrading. As if I have the time to read every KB they publish. Anyways, it also details a couple of other possible solutions, such as how to get the previous registry settings out of Windows.old.

My recommendation is as such: Do not upgrade your domain controllers. Especially not in production. I only did it in the lab because I wanted to save time.

If you as me have put yourself in this situation, and honestly, why else would you have read this far, I recommend following method 3 in KB3201265. Unless you feel comfortable exploring the registry and fixing it manually.

Multiple default gateways

Update 2018-0307

I have verified this as an issue on Windows 2016 as well. Sometimes if a network adapter has been configured with a default gateway before it is added to a NIC Team, you will get multiple default gateways.

Problem

While troubleshooting a networking teaming issue on a cluster, someone sent me the a link to this article about multiple default gateways on Win 2012 native teaming: http://www.concurrency.com/blog/bug-in-nic-teaming-wizard-makes-duplicate-default-routes-in-server-2012/. The post discusses a pretty specific scenario that we didn’t have on our clusters (most of them are on 2008R2), but I discovered several nodes with more than one default route in route print:

image

The issue I was looking into was another, but I remembered a problem from a weekend some months ago that might be related: When a failover was triggered on a SQL cluster, the cluster lost communication with the outside world. To be specific: no traffic passed through the default gateway. As all cluster nodes were on the same subnet the cluster itself was content with the situation, but none of the webservers were able to communicate with the clustered SQL server as they were in a different subnet. This made the webservers sad and the webmaster angry, so we had to fix it. As this happened in production over the weekend, the focus was on a quick fix and we were unable to find a root cause at the time. A reboot of the cluster nodes did the trick, and we just wrote it off as fallout from the storage issue that triggered the failover. The discovery of multiple default gateways on the other hand prompted a more thorough investigation.

Analysis

The article mentioned above talks exclusively about Windows 2012’s native teaming software, but this cluster is running Windows 2008 R2 and is relying on teaming software provided by the NIC manufacturer (Qlogic). We have had quite a lot of problems with the Qlogic network adapters on this cluster, so I immediately suspected them to be the rotten apple. I am not sure if this problem is caused by a bug in Windows itself that is present in both 2012 and 2008R2, or if both MS and Qlogic are unable to produce a functioning NIC teaming driver, but the following is clear:

If your adapters have a default gateway when you add them to a team, there is a chance that this default gateway will not get removed from the system. This happens regardless if the operating system is Windows 2012 or Windows 2008 R2. I am not sure if gateway addresses configured by DHCP also triggers this behavior. It doesn’t happen every time, and I have yet to figure out if there are any specific triggers as I haven’t been able to reproduce the problem at will.

Solution A

To resolve this issue, follow the recommendations in  http://www.concurrency.com/blog/bug-in-nic-teaming-wizard-makes-duplicate-default-routes-in-server-2012/:

First you have to issue a command to delete all static routes to 0.0.0.0. NB! This will disconnect you from the server if you are connected remotely from outside the subnet.

image

Configure the default gateway for the team using IP properties on the virtual team adapter:

image

Do a route print to make sure you have only one default gateway under persistent routes.

Solution b

If solution A doesn’t work, issue a netsh interface ip reset command to reset the ip configuration and reboot the server. Be prepared to re-enter the ip information for all adapters if necessary.

What not to do

Do not configure the default gateway using route add, as this will result in a static route. If the computer is a node in a cluster, the gateway will be disabled at failover and isolate the server on the local subnet. See http://support.microsoft.com/kb/2161341 for information about how to configure static routes on clusters if you absolutely have to use a static route.