Cluster Shared Volumes password policy

Problem

Failover Cluster validation genereates a warning in the Storage section under “Validate CSV Settings”. The error message states:

Failure while setting up to run Cluster Shared Volumes support testing on node [FQDN]: The password does not meet the password policy requirements. Check the minimum password length, password complexity and password history requirements.

No failure audits in the security log, and no significant error messages detected elsewhere.

Analysis

This error was logged shortly after a change in the password policy for the Windows AD domain the cluster is a member of. The current minimum password length was set to 14 (max) and complexity requirements were enabled:

image

This is a fairly standard setup, as written security policies usually mandate a password length far exceeding 14 characters for IT staff. Thus, I already knew that the problem was not related to the user initiating the validation, as the length of his/her password already exceeded 14 characters before the enforcement policy change.

Lab tests verified that the problem was related to the Default domain password policy. Setting the policy as above makes the cluster validation fail. The question is why. Further lab tests revealed that the limit is 12 characters. That is, if you set the Minimum length to 12 characters the test will pass with flying colors as long as there are no other problems related to CSV. I still wondered why though. The problem is with the relation between the local and domain security policies of a domain joined computer. To understand this, it helps to be aware of the fact that Failover Cluster Validation creates a local user called CliTest2 on all nodes during the CSV test:

SNAGHTML62b3e8f

The local user store on a domain joined computer is subject to the same password policies as are defined in the Default Domain GPO. Thus, when the domain policy is changed this will also affect any local accounts on all domain joined computers. As far as I can tell, the Failover Cluster validation process creates the CliTest2 user with a 12 character password. This has few security ramifications, as the user is deleted as soon as the validation process ends.

Solution

The solution is relatively simple to describe. You have to create a separate Password Policy for you failover cluster nodes where Minimum Password Length is set to 12 or less. This requires that you keep your cluster nodes in a separate Organizational Unit from your user and service accounts. That is a good thing to do anyway, but be aware that moving servers from one OU to another may have adverse effects.

You then create and link a GPO to the cluster node OU and set the Minimum Password Length to 12 in the new GPO. That is the only setting that should be defined in this GPO. Then check the Link order for the OU and make sure that your new GPO has a link order 1, or at least a lower link order than the Default Domain policy. Then you just have to run GPUPDATE /Force on all cluster nodes and try the cluster validation again.

If the above description sounds like a foreign language, please ask for help before you try implementing it. Group Policies may be a fickle fiend, and small changes may lead to huge unforeseen consequences.

Networks, teaming and heartbeats for clusters

Introduction

In this guide, a fabric is a separate network infrastructure, be it SAN, WAN or LAN. A network may or may not be connected to a dedicated fabric. Some fabrics have more than one network.

The cluster nodes should be connected to each other over at least two independent networks/fabrics. The more independent the better. Ideally, the networks should share no components at all, but as a minimum they should be connected to separate NICs in the server. Ergo, if you want to use NIC teaming you should have at least 4 physical network ports on at least two separate NICs. The more the merrier, but be aware that as with all other forms of redundancy, higher redundancy equals higher complexity.

If you do not have more than one network port or only one network team, do not add an additional virtual network adapter/vlan for “heartbeat purposes”. The most prevalent network faults today are caused by someone unplugging the wrong cable, deactivating the wrong switch port or other user errors. Having separate vlans over the same physical infrastructure rarely offers any protection from this. You are better off just using the one adapter/team.

Previously, each Windows cluster needed a separate heartbeat network used to detect node failures. From Windows 2008 and newer (and maybe also on 2003) the “heartbeat” traffic is sent over all available networks between the cluster nodes unless we manually block it on specific cluster networks. Thus, we no longer need a separate dedicated heartbeat network, but adding a second network ensures that the cluster will survive failures on the primary network. Some cluster roles such as Hyper-V require multiple networks, so check what the requirements are for your specific implementation.

Quick takeaway

If you are designing a cluster and need a quick no-nonsense guideline regarding networks, here it comes:

  • If you use shared storage, you need at least 3 separate fabrics
  • If you use local storage, you need at least 2 separate fabrics

All but a few clusters I have been troubleshooting have had serious shortcomings and design failures in the networking department. The top problems:

  • Way to few fabrics.
  • Mixing storage and network traffic on the same fabric
  • Mixing internal and external traffic on the same fabric
  • Outdated faulty NIC firmware and drivers
  • Bad, poorly designed NICs from Qlogic and Emulex
  • Converged networking

Do not set yourself up for failure.

IPv6

If you haven’t implemented IPv6 yet in your datacenter, you should disable IPv6 on all cluster nodes. If you don’t, you run a high risk of unnecessary failovers due to IPv6 to IPv4 conversion mishaps on the failover cluster virtual adapter. As long as IPv6 is active on the server, the failover cluster virtual adapter will use IPv6, even if none of the cluster networks have a valid IPv6 address. This causes all heartbeat traffic to be converted to/from IPv4 on the fly, which sometimes will fail. If you want to use IPv6, make sure all cluster nodes and domain controllers have a valid IPv6 address that is not link local (fe80:), and make sure you have routers, switches and firewalls that support IPv6 and are configured properly. You will also need IPv6 dns in the active directory domain.

Disabling IPv6

Do NOT disable IPv6 on the network adapters. The protocol binding for IPv6 should be enabled:

clip_image001

Instead, use the DisabledComponents registry setting. See Disable IPv6 for details.

clip_image003

Storage networks

If you use IP-based storage like ISCSI, SMB or FCOE, make sure you do not mix it with other traffic. Dedicated physical adapters should always be used for storage traffic. Moreover, if you are one of the unlucky few using FCOE you should seriously consider converting to FC or SMB3.

Hyper-V networks

In a perfect world, you should have six or more separate networks/fabrics for Hyper-v clusters. Sadly though, the world is seldom perfect. The absolute minimum for production clusters is two networks. Using only one network in production will cause nothing but trouble, so please do not try. Determining whether or not to use teaming also complicates matters further. As a general guide, I would strongly recommend that you always have a dedicated storage fabric with HA, that is teaming or MPIO, unless you use local storage on the cluster nodes. The storage connection is the most important one in any form of cluster. If the storage connection fails, everything else falls apart in seconds. For the other networks, throughput is more important than high availability. If you have to make a choice between HA and separate fabrics, chose separate fabrics for all other networks than the storage network.

7 Physical networks/fabrics

· Internal/Cluster/CSV (if local)/Heartbeat

· Public network for VMs

· VM Host management

· Live Migration

· 2*Storage (ISCSI, FC, SMB3)

· Backup

5 Physical networks/fabrics

· Internal/Cluster/CSV (if local)/Heartbeat/Live Migration

· Public network for vm, VM guest management

· VM Host management

· 2*Storage (ISCSI, FC, SMB3)

4 Physical networks/fabrics

· Internal/Live Migration

· Public & Management

· 2*Storage

Example

clip_image004

Most blade server chasses today have a total of six fabric backplanes, grouped in three groups where each group connects to a separate adapter in the blade. Thus, each network adapter or FC HBA is connected to two separate fabrics/backplanes. The groups could be named A,B and C, with the fabrics named A1, A2, B1 and so on. Each group should have identical backplanes, that is the backplane in A1 should be the same model as the backplane in A2.

If we have Fibrechannel (FC) backplanes in group A, and 10G Ethernet backplanes in group B & C, we have several possible implementations. Group A will always be storage in this example, as FC is a dedicated storage network.

clip_image005

Here, we have teaming implemented on both B and C. Thus, we use the 4 networks configuration from above, splitting our traffic in Internal and Public/Management. This implementation may generate some conflicts during Live Migrations, but in return we get High Availability for all groups.

clip_image006

By splitting group B and C in two single ports, we get 5 fabrics and a more granulated separation of traffic at the cost of High Availability.

Hyper-V trunk adapters/teams on 2012

If you are using Hyper-V virtual switches bound to a physical port or team on you Hyper-V hosts, Hyper-V Extensible Virtual Switch should be the only bound protocol. Note: Do not change these settings manually, Hyper-V manager will change the settings automatically when you configure the virtual switch. If you bind the Hyper-V Extensible virtual switch protocol manually, creation of the virtual switch may fail.

clip_image007

Teaming in Windows 2012

In Windows 2012 we finally got native support for nic teaming. You access the nic teaming dialog from Server Manager. You can find a short description of the features here: http://technet.microsoft.com/en-us/library/hh831648.aspx, and a more detailed one here: Windows Server 2012 NIC Teaming (LBFO) Deployment and Management.

Native teaming support rids us of some of the problems related to unstable vendor teaming drivers, and makes setup of nic teaming a unified experience no matter what nics you are using. Note: never use nic teaming on ISCSI networks. Use MPIO instead.

A note on Active/Active teaming

It is possible to use active active teaming, thus aggregating the bandwidth of two or more adapters to support higher throughput. This is a fantastic technology, especially on 1G ethernet adapters where bandwidth congestion can become a problem. There is, however a snag; a lot of professional datacenters have a complete ban on active/active teaming due to years of teaming problems. I have my self been victim of unstable active/active teams, so I know this to be a real issue. I do think this is less of a problem in Windows 2012 than it was on previous versions, but there may still be configurations that just does not work. The more complex your network infrastructure is, the less likely active/active teaming is to work. Connecting all members in the team to the same switch increases the chance of success. This also makes the team dependent on a single switch of course, but if the alternative is bandwidth congestion or no teaming at all, it does not really matter.

I recommend talking to your local network specialist about teaming before creating a design dependent on active/active teaming.

Using multiple vlans per adapter or team

It has become common practice to use more than one vlan per team, or even more than one vlan per adapter. I do not recommend this for clusters, with the exception of adapters/teams connected to a Hyper-V switch. An especially stupid thing to do is mixing ISCSI traffic with other traffic on the same physical adapter. I have dealt with the aftermath of such a setup, and it does not look pretty unless data corruption is your kind of fun. And if you create a second vlan just to get an internal network for cluster heartbeat traffic on the same physical adapters you are using for client connections, you are not really achieving anything other than making your cluster more complex. The cluster validation report will even warn you about this, as it will detect more than one interface with the same MAC address.

SQLIO results parser

Ever since I first learned about it I have been a happy user of a powershell script for parsing the results of a SQLIO test run into Excel made by Mr. Jonathan Kehayias (blog). It takes the otherwise difficult to read results file and reduces it to an excel file including a couple of “management-friendly” charts  Smilefjes.

The original script can be found here.

Continue reading “SQLIO results parser”

Count fragmented indexes

DECLARE @AVG_FRAG float
DECLARE @MIN_PAGES int 
SET @AVG_FRAG = 10
SET @MIN_PAGES = 1000

SELECT db.name, ips.index_type_desc, COUNT(*) as FragmentedIndexes
FROM sys.dm_db_index_physical_stats(null, null, NULL, NULL, NULL) ips
left join sys.databases db on ips.database_id = db.database_id
WHERE avg_fragmentation_in_percent > @AVG_FRAG and
ips.page_count > @MIN_PAGES
group by db.name, index_type_desc
order by db.name, index_type_desc

Lists the databases containing indexes and heaps with an average fragmentation percentage larger than @AVG_FRAG and how many indexes in the given database that are fragmented. Used to identify index issues at the instance/server level. @MIN_PAGES define the minimum size of the index. Small indexes are usually not affected by fragmentation and you can safely ignore them. The default limit in the script is set to 1000 pages (8000KiB), based on the recommendations from http://technet.microsoft.com/en-us/library/cc966523.aspx .

Datofeltenes forbannelse

Av og til må man hente ut data fra en tabell basert på dato eller tid. Slike tabeller har gjerne en egen tidsmerkekolonne, eller timestamp som det egentlig heter. Problemet er at datatypen som passer i de aller fleste databasemotorer også har en tidskomponent, slik at det er forskjell på 20100830153312 og 20100830073658 selv om de begge er fra samme dag. Man kan selvsagt jukse og sette denne til null, slik at alt blir merket med dato sånn og sånn klokken 00:00:00, men det er ikke så praktisk og krever at man tar hensyn til det hver gang man legger inn data. Og for å gjøre dette morsommere finnes det knapt noen datofunksjoner som er standardiserte på tvers av databaseplattformer, så man må finne de som passer til den versjonen man har. Dette har gitt meg mye “glede” opp gjennom årene, og siden jeg jobber med sånt forholdsvis sjelden forsøker jeg nå å dokumentere noe av det jeg stadig må finne opp på nytt.

Continue reading “Datofeltenes forbannelse”

Kopiere en MSSQL-database

Bakgrunn

Prosedyren beskriver to metoder for å ta en fullstendig kopi (snapshot) av en database på filnivå. En slik backup gir hurtig restore, og kan med fordel brukes før man tester oppgraderinger eller om man ønsker en kopi av databasen for testformål. Eksemplene er fra en MSSQL 2008 server, men prosedyren kan brukes på de fleste MSSQL versjoner. Grensesnittet er varierer på de forskjellige versjonene, men det bør ikke være så vanskelig å finne igjen bildene på en nyere eller eldre utgave. Continue reading “Kopiere en MSSQL-database”