Failover Cluster

You are currently browsing the archive for the Failover Cluster category.

Problem

I received an alarm from one of my SQL Servers about IO stall time measured in seconds and went to investigate. We have had trouble with HBA Firmware causing FC stalls previously, so I suspected another storage error. The server in question was running virtual FC, and a cascading error among the other servers on the same host seemed to confirm my initial hypothesis about a HBA problem on the host.

Analysis

The kernel mode CPU time on the host was high (the red part of the graph in Process Explorer), something that is also a pointer in the direction of storage problems. The storage minions found no issue on the SAN though. Yet another pointer towards a problem on the server itself. We restarted it twice, and the situation seemed to normalize. It was all written off as collateral damage from a VMWare fault that flooded the SAN with invalid packet some time ago. I moved one of the VMs back and let it simmer overnight. I felt overly cautious not moving them all back, but the next morning the test VM was running 80% PCU without getting anything done, and the CPU load on the host was  about 50%, running a 3 cpu vm on a 2×12 core host…

25C33835

I failed the test vm back to the spare host, and the load on the VM went down immediately:

image

At this point I was ready to take a trip to the room of servers and visit the host in person, and I was already planning a complete re-imaging of the node in my head. But then I decided to run CPU-Z first, and suddenly it all became clear.

image

 

The host is equipped with Intel Xeon E5-2690 v3 CPUs. Intel Ark informs me that the base clock is indeed 2,6GHz as reported by CPU-Z, and the turbo frequency is as high as 3,5GHz. A core speed of 1195MHz as shown in CPU-Z is usually an indicator of one of two things. Either someone has fiddled with the power saving settings, or there is something seriously wrong with the hardware.

A quick check of the power profile revealed that the server was running in the so called “balanced” mode, a mode that should be called “run-around-in-circles-and-do-nothing-mode” on servers. The question then becomes, why did this setting change?

image

My server setup checklist clearly states that server should run in High performance mode. And I had installed this particular server myself, so I know it was set correctly. The culprit was found to be a firmware upgrade installed some months back. It had the effect of resetting the power profile both in the BIOS and in Windows to the default setting. There was even a change to the fan profile, causing the server to become very hot. The server in question is a HP ProLiant DL380 Gen 9, and the ROM version is P89 v2.30 (09/13/2016).

Solution

  • First you should change the power profile to High performance in the control panel. This change requires a reboot.
  • While you are rebooting the server, enter the BIOS settings and check the power profile. I recommend Maximum Performance mode for production servers.
  • image
  • Then, check the Fan Profile
  • image
  • Try increased cooling. If your servers still get exceedingly hot, there is a maximum cooling mode, but this basically runs all the fans at maximum all the time.

This is how CPU-Z looks after the change:

image

And the core utilization on the host, this time with 8 active SQL Server VMs:

image

Print This Post Print This Post

Tags:

Introduction

Since W2012R2 it is recommended that all clusters have a quorum witness regardless of the number of cluster nodes. As you may know, the purpose of the cluster witness is to ensure a majority vote in the cluster. If you have 2 nodes with one vote each and add a cluster witness you create a possibility for a majority vote. If you have 3 nodes on the other hand, adding a witness will remove the majority vote as you have 4 votes total and a possible stalemate.

If as stalemate occurs, the cluster nodes may revolt and you are unable to get it working without a force quorum, or you could take a node out behind the barn and end its misery. Not a nice situation at all. W2012R2 solves this predicament by dynamic vote assignments. As long as a quorum has been established, if votes disappear due to nodes going offline, it will turn the witness vote on and off to make sure that you always have a possibility for node majority. As long as you HAVE a disk witness that is.

There are three types of disk witnesses:

  • A SAN-connected shared witness disk, usually FC or iSCSI. Recommended for clusters that use shared SAN-based cluster disks for other purposes, otherwise not recommended. If this sounds like gibberish to you, you should use another type of witness.
  • A File share witness. Just a file share. Any type of file share would do, as long as it resides on a Windows server in the same domain as the cluster nodes. SOFS shares are recommended, but not necessary. DO NOT build a SOFS cluster for this purpose alone. You could create a VM for cluster witnesses, as each cluster witness is only about 5MiB, but it is best to find an existing physical server with a high uptime requirement in the same security zone as the cluster and create some normal SMB-shares there. I recommend a physical server because a lot of virtual servers are Hyper-V based, and having the disk witness on a vm in the cluster it is a witness for is obviously a bad idea.
  • Cloud Witness. New in W2016. If you have an Azure storage account and are able to allow the cluster nodes a connection to Azure, this is a good alternative. Especially for stretch clusters that are split between different rooms.

How to set up a simple SMB File share witness

  • Select a server to host the witness, or create one if necessary.
  • Create a folder somewhere on the server and give it a name that denotes its purpose:
  • image
  • Open the Advanced Sharing dialog
  • image
  • Enable sharing and change the permissions. Make sure that everyone is removed, and add the cluster computer object. Give the cluster computer object full control permissions
  • image
  • Open Failover Cluster manager and connect to the cluster
  • Select “Configure Cluster Quorum Settings:
  • image
  • Chose Select The Quorum Witness
    image

  • Select File Share Witness

  • image

  • Enter the path to the files share as \\server\share

  • image

  • Finish the wizard

  • Make sure the cluster witness is online:

  • image

  • Done!

Print This Post Print This Post

Problem

On one of my Hyper-V clusters, Event ID 1196 from FailoverClustering is logged in the system log every fifteen minutes. The event lists the name of the resource and the error message “DNS operation refused”. What it is trying to tell me is that the cluster is unable to register a network name resource in DNS due to a DNS 9005 status code. A 9005 status code translates to “Operation refused”. In this case it was a CAU network name resource which is a part of Cluster Aware Updating.

Read the rest of this entry »

Print This Post Print This Post

Tags: ,

Problem

You run failover cluster validation, and the report claims that one or more of the nodes are missing update KB3005628:

SNAGHTML338020e6

You try running Windows Update, but KB3005628 is not listed as an available update. You try downloading and installing it manually, but the installer quits without installing anything.

Analysis

KB3005628 is a fix for .Net framework 3.5, correcting a bug in KB2966827 and KB2966828. The problem is that the cluster node in question does not have the .Net framework 3.5 feature installed. It did however have  KB2966828 installed. As this is also a .Net 3.5 update, I wonder how it got installed in the first place. After reading more about KB3005628, it seems that KB2966828 could get installed even if .Net framework 3.5.1 is not installed.

So far, no matter what I do the validation report lists KB3005628 as missing on one of the cluster nodes. This may be a bug in the Failover Cluster validator.

Workaround

If the .Net Framework 3.5 feature is not installed, remove KB2966827 and KB2966828 manually from the affected node if they are installed. The validation report will still list KB3005628 as missing, but as the only function of KB3005628 is to remove KB2966827 and KB2966828 this poses no problem.

Print This Post Print This Post

Tags: ,

Problem

During a maintenance window, one of five Hyper-v cluster nodes failed to come out of maintenance mode after a reboot. SCVMM was used to facilitate maintenance mode. The system log shows the following error messages repeatedly:

Service Control Manager Event ID 7024

The Cluster Service service terminated with the following service-specific error:
Cannot create a file when that file already exists.

FailoverClustering Event ID 1070

The node failed to join failover cluster [clustername] due to error code ‘183’.

Service Control Manager Event ID 7031

The Cluster Service service terminated with the following service-specific error:
The Cluster Service service terminated unexpectedly.  It has done this 6377 time(s).  The following corrective action will be taken in 60000 milliseconds: Restart the service.

SNAGHTML115b89e

SNAGHTML1170717

SNAGHTML117b921

Analysis

I had high hopes of a quick fix. The cluster is relatively new, and we had recently changed the network architecture by adding a second switch. Thus, I instructed the tech who discovered the fault to try rebooting and checking that the server was able to reach the other nodes on all its interfaces. That river ran dry quickly though, as the local network was proved to be working perfectly.

Looking through the Windows Update log and the KBs for the installed updates did not reveal any clues. Making it even more suspicious, a cluster validation ensured me that all nodes had the same updates. Almost. Except for one. Hopefully I looked closer, but of course, this was some .Net framework update rollup missing on a different node.

I decided to give up all hope of an impressive five minute fix and venture into the realm of the cluster log. It is possible to read the cluster log in the event log system, but I highly recommend generating the text file and opening it in notepad++ or some other editor with search highlighting. I find it a lot easier on the eyes and mind. (Oh, and if you use the event log reader, DO NOT filter out information messages. For some reason, the eureka moment is often hidden in an information message.) The cluster log is a bit like the forbidden forest; it looks scary in daylight and even scarier in the dark during an unscheduled failover. It is easy to get lost down a track interpreting hundreds of “strange” messages, only to discover that they were benign. To make it worse, they covered a timespan of about half a second. The wrong second of course, not the one where the problem actually occurred. To say it mildly, the cluster service is very talkative. Especially so when something is wrong. As event 7031 told us, the cluster service is busy trying to start once a minute. Each try spews out thousands of log messages. The log I was working with had 574 942 lines and covered a timespan 68 minutes. That is about 8450 lines per service start.

Anyway, into the forbidden forest I went with nothing but a couple of event log messages to guide me. After a while, I isolated one cluster service startup attempt with correlated system event log data.   I discovered the following error:

ERR   mscs::GumAgent::ExecuteHandlerLocally: AlreadyExists(183)’ because of ‘already exists'([Servername]- LiveMigration Vlan 197)

The sneaky cluster service had tried fooling me into believing that the fault was file system related, when in fact it was a networking mess after all! You may recognize error 183 from system event id 1070 above. Thus we can conclude with fairly high probability that we have found the culprit. But what does it mean? I checked the name of the adapter in question and its teaming configuration. I ventured even further and checked for disconnected and missing network adapters, but none were to be found, neither in the device manager or in the registry. Then it struck med. The problem was staring me straight in the eye. The line above the error message, an innocent looking INFO message, was referring to the network adapter by an almost identical but nevertheless different name:

INFO [ClNet] Adapter Intel(R) Ethernet 10G 2P X520-k bNDC #2 – VLAN : VLAN197 LiveMigration is still attached to network VLAN197 Live Migration

A short but efficient third degree interrogation of the tech revealed that the network names had been changed some weeks prior to make them consistent on all cluster nodes. Ensuring network name consistency is in itself a noble task, but one that should be completed before the cluster is formed. It should of course be possible to change network names at any time, but for some reason the cluster service has not been able to persist the network configuration properly. As long as the node is kept running this poses no problem. However, when the node is rebooted for whatever reason, the cluster service gets lost running around in the forest looking for network adapters. Network adapters that are present but silent as the cluster service is calling them by the wrong name. I have not been able to figure out exactly what happens, not to mention what caused it in the first place, but I can guess. My guess is that different parts of the persisted cluster configuration came out of sync. This probably links network adapters to their cluster network names:

SNAGHTML151cc90

I have found this fault now on two clusters in as many days, and those are my first encounters with it. I suspect the fault is caused by a bug or “feature” introduced in a recent update to Win2012R2.

Solution

The solution is simple. Or at least I hope it is simple. A usual I strongly encourage you to seek assistance if you do not fully understand the steps below and their implications.

  • First you have to disable the cluster service to keep it out of the way. You can do this in srvmgr.msc, PowerShell, cmd.exe or through telepathy.
  • Wait for the cluster service to stop if it was running (or kill it manually if you are impatient).
  • Change the name of ALL network adapter that have an ipv4 and/or ipv6 address and are used by the cluster service. Changing the name of only the troublesome adapter mentioned in the log may not be enough. Make sure you do not mess around with any physical teaming members, SAN HBAs, virtual cluster adapters or anything else that is not directly used as a cluster network.
  • Before:image After: image
  • Enable the cluster service
  • Wait for the cluster service to start, or start it manually
  • The node should now be able to join the cluster
  • Stop the cluster service again  (properly this time, do not kill it)
  • Change the network adapter names back
  • Start the cluster service again
  • Verify that the node joins the cluster successfully
  • Restart the node to verify that the settings are persisted to disk.
Print This Post Print This Post

Tags: ,

Problem

I was patching one of my clusters to SQL2012 SP2 and SP2 CU3 when something bad happened. This particular cluster is a 3 node cluster with a FCI Primary AOAG replica instance on node 1 and 2, and a stand alone Secondary AOAG replica instance on node 3. Node 3 is used for HADR when the shared storage or other shared infrastructure has an outage.

The update passed QAT with flying colors, but sadly that does not always guarantee a successful production run. My standard patch procedure for this cluster:

  • Patch node 3
  • Patch node 2 (passive FCI node)
  • AOAG failover to node 3, node 3 becomes AOAG Primary
  • FCI failover from node 1 to node 2
  • Patch node 1
  • FCI failover to node 1
  • AOAG failover to node 1

When I tried to fail over the FCI to node 2 (step 4 above), the instance failed. First, I was worried that the SP2 upgrade process may be very lengthy or slow and triggering the FCI timeouts. An inspection of the SLQ Server error log revealed that this was not the case. Instead, I was the victim of a dreaded master database failure:

015-01-12 01:28:02.82 spid7s      Database 'master' is upgrading script 'msdb110_upgrade.sql' from level 184552836 to level 184554932.
2015-01-12 01:28:02.82 spid7s      ----------------------------------
2015-01-12 01:28:02.82 spid7s      Starting execution of PRE_MSDB.SQL
2015-01-12 01:28:02.82 spid7s      ----------------------------------
2015-01-12 01:28:02.96 spid7s      Error: 3930, Severity: 16, State: 1.
2015-01-12 01:28:02.96 spid7s      The current transaction cannot be committed and cannot support operations that write to the log file. Roll back the transaction.
2015-01-12 01:28:02.96 spid7s      Error: 912, Severity: 21, State: 2.
2015-01-12 01:28:02.96 spid7s      Script level upgrade for database 'master' failed because upgrade step 'msdb110_upgrade.sql' encountered error 3930, state 1, severity 16. This is a serious error condition which might interfere with regular operation and the database will be taken offline. If the error happened during upgrade of the 'master' database, it will prevent the entire SQL Server instance from starting. Examine the previous errorlog entries for errors, take the appropriate corrective actions and re-start the database so that the script upgrade steps run to completion.
2015-01-12 01:28:02.97 spid7s      Error: 3417, Severity: 21, State: 3.
2015-01-12 01:28:02.97 spid7s      Cannot recover the master database. SQL Server is unable to run. Restore master from a full backup, repair it, or rebuild it. For more information about how to rebuild the master database, see SQL Server Books Online.
2015-01-12 01:28:02.97 spid7s      SQL Server shutdown has been initiated
2015-01-12 01:28:02.97 spid7s      SQL Trace was stopped due to server shutdown. Trace ID = '1'. This is an informational message only; no user action is required.

Analysis

In case misbehaving SQL Server instances are able to smell fear, I am glad I was located several miles away from the datacenter at this point in time. While a rebuild of master is certainly doable even in a complex setup such as this, it is not something you want to do at 2am without a detailed plan if you don’t have to. Thus, I tried failing the instance back to node 1 (running SP1 CU 11). To my amazement it came online straight away. I have seen similar issues reduce clustered instances to an unrecognizable puddle of zeros and ones in a corner on the SAN drive, so this was a welcome surprise. Feeling lucky, I tried another failover to node 2, only to be greeted with another failure and the exact same errors in the log. A quick search revealed several similar issues, but no exact matches and no feasible solutions. The closest was a suggestion to disable replication during the upgrade. As you probably know, AOAG is just replication in a fancy dress, so I went looking for my Disaster Recovery Runbook that contains ready made scripts and plans for disabling and re-enabling AOAG. My only problem is that disabling AOAG will take down the AOAG listener, thus disconnecting all clients. Such antics results in grumpy client systems, web service downtimes and a lot of paperwork for instance reviews, and is therefore something to avoid if at all possible. Just for the fun of it, I decided to try making Node 2 the AOAG Primary during the upgrade. To my astonishment, this worked like a charm. Crisis (and paperwork) averted.

Solution

You have to promote the FCI to AOAG Primary during the upgrade from SP2 to SP1. The upgrade is triggered by failing the FCI over from a node running SP1 to a node running SP2, in my case the failover from node 1 to node 2 after patching node 2.

Sadly, there is no fixed procedure for patching failover cluster instances. Some patches will only install on the active FCI node, and will then continue to patch all nodes automatically. But most patches follow the recipe above, where the passive node(s) are patched first.

This issue will probably not affect “clean” AOAG or FCI clusters where you only apply one technology. If you use FCI with replication on the other hand, you may experience the same issue.

Definitions

AOAG = Always On Availability Group

FCI = Failover cluster Instance

HADR = High Availability / Disaster Recovery

Print This Post Print This Post

Tags: , ,

To verify which SMB version is in use for a specific fileshare/connection, run the following powershell command:

Get-SmbConnection |select ShareName, Dialect

You can run this command on both the client and the server. A client/server connection will use the highest version supported by both client and server. If the client supports up to v3.02, but the server is only able to support v3.00, v3.00 will be used for the connection.

The Get-Smbconnection commandlet supports a several other outputs, use select * to list them all.

Sample output

SNAGHTML3f409e4d

This is from a Win2012R2 client, connected to a share on a Win2012 cluster with multichannel support.

Print This Post Print This Post

Tags:

This post is part of the Failover Cluster Checklist series.

 

The Failover Cluster computer object needs to be granted the appropriate permissions necessary to create cluster resource objects (computers). Some resource objects can be staged, others cannot be staged. This depends on the OS version and resource type. The easiest solution is to place each cluster in a separate OU, and give the cluster permissions to create objects in that OU only.

How to do it

  • If necessary, create a new OU and move all cluster nodes and cluster resource objects to the new OU.
  • Enable view advanced features in ADUaC.

clip_image001

  • Open the Advanced Security Settings for the OU.

clip_image002

  • Add the cluster name machine object, and grant the Create Computer objects permission.

clip_image003

  • Make sure the cluster machine Object has been granted the Read all Properties permission.
    image
Print This Post Print This Post

Introduction

OK, so you want to install a cluster? This is not a “Should I build a cluster?” post, this is a “How to build a proper cluster” post. I like checklists, so I made a Windows Failover Cluster installation checklist. Some of the points have their own post, some are just a short sentence. I will add more details as time allows. The goal is to share my knowledge about how to build stable clusters. I may disagree with other best practices out there, but this list is based on my experience, what works in production and what does not. I use it to build new clusters, as well as troubleshooting clusters made by others. Clustering is so easy that anyone can build a working cluster these days, but building a stable production worthy cluster may still be like finding you way out of a maze. A difficult maze filled with ghosts, trolls and angry badgers.

There are some things you need to know about this post before you continue reading:

  • This list is made for production clusters. There is nothing stopping you from building a lab using this list, but if you do as I say, you will build a very expensive lab.
  • I work with SQL Server, Hyper-V and File clusters. This list may work for other kinds of clusters as well, but I have not tested it on recent versions.
  • At time of writing (fall 2014), this list is for Windows 2008R2 up until Windows 2012R2. Version specific instructions are given when necessary.
  • This list is for physical clusters. I dislike virtual clusters, because most organizations are not clever enough to create functioning virtual production clusters that won’t fail miserably due to user error someday. (By “virtual clusters” I mean cluster nodes on top of hypervisors, not clustered hypervisors).
  • This is MY checklist. I have spent several years honing it, and it works very well for me. That does not guarantee that it will work for you. I welcome any comments on alternative approaches, but don’t expect me to agree with you.
  • This list is mostly written in a “How to do it” manner, and may be lacking in the “But why should I do it” department. This is due to several reasons, but mostly a lack of time on my part. I do however want you to know that there are several hours, if not days of work behind each point.
  • Updates will be made as I discover new information.
  • The list is chronological. That is, start at the top and make your way down the list. If you jump back and forth, you will not achieve the desired result.
  • This list is based on the GUI version of Windows Server, not Core
  • Understanding this list requires knowledge of Active Directory and basic knowledge of Failover Clustering.

The design phase

In the design phase, there are a lot of decisions you have to make BEFORE you start building the cluster. These are just a few of them:

  • How many nodes do you need? Remember you need at least one node for HA (High Availability). Depending on the total number of nodes you may need several standby nodes. Some managers will complain about the extra nodes just sitting there unused, but they forget that they are there to provide HA. No matter the number of nodes, make sure the hardware is as equal as possible. I don’t care what the manual says, having cluster nodes with different hardware in them is a recipe for disaster. If possible, all nodes should be built on the same day by the same persons and have consecutive serial numbers.
  • How many network fabrics do you need? And how many can you afford? See Networks, teaming and heartbeats for clusters for more information. This is where most troublesome clusters fail.
  • Will you use shared storage? And what kind of shared storage? In short: FCOE is bad for you, ISCSI is relatively cheap, SMB3 is complicated and may be cheap, shared DAS/SAS is very cheap, FC is the enterprise norm and infiniband is for those who want very high performance at any cost. In most cases you will have to use what is already in place in your datacenter though. And it is usually better to have something your storage guys are used to supporting. Just remember that storage is very important for your overall performance, no matter what kind of cluster. For file clusters, high throughput is important. For SQL Server, low latency is key and you should use FC or infiniband.
  • What kind of hardware should you use in your cluster nodes? Currently, these are my opinions, based on my personal experience. As mentioned above, these are my opinions, you may come to other conclusions. My opinions on this change frequently as new generations are released, but here goes:
    • Emulex should stop making any kind of hardware. It is all bad for you and bad for your cluster. If you are having trouble with cluster stability and you have Emulex made parts in your nodes, remove them at once.
    • QLogic make good FC HBAs. If you have a FC SAN, QLogic HBAs are highly recommended. If you have QLogic network adapters on the other hand, use them for target practice.
    • Broadcom network adapters used to be good, but the drivers for Windows are getting worse by the minute.
    • Intel X520 is my current favorite network adapter.
    • Use Brocade FC switches only. They are sold under many other brand names as well, I have seen them with both HP and IBM stickers.
    • Use Cisco or HP network switches, but do not use them for FC traffic.
    • Make sure your nodes have local disk controllers with battery or flash backed cache. Entry level disk controllers are not worth the cardboard box they are delivered in.
    • Intel Xeon CPUs currently reigns supreme for most applications. There are however some edge cases for SQL Server where AMD CPUs will perform better. I recommend reading Glenn Berry’s blogs for up to date SQL Server CPU information.
    • HP, IBM and Dell all make reasonably good servers for clustering. Or, I should say equally bad, but better than the alternatives.
  • RACK or Blade?
    • RACK servers
      • are easier to troubleshoot
      • are versatile
      • give you a lot of expansion options
      • are cheaper to buy
    • Blade servers are
      • space efficient
      • cheaper to maintain if you rent rack space
      • easier to install
      • limited in terms of expansion options
  • Where should your nodes be located physically? I do not recommend putting them all in the same rack. The best solution is to put them in separate rooms within sub-millisecond network distance. You can also place them in separate data centers with a long distance between them if you do not use shared storage or use some kind of hybrid solution. I do not recommend SAN synchronization to data centers far, far away though, it is better to have synchronization higher up in the stack. If you only have one datacenter, place the nodes in different racks and make sure they have redundant power supplies.
  • Talking about power, your redundant power supplies should be connected to separate power circuits, preferably with each connected to an independent UPS.
  • What domain should your servers be member of, and which organizational unit should you use? Failover clustering will not work without Active Directory. The Active Directory role should NOT be installed on the cluster nodes. You should have at least two domain controllers, one of which should be a dedicated physical machine. I know that MS now supports virtualizing all your domain controllers, but that does not mean that you should do it, or that it is smart to do so. I would also recommend creating a separate OU for each cluster.
  • What account should you use to run the installation? I would recommend using a special cluster setup account, as some cluster roles latch on to the account used during installation and become unstable if that account is deleted at a later date. The account should be a domain administrator, and should be set to automatically deactivate at some point in the near future after you are done with the cluster setup. You can then re-activate it for the next cluster installation by changing the expiration date and password.
  • And then there are a lot of product and project specifics, such as storage requirements, CPU and memory sizing and so on, all of which may affect your cluster design.

The actual checklist

All list items should be performed on each node in the cluster unless specified otherwise. You can do one node at the time or all at once until you get to cluster validation. All nodes should be ready when you run cluster validation. I find it easiest to remember everything by doing one list item for each node before I move on to the next, making notes along the way.

  • Install Windows Server
  • Copy any required media, drivers etc. to a folder on each node
  • Activate a machine proxy if you use a proxy server to access the internet. See Proxy for cluster nodes for more information.
  • Check whether the server is a member of the domain, and add it to the domain if necessary
  • Make sure all your drivers are installed using Device Manager
  • Make sure you are running current BIOS, Remote Access, RAID, HBA and Network firmware according with your patch regime. If in doubt, use the latest available version from your server vendor. Do NOT download drivers and firmware from the chip vendor unless you are troubleshooting a specific problem.
  • Make sure your drivers are compatible with the firmware mentioned above.
  • Add the failover cluster features:
    Install-WindowsFeature -Name Failover-Clustering –IncludeManagementTools
  • If this is a Hyper-V host, install the Hyper-V role
    Install-WindowsFeature –Name Hyper-V -IncludeManagementTools -Restart
  • Verify your network setup. Networks, teaming and heartbeats for clusters
  • Check network adapter names and binding order. The public interface (the one facing the domain controllers) should be at the top of the binding order, and adapters should have the same name on each cluster node.
  • Disable IPv6. See How to disable IPv6
  • Remove duplicate persistent routes. Details
  • Disable NICs that are not in use
  • Install any prerequisites required by your shared storage. Check with your SAN admin for details.
  • Change page file settings according to Page file defaults
  • Activate Microsoft Update http://update.microsoft.com/microsoftupdate
  • Run Windows update
  • Install cluster hotfixes. See Does your cluster have the recommended hotfixes?
  • Select the High Performance power plan, both in Windows and in the BIOS/UEFI
  • Verify automount settings for Failover Clustering
  • If you are using shared storage, verify storage connections and MPIO in accordance with guidelines from your SAN vendor. Most SAN vendors have specific guidelines/whitepapers for Failover Clustering.
  • If you are creating a Hyper-V cluster, this is the time to create a virtual switch
  • Validate the configuration: Validating a Failover Cluster. Do not continue until your cluster passes validation. I have yet to see a production cluster without validation warnings, but you should document why you have each warning before you continue.
  • Create the cluster: Creating a Failover Cluster
  • Verify the Quorum configuration. If you are using Windows 2012, make sure dynamic quorum is enabled. If you use shared storage, you should always have a quorum witness drive (even if you don’t use it). The create cluster wizard will without fail select another quorum witness drive than the one you intended to use, so make sure to correct this as well.
  • Grant create computer object permissions to the cluster. This is necessary for installation of most clustered roles, and this is why each cluster should have its own OU.
Print This Post Print This Post

This post is part of the Failover Cluster Checklist series.

Create cluster

Start the Create Cluster wizard from failover cluster manager or from the Validate Cluster wizard results page. First specify the servers that will be nodes in the cluster. Then, you need to supply a valid static IPv4 or IPv6 address and a Virtual Computer Name. The IP address has to be in a network that has access to a writable domain controller, and all nodes needs to have a NIC with an IP in this subnet. (unless you are creating a disjointed cluster with nodes in different subnets which makes everything more difficult and is not part of this post at this time)

SNAGHTML6fe91b

Set disk names

In Failover Cluster Manager, any shared storage will be listed as Cluster Disk 1, Cluster Disk 2 and so on

SNAGHTML755673

You should change this name to reflect the volume name or drive name if you for some strange reason chose to have more than one volume per drive.

Set network names

Same goes for network names, Failover Clustering will just name them Network 1, Network 2 and so on. I like using Vlan numbers and functional tags like Public, Internal, Live Migration and so on. The important part is that you should instantaneously know what network is which. You should also define which networks are available for client connections. the cluster will assume that networks with a default gateway is client facing, while networks without a gateway is for internal use. If you use iSCSI, disable all cluster traffic on the iSCSI networks.

image

Print This Post Print This Post

« Older entries

%d bloggers like this: