Unable to add shares to Windows 2012 File Cluster

Problem

image

When you try to add a share to a newly formed (and perhaps also an existing) Windows 2012 File Server Cluster, you get an error message stating that the you are unable to do so due to lack of WinRM communication between the cluster nodes. Additionally, you may spot event id 49 from WinRM MI Operation in the Windows Remote Management operational event log with the following message:

“The WinRM protocol operation failed due to the following error: The WinRM client sent a request to an HTTP server and got a response saying the requested HTTP URL was not available. This is usually returned by a HTTP server that does not support the WS-Management protocol..”

SNAGHTML6084577

Or the following text for Event 49:

“The WinRM protocol operation failed due to the following error: The connection to the specified remote host was refused. Verify that the WS-Management service is running on the remote host and configured to listen for requests on the correct port and HTTP URL..”

And event id 142 from Windows Remote management stating

“WSMan operation Enumeration failed, error code 2150859027”

SNAGHTML609f7a6

Other possible events:

EventID 0 from FileServices-Manager.Eventprovider

image

“ Exception: Caught exception Microsoft.Management.Infrastructure.CimException: The WinRM client received an HTTP status code of 502 from the remote WS-Management service.
   at Microsoft.Management.Infrastructure.Internal.Operations.CimSyncEnumeratorBase`1.MoveNext()
   at Microsoft.FileServer.Management.Plugin.Services.FSCimSession.PerformQuery(String cimNamespace, String queryString)
   at Microsoft.FileServer.Management.Plugin.Services.ClusterEnumerator.RetrieveClusterConnections(ComputerName serverName, ClusterMemberTypes memberTypeToQuery)”

Error code 504 has also been detected.

Analysis

The problem is clearly related to windows Remote Management. What was even more peculiar in this case, was the fact that when I failed over to another node, the error message disappeared. Thus I knew that the error was isolated to the one node. But even though I spent hours comparing settings on the nodes, all I was able to establish was the fact that they were exactly alike. Then I remembered something from my Exchange admin days; In earlier versions of Windows, WinRM could be removed and reinstalled from the system. I remember this because Exchange 2010 relied heavily on WinRM and remote powershell, bot of which could be a major pain to get working properly. In Win2012, remote management is heavily integrated in server manager, and I was unable to find a way to remove it. I did however find a way to turn it off an on again.

Update 2016.11.24:

I found another version of this problem where solution one did not work. It was still a WinRM-problem, but this time it was proxy-related. You may need an explicit  proxy exception for the local domain.

Solution one

Disable and enable WinRM. There are of course multiple ways to achieve this. I used powershell, but there is an option in the gui, and the command works in CMD.EXE as well. Beware, you have to use an elevated powershell prompt. When I come to think of it, most things that are worh doing seems to require an elevated shell.

Configure-SMRemoting -disable
Configure-SMRemoting -enable

That is it. no need to reboot or anything, just run the two commands and wait for them to finish. If you get a message that remoting is enforced by Group Policy, look for this GPO:

image

It has to be set as Not configured to allow you to disable and enable WinRM. If it is enforced by a domain policy, you have to block said policy temporarily while you fix this.

Enabling and disabling should also make sure that the necessary firewall settings are enabled. If you have a proxy server defined, make sure you have exceptions added for your local servers as this could also block WinRM, albeit with other error messages.

Solution two

Make sure you have an exception in your proxy definition for the local domain. For system proxy setups:

netsh winhttp set proxy [proxyserveraddress]:[proxy port] bypass-list=”*.ADDomain.local;<local>”

For other proxy configs, ask your proxy admin.

Testing winrm with powershell

You can use the Invoke-command powershell command to test powershell remote connections:

Invoke-Command -ComputerName Lab-DC -ScriptBlock { Get-ChildItem c:\ } -credential lab\sauser

This command will output a directory listing of c:\ on the computer Lab-DC. The command will be executed with the lab\sauser account. Powershell prompts for account password on execution. Sample output:

07-05-2014 11-57-04

Multiple default gateways

Update 2018-0307

I have verified this as an issue on Windows 2016 as well. Sometimes if a network adapter has been configured with a default gateway before it is added to a NIC Team, you will get multiple default gateways.

Problem

While troubleshooting a networking teaming issue on a cluster, someone sent me the a link to this article about multiple default gateways on Win 2012 native teaming: http://www.concurrency.com/blog/bug-in-nic-teaming-wizard-makes-duplicate-default-routes-in-server-2012/. The post discusses a pretty specific scenario that we didn’t have on our clusters (most of them are on 2008R2), but I discovered several nodes with more than one default route in route print:

image

The issue I was looking into was another, but I remembered a problem from a weekend some months ago that might be related: When a failover was triggered on a SQL cluster, the cluster lost communication with the outside world. To be specific: no traffic passed through the default gateway. As all cluster nodes were on the same subnet the cluster itself was content with the situation, but none of the webservers were able to communicate with the clustered SQL server as they were in a different subnet. This made the webservers sad and the webmaster angry, so we had to fix it. As this happened in production over the weekend, the focus was on a quick fix and we were unable to find a root cause at the time. A reboot of the cluster nodes did the trick, and we just wrote it off as fallout from the storage issue that triggered the failover. The discovery of multiple default gateways on the other hand prompted a more thorough investigation.

Analysis

The article mentioned above talks exclusively about Windows 2012’s native teaming software, but this cluster is running Windows 2008 R2 and is relying on teaming software provided by the NIC manufacturer (Qlogic). We have had quite a lot of problems with the Qlogic network adapters on this cluster, so I immediately suspected them to be the rotten apple. I am not sure if this problem is caused by a bug in Windows itself that is present in both 2012 and 2008R2, or if both MS and Qlogic are unable to produce a functioning NIC teaming driver, but the following is clear:

If your adapters have a default gateway when you add them to a team, there is a chance that this default gateway will not get removed from the system. This happens regardless if the operating system is Windows 2012 or Windows 2008 R2. I am not sure if gateway addresses configured by DHCP also triggers this behavior. It doesn’t happen every time, and I have yet to figure out if there are any specific triggers as I haven’t been able to reproduce the problem at will.

Solution A

To resolve this issue, follow the recommendations in  http://www.concurrency.com/blog/bug-in-nic-teaming-wizard-makes-duplicate-default-routes-in-server-2012/:

First you have to issue a command to delete all static routes to 0.0.0.0. NB! This will disconnect you from the server if you are connected remotely from outside the subnet.

image

Configure the default gateway for the team using IP properties on the virtual team adapter:

image

Do a route print to make sure you have only one default gateway under persistent routes.

Solution b

If solution A doesn’t work, issue a netsh interface ip reset command to reset the ip configuration and reboot the server. Be prepared to re-enter the ip information for all adapters if necessary.

What not to do

Do not configure the default gateway using route add, as this will result in a static route. If the computer is a node in a cluster, the gateway will be disabled at failover and isolate the server on the local subnet. See http://support.microsoft.com/kb/2161341 for information about how to configure static routes on clusters if you absolutely have to use a static route.

Does your cluster have the recommended hotfixes?

From time to time, the good people at Microsoft publish a list of problems with failover clustering that has been resolved. This list, as all such bad/good news comes in the form of a KB, namely KB2784261. I check this list from time to time. Some of them relate to a specific issue, while others are more of the go-install-them-at-once type. As a general rule, I recommend installing ALL hotfixes regardless of the attached warning telling you to only install them if you experience a particular problem. In my experience, hotfixes are at least as stable as regular patches, if not better. That being said, sooner or later you will run across patches or hotfixes that will make a mess and give you a bad or very bad day. But then again, that is why cluster admins always fight for funding of a proper QA/T environment. Preferably one that is equal to the production system in every way possible.

Anyways, this results in having to check all my servers to see if they have the hotfixes installed. Luckily some are included in Microsoft Update, but some you have to install manually. To simplify this process, I made the following powershell script. It takes a list of hotfixes, and returns a list of the ones who are missing from the system. This script could easily be adapted to run against several servers at once, but I have to battle way to many internal firewalls to attempt such dark magic. Be aware that some hotfixes have multiple KB numbers and may not show up even if they are installed. This usually happens when updates are bundled together as a cummulative package or superseded by a new version. The best way to test if patch/hotfix X needs to be installed is to try to install it. The installer will tell you whether or not the patch is applicable.

Edit: Since the original, I have added KB lists for 2008 R2 and 2012 R2 based clusters. All you have to do is replace the ” $recommendedPatches = ” list with the one you need. Links to the correct KB list is included for each OS. I have also discovered that some of the hotfixes are available through Microsoft Update-Catalog, thus bypassing the captcha email hurdle.

2012 version

$menucolor = [System.ConsoleColor]::gray
write-host "╔═══════════════════════════════════════════════════════════════════════════════════════════╗"-ForegroundColor $menucolor
write-host "║                              Identify missing patches                                     ║"-ForegroundColor $menucolor
write-host "║                              Jan Kåre Lokna - lokna.no                                    ║"-ForegroundColor $menucolor
write-host "║                                       v 1.2                                               ║"-ForegroundColor $menucolor
write-host "║                                  Requires elevation: No                                   ║"-ForegroundColor $menucolor
write-host "╚═══════════════════════════════════════════════════════════════════════════════════════════╝"-ForegroundColor $menucolor
#List your patches here. Updated list of patches at http://support.microsoft.com/kb/2784261
$recommendedPatches = "KB2916993", "KB2929869","KB2913695", "KB2878635", "KB2894464", "KB2838043", "KB2803748", "KB2770917"
 
$missingPatches = @()
foreach($_ in $recommendedPatches){
    if (!(get-hotfix -id $_ -ea:0)) { 
        $missingPatches += $_ 
    }    
}
$intMissing = $missingPatches.Count
$intRecommended = $recommendedpatches.count
Write-Host "$env:COMPUTERNAME is missing $intMissing of $intRecommended patches:" 
$missingPatches

2008R2 Version

A list of recommended patches for Win 2008 R2 can be found here:  KB2545685

$recommendedPatches = "KB2531907", "KB2550886","KB2552040", "KB2494162", "KB2524478", "KB2520235"

2012 R2 Version

A list of recommended patches for Win 2012 R2 can be found here:  KB2920151

#All clusters
$recommendedPatches = "KB3130944", "KB3137691", "KB3139896", "KB3130939", "KB3123538", "KB3091057", "KB3013769", "KB3000850", "KB2919355"
#Hyper-V Clusters
$recommendedPatches = "KB3130944", "KB3137691", "KB3139896", "KB3130939", "KB3123538", "KB3091057", "KB3013769", "KB3000850", "KB2919355", "KB3090343", "KB3060678", "KB3063283", "KB3072380"

Hyper-V

If you are using the Hyper-V role, you can find additional fixes for 2012 R2 in KB2920151 below the general cluster hotfixes. If you use NVGRE, look at this list as well: KB2974503

Sample output (computer name redacted)

SNAGHTML2af111ec

Edit:

I have finally updated my script to remove those pesky red error messages seen in the sample above.

Generating and reading cluster logs

NOTE: this post was originally from 2010, it was updated for win2012 in august 2013.

If you want to read the cluster log on Windows 2008/2012 failover clusters, you have to generate it first. This log is considered sort of a debug level log, so it is not written to disk in a readable format by default. The log is however stored on disk as a circular .etl file, and it can be written out to a readable cluster.log file on demand. There are two ways you can create this file, by using cluster.exe or by PowerShell. Windows 2008/2008R2 supports both, while Windows 2003 is so old that it only supports the .log text file format and thus creates a readable log by default. Windows 2012 on the other hand considers cluster.exe to be too “old-school”, so it supports PowerShell only.

Be aware that readable might be an undeserving description of the cluster.log file. It is not for the faint of heart, and it should NOT be your first entry point when troubleshooting cluster issues. I usually access it only as a last resort when all else fails, or when I try to decipher why the cluster had issues AFTER I have solved the problem at hand.

Continue reading “Generating and reading cluster logs”

Weak event created

Problem

In windows failover cluster manager, clicking on an node in the tree will raise the following error:

SNAGHTML106567d

This is one of the biggest error messages I have ever seen in a Microsoft product, that is, regarding the size of the message box Smilefjes. The error states: “A weak event was created and I lives on the wrong object, there is a very high chance this will fail, pleas review your code to prevent the issue” and goes on with a .net call stack.

Analysis

This feels like a .Net framework issue to me, and a quick search rustled up the following post on Technet: http://blogs.technet.com/b/askcore/archive/2013/01/14/error-in-failover-cluster-manager-after-install-of-kb2750149.aspx, stating that this is a bug caused by KB 2750149.

Solution

Install KB2803748, available here: http://support.microsoft.com/kb/2803748

To check if both or any of the patches are installed, run the following powershell commands:

Get-HotFix -id KB2750149
Get-HotFix -id KB2803748

Cluster validation fails on missing drive letter

Problem

Failover Cluster validation fails with the following error: “Failed to prepare storage for testing on node X status 87”:

SNAGHTML314451ca

Analysis

This error was caused by how the operating system was installed on the server. For some reason there was a separate hidden boot partition with no name:

SNAGHTML3145d2ac

I suspect that this is the remains of a Dell OEM installation CD, as there is also an OEM partition on the drive. I didn’t install the OS on this server though, so I was unable to figure out exactly what had happened.

Solution

The solution is simple enough, just assign a drive letter to the boot partition:

image

You could also reinstall the node or create a valid boot loader on the C: drive, but just assigning a drive letter to the partition is way quicker.

Exchange 2010 Dag won’t go online

Problem

After installation of Exchange 2010 SP1 Rollup5 I discovered that one of my networks was listed as partitioned in EMC. I ran several test without being able to find a cause for this. I am able to ping the addresses both ways, and there are no routes from network 1 to network 2. I even ran “Validate this cluster” (network only) successfully. Further investigation established that the cluster core resources were offline as well:

SNAGHTML5c562012

After a lot of fruitless searching I came across this Technet blog: http://blogs.technet.com/b/timmcmic/archive/2010/05/12/cluster-core-resources-fail-to-come-online-on-some-exchange-2010-database-availability-group-dag-nodes.aspx , who describes this situation and how to resolve it. But it also claims that the error is fixed in SP1. Since I’ve been at SP1 since installation I was skeptical, but I tried it anyway. As I suspected it didn’t help, but a comment from MattP_75 put me on the right track to a solution.

The problem is related to one or both of the networks not allowing client connections. I even found it in the eventlog, event 1223 from FailoverClustering.

SNAGHTML5c5c3660image

When I tried to change it in Failover Cluster management as the event suggested, it just bounced back to not allowing client access. To get it to stick, I had to change it in the registry.

I have no idea why this happens, but several of the comments on the article mentioned above talk about backup agents, mostly backup exec which I don’t have on my servers.

Solution

This is what I did to resolve the issue:

  • Shut down one of the nodes to ensure quorum
  • Change the role value to 3 on all cluster networks
    image
  • Get the core resources online
  • Restart the other node
  • Check the registry on both nodes to verify

I have had this happen again when I restart the cluster node that is hosting the Public Folders database, but it doesn’t happen every time.

Update 2011.11.23:

Microsoft recently released a hotfix which might be related to this error, kb2550886. According to the Exchange Team blog this is highly recommended for Exchange DAG’s running on Windows 2008R2, and they describe a scenario resembling the problem mentioned above. I have not verified that this update solves the problem permanently, but it is most certainly worth installing at your earliest convenience if you haven’t done so already.

Cluster disker på passiv node og WWN på HBA

Ved feilsøking av noen rare clusterproblemer begynte jeg plutselig å lure på hvordan clusterdisker var representert på den passive noden i et Windows 2008 cluster. De var nemlig ikke å se i Disk Management, og forsøk på failover av diskressurser feilet spektakulert og fullstendig. Litt forskning viste kjapt at joda, de skulle vært listet omtrent som dette:

image

Dermed måtte feilen være et annet sted. Litt mer fundering avslørte at vi hadde oppdatert bios og managementprosessor på dette clusteret for kort tid siden, og vi begynte å ane hvor grevlingen var. Av en eller annen grunn hadde ikke serveren fått satt riktig WWN ved omstart. Dette er et IBM HS22 blade med Qlogic HBA som får sine WWN fra managementmodulen, og en rask sjekk i SanSurfer avslørte at WWN ikke stemte overens med det som var definert i SANet. Dermed trengs en kaldstart av managementmodueln i bladet, noe som betyr å slå av serveren, trekke ut bladet, vente et minutt og sette det inn igjen. Deretter må man vente en 5 minutters tid mens IMM booter. Man kan se at den er ferdig ved at powerlyset i front av bladet begynner å blinke med betydelig langsommere takt. Først da vil bladet la seg slå på.

Update: Det har også fungert i enkelte tilfeller å ta en full shutdown av bladet, vente to minutter og så starte det igjen. Det er verdt et forsøk, da man sparer en tur til serverrommet.

Samme situasjon har jeg tidligere opplevd på HP BL460 blade. Disse har vanligvis Emulex-kort som liker å komme opp med blank WWN (00-00-00-…), eller med en WWN som begynner på 1 istedenfor 5. Dette er noe som er litt lettere å oppdage enn et nummer som er tilsynelatende riktig. Løsningen er den samme, ut med bladet for å få kaldstartet managementmodulen.

SNAGHTML109f57a

Dersom feilen oppstår på en helt ny server/nytt HostBusAdapter er det vanligvis firmware på HBA som er utdatert eller på annen måte ikke samsvarer med firmware til bladets managementmodul. Emulex liker å splitte denne i flere deler, en HBA bios og en HBA firmware, pass på at du oppdaterer alle. Og oppdater firmware på bladet samtidig. Det anbefales å bruke firmware fra bladeprodusenten og ikke fra Qlogic/Emulex. Pass også på at du oppdaterer firmwaren til den sentrale managementmodulen på IBM Bladecenter samtidig. Denne kalles AMM. På HP C7000 har man ofte Virtual Connect fibermoduler istedenfor fiberswitcher fra feks Brocade. Rent teknisk er det store forskjeller på AMM og Virtual Connect, men i denne sammenhengen er det viktigste at versjonene stemmer overens.