DNS Operation refused on Cluster Aware Updating resource

Problem

On one of my Hyper-V clusters, Event ID 1196 from FailoverClustering is logged in the system log every fifteen minutes. The event lists the name of the resource and the error message “DNS operation refused”. What it is trying to tell me is that the cluster is unable to register a network name resource in DNS due to a DNS 9005 status code. A 9005 status code translates to “Operation refused”. In this case it was a CAU network name resource which is a part of Cluster Aware Updating.

 

Analysis

The funny thing about the Cluster Aware Updating (CAU) resource group is that it is hidden. even from powershell. There is a cluster aware updating cluster group, but the get-clustergroup commandlet does not list it. You have to know the name of the resource group and query it directly:

SNAGHTML68b27f7

The resource is listed though, if you list all resources. You can use this to identify the resource group name.

SNAGHTML68d8b3b

In this case, the troublesome resource is a Distributed Network Name. A DSN is a network name that is shared by more than one cluster node, and there should be a corresponding A-Record in DNS for each node. As this is an AD integrated DNS zone, the records should be updated by the DNS clients from time to time. The DNS client in this case is the Failover Cluster computer object, the virtual computer representing the Failover Cluster in AD. This virtual computer is impersonated by the cluster service on the nodes in the cluster. For some reason, the DNS servers are refusing the operation. The node responsible for the update is the node that currently owns the CAU resource group, in this case node 05. I decided to dive into the operational failover cluster event log on node 05 to uncover what was going on.

The operation starts with a couple Event 2049s

[RES] Distributed Network Name <CAU***>: Dns: HealthCheck: CAU****

[RES] Distributed Network Name <CAU***>: Dns: End of Slow Operation, state: Initialized/Reading, prevWorkState: Reading
[RES] Distributed Network Name <CAU***>: Dns: Stopping health check because DNS recheck configuration is needed

[RES] Distributed Network Name <CAU***>: Dns: Dns – Rechecking config internally

Then all network adapters are enumerated to identify DNS servers (event 2052).

SNAGHTML6a14f69

When it is done, the following event 2049 is logged:

SNAGHTML6a112fd

Then, the cluster service looks for the IP of all nodes in this network. When it is finished the following event 2049 is logged:

SNAGHTML6a38787

SNAGHTML6a6e26a

The cluster service then detects that a record exists and should be updated:

Event 2049: [RES] Network Name: [NNLIB] DNS Record there for CAU***, proceeding

Then comes the error message. It tries both DNS servers in case one is down, thus giving two error messages.

Event 2051: [RES] Network Name: [NNLIB] Error 9005 on DNS DnsReplaceRecordSetW for A records, name CAU*** (ipv4Count 5, ipv6Count 0)

Event 2049: [RES] Network Name: [NNLIB] Second Phase for DNS Network: drift.nhn.no (2 DNS Servers)

Event 2049: [RES] Network Name:  [NN] IdentityLocal End Impersonating

Event 2051: [RES] Network Name: [NNLIB] Error 9005 on DNS DnsReplaceRecordSetW for A records, name CAU*** (ipv4Count 5, ipv6Count 0)

The issue then has to be related to the existing DNS records themselves, as there is no authentication errors or other indicators that the cluster server is unable to talk to the DNS servers. I suddenly remembered having had similar issues before after reinstalling a cluster node in a SQL Cluster. The solution is quite simple. Something that strikes me as odd, as my cluster troubleshooting usually ends with a hideously complex solution promising doom to the uninitiated. Let me also note that Cluster Aware Updating worked perfectly fine in spite of the missing DNS records. Maybe I was lucky…

Solution

  • First, you have to delete the existing A-Records from the AD-integrated DNS servers. In my case there were only three records for a five-node cluster, and node 05 was one of the missing, indicating that it was not operating when the record was last updated.
  • Then you wait for the DNS servers to sync. Check all to make sure that the changes propagate.
  • When the records are gone from all DNS servers, wait for the next update cycle. The cluster server will retry the update every 15 minutes.

If you are successful, the following event 2049 should show up in the cluster log:

SNAGHTML6b2d3aa

And inspection of the DNS server should show one CAU A-record for each cluster node IP:

SNAGHTML6b3ec6d

Print This Post Print This Post

Tags: ,

Leave a Reply

%d bloggers like this: