Asynchronous AOAG down after restart of primary node

Background

This article is the result of a long day in the woods on a SAR mission turning into an even longer night due to a difficult cluster.

I have many clusters. One of them is a cluster with 3 nodes. 2 of them are running a regular Failover Cluster Instance with shared storage, while the third node has local storage and serves as an AOAG replica for the most critical databases. The Failover cluster instance is the primary replica as long as we don’t have a HA/DR scenario where the SAN is down or massive hardware issues. This is a very specific setup, but I would not be surprised if this problem could be caused by restarting the primary replica on any async AOAG.

Problem

 

During scheduled maintenance, I failed over the instance containing the primary replica from one node to another manually. This is what usually happens:

  • SQL instance is manually moved from node A to node B
  • The AOAG listener cluster resources fail on node A
  • The AOAG listener cluster resources are automatically moved to node B
  • The AOAG listener cluster resources come on line on node B

But not this time. This time, the AOAG listener objects came online on node A. Such a thing is not supposed to happen. In my experience, AOAG Listeners always stays with the primary node.

Note: I do not recommend this procedure as a standard maintenance procedure. It is always best to make sure that the instance you are restarting is NOT the primary replica of any AOAG.

Anyway, the end result is that SQL Server patiently waits for the AOAG listeners to come on line on the correct server. Or, perhaps patiently is not the correct word. It spews angry error messages in the logs, and the AOAG dashboard is all red.

SNAGHTMLd170a1

“Availability replica is disconnected”

“Availability replica does not have a healthy role”

SNAGHTMLd1f856

Solution

The solution is quite simple. It does however it require crossing over to the dark side, performing unspeakable dark magic. Magic that should never be performed in the presence of an AOAG listener, much less be performed on the listener itself. But simple it is, as long as you are the kind of person who knows your way around the part of Windows where Failover Cluster Manager dwells. Or is familiar with the sky blue realm of the Powershell cluster commandlets. Both will suffice. However, if you are not such a person, if some of the sentences in this article sounds like dark incantations overheard in a shady tavern; please seek assistance from someone who is before you proceed down this path to solution in production. Please note that normally, AOAG listener resources should never be manipulated manually. Trying to do so usually just makes the situation even more dire. But here goes:

Take the AOAG listener resources offline manually, and then bring them back online. Doing so should make the resources realize the errors of their ways and promptly enter the failed state. The cluster and SQL Server should detect this and take action. All failed listeners should be whisked over to the primary replica node and brought online without any need for further input from you.

However, if the listeners are still offline, active on the wrong cluster node, or even worse locked in a failed state, you still have options. Suggestions listed by increasing approximated time consumption.

  • Move the listener resources manually to the node containing the primary replica.
  • Shut down all the nodes, then start just one and let SQL Server initialize completely before you start the others.
  • Look for underlying domain and network issues preventing the listener from starting.
  • Destroy and re-create the availability group and listeners.

Author: DizzyBadger

SQL Server DBA, Cluster expert, Principal Analyst

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.