Asynchronous AOAG down after restart of primary node

Background

This article is the result of a long day in the woods on a SAR mission turning into an even longer night due to a difficult cluster.

I have many clusters. One of them is a cluster with 3 nodes. 2 of them are running a regular Failover Cluster Instance with shared storage, while the third node has local storage and serves as an AOAG replica for the most critical databases. The Failover cluster instance is the primary replica as long as we don’t have a HA/DR scenario where the SAN is down or massive hardware issues. This is a very specific setup, but I would not be surprised if this problem could be caused by restarting the primary replica on any async AOAG.

Problem

 

During scheduled maintenance, I failed over the instance containing the primary replica from one node to another manually. This is what usually happens:

  • SQL instance is manually moved from node A to node B
  • The AOAG listener cluster resources fail on node A
  • The AOAG listener cluster resources are automatically moved to node B
  • The AOAG listener cluster resources come on line on node B

But not this time. This time, the AOAG listener objects came online on node A. Such a thing is not supposed to happen. In my experience, AOAG Listeners always stays with the primary node.

Note: I do not recommend this procedure as a standard maintenance procedure. It is always best to make sure that the instance you are restarting is NOT the primary replica of any AOAG.

Anyway, the end result is that SQL Server patiently waits for the AOAG listeners to come on line on the correct server. Or, perhaps patiently is not the correct word. It spews angry error messages in the logs, and the AOAG dashboard is all red.

SNAGHTMLd170a1

“Availability replica is disconnected”

“Availability replica does not have a healthy role”

SNAGHTMLd1f856

Solution

The solution is quite simple. It does however it require crossing over to the dark side, performing unspeakable dark magic. Magic that should never be performed in the presence of an AOAG listener, much less be performed on the listener itself. But simple it is, as long as you are the kind of person who knows your way around the part of Windows where Failover Cluster Manager dwells. Or is familiar with the sky blue realm of the Powershell cluster commandlets. Both will suffice. However, if you are not such a person, if some of the sentences in this article sounds like dark incantations overheard in a shady tavern; please seek assistance from someone who is before you proceed down this path to solution in production. Please note that normally, AOAG listener resources should never be manipulated manually. Trying to do so usually just makes the situation even more dire. But here goes:

Take the AOAG listener resources offline manually, and then bring them back online. Doing so should make the resources realize the errors of their ways and promptly enter the failed state. The cluster and SQL Server should detect this and take action. All failed listeners should be whisked over to the primary replica node and brought online without any need for further input from you.

However, if the listeners are still offline, active on the wrong cluster node, or even worse locked in a failed state, you still have options. Suggestions listed by increasing approximated time consumption.

  • Move the listener resources manually to the node containing the primary replica.
  • Shut down all the nodes, then start just one and let SQL Server initialize completely before you start the others.
  • Look for underlying domain and network issues preventing the listener from starting.
  • Destroy and re-create the availability group and listeners.

Misaligned IOs on SSD storage

Problem

The SQL server error log on one of my secondary replicas is riddled with error messages like this one:

“There have been 271104 misaligned log IOs which required falling back to synchronous IO.  The current IO is on file ….”

image

Analysis

The AOAG is running on Win 2012, and I have been lulled into thinking that drive misalignment was a thing of the past, expelled from reality with my last Win2003R2 server. Seems maybe I was wrong. Although, analysis of the drives show that they are properly aligned. This particular replica is running with local SSD drives, whereas the primary replica is connected to a SAN. I thought “maybe the SAN volumes are misaligned?”, but that to was in order. Turns out the culprit was the SSD drive itself. The SAN volume is using a 512 bytes per sector physical setup, but the new SSD drive is set up to use 4096 bytes.

SSD:

image

SAN:

image

(Mountpoints are in use, thus a path instead of a driveletter.)

This was the only dividing factor I could find.

Even though it seemed futile, I tried to find a way to change bytes Per physical Sector. I mean, the “physical” in the name makes it sound very final. But, by converting the SSD to dynamic, mirroring it and making sure it was aligned I finally got it anyway:

image

I suppose it is still 4096 under the covers, but the cluster has been running for a week, and no new error mesasges have been logged so far.

Solution

Get drives with identical Bytes Per Physical Sector for all your AOAG nodes. I have heard rumors about SANs being able to define this on a per volume basis, but I consider this a poor workaround as it will probably add latency. It sounds better to upgrade all replicas to the new fancy 4096 bytes per sector drives, as they are probably faster for SQL server anyway.