AOAG: Local disks are set offline

Problem

After a reboot, the local disks that are not the boot disk are offline. Disk manager reports the following status:

THE DISK IS OFFLINE BECAUSE OF POLICY SET BY AN ADMINISTRATOR

The SQL Server instance fails as the drives containing the database files are offline.

Information about the system where this fault was detected:

  • SQL Server 2019
  • Windows Server 2022
  • Three nodes
  • One node is a stand alone AOAG replica with local storage
  • Two nodes form an AOFCI instance using shared SAN storage
  • The AOFCI instance is participating in an AOAG together with the third node
  • Multiple subnets are in use
  • Most disks are mounted to a folder, not a driveletter
  • Intel Xeon gold
  • Physical servers made in 2021/22

After setting the disks online and restarting the node, the drives are online and the SQL Server instance starts. Subsequent reboots does not reveal a pattern. Sometimes all drives are offline, sometimes half of the drives are offline.

Analysis

San policy

The policy referenced in the message is probably the SAN policy from diskpart:

The alternatives are Offline shared (default), Online all, Offline All and Offline Internal. Offline shared sets all shared storage as offline by default, and it has to be brought online. Usually that will be the cluster service changing the state of shared drives in accordance with the state of cluster resources. If you ask your not-so-friendly search engine and spy, you will find a lot of references asking you to just change the policy to online all. And in this case, that would probably be ok. If you try to mount a shared disk on multiples nodes of an AOFCI cluster for instance, you may end of in a sad world of disk corruption. However, the node with the problem is not connected to a SAN or other forms of shared storage and would handle online all without problems.

Disk signatures

A look in the failover cluster validation report reveals that the cluster service identifies all the “problem disks” as eligible for failover cluster validation:

Looking further down, the drives are identified as only existing on one node. This is important, as different scenarios may create local drives with the same signature on multiple nodes. This is especially a problem on virtual machines and when using cloning software to install physical machines. If duplicate disk signatures had existed in the cluster, the disks would have been validated, and failover clustering would have tried to add them to the cluster.

Luckily that was not the case here. All the local drives had an unique signature:

Add all eligible storage to the cluster

When you add a node to an existing cluster or form a new cluster, the cluster wizard will add all eligible storage to the cluster as default.

Your not so friendly search engine will list numerous reports of SQL Server disks disappearing when someone is building an AOAG cluster and forgets to uncheck this option. Whether or not that was the case here is unknown. What I do now from the validation reports is that the drives were not formatted when the node with local storage was added to the cluster. Anyways, the solution reported by many internet patrons is to just online the drives in disk manager and restart/start SQL Server. I have yet to find reports of intermittent problems.

Hypothesis

After applying the tentative solution listed below I have yet to reproduce the error. That does in no way guarantee a solution, especially as I have not been able to determine the root cause with 100% certainty. Maybe not even 50/50. But here goes:

  • The “Add all eligible storage..” option was not unchecked
  • Cluster validation has not been executed since the drives where formatted and SQL Server was installed.
  • The disk controller HPE SR932-p Gen10+ is doing something it should not.
  • The drives are all NVME based but RAID is still being used.
  • Resulting in the disk automount service believing that the local drives are shared.

Tentative solution

I do not now if this is the final solution. I do not know why it worked. I will update if something changes.

As usual, make sure that you understand this plan before you attempt to implement it.

  • Online all disks that are offline
  • Move the “Available Storage” cluster resource group to the problematic node. It does not matter if it is offline.
  • Run a cluster validation with storage validation
  • Make sure that there are no disk signature conflicts in the report.
  • Restart the node

Author: DizzyBadger

SQL Server DBA, Cluster expert, Principal Analyst

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.