AOAG: Local disks are set offline

Problem

After a reboot, the local disks that are not the boot disk are offline. Disk manager reports the following status:

THE DISK IS OFFLINE BECAUSE OF POLICY SET BY AN ADMINISTRATOR

The SQL Server instance fails as the drives containing the database files are offline.

Information about the system where this fault was detected:

  • SQL Server 2019
  • Windows Server 2022
  • Three nodes
  • One node is a stand alone AOAG replica with local storage
  • Two nodes form an AOFCI instance using shared SAN storage
  • The AOFCI instance is participating in an AOAG together with the third node
  • Multiple subnets are in use
  • Most disks are mounted to a folder, not a driveletter
  • Intel Xeon gold
  • Physical servers made in 2021/22

After setting the disks online and restarting the node, the drives are online and the SQL Server instance starts. Subsequent reboots does not reveal a pattern. Sometimes all drives are offline, sometimes half of the drives are offline.

Analysis

San policy

The policy referenced in the message is probably the SAN policy from diskpart:

The alternatives are Offline shared (default), Online all, Offline All and Offline Internal. Offline shared sets all shared storage as offline by default, and it has to be brought online. Usually that will be the cluster service changing the state of shared drives in accordance with the state of cluster resources. If you ask your not-so-friendly search engine and spy, you will find a lot of references asking you to just change the policy to online all. And in this case, that would probably be ok. If you try to mount a shared disk on multiples nodes of an AOFCI cluster for instance, you may end of in a sad world of disk corruption. However, the node with the problem is not connected to a SAN or other forms of shared storage and would handle online all without problems.

Disk signatures

A look in the failover cluster validation report reveals that the cluster service identifies all the “problem disks” as eligible for failover cluster validation:

Looking further down, the drives are identified as only existing on one node. This is important, as different scenarios may create local drives with the same signature on multiple nodes. This is especially a problem on virtual machines and when using cloning software to install physical machines. If duplicate disk signatures had existed in the cluster, the disks would have been validated, and failover clustering would have tried to add them to the cluster.

Luckily that was not the case here. All the local drives had an unique signature:

Add all eligible storage to the cluster

When you add a node to an existing cluster or form a new cluster, the cluster wizard will add all eligible storage to the cluster as default.

Your not so friendly search engine will list numerous reports of SQL Server disks disappearing when someone is building an AOAG cluster and forgets to uncheck this option. Whether or not that was the case here is unknown. What I do now from the validation reports is that the drives were not formatted when the node with local storage was added to the cluster. Anyways, the solution reported by many internet patrons is to just online the drives in disk manager and restart/start SQL Server. I have yet to find reports of intermittent problems.

Hypothesis

After applying the tentative solution listed below I have yet to reproduce the error. That does in no way guarantee a solution, especially as I have not been able to determine the root cause with 100% certainty. Maybe not even 50/50. But here goes:

  • The “Add all eligible storage..” option was not unchecked
  • Cluster validation has not been executed since the drives where formatted and SQL Server was installed.
  • The disk controller HPE SR932-p Gen10+ is doing something it should not.
  • The drives are all NVME based but RAID is still being used.
  • Resulting in the disk automount service believing that the local drives are shared.

Tentative solution

I do not now if this is the final solution. I do not know why it worked. I will update if something changes.

As usual, make sure that you understand this plan before you attempt to implement it.

  • Online all disks that are offline
  • Move the “Available Storage” cluster resource group to the problematic node. It does not matter if it is offline.
  • Run a cluster validation with storage validation
  • Make sure that there are no disk signature conflicts in the report.
  • Restart the node

More about password managers and security

In response to comments on my post about 1Password and cloud security, this is an update about other password managers and my way out. It is highly recommended that you read the previous post first to understand where I am going with this. I was looking for a password manager for a specific purpose, that is cloud sync ability with something that at least looks like true encryption without back doors, and my comments are written from that mindset.

Why I would not use Dropbox for my passwords

Or Google drive. Or Onedrive for that matter.

And note the “for my passwords” part of the sentence. It is not that I do not use such services, but I consider all of them as an insecure location. They are just to juicy a target for wrongdoers, both intelligence agencies and the other kind of cyber criminals.

About “Free” cloud services

Such as Facebook, or maybe Lastpass is a better example in this case. There is no free lunch. You are either a customer or a product. You are either the farmer or the pig. Pigs have no rights to privacy. And before you consider using any google-based service, read this: Street View cars slurping wi-fi. This case complex was for me the first warning that something was rotten in the house of google. And it has not become any better since. I guess the “Don’t be evil” company motto should have been a warning as well…

As such I would never install a password manager on an Android or IOS-based phone that contains passwords for other systems. They are way to easy to hack.

And just as a note, LastPass and LogMeOnce was not considered due to their lack of a desktop client.

KeePass

I have used KeePass in the past, but it is mostly a local solution only, and is therefore out of scope. At least initially. I am also slightly worried about the fact that it is free, as in free of charge. I have nothing against open source software, but for some needs I prefer to have access to someone to complain to if it all goes wrong. Someone who is paid to listen to my ails and complaints. That being said, KeePass is a nice product for local password management.

Keeper

Had a brief look at it, discovered it was dependent on java, uninstalled immediately. Also, at revisiting it seems to be “moving to the cloud”.

DashLane

A solid security policy. I ran the demo for some time, but the “Modern” UI was horrible, if not as horrible as the 1Password 6 beta. I wanted to like it, but after continuously having to click buttons two or three times for them to do anything, I gave up. I may re-test it in the future. This is sadly a complaint I have about most “Modern” UI applications, they do not respond to mouse clicks consistently. I also could not get the browser plugin to work properly in Vivaldi.

StickyPassword

This has been the best contender so far. I got as far as testing the synchronization, and I used it for the full trial period without a rage-uninstall. What actually stopped me from going for it in the end was it’s login dialog constantly popping up when I wasn’t trying to use it. It became a nuisance. I also did not like that it wants to run all the time, instead of when I actually want to use it. The chance of someone snagging access to it while walking by if I forgot to lock my screen is of course a danger, but primarily I stopped using it because it became irksome.

What I ended up doing

I stuck with what I had, a simple local-only password manager not to be named. Because the password manager itself is not important, as long as you have control of the data locally.  It is how you control the synchronization of data that is important. And as I do not really trust any of the “public cloud” alternatives, I decided to make my own. I installed Resilio Sync, a file synchronization application based on the BitTorrent protocol, and used it to keep my encrypted password store in sync across my computers.

This allows me to keep the data in sync and, to a certain degree, actually know where my data is physically located. It could still be hacked or intercepted of course, but that had to be a much more directed attack than the usual “lets  archive everything that was ever stored on Dropbox in case we need it some time” behavior we have come to expect from the people who are supposedly working to keep the world “safe”. I may come across as rather paranoid in this post, but such are the times.

1Password and cloud security

Posting reviews of software is not something that I do every day. Or every year for that matter. But something unexplainable about the incident foretold in this post made me write it. You have been warned…

I have been on the lookout for a new password manager, especially one with “secure” cloud sync capabilities, and someone recommended 1Password (name withheld to protect the guilty). What peaked my interest was the claim that no one but me would be able to decrypt the data.

This is in stark contrast to most cloud solutions. Let us use DigiPost.no as an example. It is touted as a completely secure way to receive digital documents from the Norwegian government and anyone else willing to pay for sender-access to the system. For instance, several brick and mortar stores in Norway are able to send you receipts and warranty-certificates over the system. But is it secure? Their FAQ claims that it is as safe as your bank. And maybe it is. But my bank does not aggregate data about me from other sources, at least not to my knowledge. Browsing further down the FAQ reveals the following quote: “Et fåtall sikkerhetsklarerte medarbeidere er autorisert til å vedlikeholde og korrigere kundeopplysninger.” Sadly this is in Norwegian only, but it basically says that some employees have the security clearance necessary to view or alter your data to perform “maintenance”. Images of underpaid outsourcing employees from Asia looking to make a quick buck on the side datamining flashed before my inner eye, but even if these people are all highly trustworthy, that is beside the point. The point is that someone other than me and the sender can access these data without me giving them the key. And then it is not really safer than regular email. And if you still believe that your emails, cloud storage and facebook messages are not stored, tagged and analyzed automatically by at least two governments beside your own, please stop reading. You are outside the target demographic and should keep your current post-it-under-the-keyboard password manager.

But I digress. I was supposed to write about password managers, more specifically 1Password from AgileBits.

I registered and downloaded a trial of the subscriptions based “family”-version, as it came so highly recommended by the website and was the only version targeted at end users with internet sync that didn’t include a known NSA-infected third party.

clip_image001

I was surprised to find that there was no stable version of the Windows Application available, only a beta, but I was feeling adventurous and downloaded the desktop version. The Modern/Metro version reports itself as an Alpha version in the Windows Store and was thus left alone.

clip_image002

clip_image003

Next, I attempted to import my existing data. The online help directed me to a community-built perl script and a pdf at https://github.com/AgileBits/onepassword-utilities. I went through the perl script maze and ended up with a 1pif file in the end, which I was to import into the main program. 1pif is some form of intermediate proprietary import/export format. All that remained was importing it into 1password. To my astonishment, there was no import button to be found. Not even the File menu at which the Import button is supposed to be located according to the PDF was available. The app is almost completely left of buttons and menus. I tried inputting data manually, but the fancy modern UI is not exactly user friendly so I gave that up. Inputting 200+ entries manually at the pace the UI allowed was out of the question. There may be a hidden import function there somewhere, but I was unable to find it.

clip_image004

Rummaging around the 1Password website I found the stable 4.x version. This is the one that only supports DropBox sync or similar. It has the aforementioned import button (which worked), but after the data was imported and I tried opening the resulting vault in v6 (beta), the vault was locked and could not be reopened. After a second try with another file I got it going, and I was able to access the data in v6 through some kind of legacy function whose location I forgot to screenshot. I was about to move the data over to the “cloud” part, but I stopped… Glancing at my main monitor, I noticed it was filling up with security warnings complaining about unsafe access to system resources. By 1password v6. See screenshot below. Sadly, it is written in Norwegian, but it is basically a warning against invalid code signing certificates.

clip_image005

I have once before lost data due to poorly managed updates to a password manager, and here I am about to put my trust in beta software? Remembering the non-decryptable data from some years back and the time spent recovering the lost data, I was not feeling safe at all. If the claim that I am the only one with the encryption keys are true, is it then even possible to restore from a backup if a botched software update garbles the data? Are there in fact any backups at all? The documentation talks about a password history, indicating that delete means tag as deleted but keep in database, but says nothing about a restore function as far as I can tell.

There are stable clients for most other platforms though. I realized that most if not all screenshots on the 1Password site are from the Mac version, so I guess they just couldn’t be bothered to build a proper Windows client before they launched V6 for Mac. A stroll down the memory lane of blog.agilebits.com confirmed my suspicions. In May 2016 they launched 1Password 6.3 for Mac. 6.0 was launched in January, with several updates in-between. The most recent post I can find about the stable Windows version is from July 2015, and as far as I can tell it just confirms that the current stable 4.6 version is compatible with Windows 10. Almost a year ago to the day.

I seriously considered reaching out to AgileBits support, but at this point I doubt there is anything they can tell me that will convince me to move my data to 1Password families. The 4.6 product looks a lot better, but I guess it is the old stuff now, as there does not seem to be any development to it. The MD5 signature on the current download as of July 2016 is from February 23. 2016. Neither does it support the kind of sync I was looking for, and if the horrible UI of families v6 is a sign of what is to come, I am out.

I have since moved on to somewhat greener pastures, and I am currently testing another similar product. If that results in another horrible experience, maybe there will be another review…

Update 2017.03.16

In response to comments, I have written another post here: https://lokna.no/?p=2113