Error 3930 installing SQL 2012 SP2 with CU3 in cluster

Problem

I was patching one of my clusters to SQL2012 SP2 and SP2 CU3 when something bad happened. This particular cluster is a 3 node cluster with a FCI Primary AOAG replica instance on node 1 and 2, and a stand alone Secondary AOAG replica instance on node 3. Node 3 is used for HADR when the shared storage or other shared infrastructure has an outage.

The update passed QAT with flying colors, but sadly that does not always guarantee a successful production run. My standard patch procedure for this cluster:

  • Patch node 3
  • Patch node 2 (passive FCI node)
  • AOAG failover to node 3, node 3 becomes AOAG Primary
  • FCI failover from node 1 to node 2
  • Patch node 1
  • FCI failover to node 1
  • AOAG failover to node 1

When I tried to fail over the FCI to node 2 (step 4 above), the instance failed. First, I was worried that the SP2 upgrade process may be very lengthy or slow and triggering the FCI timeouts. An inspection of the SLQ Server error log revealed that this was not the case. Instead, I was the victim of a dreaded master database failure:

015-01-12 01:28:02.82 spid7s      Database 'master' is upgrading script 'msdb110_upgrade.sql' from level 184552836 to level 184554932.
2015-01-12 01:28:02.82 spid7s      ----------------------------------
2015-01-12 01:28:02.82 spid7s      Starting execution of PRE_MSDB.SQL
2015-01-12 01:28:02.82 spid7s      ----------------------------------
2015-01-12 01:28:02.96 spid7s      Error: 3930, Severity: 16, State: 1.
2015-01-12 01:28:02.96 spid7s      The current transaction cannot be committed and cannot support operations that write to the log file. Roll back the transaction.
2015-01-12 01:28:02.96 spid7s      Error: 912, Severity: 21, State: 2.
2015-01-12 01:28:02.96 spid7s      Script level upgrade for database 'master' failed because upgrade step 'msdb110_upgrade.sql' encountered error 3930, state 1, severity 16. This is a serious error condition which might interfere with regular operation and the database will be taken offline. If the error happened during upgrade of the 'master' database, it will prevent the entire SQL Server instance from starting. Examine the previous errorlog entries for errors, take the appropriate corrective actions and re-start the database so that the script upgrade steps run to completion.
2015-01-12 01:28:02.97 spid7s      Error: 3417, Severity: 21, State: 3.
2015-01-12 01:28:02.97 spid7s      Cannot recover the master database. SQL Server is unable to run. Restore master from a full backup, repair it, or rebuild it. For more information about how to rebuild the master database, see SQL Server Books Online.
2015-01-12 01:28:02.97 spid7s      SQL Server shutdown has been initiated
2015-01-12 01:28:02.97 spid7s      SQL Trace was stopped due to server shutdown. Trace ID = '1'. This is an informational message only; no user action is required.

Analysis

In case misbehaving SQL Server instances are able to smell fear, I am glad I was located several miles away from the datacenter at this point in time. While a rebuild of master is certainly doable even in a complex setup such as this, it is not something you want to do at 2am without a detailed plan if you don’t have to. Thus, I tried failing the instance back to node 1 (running SP1 CU 11). To my amazement it came online straight away. I have seen similar issues reduce clustered instances to an unrecognizable puddle of zeros and ones in a corner on the SAN drive, so this was a welcome surprise. Feeling lucky, I tried another failover to node 2, only to be greeted with another failure and the exact same errors in the log. A quick search revealed several similar issues, but no exact matches and no feasible solutions. The closest was a suggestion to disable replication during the upgrade. As you probably know, AOAG is just replication in a fancy dress, so I went looking for my Disaster Recovery Runbook that contains ready made scripts and plans for disabling and re-enabling AOAG. My only problem is that disabling AOAG will take down the AOAG listener, thus disconnecting all clients. Such antics results in grumpy client systems, web service downtimes and a lot of paperwork for instance reviews, and is therefore something to avoid if at all possible. Just for the fun of it, I decided to try making Node 2 the AOAG Primary during the upgrade. To my astonishment, this worked like a charm. Crisis (and paperwork) averted.

Solution

You have to promote the FCI to AOAG Primary during the upgrade from SP2 to SP1. The upgrade is triggered by failing the FCI over from a node running SP1 to a node running SP2, in my case the failover from node 1 to node 2 after patching node 2.

Sadly, there is no fixed procedure for patching failover cluster instances. Some patches will only install on the active FCI node, and will then continue to patch all nodes automatically. But most patches follow the recipe above, where the passive node(s) are patched first.

This issue will probably not affect “clean” AOAG or FCI clusters where you only apply one technology. If you use FCI with replication on the other hand, you may experience the same issue.

Definitions

AOAG = Always On Availability Group

FCI = Failover cluster Instance

HADR = High Availability / Disaster Recovery

Asynchronous AOAG down after restart of primary node

Background

This article is the result of a long day in the woods on a SAR mission turning into an even longer night due to a difficult cluster.

I have many clusters. One of them is a cluster with 3 nodes. 2 of them are running a regular Failover Cluster Instance with shared storage, while the third node has local storage and serves as an AOAG replica for the most critical databases. The Failover cluster instance is the primary replica as long as we don’t have a HA/DR scenario where the SAN is down or massive hardware issues. This is a very specific setup, but I would not be surprised if this problem could be caused by restarting the primary replica on any async AOAG.

Problem

 

During scheduled maintenance, I failed over the instance containing the primary replica from one node to another manually. This is what usually happens:

  • SQL instance is manually moved from node A to node B
  • The AOAG listener cluster resources fail on node A
  • The AOAG listener cluster resources are automatically moved to node B
  • The AOAG listener cluster resources come on line on node B

But not this time. This time, the AOAG listener objects came online on node A. Such a thing is not supposed to happen. In my experience, AOAG Listeners always stays with the primary node.

Note: I do not recommend this procedure as a standard maintenance procedure. It is always best to make sure that the instance you are restarting is NOT the primary replica of any AOAG.

Anyway, the end result is that SQL Server patiently waits for the AOAG listeners to come on line on the correct server. Or, perhaps patiently is not the correct word. It spews angry error messages in the logs, and the AOAG dashboard is all red.

SNAGHTMLd170a1

“Availability replica is disconnected”

“Availability replica does not have a healthy role”

SNAGHTMLd1f856

Solution

The solution is quite simple. It does however it require crossing over to the dark side, performing unspeakable dark magic. Magic that should never be performed in the presence of an AOAG listener, much less be performed on the listener itself. But simple it is, as long as you are the kind of person who knows your way around the part of Windows where Failover Cluster Manager dwells. Or is familiar with the sky blue realm of the Powershell cluster commandlets. Both will suffice. However, if you are not such a person, if some of the sentences in this article sounds like dark incantations overheard in a shady tavern; please seek assistance from someone who is before you proceed down this path to solution in production. Please note that normally, AOAG listener resources should never be manipulated manually. Trying to do so usually just makes the situation even more dire. But here goes:

Take the AOAG listener resources offline manually, and then bring them back online. Doing so should make the resources realize the errors of their ways and promptly enter the failed state. The cluster and SQL Server should detect this and take action. All failed listeners should be whisked over to the primary replica node and brought online without any need for further input from you.

However, if the listeners are still offline, active on the wrong cluster node, or even worse locked in a failed state, you still have options. Suggestions listed by increasing approximated time consumption.

  • Move the listener resources manually to the node containing the primary replica.
  • Shut down all the nodes, then start just one and let SQL Server initialize completely before you start the others.
  • Look for underlying domain and network issues preventing the listener from starting.
  • Destroy and re-create the availability group and listeners.

The server cannot accept TCP connections after disabling named pipes

Problem

I always disable the named pipes protocol on my SQL Servers if they are going to accept network connections from clients. I am not sure if named pipes still poises a threat to my network in a 10GB Ethernet world, but I have spent a considerable amount of time troubleshooting network latencies caused by named pipes previously. For some background on TCP/IP vs NP, check out this article by devpro. Going into SQL Config manager and turning it of has become part of my routine, so imagine my surprise when one of the junior DBAs at my workplace told me that this caused a new cluster instance he was installing to fail. I could of course have dismissed it as some strange fault during installation, but a quick survey revealed that he had followed my SQL Server cluster installation manual to the letter. And this happened only in production, not in the identical QA environment. The only error message logged during SQL startup was this one:

“Error: 26058, Severity: 16, State: 1. A TCP provider is enabled, but there are no TCP listening ports configured. The server cannot accept TCP connections.”

The failover cluster service does not like SQL Server instances that are not network connectable, so it promptly shut down the instance, followed by a slew of error messages in the event log. If I remember correctly, the default timeout for a SQL Server instance to come online after it is started is 3 minutes. That is, it has to be ready for client connections and respond to the cluster service’s health checks in 3 minutes. This does not include any time needed to recover user databases, so in a physical environment with a functional storage subsystem this poses no problem at all. If you have slow hardware or other problems causing a slow startup you can increase the timeout through failover cluster manager, but I have personally never had the need to increase it.

So far, I have only seen this issue in the following scenario:

  • MSSQL 2012 SP1
  • Windows 2012
  • Failover cluster
  • >1 instance of SQL Server in the cluster

But I have been unable to reproduce the problem consistently in my lab. I have otherwise equal cluster environments, where some have this issue and some does not.

Steps to reproduce:

  • Install instance 1
  • Apply patches to Instance 1
  • Test failover of instance 1
  • Disable named pipes on instance 1
  • Restart Instance 1
  • Install instance 2
  • Apply patches to instance 2
  • Test Failover of instance 2
  • Disable named pipes on instance 2
  • Restart instance 2
  • Instance 2 fails

Analysis

We spent quite a lot of time comparing the working cluster with the failing one. Down to firmware versions, drivers and patches applied. Everything was equal. I even considered a local disk fault, but failing over to another node didn’t change anything. The registry settings also looked ok. On a cluster, the registry settings are only available on the node currently owning the instance.  The network settings are found in HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL11.[Instance name]\MSSQLServer\SuperSocketNetLib\.

SNAGHTML864377

WARNING: NEVER try to change anything directly in the registry on a clustered instance, especially one that is not working. See Technet for a guide to changing clustered registry settings on an offline SQL instance.

As the sample shows, we even had Listen All enabled.

Workaround

After a lot of tinkering we came up empty and resorted to re-enabling Named Pipes. If anyone finds or know of a solution to this, please leave a comment below.

Note: the following guide is vague by design. It is very difficult to provide a proper step by step guide for this workaround covering all alternatives. If you have questions, leave a comment including details about your situation, and I will try to clarify.

To enable named pipes again, we just started sqlserver directly through the command line on the active node. This will usually bring it online. If not, stop the cluster service while you do this. Then we enabled named pipes in Config Manager and stopped the instance again. After that, we were able to start the instance as expected, albeit with named pipes enabled.

Warning: If you stop the cluster service on an active node, the cluster may try to fail over the instance to another node if there are other active nodes. Make sure the instance is listed as Failed before you try this. If all else fails, you may have to shut down the other nodes as well, thus taking down all instances in the cluster.

Clustered MSDTC fails

Problem

While setting up a new clustered Distributed transaction coordinator for a SQL Server FCI, it fails to come online when restarted. This time it happened after I enabled network dtc access, but I have had this happen a lot during patching and cluster failover. Usually, I would just remove and reinstall, but that doesn’t seem to help this time. No matte what I did, FC Manager would just list it as failed:

image

Analysis

Looking in Services, I could see the DTC service was disabled:

image

The GUID in the service name can be matched to the cluster resource in the registry. This is useful if you have more than one DTC in your cluster, as FCI will not allow you to have several DTC resources with the same name. additional DTCs are named “New Distributed Transaction Coordinator (1)”, 2 and so on.

image

I tried to enable the service, only to be greeted with a snarky “This service is marked for deletion” message.

image

Then, I tried removing the resource and adding a new one, as this is my standard MO whenever I have trouble with a clustered MSDTC. Doing that I ended up with another “Marked for deletion” DTC service. My next idea was to fail over the instance, but then I thought, what if this had been in production? Thus, I kept on searching for another solution. And the solution turned out to be a simple one…

Solution

Log out ALL user sessions from the active node. This means all, yourself and any disconnected others included. Then log back in again, and bring the DTC resource online.

image

And by the way, remember change the policy for the DTC, to make sure that such errors doesn’t take down and fail over the entire instance. It could have solved the problem, but it could just as easily lead to the instance failing back and forth until it fails. Adding a script or policy that automatically logs out inactive users from the cluster nodes once a day is also a good idea.

image

Event 324 from SQLAgent OpenCluster (reason: 5).

Problem

Overzealous monitoring alerts you to an error logged during a cluster failover, more specifically Event ID 324 from SQLAgent$InstanceName:

SNAGHTML19452993

Analysis

As mentioned this happens during failover, one that otherwise may pass without incident. Further analysis of the Application log shows that recovery isn’t done at the time. The next messages in the log are related to the server starting up and running recovery on the new node. For some reason this takes longer than expected. Maybe there was a lot of transactions in flight at the time of failover, maybe the server or storage is to slow, or maybe you were in the process of installing an update to SQLServer which may lead to extensive recovery times. Or it may be something completely different. Whatever it was, it caused the cluster service to try to start the SQLAgent before the node was ready. Reason 5 is probably access denied. Thus, the issue could be related to lack of permissions. I have yet to catch one of these early enough to have a cluster debug log containing the time of the error. Analysis of the cluster in question revealed another access related error at about the same time, ESENT Event ID 490:

SNAGHTML1955d914

This error is related to lack of permissions for the SQLServer engine and Agent runas accounts. Whether or not these accounts should have Local Admin permissions on the node is a never ending discussion. I have found though, that granting the permissions causes far less trouble in a clustered environment than not doing so. There is always another issue, always another patch or feature requiring yet another explicit permission. From a security stand point, it is easy to argue that the data served by the SQL Server is far more valuable than the local server it runs on. If an attacker is able to gain access to the runas accounts, he already has access to read and change/delete the data. What happens to the local server after that is to me rather irrelevant. But security regulations aren’t guaranteed to be neither logical nor sane.

Solution/Workaround

To solve the permission issue, you can either:

  • Add the necessary local permissions for the runas accounts as discussed in KB2811566 and wait for the next “feature” requiring you to add even more permissions to something else. Also, make sure the Agent account has the proper permissions to your backup folders and make sure you are able to create new databases. Not being able to do so may be caused by the engine account not having the proper permissions to your data/log folders.
  • Add the SQL Server Engine and Agent runas accounts to the local administrators group on the server.

Do NOT grant the runas accounts Domain Admin permissions. Ever.

Regarding the open cluster error:

On the servers I have analyzed with this issue, the log always shows the agent starting successfully within two minutes of the error, and it only happens during failover. I have yet to find it on servers where the permissions issue is not solved (using either method), but I am not 100% sure that they are related. I can however say that the message can safely be ignored as long as the Agent account is able to start successfully after the message.

SQL agent jobs cause Audit failure in security log

Problem

When you audit your security log, something you are doing every day of course, you discover that SQL server is causing an audit failure fairly often:

image

Audit failure event 4625, The user account has expired:

image

Closer inspection reveals that the event is triggered by the SQL server engine service account. You could easily be lead to believe that the engine account itself has expired, as it is the only account mentioned by name in the error message. That is not the case here, as the “Account for which logon failed” is a null sid, also known as S-1-0-0 or nobody. This is Windows’ way of telling you that something failed, but I’m not going to say exactly what it was.

Analysis

The error appeared about every hour on the hour, so agent jobs is the top suspect. But the job logs disagreed, every job was working like a charm. Analysis of the central log repository revealed that the error appeared suddenly without any changes being made on the server at the time.

I discovered this on a cluster, so I tried failing over to another node. That didn’t change anything, the error followed sqlserver.exe to the other node. At least I had proven beyond reasonable doubt that the error was directly related to something SQL does. I was unable to find any fault on the server except from the audit failure. I checked all SQL server service accounts, none of which had an expiry date set. I drifted back to the agent jobs again, as agent job ownership has given me a hard time earlier. For some reason, if the person creating the jobs/maintenance plans is a sysadmin by group membership and not by direct membership, all agent jobs fail to execute unless you use a proxy account. I have blogged about this before though (https://lokna.no/?p=1267), so I always check for this when a new instance is installed and correct if necessary.

Then it struck me: there is a best practice somewhere stating that SQL Server installs are supposed to be executed as a special account, and not an account associated with the person that performs the installation. This is in case that person quits and we should delete his/her account. The agent jobs for backup, dbcc checkdb and such are usually created using another account though, but maybe there was an anomaly. I ran a quick check, and yes, the agent jobs were owned by the setup user. And this user was since marked as expired for security reasons, as it is only used  during setup and as a way into the server in case someone removes all other sysadmins and locks themselves out. I know there are other ways into a SQL Server you don’t have access to, but this is a lot easier as long as you have access to the domain.

To list job owners:

select name, owner_sid from msdb.dbo.sysjobs

image

0x1 is sa, all the other jobs were owned by the setup user. These SIDs are in hex format. To convert to a username, run this:

SELECT SUSER_SNAME (0xHEXSID)

solution

Change the job and/or maintenance plan owner. See https://lokna.no/?p=325

Does your cluster have the recommended hotfixes?

From time to time, the good people at Microsoft publish a list of problems with failover clustering that has been resolved. This list, as all such bad/good news comes in the form of a KB, namely KB2784261. I check this list from time to time. Some of them relate to a specific issue, while others are more of the go-install-them-at-once type. As a general rule, I recommend installing ALL hotfixes regardless of the attached warning telling you to only install them if you experience a particular problem. In my experience, hotfixes are at least as stable as regular patches, if not better. That being said, sooner or later you will run across patches or hotfixes that will make a mess and give you a bad or very bad day. But then again, that is why cluster admins always fight for funding of a proper QA/T environment. Preferably one that is equal to the production system in every way possible.

Anyways, this results in having to check all my servers to see if they have the hotfixes installed. Luckily some are included in Microsoft Update, but some you have to install manually. To simplify this process, I made the following powershell script. It takes a list of hotfixes, and returns a list of the ones who are missing from the system. This script could easily be adapted to run against several servers at once, but I have to battle way to many internal firewalls to attempt such dark magic. Be aware that some hotfixes have multiple KB numbers and may not show up even if they are installed. This usually happens when updates are bundled together as a cummulative package or superseded by a new version. The best way to test if patch/hotfix X needs to be installed is to try to install it. The installer will tell you whether or not the patch is applicable.

Edit: Since the original, I have added KB lists for 2008 R2 and 2012 R2 based clusters. All you have to do is replace the ” $recommendedPatches = ” list with the one you need. Links to the correct KB list is included for each OS. I have also discovered that some of the hotfixes are available through Microsoft Update-Catalog, thus bypassing the captcha email hurdle.

2012 version

$menucolor = [System.ConsoleColor]::gray
write-host "╔═══════════════════════════════════════════════════════════════════════════════════════════╗"-ForegroundColor $menucolor
write-host "║                              Identify missing patches                                     ║"-ForegroundColor $menucolor
write-host "║                              Jan Kåre Lokna - lokna.no                                    ║"-ForegroundColor $menucolor
write-host "║                                       v 1.2                                               ║"-ForegroundColor $menucolor
write-host "║                                  Requires elevation: No                                   ║"-ForegroundColor $menucolor
write-host "╚═══════════════════════════════════════════════════════════════════════════════════════════╝"-ForegroundColor $menucolor
#List your patches here. Updated list of patches at http://support.microsoft.com/kb/2784261
$recommendedPatches = "KB2916993", "KB2929869","KB2913695", "KB2878635", "KB2894464", "KB2838043", "KB2803748", "KB2770917"
 
$missingPatches = @()
foreach($_ in $recommendedPatches){
    if (!(get-hotfix -id $_ -ea:0)) { 
        $missingPatches += $_ 
    }    
}
$intMissing = $missingPatches.Count
$intRecommended = $recommendedpatches.count
Write-Host "$env:COMPUTERNAME is missing $intMissing of $intRecommended patches:" 
$missingPatches

2008R2 Version

A list of recommended patches for Win 2008 R2 can be found here:  KB2545685

$recommendedPatches = "KB2531907", "KB2550886","KB2552040", "KB2494162", "KB2524478", "KB2520235"

2012 R2 Version

A list of recommended patches for Win 2012 R2 can be found here:  KB2920151

#All clusters
$recommendedPatches = "KB3130944", "KB3137691", "KB3139896", "KB3130939", "KB3123538", "KB3091057", "KB3013769", "KB3000850", "KB2919355"
#Hyper-V Clusters
$recommendedPatches = "KB3130944", "KB3137691", "KB3139896", "KB3130939", "KB3123538", "KB3091057", "KB3013769", "KB3000850", "KB2919355", "KB3090343", "KB3060678", "KB3063283", "KB3072380"

Hyper-V

If you are using the Hyper-V role, you can find additional fixes for 2012 R2 in KB2920151 below the general cluster hotfixes. If you use NVGRE, look at this list as well: KB2974503

Sample output (computer name redacted)

SNAGHTML2af111ec

Edit:

I have finally updated my script to remove those pesky red error messages seen in the sample above.

Kernel memory leak analysis

I have had several issues in the past year involving kernel memory leaks, so I decided to make a separate blog post about general kernel memory leak analysis. In this post I mostly use the amazing Sysinternals tools for troubleshooting. You also need Poolmon.exe, a small utility currently part of the Windows Driver Kit. Sadly, this 35k self contained .exe is not available as a separate download, you have to download and install the entire 500+MiB WDK somewhere to extract it. You only have to do this once though, as there is no need to install the WDK on every system you analyze. You can just copy the executable from the machine where you installed the WDK.

Problem

Something is causing the kernel paged or non paged pools to rise uncontrollably. Screenshot from Process Explorer’s System Information dialog:

image

In this sample, the non paged pool has grown to an unhealthy 2,2GB, and continues to grow. Even though the pool limit is 128GIB and the server has a whopping  256GIB of RAM, the kernel memory pools are usually way below the 1GiB mark. You should of course baseline this to make sure you actually have an issue, but generally speaking, every time I find a Kernel memory value above 1GiB I go hunting for the cause.

Note: To show the pool limits, you have to enable symbols in Process Explorer. Scott Hanselman has blogged about that here: http://www.hanselman.com/blog/SetUpYourSystemToUseMicrosoftsPublicSymbolServer.aspx

Analysis

Kernel leaks are usually caused by a driver. Kernel leaks in the OS itself are very rare, unless you are running some sort of beta version of Windows. To investigate further, you have to fire up poolmon.exe.

image

Poolmon has a lot of shortcuts. From KB177415:

P – Sorts tag list by Paged, Non-Paged, or mixed. Note that P cycles through each one.
B – Sorts tags by max byte usage.
M – Sorts tags by max byte allocation.
T – Sort tags alphabetically by tag name.
E – Display Paged, Non-paged total across bottom. Cycles through.
A – Sorts tags by allocation size.
F – Sorts tags by “frees”.
S – Sorts tags by the differences of allocs and frees.
E – Display Paged, Non-paged total across bottom. Cycles through.
Q – Quit.

The important ones are “P”, to view either paged or non-paged pool tags, and “B”, to list the ones using the most of it at the top. The same view as above, after pressing “P” and “B”:

image

The “Cont” tag relates to “Contiguous physical memory allocations for device drivers”, and is usually the largest tag on a normal system.

And this screenshot is from the server with the non-paged leak:

image

As you can see, the LkaL tag is using more than 1GiB on its own, accounting for half of the pool. we have identified the pool tag, now we have to look for the driver that owns it. To do that, I use one of two methods:

1: Do an internet search for the pool tag.

http://blogs.technet.com/b/yongrhee/archive/2009/06/24/pool-tag-list.aspx contains a large list of tags.

2: Use Sysinternals strings together with findstr.

Most kernel mode drivers are located in “%systemroot%\System32\drivers”. First you have to start an elevated command prompt.  Make sure the Sysinternals suite is installed somewhere on the path, and enter the following commands:

  • cd “%systemroot%\System32\drivers”
  • strings * | findstr [tag]

Example:

image

Hopefully, you should now have the name of the offending driver. To gather more intel about it, use Sysinternals sigcheck:

image

In this case, the offending driver is part of Diskeeper.

Solution

You have to contact the manufacturer of the offending driver and check for an updated version. In the example, I got a version of Diskeeper without the offending driver, but a good place to start is the manufacturers website. And remember, if you already have the newest version, you can always try downgrading to a previous version. If you present your findings as detailed above, most vendors are very interested in identifying the cause of the leak and will work with you to resolve the issue.

Limit the size of the error log

Problem

Something bad happens, and the error log is  bloated with error messages very fast, triggering low disk space alarms and eventually filling the drive to the brim. For a concrete example, see https://lokna.no/?p=1263.

Analysis

Even though I have spent considerable time searching for a solution to this, so far I have been at a loss. My best bet has been limiting the number of files. If the default limit of 6 isn’t changed, there seems to be no entry in the registry. On clustered instances such errors as mentioned above can cause a failover roundabout, and I have seen several hundred log files being generated (on SQL Server 2008R2). Changing the limit adds a registry value, and this seems to force SQL Server to enforce the limit.

This doesn’t solve the most pressing issue though. If some sort of error starts pushing several log messages every second, a small base volume with say 10GiB of free space is filled to the brim in hours if not minutes. Thus, when you are woken up by the low disk space alarm it is already to late to stop it, and you run the chance of NTFS corruption. In my experience, the chance of NTFS corruption is increasing on a logarithmic scale when you pass 5% free space on a volume. 0 bytes of free space with processes still trying to write more data to the volume more or less guarantees some kind of NTFS corruption unless you catch it quickly.

Continue reading “Limit the size of the error log”

Misaligned IOs on SSD storage

Problem

The SQL server error log on one of my secondary replicas is riddled with error messages like this one:

“There have been 271104 misaligned log IOs which required falling back to synchronous IO.  The current IO is on file ….”

image

Analysis

The AOAG is running on Win 2012, and I have been lulled into thinking that drive misalignment was a thing of the past, expelled from reality with my last Win2003R2 server. Seems maybe I was wrong. Although, analysis of the drives show that they are properly aligned. This particular replica is running with local SSD drives, whereas the primary replica is connected to a SAN. I thought “maybe the SAN volumes are misaligned?”, but that to was in order. Turns out the culprit was the SSD drive itself. The SAN volume is using a 512 bytes per sector physical setup, but the new SSD drive is set up to use 4096 bytes.

SSD:

image

SAN:

image

(Mountpoints are in use, thus a path instead of a driveletter.)

This was the only dividing factor I could find.

Even though it seemed futile, I tried to find a way to change bytes Per physical Sector. I mean, the “physical” in the name makes it sound very final. But, by converting the SSD to dynamic, mirroring it and making sure it was aligned I finally got it anyway:

image

I suppose it is still 4096 under the covers, but the cluster has been running for a week, and no new error mesasges have been logged so far.

Solution

Get drives with identical Bytes Per Physical Sector for all your AOAG nodes. I have heard rumors about SANs being able to define this on a per volume basis, but I consider this a poor workaround as it will probably add latency. It sounds better to upgrade all replicas to the new fancy 4096 bytes per sector drives, as they are probably faster for SQL server anyway.