Event 324 from SQLAgent OpenCluster (reason: 5).

Problem

Overzealous monitoring alerts you to an error logged during a cluster failover, more specifically Event ID 324 from SQLAgent$InstanceName:

SNAGHTML19452993

Analysis

As mentioned this happens during failover, one that otherwise may pass without incident. Further analysis of the Application log shows that recovery isn’t done at the time. The next messages in the log are related to the server starting up and running recovery on the new node. For some reason this takes longer than expected. Maybe there was a lot of transactions in flight at the time of failover, maybe the server or storage is to slow, or maybe you were in the process of installing an update to SQLServer which may lead to extensive recovery times. Or it may be something completely different. Whatever it was, it caused the cluster service to try to start the SQLAgent before the node was ready. Reason 5 is probably access denied. Thus, the issue could be related to lack of permissions. I have yet to catch one of these early enough to have a cluster debug log containing the time of the error. Analysis of the cluster in question revealed another access related error at about the same time, ESENT Event ID 490:

SNAGHTML1955d914

This error is related to lack of permissions for the SQLServer engine and Agent runas accounts. Whether or not these accounts should have Local Admin permissions on the node is a never ending discussion. I have found though, that granting the permissions causes far less trouble in a clustered environment than not doing so. There is always another issue, always another patch or feature requiring yet another explicit permission. From a security stand point, it is easy to argue that the data served by the SQL Server is far more valuable than the local server it runs on. If an attacker is able to gain access to the runas accounts, he already has access to read and change/delete the data. What happens to the local server after that is to me rather irrelevant. But security regulations aren’t guaranteed to be neither logical nor sane.

Solution/Workaround

To solve the permission issue, you can either:

  • Add the necessary local permissions for the runas accounts as discussed in KB2811566 and wait for the next “feature” requiring you to add even more permissions to something else. Also, make sure the Agent account has the proper permissions to your backup folders and make sure you are able to create new databases. Not being able to do so may be caused by the engine account not having the proper permissions to your data/log folders.
  • Add the SQL Server Engine and Agent runas accounts to the local administrators group on the server.

Do NOT grant the runas accounts Domain Admin permissions. Ever.

Regarding the open cluster error:

On the servers I have analyzed with this issue, the log always shows the agent starting successfully within two minutes of the error, and it only happens during failover. I have yet to find it on servers where the permissions issue is not solved (using either method), but I am not 100% sure that they are related. I can however say that the message can safely be ignored as long as the Agent account is able to start successfully after the message.

SQL agent jobs cause Audit failure in security log

Problem

When you audit your security log, something you are doing every day of course, you discover that SQL server is causing an audit failure fairly often:

image

Audit failure event 4625, The user account has expired:

image

Closer inspection reveals that the event is triggered by the SQL server engine service account. You could easily be lead to believe that the engine account itself has expired, as it is the only account mentioned by name in the error message. That is not the case here, as the “Account for which logon failed” is a null sid, also known as S-1-0-0 or nobody. This is Windows’ way of telling you that something failed, but I’m not going to say exactly what it was.

Analysis

The error appeared about every hour on the hour, so agent jobs is the top suspect. But the job logs disagreed, every job was working like a charm. Analysis of the central log repository revealed that the error appeared suddenly without any changes being made on the server at the time.

I discovered this on a cluster, so I tried failing over to another node. That didn’t change anything, the error followed sqlserver.exe to the other node. At least I had proven beyond reasonable doubt that the error was directly related to something SQL does. I was unable to find any fault on the server except from the audit failure. I checked all SQL server service accounts, none of which had an expiry date set. I drifted back to the agent jobs again, as agent job ownership has given me a hard time earlier. For some reason, if the person creating the jobs/maintenance plans is a sysadmin by group membership and not by direct membership, all agent jobs fail to execute unless you use a proxy account. I have blogged about this before though (https://lokna.no/?p=1267), so I always check for this when a new instance is installed and correct if necessary.

Then it struck me: there is a best practice somewhere stating that SQL Server installs are supposed to be executed as a special account, and not an account associated with the person that performs the installation. This is in case that person quits and we should delete his/her account. The agent jobs for backup, dbcc checkdb and such are usually created using another account though, but maybe there was an anomaly. I ran a quick check, and yes, the agent jobs were owned by the setup user. And this user was since marked as expired for security reasons, as it is only used  during setup and as a way into the server in case someone removes all other sysadmins and locks themselves out. I know there are other ways into a SQL Server you don’t have access to, but this is a lot easier as long as you have access to the domain.

To list job owners:

SELECT name, owner_sid FROM msdb.dbo.sysjobs

image

0x1 is sa, all the other jobs were owned by the setup user. These SIDs are in hex format. To convert to a username, run this:

SELECT SUSER_SNAME (0xHEXSID)

solution

Change the job and/or maintenance plan owner. See https://lokna.no/?p=325

Does your cluster have the recommended hotfixes?

From time to time, the good people at Microsoft publish a list of problems with failover clustering that has been resolved. This list, as all such bad/good news comes in the form of a KB, namely KB2784261. I check this list from time to time. Some of them relate to a specific issue, while others are more of the go-install-them-at-once type. As a general rule, I recommend installing ALL hotfixes regardless of the attached warning telling you to only install them if you experience a particular problem. In my experience, hotfixes are at least as stable as regular patches, if not better. That being said, sooner or later you will run across patches or hotfixes that will make a mess and give you a bad or very bad day. But then again, that is why cluster admins always fight for funding of a proper QA/T environment. Preferably one that is equal to the production system in every way possible.

Anyways, this results in having to check all my servers to see if they have the hotfixes installed. Luckily some are included in Microsoft Update, but some you have to install manually. To simplify this process, I made the following powershell script. It takes a list of hotfixes, and returns a list of the ones who are missing from the system. This script could easily be adapted to run against several servers at once, but I have to battle way to many internal firewalls to attempt such dark magic. Be aware that some hotfixes have multiple KB numbers and may not show up even if they are installed. This usually happens when updates are bundled together as a cummulative package or superseded by a new version. The best way to test if patch/hotfix X needs to be installed is to try to install it. The installer will tell you whether or not the patch is applicable.

Edit: Since the original, I have added KB lists for 2008 R2 and 2012 R2 based clusters. All you have to do is replace the ” $recommendedPatches = ” list with the one you need. Links to the correct KB list is included for each OS. I have also discovered that some of the hotfixes are available through Microsoft Update-Catalog, thus bypassing the captcha email hurdle.

2012 version

$menucolor = [System.ConsoleColor]::gray
write-host "╔═══════════════════════════════════════════════════════════════════════════════════════════╗"-ForegroundColor $menucolor
write-host "║                              Identify missing patches                                     ║"-ForegroundColor $menucolor
write-host "║                              Jan Kåre Lokna - lokna.no                                    ║"-ForegroundColor $menucolor
write-host "║                                       v 1.2                                               ║"-ForegroundColor $menucolor
write-host "║                                  Requires elevation: No                                   ║"-ForegroundColor $menucolor
write-host "╚═══════════════════════════════════════════════════════════════════════════════════════════╝"-ForegroundColor $menucolor
#List your patches here. Updated list of patches at http://support.microsoft.com/kb/2784261
$recommendedPatches = "KB2916993", "KB2929869","KB2913695", "KB2878635", "KB2894464", "KB2838043", "KB2803748", "KB2770917"
 
$missingPatches = @()
foreach($_ in $recommendedPatches){
    if (!(get-hotfix -id $_ -ea:0)) { 
        $missingPatches += $_ 
    }    
}
$intMissing = $missingPatches.Count
$intRecommended = $recommendedpatches.count
Write-Host "$env:COMPUTERNAME is missing $intMissing of $intRecommended patches:" 
$missingPatches

2008R2 Version

A list of recommended patches for Win 2008 R2 can be found here:  KB2545685

$recommendedPatches = "KB2531907", "KB2550886","KB2552040", "KB2494162", "KB2524478", "KB2520235"

2012 R2 Version

A list of recommended patches for Win 2012 R2 can be found here:  KB2920151

#All clusters
$recommendedPatches = "KB3130944", "KB3137691", "KB3139896", "KB3130939", "KB3123538", "KB3091057", "KB3013769", "KB3000850", "KB2919355"
#Hyper-V Clusters
$recommendedPatches = "KB3130944", "KB3137691", "KB3139896", "KB3130939", "KB3123538", "KB3091057", "KB3013769", "KB3000850", "KB2919355", "KB3090343", "KB3060678", "KB3063283", "KB3072380"

Hyper-V

If you are using the Hyper-V role, you can find additional fixes for 2012 R2 in KB2920151 below the general cluster hotfixes. If you use NVGRE, look at this list as well: KB2974503

Sample output (computer name redacted)

SNAGHTML2af111ec

Edit:

I have finally updated my script to remove those pesky red error messages seen in the sample above.

Kernel memory leak analysis

I have had several issues in the past year involving kernel memory leaks, so I decided to make a separate blog post about general kernel memory leak analysis. In this post I mostly use the amazing Sysinternals tools for troubleshooting. You also need Poolmon.exe, a small utility currently part of the Windows Driver Kit. Sadly, this 35k self contained .exe is not available as a separate download, you have to download and install the entire 500+MiB WDK somewhere to extract it. You only have to do this once though, as there is no need to install the WDK on every system you analyze. You can just copy the executable from the machine where you installed the WDK.

Problem

Something is causing the kernel paged or non paged pools to rise uncontrollably. Screenshot from Process Explorer’s System Information dialog:

image

In this sample, the non paged pool has grown to an unhealthy 2,2GB, and continues to grow. Even though the pool limit is 128GIB and the server has a whopping  256GIB of RAM, the kernel memory pools are usually way below the 1GiB mark. You should of course baseline this to make sure you actually have an issue, but generally speaking, every time I find a Kernel memory value above 1GiB I go hunting for the cause.

Note: To show the pool limits, you have to enable symbols in Process Explorer. Scott Hanselman has blogged about that here: http://www.hanselman.com/blog/SetUpYourSystemToUseMicrosoftsPublicSymbolServer.aspx

Analysis

Kernel leaks are usually caused by a driver. Kernel leaks in the OS itself are very rare, unless you are running some sort of beta version of Windows. To investigate further, you have to fire up poolmon.exe.

image

Poolmon has a lot of shortcuts. From KB177415:

P – Sorts tag list by Paged, Non-Paged, or mixed. Note that P cycles through each one.
B – Sorts tags by max byte usage.
M – Sorts tags by max byte allocation.
T – Sort tags alphabetically by tag name.
E – Display Paged, Non-paged total across bottom. Cycles through.
A – Sorts tags by allocation size.
F – Sorts tags by “frees”.
S – Sorts tags by the differences of allocs and frees.
E – Display Paged, Non-paged total across bottom. Cycles through.
Q – Quit.

The important ones are “P”, to view either paged or non-paged pool tags, and “B”, to list the ones using the most of it at the top. The same view as above, after pressing “P” and “B”:

image

The “Cont” tag relates to “Contiguous physical memory allocations for device drivers”, and is usually the largest tag on a normal system.

And this screenshot is from the server with the non-paged leak:

image

As you can see, the LkaL tag is using more than 1GiB on its own, accounting for half of the pool. we have identified the pool tag, now we have to look for the driver that owns it. To do that, I use one of two methods:

1: Do an internet search for the pool tag.

http://blogs.technet.com/b/yongrhee/archive/2009/06/24/pool-tag-list.aspx contains a large list of tags.

2: Use Sysinternals strings together with findstr.

Most kernel mode drivers are located in “%systemroot%\System32\drivers”. First you have to start an elevated command prompt.  Make sure the Sysinternals suite is installed somewhere on the path, and enter the following commands:

  • cd “%systemroot%\System32\drivers”
  • strings * | findstr [tag]

Example:

image

Hopefully, you should now have the name of the offending driver. To gather more intel about it, use Sysinternals sigcheck:

image

In this case, the offending driver is part of Diskeeper.

Solution

You have to contact the manufacturer of the offending driver and check for an updated version. In the example, I got a version of Diskeeper without the offending driver, but a good place to start is the manufacturers website. And remember, if you already have the newest version, you can always try downgrading to a previous version. If you present your findings as detailed above, most vendors are very interested in identifying the cause of the leak and will work with you to resolve the issue.

Limit the size of the error log

Problem

Something bad happens, and the error log is  bloated with error messages very fast, triggering low disk space alarms and eventually filling the drive to the brim. For a concrete example, see https://lokna.no/?p=1263.

Analysis

Even though I have spent considerable time searching for a solution to this, so far I have been at a loss. My best bet has been limiting the number of files. If the default limit of 6 isn’t changed, there seems to be no entry in the registry. On clustered instances such errors as mentioned above can cause a failover roundabout, and I have seen several hundred log files being generated (on SQL Server 2008R2). Changing the limit adds a registry value, and this seems to force SQL Server to enforce the limit.

This doesn’t solve the most pressing issue though. If some sort of error starts pushing several log messages every second, a small base volume with say 10GiB of free space is filled to the brim in hours if not minutes. Thus, when you are woken up by the low disk space alarm it is already to late to stop it, and you run the chance of NTFS corruption. In my experience, the chance of NTFS corruption is increasing on a logarithmic scale when you pass 5% free space on a volume. 0 bytes of free space with processes still trying to write more data to the volume more or less guarantees some kind of NTFS corruption unless you catch it quickly.

Continue reading “Limit the size of the error log”

Misaligned IOs on SSD storage

Problem

The SQL server error log on one of my secondary replicas is riddled with error messages like this one:

“There have been 271104 misaligned log IOs which required falling back to synchronous IO.  The current IO is on file ….”

image

Analysis

The AOAG is running on Win 2012, and I have been lulled into thinking that drive misalignment was a thing of the past, expelled from reality with my last Win2003R2 server. Seems maybe I was wrong. Although, analysis of the drives show that they are properly aligned. This particular replica is running with local SSD drives, whereas the primary replica is connected to a SAN. I thought “maybe the SAN volumes are misaligned?”, but that to was in order. Turns out the culprit was the SSD drive itself. The SAN volume is using a 512 bytes per sector physical setup, but the new SSD drive is set up to use 4096 bytes.

SSD:

image

SAN:

image

(Mountpoints are in use, thus a path instead of a driveletter.)

This was the only dividing factor I could find.

Even though it seemed futile, I tried to find a way to change bytes Per physical Sector. I mean, the “physical” in the name makes it sound very final. But, by converting the SSD to dynamic, mirroring it and making sure it was aligned I finally got it anyway:

image

I suppose it is still 4096 under the covers, but the cluster has been running for a week, and no new error mesasges have been logged so far.

Solution

Get drives with identical Bytes Per Physical Sector for all your AOAG nodes. I have heard rumors about SANs being able to define this on a per volume basis, but I consider this a poor workaround as it will probably add latency. It sounds better to upgrade all replicas to the new fancy 4096 bytes per sector drives, as they are probably faster for SQL server anyway.

Creating firewall rules for SQL server using Powershell

On Win2012/Powershell 3 there is a commandlet called “New-NetFirewallRule” that allows for scripted creation of firewall rules. This makes it a lot easier to get them rules right. I have previously used GPO to push this to my SQL servers, but sadly I have discovered that it does not always work. For some reason, servers don’t like to have their firewall rules pushed by GPO. This meant I had to check them every time anyway, so I just resorted to creating them manually. But now, thanks to the wonders of Powershell 3, maybe I won’t have to do that again Smilefjes

More information about the commandlet can be found here: http://technet.microsoft.com/en-us/library/jj554908.aspx

Sample code

This code creates rules to allow the SQL server browser (UDP 1434), the standard engine port for two instances (TCP 1433 and 1434) and the default port for AOAG endpoints (TCP 5022).

New-NetFirewallRule -DisplayName "MSSQL BROWSER UDP" -Direction Inbound -LocalPort 1434 -Protocol UDP -Action Allow
New-NetFirewallRule -DisplayName "MSSQL ENGINE TCP" -Direction Inbound -LocalPort 1433-1434 -Protocol TCP -Action Allow
New-NetFirewallRule -DisplayName "MSSQL AOAG EP TCP" -Direction Inbound -LocalPort 5022 -Protocol TCP -Action Allow

Aligning dynamic disks

If you are using dynamic disks for some reason, and please avoid using them if you don’t have to, the partitions on it are likely to be misaligned for SQL server. I discovered this while trying to configure an AO Availability Group where one of the replicas were using local SSD drives configured with software RAID 1.

Problem

Since this was a new setup for me, I ran the old “wmic partition get BlockSize, StartingOffset, Name, Index”  just to make sure everything was in order. To my astonishment, it was not: image For some reason, the partition is using the old Win 2003 31,5 KB offset! To make it worse, I discovered this AFTER I had installed SQL server. Since dynamic disk and software raid basically sucks, information about this on the great interweb was sparse. But after some searching I found a cure, at least for volumes without RAID, at http://blogs.utexas.edu/alex/2013/04/04/windows-aligning-dynamic-disks/. (Link dead as of 2016.10)

Solution

Based on the above mentioned blog post, with my comments and changes for RAID. Be aware, this process may be destructive. This guide assumes that you, as I did, already have an active mirror with the wrong alignment. If you have fresh drives, just ignore the parts about breaking the mirror and moving data.

  • Make sure you have a valid backup
  • Be prepared to do a clean install if necessary
  • Break the mirror
  • Give both drives new drive-letters and restart the server to make sure no active application/service is using the drive
  • Run diskpart, and select one of the drives that was part of the mirror. If you don’t know how to do this, STOP and ask someone to help you or read up on diskpart BEFORE you continue.
  • Execute the following diskpart commands against the selected drive. This guide expects that you want one volume to fill the entire drive. If you don’t want this, think long and hard about why and consider changing your mind.
  • clean
  • online disk
  • attributes disk clear readonly
  • convert gpt
  • select part 1
  • delete part override
  • create partition msr size 128
  • convert dynamic
  • create volume simple align=1024
  • Screenshots image image
  • Then, format the new partition with 64K allocation unit size
  • Move the data from the other partition that was part of the mirror, and hope this will work
  • Run the diskpart commands against the other disk. Be aware, this will delete the data, so make sure the move command was successful. And this time, skip the last create volume command.
  • Select second disk
  • clean
  • online disk
  • attributes disk clear readonly
  • convert gpt
  • select part 1
  • delete part override
  • create partition msr size 128
  • convert dynamic

You should now have two dynamic disks, one with the data and one unallocated. Now, to add the mirror. I discovered that I was unable to add the second drive back as a mirror, the add mirror option was grayed out. I solved this by first shrinking the original by 50MB, and then creating the mirror. I didn’t test this extensively, but I would guess that 5 or 10 megabytes of free space would have been enough.

Weak event created

Problem

In windows failover cluster manager, clicking on an node in the tree will raise the following error:

SNAGHTML106567d

This is one of the biggest error messages I have ever seen in a Microsoft product, that is, regarding the size of the message box Smilefjes. The error states: “A weak event was created and I lives on the wrong object, there is a very high chance this will fail, pleas review your code to prevent the issue” and goes on with a .net call stack.

Analysis

This feels like a .Net framework issue to me, and a quick search rustled up the following post on Technet: http://blogs.technet.com/b/askcore/archive/2013/01/14/error-in-failover-cluster-manager-after-install-of-kb2750149.aspx, stating that this is a bug caused by KB 2750149.

Solution

Install KB2803748, available here: http://support.microsoft.com/kb/2803748

To check if both or any of the patches are installed, run the following powershell commands:

Get-HotFix -id KB2750149
Get-HotFix -id KB2803748

Health Service Script Event 4001

Problem

The event log is littered with regular copies of the following event:

SNAGHTML23ad5ff5

“GetSQL2008SPNState.vbs : The Query ‘SELECT ProtocolName FROM ServerNetworkProtocol where Enabled = true and InstanceName = ‘INSTANCE” did not return any valid instances.  Please check to see if this is a valid WMI Query.. Invalid class”

I do not know what impact this has on SCOM  (if any), but the warning is escalated to the Administrative Events view and is thus making it harder to spot other errors as it occurs very frequently. On one of my servers this error occurred every 10 minutes or so. Furthermore, I suspect the cause of the problem to be changes in RunAS accounts in SCOM, but I am not certain.

Solution

I found this on the Technet forums http://social.technet.microsoft.com/Forums/en-US/operationsmanagermgmtpacks/thread/01eff618-1087-4b6a-9d3f-9f1402ddf3f4:

Reregister SQL management object (MOF) data. I have yet to figure out exactly what MOF data is, but SCOM is using it to monitor the SQL server instance(s) on the server. Perform the following operations:

  • Start an administrative command prompt
  • Browse to “C:\Program Files (x86)\Microsoft SQL Server\100\Shared”
  • Run mofcomp sqlmgmproviderxpsp2up.mof
  • Restart the SCOM agent service (System Center Management)

image