MS SQL

You are currently browsing the archive for the MS SQL category.

From time to time, the good people at Microsoft publish a list of problems with failover clustering that has been resolved. This list, as all such bad/good news comes in the form of a KB, namely KB2784261. I check this list from time to time. Some of them relate to a specific issue, while others are more of the go-install-them-at-once type. As a general rule, I recommend installing ALL hotfixes regardless of the attached warning telling you to only install them if you experience a particular problem. In my experience, hotfixes are at least as stable as regular patches, if not better. That being said, sooner or later you will run across patches or hotfixes that will make a mess and give you a bad or very bad day. But then again, that is why cluster admins always fight for funding of a proper QA/T environment. Preferably one that is equal to the production system in every way possible.

Anyways, this results in having to check all my servers to see if they have the hotfixes installed. Luckily some are included in Microsoft Update, but some you have to install manually. To simplify this process, I made the following powershell script. It takes a list of hotfixes, and returns a list of the ones who are missing from the system. This script could easily be adapted to run against several servers at once, but I have to battle way to many internal firewalls to attempt such dark magic. Be aware that some hotfixes have multiple KB numbers and may not show up even if they are installed. This usually happens when updates are bundled together as a cummulative package or superseded by a new version. The best way to test if patch/hotfix X needs to be installed is to try to install it. The installer will tell you whether or not the patch is applicable.

Edit: Since the original, I have added KB lists for 2008 R2 and 2012 R2 based clusters. All you have to do is replace the ” $recommendedPatches = ” list with the one you need. Links to the correct KB list is included for each OS. I have also discovered that some of the hotfixes are available through Microsoft Update-Catalog, thus bypassing the captcha email hurdle.

2012 version

$menucolor = [System.ConsoleColor]::gray
write-host "╔═══════════════════════════════════════════════════════════════════════════════════════════╗"-ForegroundColor $menucolor
write-host "║                              Identify missing patches                                     ║"-ForegroundColor $menucolor
write-host "║                              Jan Kåre Lokna - lokna.no                                    ║"-ForegroundColor $menucolor
write-host "║                                       v 1.2                                               ║"-ForegroundColor $menucolor
write-host "║                                  Requires elevation: No                                   ║"-ForegroundColor $menucolor
write-host "╚═══════════════════════════════════════════════════════════════════════════════════════════╝"-ForegroundColor $menucolor
#List your patches here. Updated list of patches at http://support.microsoft.com/kb/2784261
$recommendedPatches = "KB2916993", "KB2929869","KB2913695", "KB2878635", "KB2894464", "KB2838043", "KB2803748", "KB2770917"
 
$missingPatches = @()
foreach($_ in $recommendedPatches){
    if (!(get-hotfix -id $_ -ea:0)) { 
        $missingPatches += $_ 
    }    
}
$intMissing = $missingPatches.Count
$intRecommended = $recommendedpatches.count
Write-Host "$env:COMPUTERNAME is missing $intMissing of $intRecommended patches:" 
$missingPatches

2008R2 Version

A list of recommended patches for Win 2008 R2 can be found here:  KB2545685

$recommendedPatches = "KB2531907", "KB2550886","KB2552040", "KB2494162", "KB2524478", "KB2520235"

2012 R2 Version

A list of recommended patches for Win 2012 R2 can be found here:  KB2920151

#All clusters
$recommendedPatches = "KB3130944", "KB3137691", "KB3139896", "KB3130939", "KB3123538", "KB3091057", "KB3013769", "KB3000850", "KB2919355"
#Hyper-V Clusters
$recommendedPatches = "KB3130944", "KB3137691", "KB3139896", "KB3130939", "KB3123538", "KB3091057", "KB3013769", "KB3000850", "KB2919355", "KB3090343", "KB3060678", "KB3063283", "KB3072380"

Hyper-V

If you are using the Hyper-V role, you can find additional fixes for 2012 R2 in KB2920151 below the general cluster hotfixes. If you use NVGRE, look at this list as well: KB2974503

Sample output (computer name redacted)

SNAGHTML2af111ec

Edit:

I have finally updated my script to remove those pesky red error messages seen in the sample above.

Print This Post Print This Post

Tags: , ,

I have had several issues in the past year involving kernel memory leaks, so I decided to make a separate blog post about general kernel memory leak analysis. In this post I mostly use the amazing Sysinternals tools for troubleshooting. You also need Poolmon.exe, a small utility currently part of the Windows Driver Kit. Sadly, this 35k self contained .exe is not available as a separate download, you have to download and install the entire 500+MiB WDK somewhere to extract it. You only have to do this once though, as there is no need to install the WDK on every system you analyze. You can just copy the executable from the machine where you installed the WDK.

Problem

Something is causing the kernel paged or non paged pools to rise uncontrollably. Screenshot from Process Explorer’s System Information dialog:

image

In this sample, the non paged pool has grown to an unhealthy 2,2GB, and continues to grow. Even though the pool limit is 128GIB and the server has a whopping  256GIB of RAM, the kernel memory pools are usually way below the 1GiB mark. You should of course baseline this to make sure you actually have an issue, but generally speaking, every time I find a Kernel memory value above 1GiB I go hunting for the cause.

Note: To show the pool limits, you have to enable symbols in Process Explorer. Scott Hanselman has blogged about that here: http://www.hanselman.com/blog/SetUpYourSystemToUseMicrosoftsPublicSymbolServer.aspx

Analysis

Kernel leaks are usually caused by a driver. Kernel leaks in the OS itself are very rare, unless you are running some sort of beta version of Windows. To investigate further, you have to fire up poolmon.exe.

image

Poolmon has a lot of shortcuts. From KB177415:

P – Sorts tag list by Paged, Non-Paged, or mixed. Note that P cycles through each one.
B – Sorts tags by max byte usage.
M – Sorts tags by max byte allocation.
T – Sort tags alphabetically by tag name.
E – Display Paged, Non-paged total across bottom. Cycles through.
A – Sorts tags by allocation size.
F – Sorts tags by “frees”.
S – Sorts tags by the differences of allocs and frees.
E – Display Paged, Non-paged total across bottom. Cycles through.
Q – Quit.

The important ones are “P”, to view either paged or non-paged pool tags, and “B”, to list the ones using the most of it at the top. The same view as above, after pressing “P” and “B”:

image

The “Cont” tag relates to “Contiguous physical memory allocations for device drivers”, and is usually the largest tag on a normal system.

And this screenshot is from the server with the non-paged leak:

image

As you can see, the LkaL tag is using more than 1GiB on its own, accounting for half of the pool. we have identified the pool tag, now we have to look for the driver that owns it. To do that, I use one of two methods:

1: Do an internet search for the pool tag.

http://blogs.technet.com/b/yongrhee/archive/2009/06/24/pool-tag-list.aspx contains a large list of tags.

2: Use Sysinternals strings together with findstr.

Most kernel mode drivers are located in “%systemroot%\System32\drivers”. First you have to start an elevated command prompt.  Make sure the Sysinternals suite is installed somewhere on the path, and enter the following commands:

  • cd “%systemroot%\System32\drivers”
  • strings * | findstr [tag]

Example:

image

Hopefully, you should now have the name of the offending driver. To gather more intel about it, use Sysinternals sigcheck:

image

In this case, the offending driver is part of Diskeeper.

Solution

You have to contact the manufacturer of the offending driver and check for an updated version. In the example, I got a version of Diskeeper without the offending driver, but a good place to start is the manufacturers website. And remember, if you already have the newest version, you can always try downgrading to a previous version. If you present your findings as detailed above, most vendors are very interested in identifying the cause of the leak and will work with you to resolve the issue.

Print This Post Print This Post

Problem

Something bad happens, and the error log is  bloated with error messages very fast, triggering low disk space alarms and eventually filling the drive to the brim. For a concrete example, see http://lokna.no/?p=1263.

Analysis

Even though I have spent considerable time searching for a solution to this, so far I have been at a loss. My best bet has been limiting the number of files. If the default limit of 6 isn’t changed, there seems to be no entry in the registry. On clustered instances such errors as mentioned above can cause a failover roundabout, and I have seen several hundred log files being generated (on SQL Server 2008R2). Changing the limit adds a registry value, and this seems to force SQL Server to enforce the limit.

This doesn’t solve the most pressing issue though. If some sort of error starts pushing several log messages every second, a small base volume with say 10GiB of free space is filled to the brim in hours if not minutes. Thus, when you are woken up by the low disk space alarm it is already to late to stop it, and you run the chance of NTFS corruption. In my experience, the chance of NTFS corruption is increasing on a logarithmic scale when you pass 5% free space on a volume. 0 bytes of free space with processes still trying to write more data to the volume more or less guarantees some kind of NTFS corruption unless you catch it quickly.

Read the rest of this entry »

Print This Post Print This Post

Tags: ,

Problem

The SQL server error log on one of my secondary replicas is riddled with error messages like this one:

“There have been 271104 misaligned log IOs which required falling back to synchronous IO.  The current IO is on file ….”

image

Analysis

The AOAG is running on Win 2012, and I have been lulled into thinking that drive misalignment was a thing of the past, expelled from reality with my last Win2003R2 server. Seems maybe I was wrong. Although, analysis of the drives show that they are properly aligned. This particular replica is running with local SSD drives, whereas the primary replica is connected to a SAN. I thought “maybe the SAN volumes are misaligned?”, but that to was in order. Turns out the culprit was the SSD drive itself. The SAN volume is using a 512 bytes per sector physical setup, but the new SSD drive is set up to use 4096 bytes.

SSD:

image

SAN:

image

(Mountpoints are in use, thus a path instead of a driveletter.)

This was the only dividing factor I could find.

Even though it seemed futile, I tried to find a way to change bytes Per physical Sector. I mean, the “physical” in the name makes it sound very final. But, by converting the SSD to dynamic, mirroring it and making sure it was aligned I finally got it anyway:

image

I suppose it is still 4096 under the covers, but the cluster has been running for a week, and no new error mesasges have been logged so far.

Solution

Get drives with identical Bytes Per Physical Sector for all your AOAG nodes. I have heard rumors about SANs being able to define this on a per volume basis, but I consider this a poor workaround as it will probably add latency. It sounds better to upgrade all replicas to the new fancy 4096 bytes per sector drives, as they are probably faster for SQL server anyway.

Print This Post Print This Post

Tags:

On Win2012/Powershell 3 there is a commandlet called “New-NetFirewallRule” that allows for scripted creation of firewall rules. This makes it a lot easier to get them rules right. I have previously used GPO to push this to my SQL servers, but sadly I have discovered that it does not always work. For some reason, servers don’t like to have their firewall rules pushed by GPO. This meant I had to check them every time anyway, so I just resorted to creating them manually. But now, thanks to the wonders of Powershell 3, maybe I won’t have to do that again Smilefjes

More information about the commandlet can be found here: http://technet.microsoft.com/en-us/library/jj554908.aspx

Sample code

This code creates rules to allow the SQL server browser (UDP 1434), the standard engine port for two instances (TCP 1433 and 1434) and the default port for AOAG endpoints (TCP 5022).

New-NetFirewallRule -DisplayName "MSSQL BROWSER UDP" -Direction Inbound -LocalPort 1434 -Protocol UDP -Action Allow
New-NetFirewallRule -DisplayName "MSSQL ENGINE TCP" -Direction Inbound -LocalPort 1433-1434 -Protocol TCP -Action Allow
New-NetFirewallRule -DisplayName "MSSQL AOAG EP TCP" -Direction Inbound -LocalPort 5022 -Protocol TCP -Action Allow
Print This Post Print This Post

Tags:

If you are using dynamic disks for some reason, and please avoid using them if you don’t have to, the partitions on it are likely to be misaligned for SQL server. I discovered this while trying to configure an AO Availability Group where one of the replicas were using local SSD drives configured with software RAID 1.

Problem

Since this was a new setup for me, I ran the old “wmic partition get BlockSize, StartingOffset, Name, Index”  just to make sure everything was in order. To my astonishment, it was not: image For some reason, the partition is using the old Win 2003 31,5 KB offset! To make it worse, I discovered this AFTER I had installed SQL server. Since dynamic disk and software raid basically sucks, information about this on the great interweb was sparse. But after some searching I found a cure, at least for volumes without RAID, at http://blogs.utexas.edu/alex/2013/04/04/windows-aligning-dynamic-disks/. (Link dead as of 2016.10)

Solution

Based on the above mentioned blog post, with my comments and changes for RAID. Be aware, this process may be destructive. This guide assumes that you, as I did, already have an active mirror with the wrong alignment. If you have fresh drives, just ignore the parts about breaking the mirror and moving data.

  • Make sure you have a valid backup
  • Be prepared to do a clean install if necessary
  • Break the mirror
  • Give both drives new drive-letters and restart the server to make sure no active application/service is using the drive
  • Run diskpart, and select one of the drives that was part of the mirror. If you don’t know how to do this, STOP and ask someone to help you or read up on diskpart BEFORE you continue.
  • Execute the following diskpart commands against the selected drive. This guide expects that you want one volume to fill the entire drive. If you don’t want this, think long and hard about why and consider changing your mind.
  • clean
  • online disk
  • attributes disk clear readonly
  • convert gpt
  • select part 1
  • delete part override
  • create partition msr size 128
  • convert dynamic
  • create volume simple align=1024
  • Screenshots image image
  • Then, format the new partition with 64K allocation unit size
  • Move the data from the other partition that was part of the mirror, and hope this will work
  • Run the diskpart commands against the other disk. Be aware, this will delete the data, so make sure the move command was successful. And this time, skip the last create volume command.
  • Select second disk
  • clean
  • online disk
  • attributes disk clear readonly
  • convert gpt
  • select part 1
  • delete part override
  • create partition msr size 128
  • convert dynamic

You should now have two dynamic disks, one with the data and one unallocated. Now, to add the mirror. I discovered that I was unable to add the second drive back as a mirror, the add mirror option was grayed out. I solved this by first shrinking the original by 50MB, and then creating the mirror. I didn’t test this extensively, but I would guess that 5 or 10 megabytes of free space would have been enough.

Print This Post Print This Post

Problem

In windows failover cluster manager, clicking on an node in the tree will raise the following error:

SNAGHTML106567d

This is one of the biggest error messages I have ever seen in a Microsoft product, that is, regarding the size of the message box Smilefjes. The error states: “A weak event was created and I lives on the wrong object, there is a very high chance this will fail, pleas review your code to prevent the issue” and goes on with a .net call stack.

Analysis

This feels like a .Net framework issue to me, and a quick search rustled up the following post on Technet: http://blogs.technet.com/b/askcore/archive/2013/01/14/error-in-failover-cluster-manager-after-install-of-kb2750149.aspx, stating that this is a bug caused by KB 2750149.

Solution

Install KB2803748, available here: http://support.microsoft.com/kb/2803748

To check if both or any of the patches are installed, run the following powershell commands:

Get-HotFix -id KB2750149
Get-HotFix -id KB2803748
Print This Post Print This Post

Tags: ,

Problem

The event log is littered with regular copies of the following event:

SNAGHTML23ad5ff5

“GetSQL2008SPNState.vbs : The Query ‘SELECT ProtocolName FROM ServerNetworkProtocol where Enabled = true and InstanceName = ‘INSTANCE” did not return any valid instances.  Please check to see if this is a valid WMI Query.. Invalid class”

I do not know what impact this has on SCOM  (if any), but the warning is escalated to the Administrative Events view and is thus making it harder to spot other errors as it occurs very frequently. On one of my servers this error occurred every 10 minutes or so. Furthermore, I suspect the cause of the problem to be changes in RunAS accounts in SCOM, but I am not certain.

Solution

I found this on the Technet forums http://social.technet.microsoft.com/Forums/en-US/operationsmanagermgmtpacks/thread/01eff618-1087-4b6a-9d3f-9f1402ddf3f4:

Reregister SQL management object (MOF) data. I have yet to figure out exactly what MOF data is, but SCOM is using it to monitor the SQL server instance(s) on the server. Perform the following operations:

  • Start an administrative command prompt
  • Browse to “C:\Program Files (x86)\Microsoft SQL Server\100\Shared”
  • Run mofcomp sqlmgmproviderxpsp2up.mof
  • Restart the SCOM agent service (System Center Management)

image

Print This Post Print This Post

Tags: ,

Problem

I’ve been managing a lot of Dell servers lately, where the baseline showed very poor performance for local drives connected to PERC (PowerEdge Expandable RAID Controller) controllers. Poor enough to trigger negative marks on a MSSQL RAP. Typically, read and write latency would never get below 11ms, even with next to no load on a freshly reinstalled server. Even the cheapest laptops with 4500 RPM SATA drives would outperform such stats, and these servers had 10 or 15K RPM SAS drives on a 6Gbps bus. We have a combination of H200, H700 and H710 PERC controllers on these servers, and the issues didn’t seem to follow a pattern, with one exception: all H200 equipped servers experienced poor performance.

Analysis

A support ticket with Dell gave the usual response: update your firmware and drivers. We did, and one of the H700 equipped servers got worse. Further inquiries with Dell gave a recommendation to replace the H200 controllers with the more powerful H700. After having a look at the specs for the H200 I fully agree with their assessment, although I do wonder why on earth they sold them in the first place. The H200 doesn’t appear to be worth the price of the cardboard box it is delivered in. It has absolutely no cache whatsoever, and it also disables the built in cache on the drives. Snap from the H200 users guide:

image

This sounds like something one would use in a print server or small departmental file server in a very limited budget, not in a four-way database cluster node. And it explains why the connected drives are painfully slow, you are reduced to platter speed.

Note: The H200 is replaced by the H310 on newer servers. I have yet to test it, but from what the specs tell me it is just as bad as the H200.

Update: Test data from a H310 equipped test server doing nothing but displaying the perfmon curve:

SNAGHTMLf6bef00

Read the rest of this entry »

Print This Post Print This Post

Tags: , ,

Problem

Failover Cluster validation fails with the following error: “Failed to prepare storage for testing on node X status 87”:

SNAGHTML314451ca

Analysis

This error was caused by how the operating system was installed on the server. For some reason there was a separate hidden boot partition with no name:

SNAGHTML3145d2ac

I suspect that this is the remains of a Dell OEM installation CD, as there is also an OEM partition on the drive. I didn’t install the OS on this server though, so I was unable to figure out exactly what had happened.

Solution

The solution is simple enough, just assign a drive letter to the boot partition:

image

You could also reinstall the node or create a valid boot loader on the C: drive, but just assigning a drive letter to the partition is way quicker.

Print This Post Print This Post

Tags:

« Older entries § Newer entries »

%d bloggers like this: