Creating firewall rules for SQL server using Powershell

On Win2012/Powershell 3 there is a commandlet called “New-NetFirewallRule” that allows for scripted creation of firewall rules. This makes it a lot easier to get them rules right. I have previously used GPO to push this to my SQL servers, but sadly I have discovered that it does not always work. For some reason, servers don’t like to have their firewall rules pushed by GPO. This meant I had to check them every time anyway, so I just resorted to creating them manually. But now, thanks to the wonders of Powershell 3, maybe I won’t have to do that again Smilefjes

More information about the commandlet can be found here: http://technet.microsoft.com/en-us/library/jj554908.aspx

Sample code

This code creates rules to allow the SQL server browser (UDP 1434), the standard engine port for two instances (TCP 1433 and 1434) and the default port for AOAG endpoints (TCP 5022).

New-NetFirewallRule -DisplayName "MSSQL BROWSER UDP" -Direction Inbound -LocalPort 1434 -Protocol UDP -Action Allow
New-NetFirewallRule -DisplayName "MSSQL ENGINE TCP" -Direction Inbound -LocalPort 1433-1434 -Protocol TCP -Action Allow
New-NetFirewallRule -DisplayName "MSSQL AOAG EP TCP" -Direction Inbound -LocalPort 5022 -Protocol TCP -Action Allow

Aligning dynamic disks

If you are using dynamic disks for some reason, and please avoid using them if you don’t have to, the partitions on it are likely to be misaligned for SQL server. I discovered this while trying to configure an AO Availability Group where one of the replicas were using local SSD drives configured with software RAID 1.

Problem

Since this was a new setup for me, I ran the old “wmic partition get BlockSize, StartingOffset, Name, Index”  just to make sure everything was in order. To my astonishment, it was not: image For some reason, the partition is using the old Win 2003 31,5 KB offset! To make it worse, I discovered this AFTER I had installed SQL server. Since dynamic disk and software raid basically sucks, information about this on the great interweb was sparse. But after some searching I found a cure, at least for volumes without RAID, at http://blogs.utexas.edu/alex/2013/04/04/windows-aligning-dynamic-disks/. (Link dead as of 2016.10)

Solution

Based on the above mentioned blog post, with my comments and changes for RAID. Be aware, this process may be destructive. This guide assumes that you, as I did, already have an active mirror with the wrong alignment. If you have fresh drives, just ignore the parts about breaking the mirror and moving data.

  • Make sure you have a valid backup
  • Be prepared to do a clean install if necessary
  • Break the mirror
  • Give both drives new drive-letters and restart the server to make sure no active application/service is using the drive
  • Run diskpart, and select one of the drives that was part of the mirror. If you don’t know how to do this, STOP and ask someone to help you or read up on diskpart BEFORE you continue.
  • Execute the following diskpart commands against the selected drive. This guide expects that you want one volume to fill the entire drive. If you don’t want this, think long and hard about why and consider changing your mind.
  • clean
  • online disk
  • attributes disk clear readonly
  • convert gpt
  • select part 1
  • delete part override
  • create partition msr size 128
  • convert dynamic
  • create volume simple align=1024
  • Screenshots image image
  • Then, format the new partition with 64K allocation unit size
  • Move the data from the other partition that was part of the mirror, and hope this will work
  • Run the diskpart commands against the other disk. Be aware, this will delete the data, so make sure the move command was successful. And this time, skip the last create volume command.
  • Select second disk
  • clean
  • online disk
  • attributes disk clear readonly
  • convert gpt
  • select part 1
  • delete part override
  • create partition msr size 128
  • convert dynamic

You should now have two dynamic disks, one with the data and one unallocated. Now, to add the mirror. I discovered that I was unable to add the second drive back as a mirror, the add mirror option was grayed out. I solved this by first shrinking the original by 50MB, and then creating the mirror. I didn’t test this extensively, but I would guess that 5 or 10 megabytes of free space would have been enough.

Weak event created

Problem

In windows failover cluster manager, clicking on an node in the tree will raise the following error:

SNAGHTML106567d

This is one of the biggest error messages I have ever seen in a Microsoft product, that is, regarding the size of the message box Smilefjes. The error states: “A weak event was created and I lives on the wrong object, there is a very high chance this will fail, pleas review your code to prevent the issue” and goes on with a .net call stack.

Analysis

This feels like a .Net framework issue to me, and a quick search rustled up the following post on Technet: http://blogs.technet.com/b/askcore/archive/2013/01/14/error-in-failover-cluster-manager-after-install-of-kb2750149.aspx, stating that this is a bug caused by KB 2750149.

Solution

Install KB2803748, available here: http://support.microsoft.com/kb/2803748

To check if both or any of the patches are installed, run the following powershell commands:

Get-HotFix -id KB2750149
Get-HotFix -id KB2803748

Health Service Script Event 4001

Problem

The event log is littered with regular copies of the following event:

SNAGHTML23ad5ff5

“GetSQL2008SPNState.vbs : The Query ‘SELECT ProtocolName FROM ServerNetworkProtocol where Enabled = true and InstanceName = ‘INSTANCE” did not return any valid instances.  Please check to see if this is a valid WMI Query.. Invalid class”

I do not know what impact this has on SCOM  (if any), but the warning is escalated to the Administrative Events view and is thus making it harder to spot other errors as it occurs very frequently. On one of my servers this error occurred every 10 minutes or so. Furthermore, I suspect the cause of the problem to be changes in RunAS accounts in SCOM, but I am not certain.

Solution

I found this on the Technet forums http://social.technet.microsoft.com/Forums/en-US/operationsmanagermgmtpacks/thread/01eff618-1087-4b6a-9d3f-9f1402ddf3f4:

Reregister SQL management object (MOF) data. I have yet to figure out exactly what MOF data is, but SCOM is using it to monitor the SQL server instance(s) on the server. Perform the following operations:

  • Start an administrative command prompt
  • Browse to “C:\Program Files (x86)\Microsoft SQL Server\100\Shared”
  • Run mofcomp sqlmgmproviderxpsp2up.mof
  • Restart the SCOM agent service (System Center Management)

image

Poor disk performance on Dell servers

Problem

I’ve been managing a lot of Dell servers lately, where the baseline showed very poor performance for local drives connected to PERC (PowerEdge Expandable RAID Controller) controllers. Poor enough to trigger negative marks on a MSSQL RAP. Typically, read and write latency would never get below 11ms, even with next to no load on a freshly reinstalled server. Even the cheapest laptops with 4500 RPM SATA drives would outperform such stats, and these servers had 10 or 15K RPM SAS drives on a 6Gbps bus. We have a combination of H200, H700 and H710 PERC controllers on these servers, and the issues didn’t seem to follow a pattern, with one exception: all H200 equipped servers experienced poor performance.

Analysis

A support ticket with Dell gave the usual response: update your firmware and drivers. We did, and one of the H700 equipped servers got worse. Further inquiries with Dell gave a recommendation to replace the H200 controllers with the more powerful H700. After having a look at the specs for the H200 I fully agree with their assessment, although I do wonder why on earth they sold them in the first place. The H200 doesn’t appear to be worth the price of the cardboard box it is delivered in. It has absolutely no cache whatsoever, and it also disables the built in cache on the drives. Snap from the H200 users guide:

image

This sounds like something one would use in a print server or small departmental file server in a very limited budget, not in a four-way database cluster node. And it explains why the connected drives are painfully slow, you are reduced to platter speed.

Note: The H200 is replaced by the H310 on newer servers. I have yet to test it, but from what the specs tell me it is just as bad as the H200.

Update: Test data from a H310 equipped test server doing nothing but displaying the perfmon curve:

SNAGHTMLf6bef00

Continue reading “Poor disk performance on Dell servers”

Cluster validation fails on missing drive letter

Problem

Failover Cluster validation fails with the following error: “Failed to prepare storage for testing on node X status 87”:

SNAGHTML314451ca

Analysis

This error was caused by how the operating system was installed on the server. For some reason there was a separate hidden boot partition with no name:

SNAGHTML3145d2ac

I suspect that this is the remains of a Dell OEM installation CD, as there is also an OEM partition on the drive. I didn’t install the OS on this server though, so I was unable to figure out exactly what had happened.

Solution

The solution is simple enough, just assign a drive letter to the boot partition:

image

You could also reinstall the node or create a valid boot loader on the C: drive, but just assigning a drive letter to the partition is way quicker.

How SQL Server 2012 Service Pack 1 destroyed my life

Or to be more exact: seven days of my life, with every waking hour spent troubleshooting and  finally reinstalling servers.

Problem

Some time after you install MSSQL 2012 SP 1 strange things start to happen on the server. Maybe you are unable to start management studio, or if you’re really unfortunate you can’t log in to the server at all. Maybe random services start to fail, Windows Update refuses to apply updates, and if you boot the server it might refuse to start SQL server at all or just greet you with a friendly BSOD. But a couple of days before everything was hunky-dory, and you are absolutely 100% sure nothing has changed since then. Yet the server fails, and a couple of days later another one goes down with the same symptoms. It’s as if they have contracted ebola or the swine flu or some other strange virus. It seems to be attacking the entire server all at once, and the only common denominator is they are all running MS SQL 2012 SP1.

Continue reading “How SQL Server 2012 Service Pack 1 destroyed my life”

DCOM error 10016 on agent job execution

Problem

Every time a maintenance plan generated SQL agent job tries to execute, the following error occurs in the system event log:

Log Name:      System
Source:        Microsoft-Windows-DistributedCOM
Date:          02.02.2013 16:00:02
Event ID:      10016
Task Category: None
Level:         Error
Keywords:      Classic
User:          SQL AGENT Service account
Computer:      SQL Server
Description:
The application-specific permission settings do not grant Local Launch permission for the COM Server application with CLSID
{46063B1E-BE4A-4014-8755-5B377CD462FC}
and APPID
{FAAFC69C-F4ED-4CCA-8849-7B882279EDBE}
to the user “SQL AGENT Service account ”SID (S-1-5-21-) from address LocalHost (Using LRPC). This security permission can be modified using the Component Services administrative tool.

image

Analysis

Searching the registry for the CLSID revealed that the DCOM application in question is MsDtsServer100:

image

A quick peek in the settings reveals that only local administrators and SYSTEM are granted the Local Launch permission by default:

image

During further analysis I discovered that this error only occurred on one of the cluster nodes (the server in question is part of a two node failover cluster). I then found that the SQLAgent service account was added to local admins on the other cluster node, but not on this one. Whether or not the SQL Agent service account should be a member of local admins or not is debatable, but it sure gives you a lot of gripe if it isn’t.

Solution

1: Add the SQL Agent service account to the local administrators group on the server.

2: Or give the SQL Agent service account explicit local launch permissions on MsDtsServer100 using dcomcnfg:

SNAGHTML7d74b4

Permission denied restoring imported backup

Problem

Trying to restore a backup of a database on a different server than the one were the backup originates generates the following error message: “The operating system returned the error ‘5(failed to retrieve text for this error. Reason: 15105)’ while attempting ‘RestoreContainer::ValidateTargetForCreation’”:

SNAGHTML375e27

 

Analysis

Error 5 is as always access denied. I don’t know why text retrieval fails, but that is a problem for another day Smilefjes.Restores of backups originating from the destination server on the destination server works as expected, and the SQLServer service account have the required permissions granted on both the .bak file and the target folder. Restore of the same backup file on other servers during staging worked as expected. I thus concluded that there had to be some differences between the staging and production servers causing the issue.

  • staging servers are virtual and stand-alone servers, production is a physical cluster
  • staging servers use vm drives as data volumes, production has SAN disks attached to mount points
  • staging has a couple of cores and > 20gigs of ram, production has 16 cores and > 200 gigs

Then it hit me, mount points are renowned for causing strange permission issues due to the fact that the permissions for the mount point and the actual volume mounted are stored separately in different ACLs:

SNAGHTML47effa

Further investigation revealed that the service account had full control permissions on both ACLs, but the volume permissions on the data and transaction log volumes were granted via group membership, while the mount point permissions on those volumes were granted explicitly. This doesn’t seem to be a problem for other operations, but when you try to restore a database to a different destination than were it came from, the server account needs explicit permissions on the destination folder(s). That is, when original file name and Restore As are not the same.

SNAGHTML4f865f

solution

Grant the SQLServer service account explicit permissions on both the volume and the mount point. If you have different mount points for transaction log and data files, you have to do this on both folders. Furthermore, I would guess similar errors could occur if the service account lacks access to the source bak file.

SQL Server Agent jobs fails, permission denied

Problem

SQL Server agent jobs fail with the following message: “Non-SysAdmins have been denied permission to run DTS Execution job steps without a proxy account” even though both the job owner, current user and agent user are members of the sysadmin role.

SNAGHTML16831ef

Workaround

  • Change the job owner to sa.
  • Or if it is a part of a maintenance plan, change the plan owner. See https://lokna.no/?p=325
  • Define a proxy account.

solution

Make sure the job owner is a direct member of the sysadmin role. If the sysadmin role membership is granted to an AD group, you get the error mentioned above when you try to run jobs who require sysadmin membership.