Poor disk performance on Dell servers

Problem

I’ve been managing a lot of Dell servers lately, where the baseline showed very poor performance for local drives connected to PERC (PowerEdge Expandable RAID Controller) controllers. Poor enough to trigger negative marks on a MSSQL RAP. Typically, read and write latency would never get below 11ms, even with next to no load on a freshly reinstalled server. Even the cheapest laptops with 4500 RPM SATA drives would outperform such stats, and these servers had 10 or 15K RPM SAS drives on a 6Gbps bus. We have a combination of H200, H700 and H710 PERC controllers on these servers, and the issues didn’t seem to follow a pattern, with one exception: all H200 equipped servers experienced poor performance.

Analysis

A support ticket with Dell gave the usual response: update your firmware and drivers. We did, and one of the H700 equipped servers got worse. Further inquiries with Dell gave a recommendation to replace the H200 controllers with the more powerful H700. After having a look at the specs for the H200 I fully agree with their assessment, although I do wonder why on earth they sold them in the first place. The H200 doesn’t appear to be worth the price of the cardboard box it is delivered in. It has absolutely no cache whatsoever, and it also disables the built in cache on the drives. Snap from the H200 users guide:

image

This sounds like something one would use in a print server or small departmental file server in a very limited budget, not in a four-way database cluster node. And it explains why the connected drives are painfully slow, you are reduced to platter speed.

Note: The H200 is replaced by the H310 on newer servers. I have yet to test it, but from what the specs tell me it is just as bad as the H200.

Update: Test data from a H310 equipped test server doing nothing but displaying the perfmon curve:

SNAGHTMLf6bef00

Continue reading “Poor disk performance on Dell servers”

Cluster validation fails on missing drive letter

Problem

Failover Cluster validation fails with the following error: “Failed to prepare storage for testing on node X status 87”:

SNAGHTML314451ca

Analysis

This error was caused by how the operating system was installed on the server. For some reason there was a separate hidden boot partition with no name:

SNAGHTML3145d2ac

I suspect that this is the remains of a Dell OEM installation CD, as there is also an OEM partition on the drive. I didn’t install the OS on this server though, so I was unable to figure out exactly what had happened.

Solution

The solution is simple enough, just assign a drive letter to the boot partition:

image

You could also reinstall the node or create a valid boot loader on the C: drive, but just assigning a drive letter to the partition is way quicker.

How SQL Server 2012 Service Pack 1 destroyed my life

Or to be more exact: seven days of my life, with every waking hour spent troubleshooting and  finally reinstalling servers.

Problem

Some time after you install MSSQL 2012 SP 1 strange things start to happen on the server. Maybe you are unable to start management studio, or if you’re really unfortunate you can’t log in to the server at all. Maybe random services start to fail, Windows Update refuses to apply updates, and if you boot the server it might refuse to start SQL server at all or just greet you with a friendly BSOD. But a couple of days before everything was hunky-dory, and you are absolutely 100% sure nothing has changed since then. Yet the server fails, and a couple of days later another one goes down with the same symptoms. It’s as if they have contracted ebola or the swine flu or some other strange virus. It seems to be attacking the entire server all at once, and the only common denominator is they are all running MS SQL 2012 SP1.

Continue reading “How SQL Server 2012 Service Pack 1 destroyed my life”

DCOM error 10016 on agent job execution

Problem

Every time a maintenance plan generated SQL agent job tries to execute, the following error occurs in the system event log:

Log Name:      System
Source:        Microsoft-Windows-DistributedCOM
Date:          02.02.2013 16:00:02
Event ID:      10016
Task Category: None
Level:         Error
Keywords:      Classic
User:          SQL AGENT Service account
Computer:      SQL Server
Description:
The application-specific permission settings do not grant Local Launch permission for the COM Server application with CLSID
{46063B1E-BE4A-4014-8755-5B377CD462FC}
and APPID
{FAAFC69C-F4ED-4CCA-8849-7B882279EDBE}
to the user “SQL AGENT Service account ”SID (S-1-5-21-) from address LocalHost (Using LRPC). This security permission can be modified using the Component Services administrative tool.

image

Analysis

Searching the registry for the CLSID revealed that the DCOM application in question is MsDtsServer100:

image

A quick peek in the settings reveals that only local administrators and SYSTEM are granted the Local Launch permission by default:

image

During further analysis I discovered that this error only occurred on one of the cluster nodes (the server in question is part of a two node failover cluster). I then found that the SQLAgent service account was added to local admins on the other cluster node, but not on this one. Whether or not the SQL Agent service account should be a member of local admins or not is debatable, but it sure gives you a lot of gripe if it isn’t.

Solution

1: Add the SQL Agent service account to the local administrators group on the server.

2: Or give the SQL Agent service account explicit local launch permissions on MsDtsServer100 using dcomcnfg:

SNAGHTML7d74b4

Permission denied restoring imported backup

Problem

Trying to restore a backup of a database on a different server than the one were the backup originates generates the following error message: “The operating system returned the error ‘5(failed to retrieve text for this error. Reason: 15105)’ while attempting ‘RestoreContainer::ValidateTargetForCreation’”:

SNAGHTML375e27

 

Analysis

Error 5 is as always access denied. I don’t know why text retrieval fails, but that is a problem for another day Smilefjes.Restores of backups originating from the destination server on the destination server works as expected, and the SQLServer service account have the required permissions granted on both the .bak file and the target folder. Restore of the same backup file on other servers during staging worked as expected. I thus concluded that there had to be some differences between the staging and production servers causing the issue.

  • staging servers are virtual and stand-alone servers, production is a physical cluster
  • staging servers use vm drives as data volumes, production has SAN disks attached to mount points
  • staging has a couple of cores and > 20gigs of ram, production has 16 cores and > 200 gigs

Then it hit me, mount points are renowned for causing strange permission issues due to the fact that the permissions for the mount point and the actual volume mounted are stored separately in different ACLs:

SNAGHTML47effa

Further investigation revealed that the service account had full control permissions on both ACLs, but the volume permissions on the data and transaction log volumes were granted via group membership, while the mount point permissions on those volumes were granted explicitly. This doesn’t seem to be a problem for other operations, but when you try to restore a database to a different destination than were it came from, the server account needs explicit permissions on the destination folder(s). That is, when original file name and Restore As are not the same.

SNAGHTML4f865f

solution

Grant the SQLServer service account explicit permissions on both the volume and the mount point. If you have different mount points for transaction log and data files, you have to do this on both folders. Furthermore, I would guess similar errors could occur if the service account lacks access to the source bak file.

SQL Server Agent jobs fails, permission denied

Problem

SQL Server agent jobs fail with the following message: “Non-SysAdmins have been denied permission to run DTS Execution job steps without a proxy account” even though both the job owner, current user and agent user are members of the sysadmin role.

SNAGHTML16831ef

Workaround

  • Change the job owner to sa.
  • Or if it is a part of a maintenance plan, change the plan owner. See https://lokna.no/?p=325
  • Define a proxy account.

solution

Make sure the job owner is a direct member of the sysadmin role. If the sysadmin role membership is granted to an AD group, you get the error mentioned above when you try to run jobs who require sysadmin membership.

Orchestrator bug kills database server

Update

20131031: Code to limit the size of the error log files added to the solution. So far only tested on SQL 2012 SP1 CU2.

Problem

The Database server/operating system starts complaining about low disk space. Analysis reveals that the volume containing the default MSSQL log folder is very low on space, maybe even 0 bytes free. Further analysis shows that the ERRORLOG files are taking up all remaining space on the volume.

Analysis

The ERROROLOG files is filled to the brim with messages like this one:

Date		07.01.2013 21:09:09
Log		SQL Server (Current - 07.01.2013 14:57:00)

Source		spid29s

Message
The activated proc '[Microsoft.SystemCenter.Orchestrator.Maintenance].[MaintenanceWorker]' running on queue 'OrchestratorDB.Microsoft.SystemCenter.Orchestrator.Maintenance.MaintenanceServiceQueue' output the following:  'Error: 9002, The transaction log for database 'OrchestratorDB' is full. To find out why space in the log cannot be reused, see the log_reuse_wait_desc column in sys.databases'

image

image

Orchestrator has a maintenance process that does not handle a situation were the database transaction log is unable to grow, either due to restrictions or due to the fact that the disk is full. When this happens, it bloats the ERRORLOG with the messages mentioned above. This happens very quickly as it keeps retrying several times each second, so even a 10GB volume is filled in hours. If this situation is left unchecked for a couple of days, the risk of volume NTFS corruption is high.

solution

  • Disable access to the Orchestrator database or stop the Orchestrator servers to prevent further bloating.
  • Stop the database instance
  • Delete the bloated error log to free up space
  • Restart the instance
  • Change the number of error logs SQL server is allowed to create (default is 6, sample limits to 10) and limit the size of the error log files to 50MB each. This will give you a total of 11 error logfiles, each with a max size of about 50 000 KB for a total of 550MB.
    USE [master]
    GO
    EXEC xp_instance_regwrite N'HKEY_LOCAL_MACHINE',
     N'Software\Microsoft\MSSQLServer\MSSQLServer',
     N'NumErrorLogs', REG_DWORD, 10
    GO
    EXEC xp_instance_regwrite N'HKEY_LOCAL_MACHINE',
     N'Software\Microsoft\MSSQLServer\MSSQLServer', 
     N'ErrorLogSizeInKb', REG_DWORD, 50000
    GO
  • Solve the problem regarding the Orchestrator DB transaction log. Usually such problems are caused by either a failed backup job (thus no log truncation) or the simple fact that you need to allocate more space for the data. If a failed backup job is the problem, consider shrinking the transaction logs back down to their normal size. See https://lokna.no/?p=349 for more information about shrinking troublesome logs (Norwegian version only).
  • Enable access to the Orchestrator database.
  • Monitor the error logs and see if the values you chose for log size and number of logs fits your workload. If the logs contain at least a day of messages, but not more than a month, you are in the butter zone.

SQLIO results parser

Ever since I first learned about it I have been a happy user of a powershell script for parsing the results of a SQLIO test run into Excel made by Mr. Jonathan Kehayias (blog). It takes the otherwise difficult to read results file and reduces it to an excel file including a couple of “management-friendly” charts  Smilefjes.

The original script can be found here.

Continue reading “SQLIO results parser”

Cluster disk resource XX contains an invalid mount point

Problem

During cluster startup or failover one of the following event is logged in the system event log:

SNAGHTML342b0d8SNAGHTML341df38

Event-ID 1208 from Physical Disk Resource: Cluster disk resource ‘[Resource name]’ contains an invalid mount point. Both the source and target disks associated with the mount point must be clustered disks, and must be members of the same group.
Mount point ‘[Mount path]’ for volume ‘\\?\Volume{[GUID]}\’ references an invalid target disk. Please ensure that the target disk is also a clustered disk and in the same group as the source disk (hosting the mount point).

Cause and investigation

The cause could of course be the fact that the base drive is not a clustered disk as the event message states. If that is the case, read a book about WFC (Windows failover clustering) and try again. If not, I have found the following causes:

  • If the mount point path is C:\$Recycle.bin\[guid], it is caused by replacing a SAN drive with another one at the same drive letter or mount point but with a different LUN. This confuses the recycle bin.
  • If the clustered drive for either the mount point or the volume being mounted is in maintenance mode and/or currently running autchk/chkdsk. This could happen so quickly that you are unable to detect it, and when you come back to check, the services are already up and running. Unless you disable it, WFC will run autochk/chkdsk when a drive with the dirty bit set is brought online. This is probably logged somewhere, but I have yet to determine in which log. Look in the application event log for Chkdsk events or something like this:

Event 17207 from MSSQL[instance]:

Event 1066 from FailoverClustering

 

Resolution

  • If it is the recycle.bin folder, make sure you have a backup of your data and delete the mount point folder under C:\recycle.bin. You might have to take ownership of the folder to be able to complete this task. If files are in use, take all cluster resources offline and try again.
  • If you suspect a corrupt mount point or drive, run chkdsk on ALL clustered drives. See https://lokna.no/?p=1194 for details.

Check C:\Windows\Cluster\Reports (default location) for files titled ChkDSK_[volume].txt, indicating that the cluster service has triggered an automatic chkdsk on a drive.

Run disk maintenance on a failover cluster mountpoint

Problem

“Validate this cluster” or another tool tells you that the dirty bit is set for a cluster shared volume, and taking the disk offline and online again (to trigger autochk) does not help. The error message from “Validate this cluster” looks like this:

SNAGHTML2c4a12c

 

Continue reading “Run disk maintenance on a failover cluster mountpoint”