Health Service Script Event 4001

Problem

The event log is littered with regular copies of the following event:

SNAGHTML23ad5ff5

“GetSQL2008SPNState.vbs : The Query ‘SELECT ProtocolName FROM ServerNetworkProtocol where Enabled = true and InstanceName = ‘INSTANCE” did not return any valid instances.  Please check to see if this is a valid WMI Query.. Invalid class”

I do not know what impact this has on SCOM  (if any), but the warning is escalated to the Administrative Events view and is thus making it harder to spot other errors as it occurs very frequently. On one of my servers this error occurred every 10 minutes or so. Furthermore, I suspect the cause of the problem to be changes in RunAS accounts in SCOM, but I am not certain.

Solution

I found this on the Technet forums http://social.technet.microsoft.com/Forums/en-US/operationsmanagermgmtpacks/thread/01eff618-1087-4b6a-9d3f-9f1402ddf3f4:

Reregister SQL management object (MOF) data. I have yet to figure out exactly what MOF data is, but SCOM is using it to monitor the SQL server instance(s) on the server. Perform the following operations:

  • Start an administrative command prompt
  • Browse to “C:\Program Files (x86)\Microsoft SQL Server\100\Shared”
  • Run mofcomp sqlmgmproviderxpsp2up.mof
  • Restart the SCOM agent service (System Center Management)

image

DCOM error 10016 on agent job execution

Problem

Every time a maintenance plan generated SQL agent job tries to execute, the following error occurs in the system event log:

Log Name:      System
Source:        Microsoft-Windows-DistributedCOM
Date:          02.02.2013 16:00:02
Event ID:      10016
Task Category: None
Level:         Error
Keywords:      Classic
User:          SQL AGENT Service account
Computer:      SQL Server
Description:
The application-specific permission settings do not grant Local Launch permission for the COM Server application with CLSID
{46063B1E-BE4A-4014-8755-5B377CD462FC}
and APPID
{FAAFC69C-F4ED-4CCA-8849-7B882279EDBE}
to the user “SQL AGENT Service account ”SID (S-1-5-21-) from address LocalHost (Using LRPC). This security permission can be modified using the Component Services administrative tool.

image

Analysis

Searching the registry for the CLSID revealed that the DCOM application in question is MsDtsServer100:

image

A quick peek in the settings reveals that only local administrators and SYSTEM are granted the Local Launch permission by default:

image

During further analysis I discovered that this error only occurred on one of the cluster nodes (the server in question is part of a two node failover cluster). I then found that the SQLAgent service account was added to local admins on the other cluster node, but not on this one. Whether or not the SQL Agent service account should be a member of local admins or not is debatable, but it sure gives you a lot of gripe if it isn’t.

Solution

1: Add the SQL Agent service account to the local administrators group on the server.

2: Or give the SQL Agent service account explicit local launch permissions on MsDtsServer100 using dcomcnfg:

SNAGHTML7d74b4

Permission denied restoring imported backup

Problem

Trying to restore a backup of a database on a different server than the one were the backup originates generates the following error message: “The operating system returned the error ‘5(failed to retrieve text for this error. Reason: 15105)’ while attempting ‘RestoreContainer::ValidateTargetForCreation’”:

SNAGHTML375e27

 

Analysis

Error 5 is as always access denied. I don’t know why text retrieval fails, but that is a problem for another day Smilefjes.Restores of backups originating from the destination server on the destination server works as expected, and the SQLServer service account have the required permissions granted on both the .bak file and the target folder. Restore of the same backup file on other servers during staging worked as expected. I thus concluded that there had to be some differences between the staging and production servers causing the issue.

  • staging servers are virtual and stand-alone servers, production is a physical cluster
  • staging servers use vm drives as data volumes, production has SAN disks attached to mount points
  • staging has a couple of cores and > 20gigs of ram, production has 16 cores and > 200 gigs

Then it hit me, mount points are renowned for causing strange permission issues due to the fact that the permissions for the mount point and the actual volume mounted are stored separately in different ACLs:

SNAGHTML47effa

Further investigation revealed that the service account had full control permissions on both ACLs, but the volume permissions on the data and transaction log volumes were granted via group membership, while the mount point permissions on those volumes were granted explicitly. This doesn’t seem to be a problem for other operations, but when you try to restore a database to a different destination than were it came from, the server account needs explicit permissions on the destination folder(s). That is, when original file name and Restore As are not the same.

SNAGHTML4f865f

solution

Grant the SQLServer service account explicit permissions on both the volume and the mount point. If you have different mount points for transaction log and data files, you have to do this on both folders. Furthermore, I would guess similar errors could occur if the service account lacks access to the source bak file.

Orchestrator bug kills database server

Update

20131031: Code to limit the size of the error log files added to the solution. So far only tested on SQL 2012 SP1 CU2.

Problem

The Database server/operating system starts complaining about low disk space. Analysis reveals that the volume containing the default MSSQL log folder is very low on space, maybe even 0 bytes free. Further analysis shows that the ERRORLOG files are taking up all remaining space on the volume.

Analysis

The ERROROLOG files is filled to the brim with messages like this one:

Date		07.01.2013 21:09:09
Log		SQL Server (Current - 07.01.2013 14:57:00)

Source		spid29s

Message
The activated proc '[Microsoft.SystemCenter.Orchestrator.Maintenance].[MaintenanceWorker]' running on queue 'OrchestratorDB.Microsoft.SystemCenter.Orchestrator.Maintenance.MaintenanceServiceQueue' output the following:  'Error: 9002, The transaction log for database 'OrchestratorDB' is full. To find out why space in the log cannot be reused, see the log_reuse_wait_desc column in sys.databases'

image

image

Orchestrator has a maintenance process that does not handle a situation were the database transaction log is unable to grow, either due to restrictions or due to the fact that the disk is full. When this happens, it bloats the ERRORLOG with the messages mentioned above. This happens very quickly as it keeps retrying several times each second, so even a 10GB volume is filled in hours. If this situation is left unchecked for a couple of days, the risk of volume NTFS corruption is high.

solution

  • Disable access to the Orchestrator database or stop the Orchestrator servers to prevent further bloating.
  • Stop the database instance
  • Delete the bloated error log to free up space
  • Restart the instance
  • Change the number of error logs SQL server is allowed to create (default is 6, sample limits to 10) and limit the size of the error log files to 50MB each. This will give you a total of 11 error logfiles, each with a max size of about 50 000 KB for a total of 550MB.
    USE [master]
    GO
    EXEC xp_instance_regwrite N'HKEY_LOCAL_MACHINE',
     N'Software\Microsoft\MSSQLServer\MSSQLServer',
     N'NumErrorLogs', REG_DWORD, 10
    GO
    EXEC xp_instance_regwrite N'HKEY_LOCAL_MACHINE',
     N'Software\Microsoft\MSSQLServer\MSSQLServer', 
     N'ErrorLogSizeInKb', REG_DWORD, 50000
    GO
  • Solve the problem regarding the Orchestrator DB transaction log. Usually such problems are caused by either a failed backup job (thus no log truncation) or the simple fact that you need to allocate more space for the data. If a failed backup job is the problem, consider shrinking the transaction logs back down to their normal size. See https://lokna.no/?p=349 for more information about shrinking troublesome logs (Norwegian version only).
  • Enable access to the Orchestrator database.
  • Monitor the error logs and see if the values you chose for log size and number of logs fits your workload. If the logs contain at least a day of messages, but not more than a month, you are in the butter zone.

Maintenance plans–change owner

Problem

Sometimes you might want to change the owner of a job connected to a maintenance plan. For instance, the current owner could be a deleted account, something that will cause the jobs to fail with not so funny error messages regarding permissions or missing accounts. It is possible to change the job owner using TSQL or Management Studio, but each time the plan is updated, the owner is overwritten by the owner of the maintenance plan. To change this, you have to use TSQL AFAIK.

 

Solution

NB: Running this is dangerous if you don’t know what you are doing! The msdb.dbo.ssispackages table contains system objects as well as your manually created plans, and could also contain SSIS packages as the name indicates. Always take a backup of MSDB before making changes.

Changing the maintenance plan owner on MSSQL 2000:

update msdb.dbo.sysdbmaintplans
set
owner=’SQL Systembruker’
where owner like ‘Feil bruker’

On MSSQL 2008R2:

UPDATE msdb.dbo.sysssispackages
SET
ownersid=suser_sid(’New User’)
WHERE ownersid = suser_sid(’Old user’)

or just set sa as the owner of everything:

UPDATE msdb.dbo.sysssispackages
SET
ownersid=0x01
WHERE ownersid != 0x01

Afterwards, either save the plan to update job ownership or run the following to change the job owner as well.

update msdb.dbo.sysjobs
set
owner_sid=suser_sid(’SQL Systembruker’)
where suser_sname(owner_sid) like ‘Feil bruker’

or just set sa as the owner of everything here as well:

update msdb.dbo.sysjobs
set
owner_sid=0x01
WHERE owner_sid != 0x01

If it doesn’t seem to work as expected, restart the agent service.