I got a callout from our monitoring system, alerting me of a high Average Wait Time on one of my production servers:
The baseline average wait value for this server is around 9ms, so an average in the 900ms range is a definitive sign of something. Whether it is a sign of something bad or not is for the investigation to decide. All other operating parameters where in the normal range, both for the SQL Server and the connected application servers. Usually, such warnings are accompanied by a complaint from the application team about slow application performance, but not today.
I started out running Mr. Randal’s brilliant wait stats scripts. As the graph from the monitoring system contained a lot of jitter, I went with http://www.sqlskills.com/blogs/paul/capturing-wait-statistics-period-time/ and let it run for the 30 minute default period when the waits were high to se what was going on. The result revealed that LCK_M_U had joined the usual band of ASYNC_NETWORK_IO, TRACEWRITE and CXPACKET waits:
LCK_M_U indicates that someone somewhere is waiting for an update lock. This particular system has a very high Read ratio versus Writes, and most writes are done during nightly update tasks. As such, this wait type was not expected and further analysis was needed. I found an old article by Kendra Little describing how to collect data using sp_WhoIsactive: http://www.littlekendra.com/2011/02/01/whoisactive/
Using her example as a guide, I created a script to collect some data in my monitoring database. To use this script, you need a user database that you can play around with and an up to date version of Adam Machanics sp_WhoIsactive. I tried using the @find_block_leaders option, but I ran in to a known recursion level limitation so I had to remove it. The script has three parts:
Create a table
This part creates a table tagged with todays date to make it easy to separate different troubleshooting sessions. You could also schedule daily automatic runs of the entire script to collect data at specific times each day.
DECLARE @destination_table VARCHAR(4000) ; SET @destination_table = 'WhoIsActive_' + CONVERT(VARCHAR, GETDATE(), 112) ; DECLARE @schema VARCHAR(4000) ; EXEC dbo.sp_WhoIsActive @get_transaction_info = 1, @get_plans = 1, @find_block_leaders = 0, @get_locks=TRUE, @RETURN_SCHEMA = 1, @SCHEMA = @schema OUTPUT; SET @schema = REPLACE(@schema, '<table_name>', @destination_table) ; PRINT @schema EXEC(@schema) ; GO </table_name>
You can run this part several times, and the intervals can be adjusted to suit your needs. Just be aware that if you cross midnight, you either have to create another table for the new day, or modify the collection script to log to yesterdays table.
DECLARE @destination_table VARCHAR(4000) ; SET @destination_table = 'WhoIsActive_' + CONVERT(VARCHAR, GETDATE(), 112) ; DECLARE @msg NVARCHAR(1000) ; DECLARE @numberOfRuns INT ; SET @numberOfRuns = 100 ; WHILE @numberOfRuns > 0 BEGIN; EXEC dbo.sp_WhoIsActive @get_transaction_info = 1, @get_plans = 1, @find_block_leaders = 0, @get_locks=TRUE, @DESTINATION_TABLE = @destination_table ; SET @numberOfRuns = @numberOfRuns - 1 ; IF @numberOfRuns > 0 BEGIN SET @msg = CONVERT(CHAR(19), GETDATE(), 121) + ': ' + 'Logged info. Waiting...' RAISERROR(@msg,0,0) WITH nowait ; WAITFOR DELAY '00:00:05' END ELSE BEGIN SET @msg = CONVERT(CHAR(19), GETDATE(), 121) + ': ' + 'Done.' RAISERROR(@msg,0,0) WITH nowait ; END END ; GO
The default is 100 runs at 5 second intervals. I had to run the collection script three times to get the data I wanted, so don’t give up if you don’t catch it on the first run.
Analyze the data
Two sample analysis scripts. The first one gives you all the collected data sorted by collection time, the second filters out the LCK_M_U waits only and has a hardcoded table name as I didn’t finish the first day.
DECLARE @destination_table NVARCHAR(2000), @dSQL NVARCHAR(4000) ; SET @destination_table = 'WhoIsActive_' + CONVERT(VARCHAR, GETDATE(), 112) ; SET @dSQL = N'SELECT collection_time, * FROM dbo.' + QUOTENAME(@destination_table) + N' order by 1 desc' ; print @dSQL EXEC sp_executesql @dSQL GO SELECT collection_time, * FROM JKL_Log.dbo.WhoIsActive_20150616 WHERE session_id NOT LIKE 51 AND wait_info LIKE '%LCK_M_U%' ORDER BY 1 DESC;
The result clearly identified the waiting code. This code is not necessarily the culprit, it could be another slow task hogging locks on the data.
With the troublesome T-SQL code identified, the next step would be debugging the application. In this case, the code was from an Entity Framework app, and I have to get the developers of said application to look into the issue.