Something bad happens, and the error log is bloated with error messages very fast, triggering low disk space alarms and eventually filling the drive to the brim. For a concrete example, see http://lokna.no/?p=1263.
Even though I have spent considerable time searching for a solution to this, so far I have been at a loss. My best bet has been limiting the number of files. If the default limit of 6 isn’t changed, there seems to be no entry in the registry. On clustered instances such errors as mentioned above can cause a failover roundabout, and I have seen several hundred log files being generated (on SQL Server 2008R2). Changing the limit adds a registry value, and this seems to force SQL Server to enforce the limit.
This doesn’t solve the most pressing issue though. If some sort of error starts pushing several log messages every second, a small base volume with say 10GiB of free space is filled to the brim in hours if not minutes. Thus, when you are woken up by the low disk space alarm it is already to late to stop it, and you run the chance of NTFS corruption. In my experience, the chance of NTFS corruption is increasing on a logarithmic scale when you pass 5% free space on a volume. 0 bytes of free space with processes still trying to write more data to the volume more or less guarantees some kind of NTFS corruption unless you catch it quickly.
Sidetrack about sAN storage with 0 bytes free space
If you find your self in such a situation, DO NOT fail over the instance (or restart it if it isn’t clustered). This will only make it worse and increase the chance of unfixable NTFS corruption.
What I would do:
- Don’t panic
- Stop SQL Server and any other software responsible for filling the volume to release the locks on the active file(s)
- Free up some space by deleting something. Do NOT try to copy/move something to another drive before you have at least a few MiB of free space. Likewise, make sure to use [shit]+[del] to force delete instead of moving it to the recycle bin. You will have to find some expendable files to delete. Failing to do this correctly on the first try could destroy the volume.
- Delete or move some more files, ensuring you have at least 20% free space
- Run chkdsk /f on the volume several times until you get two consecutive runs without any error messages. If this I not possible, the volume is damaged beyond safe repair. Try to salvage what you can, reformat and reinstall.
- If chkdsk reports any physical errors (bad sectors), you should get a new SAN volume/lun as well.
- Restart, or in the worst case reinstall the SQL Server instance
- Resolve the root cause before it fills up the drive again.
Further analysis: limit the error log file size
Lab environment: SQL Server 11.0.2100.60 – RTM (Enterprise Edition (64-bit)) instance running on a virtual Windows 2012 two-node failover cluster backed by iSCSI shared storage.
Finally, I found a reference to an ErrorLogSizeInKb registry key at http://sqlserverscribbles.com/tag/limiting-sqlserver-errorlog-size/. Sadly though, I was unable to get the method referenced at that site to work in my lab. This might be because my lab is clustered, so I decided to just try adding it to the same registry key where the NumErrorLogs value that is used to control the number of previous error logs is stored. On a clustered instance, you can set this option by running the following TSQL code:
USE [master] GO EXEC xp_instance_regwrite N'HKEY_LOCAL_MACHINE', N'Software\Microsoft\MSSQLServer\MSSQLServer', N'NumErrorLogs', REG_DWORD, 10 GO
In my lab, this creates the following value:
(This setting can also be changed via the GUI in SSMS by right clicking the SQL Server Error Logs folder).
So I ran this code:
USE [master] GO EXEC xp_instance_regwrite N'HKEY_LOCAL_MACHINE', N'Software\Microsoft\MSSQLServer\MSSQLServer', N'ErrorLogSizeInKb', REG_DWORD, 500 GO
And hey presto, this value is created:
I restarted the instance and generated some error log messages by turning traceflag 3004 and 3605 on and off a few thousand times. These flags are used in another test to see if instant file initialization is turned on, more on that later, but the point is just to generate a bunch of log messages. The test confirmed that Sql Server will honor the limitations, as I quickly got 10 ERRORLOG.n files at 502KiB each:
I tried changing the size, and it seems to always add 2KiB to the set limit.
Solution – Limit SQL Server Error log size
To limit the size of the SQL Server error log files, you have to add a registry key and restart the instance. You should also consider changing the default number of error logs to make sure SQL Server is enforcing the limit when something bad happens. I would suggest setting a value based on the amount of space available for log messages. Make sure you account for future file growth on the volume where the logs are located. In a failure scenario I would like as much data as possible in the logs without endangering the stability of the entire volume.
If the instance is part of a cluster, don’t edit the registry directly, as the registry is replicated between nodes by the cluster service. Instead, use xp_instance_regwrite to ensure the value is created/updated on the active node and replicated to passive nodes.
Sample: 10 error logs and 50 000KiB per file for a total maximum error log size of 500 020KiB excluding the active log file.
USE [master] GO EXEC xp_instance_regwrite N'HKEY_LOCAL_MACHINE', N'Software\Microsoft\MSSQLServer\MSSQLServer', N'NumErrorLogs', REG_DWORD, 10 GO EXEC xp_instance_regwrite N'HKEY_LOCAL_MACHINE', N'Software\Microsoft\MSSQLServer\MSSQLServer', N'ErrorLogSizeInKb', REG_DWORD, 50000 GO
You could also create an agent job to recycle the log on a daily basis, thus ensuring that you get one log per day under normal circumstances. See http://technet.microsoft.com/en-us/library/ms182512.aspx.