Or to be more exact: seven days of my life, with every waking hour spent troubleshooting and finally reinstalling servers.
Some time after you install MSSQL 2012 SP 1 strange things start to happen on the server. Maybe you are unable to start management studio, or if you’re really unfortunate you can’t log in to the server at all. Maybe random services start to fail, Windows Update refuses to apply updates, and if you boot the server it might refuse to start SQL server at all or just greet you with a friendly BSOD. But a couple of days before everything was hunky-dory, and you are absolutely 100% sure nothing has changed since then. Yet the server fails, and a couple of days later another one goes down with the same symptoms. It’s as if they have contracted ebola or the swine flu or some other strange virus. It seems to be attacking the entire server all at once, and the only common denominator is they are all running MS SQL 2012 SP1.
- The Kernel memory paged pool is bloated. Normally for SQL servers of this size (128-256G RAM) this metric is somewhere between 256MB and 512MB, servers experiencing this problem have a value of 1024-3000MB:
- Event ID 1101 from .NET Runtime Optimization, failed to compile something in the 32 bit SQL server installation folder:
- Unable to start SSMS:
Visual Studio Shell Isolated Cannot find one or more components. Please reinstall the application.
- MSIEXEC.EXE still running two weeks after last installation and returns after reboot:
- Unable to log in because “The User Profile Service service failed the logon. User profile cannot be loaded.”
- ..or, logging you in with a temporary profile.
- Event 1001 from MSInstaller, Detection of product ‘SQL somthing’ failed repeats in the event log:
- Windows Module Installer Service fails repeatedly and generates errors in the system event log.
and a lot of other errors and warnings that I didn’t document or simply forgot.
The bloated kernel memory pool seemed to me like the best bet, so I started looking closer at that problem. In the beginning I got sidetracked by the fact that all of the servers experiencing the problem were missing the correct chipset drivers. They were all the same make and model, only a couple of weeks old, and they differed only in RAM. I quickly established that installing the drivers didn’t help, but I still believe that the missing drivers was somehow related to which servers ended up with the biggest problems. I started with one server with problems, but so far I have detected the issue to some degree on all servers running MSSQL 2012 SP1.
I spent half a day downloading the Windows Driver Development Kit (WDK) into our environment to get a hold of Poolmon.exe. (I won’t bother listing a URL to this, as MS changes the location every 6 months. Just bing or google poolmon.exe). Running poolmon on one of the troubled servers gave this result:
For more information on tracing pool leaks, see this post on Mark Russinovich’s blog: http://blogs.technet.com/b/markrussinovich/archive/2009/03/26/3211216.aspx
CM31 seems to be the tag using all my paged pool memory. But has it reached a limit? There seemed to be a 2048MB roof that was never reached, until I checked another server were the paged pool was way beyond 3000MB. Research shows that in theory the limit is something like 50% of the physical RAM or 128GB, none of which have been reached on this particular server containing 256gigs of RAM. Process explorer is able to identify the current limit, but that requires symbols and I didn’t have the time or energy to get the necessary firewall and proxy exceptions approved to download symbols on my servers. Anyway, I followed the path of Mr. Russinovich and used the strings.exe from Sysinternals to try to identify the culprit. Sadly it pointed back at the kernel itself:
Of course, finding a kernel leak in Windows would be fun, but I seriously doubted this to be the issue. I concluded it had to be something else asking the kernel to reserve memory, but what? It took a lot of googling and asking around to identify it, and finally it was a developer who put me on the right track: CM is Configuration Manager, which is another name for the registry or at least registry access related. A quick look in the corresponding folder on the servers revealed that the software hive was dangerously close to 2048MB on the affected servers. Dangerously because the fixed limit for all registry hives except system is 2048MB. This is usually not a problem, as most registry hives can be measured in tens of megabytes, not thousands.
What happens is that every operation trying to do something in the registry that would require the HKLM software hive to grow a bit fails miserably. Knowing that such a simple act as starting word could trigger tens of thousands of registry operations, I dare not think about what happens when the system boots and a user tries to open his profile.
Further analysis identified the following registry keys (and their subkeys) to be the culprits:
Each entry has an ImageList binary value containing what appears to be a link to the file Image and a checksum of sorts:
Each one of these takes up a fair bit of room in the registry. The key shown above have only one subkey, the problematic keys have thousands of subkeys, each containing an identical imagelist value thus compounding the size.
These values are used by the Microsoft .Net Framework NGEN services as it tries to optimize .Net code for each specific server. This process is corrupted for some reason during the installation of MSSQL 2012 SP 1, so each time this process is triggered another subkey is added, which triggers another one, and another one, and so on.
This issue is linked to the following MSSQL bug report: https://connect.microsoft.com/SQLServer/feedback/details/770630/msiexec-exe-processes-keep-running-after-installation-of-sql-server-2012-sp1#tabs
Check that thread for updates before applying my workaround. As of this moment (2012.12.19) there is no fix available.
Update 2013.01.03: KB2793634 released (se fix below)
If your system hive has reached 2GB
There is a way to compact bloated registry hives, but to do this you would first have to delete the excess registry keys and reduce the active size of the hive to <100MB. Otherwise the compression will fail. And ask yourself: do you want to live with a productions server that has experienced serious registry errors? In my opinion reinstallation is the only answer until when and if our friends at MS provides us with another option. If your servers are clustered, you could reinstall one node at the time as long as you are able to log on to at least one of the affected nodes. I have done this successfully on a cluster were both nodes had a 2GB software hive.
Furthermore I have no idea what so ever as to what happens if you delete the keys. I have tried asking MS Premier support, and so far I have yet to receive an answer.
Stop it from growing further
The following has been successful for me so far:
Just stop and disable ALL Microsoft .Net Framework NGEN services, and make sure you kill any lingering msiexec.exe processes.
Remove management tools
If you are able to control you SQL servers from another server or your own management computer, you could remove management tools from the server. I am not 100% certain that this will remove the problem, but others have tried it successfully. Be aware that this is not possible on a cluster afaik. You have to remove and reinstall the instance to remove components.
Update 2013.02.08: http://support.microsoft.com/kb/2790947 prevents the problem from occurring (SQL 2012 SP1 CU2). I have tested this extensively, and it worked as promised on all the systems I have tested it on. Make sure you enable the Microsoft .Net Framework NGEN services again if you followed my workaround and disabled them.
To remove the excessive registry keys, let the Ngen services run for 24 hours or more. This doesn’t compress the hive though. As the method described at http://support.microsoft.com/kb/2498915 only works on hives that are very small to begin with, another approach is needed (see below).
To check if all excess registry keys have been removed, execute the following commands:
%SystemRoot%\Microsoft.Net\Framework\v4.0.30319\ngen eqi %SystemRoot%\Microsoft.Net\Framework64\v4.0.30319\ngen eqi
If you still have excessive keys, this will remove them provided you have installed KB2790947. This process can take several days to complete, especially on virtual machines with limited resources. On a dual CPU Xeon E5-2680 with 128GB ram it took less than 18 hours (the time that passed from I ran the commands and until I came back to check on it).
Compress registry hive
There are several ways to do this. The first method described below worked for me, but I have added another from the connect article. I recommend trying method 1 first and then move on to method 2 if method 1 fails, as method 1 is easier to implement.
This method was developed with help from from Microsoft Support .
Make sure you have installed SP1 CU2 and removed the excess registry keys (see above)
Make sure the computer has at least 8GB of RAM, otherwise this process might fail. You can remove the additional memory after you are done. “ERROR: Insufficient system resources..” is returned if you are low on RAM.
- Create a compacted hive and replace the bloated hive:
- Start windows in repair mode (F8 on boot)
- Open a command prompt
- Identify your Windows installation and it’s repair mode drive letter, usually NOT C:.
- cd [drive]:\Windows\System32\config
- Execute the following command to load the bloated hive under HKLM
reg load HKLM\Bloated SOFTWARE
- Start regedit
- Right click on “HKEY_LOCAL_MACHINE\Bloated” and click on Export. Save the file as “Registry Hive File” to C:\windows\System32\config\SOFTWARENew to differentiate it from the original hive.
- Close the regedit
- Execute the following command to unload the bloated registry hive from the registry
reg unload HKLM\Bloated
- Rename C:\windows\System32\config\SOFTWARE to C:\windows\System32\config\SOFTWAREOld
- Rename C:\Windows\System32\config\SOFTWARENew to C:\Windows \System32\config\SOFTWARE
- Close the command prompt and restart the machine
I haven’t had time to test this myself yet, but nibbler2 posted a solution at https://connect.microsoft.com/SQLServer/feedback/details/770630/msiexec-exe-processes-keep-running-after-installation-of-sql-server-2012-sp1#tabs. I quote:
Posted by nibbler2 on 11.01.2013 at 14:45
@DizzyBadger. You put me on the right track, and I’ve solved the problem. Thanks!
Here are the steps.
-2) Read and understand http://support.microsoft.com/kb/2498915
-1) Make sure you’ve cleaned out the junk keys from the registry.
0) Boot into Windows PE
1) copy the huge SOFTWARE hive file to a backup file for safekeeping
2) load the bloated software hive in regedit
3) export the Microsoft/Windows and the Microsoft/.net framework/2.0xx/Ngen/Roots keys into new files as hive files.
4) Delete the two keys exported above. The point here is to reduce the size of the remaining keys in the software hive. Exporting these two keys worked for me, you may have to experiment.
5) export the bloated software hive as a hive file. If you’ve deleted enough in steps 3 and 4, this will work.
6) Exit regedit, load regedit, and unload the bloated hive.
7) Load the newly saved hive that you saved in step 5.
8) Create the Microsoft/Windows key and import the exported Windows hive file into it.
9) Create the Microsoft/.net framework/…./Roots key and import the exported Roots hive into it.
10) Exit regedit, load regedit, and unload the hive we are working with.
11) Copy the newly fixed up hive overtop of SOFTWARE
12) Reboot into Windows
13) run sfc /scannow