How SQL Server 2012 Service Pack 1 destroyed my life

Or to be more exact: seven days of my life, with every waking hour spent troubleshooting and  finally reinstalling servers.

Problem

Some time after you install MSSQL 2012 SP 1 strange things start to happen on the server. Maybe you are unable to start management studio, or if you’re really unfortunate you can’t log in to the server at all. Maybe random services start to fail, Windows Update refuses to apply updates, and if you boot the server it might refuse to start SQL server at all or just greet you with a friendly BSOD. But a couple of days before everything was hunky-dory, and you are absolutely 100% sure nothing has changed since then. Yet the server fails, and a couple of days later another one goes down with the same symptoms. It’s as if they have contracted ebola or the swine flu or some other strange virus. It seems to be attacking the entire server all at once, and the only common denominator is they are all running MS SQL 2012 SP1.

Findings

  • The Kernel memory paged pool is bloated. Normally for SQL servers of this size (128-256G RAM) this metric is somewhere between 256MB and 512MB, servers experiencing this problem have a value of 1024-3000MB:

image

 

  • Event ID 1101 from .NET Runtime Optimization, failed to compile something in the 32 bit SQL server installation folder:

image

  • Unable to start SSMS:

Visual Studio Shell Isolated Cannot find one or more components. Please reinstall the application.

image

 

  • MSIEXEC.EXE still running two weeks after last installation and returns after reboot:

image

  • Unable to log in because “The User Profile Service service failed the logon. User profile cannot be loaded.”
  • ..or, logging you in with a temporary profile.
  • Event 1001 from MSInstaller, Detection of product ‘SQL somthing’ failed repeats in the event log:

image

  • Windows Module Installer Service fails repeatedly and generates errors in the system event log.

and a lot of other errors and warnings that I didn’t document or simply forgot.

Analysis

The bloated kernel memory pool seemed to me like the best bet, so I started looking closer at that problem. In the beginning I got sidetracked by the fact that all of the servers experiencing the problem were missing the correct chipset drivers. They were all the same make and model, only a couple of weeks old, and they differed only in RAM. I quickly established that installing the drivers didn’t help, but I still believe that the missing drivers was somehow related to which servers ended up with the biggest problems. I started with one server with problems, but so far I have detected the issue to some degree on all servers running MSSQL 2012 SP1.

I spent half a day downloading the Windows Driver Development Kit (WDK) into our environment to get a hold of Poolmon.exe. (I won’t bother listing a URL to this, as MS changes the location every 6 months. Just bing or google poolmon.exe). Running poolmon on one of the troubled servers gave this result:

image

For more information on tracing pool leaks, see this post on Mark Russinovich’s blog: http://blogs.technet.com/b/markrussinovich/archive/2009/03/26/3211216.aspx

CM31 seems to be the tag using all my paged pool memory. But has it reached a limit? There seemed to be a 2048MB roof that was never reached, until I checked another server were the paged pool was way beyond 3000MB. Research shows that in theory the limit is something like 50% of the physical RAM or 128GB, none of which have been reached on this particular server containing 256gigs of RAM. Process explorer is able to identify the current limit, but that requires symbols and I didn’t have the time or energy to get the necessary firewall and proxy exceptions approved to download symbols on my servers. Anyway, I followed the path of Mr. Russinovich and used the strings.exe from Sysinternals to try to identify the culprit. Sadly it pointed back at the kernel itself:

image

Of course, finding a kernel leak in Windows would be fun, but I seriously doubted this to be the issue. I concluded it had to be something else asking the kernel to reserve memory, but what? It took a lot of googling and asking around to identify it, and finally it was a developer who put me on the right track: CM is Configuration Manager, which is another name for the registry or at least registry access related. A quick look in the corresponding folder on the servers revealed that the software hive was dangerously close to 2048MB on the affected servers. Dangerously because the fixed limit for all registry hives except system is 2048MB. This is usually not a problem, as most registry hives can be measured in tens of megabytes, not thousands.

image

What happens is that every operation trying to do something in the registry that would require the HKLM software hive to grow a bit fails miserably. Knowing that such a simple act as starting word could trigger tens of thousands of registry operations, I dare not think about what happens when the system boots and a user tries to open his profile.

Further analysis identified the following registry keys (and their subkeys) to be the culprits:

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\.NETFramework\v2.0.50727\NGENService\Roots
HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\.NETFramework\v2.0.50727\NGENService\Roots

Each entry has an ImageList binary value containing what appears to be a link to the file Image and a checksum of sorts:

image

Each one of these takes up a fair bit of room in the registry. The key shown above have only one subkey, the problematic keys have thousands of subkeys, each containing an identical imagelist value thus compounding the size.

These values are used by the Microsoft .Net Framework NGEN services as it tries to optimize .Net code for each specific server. This process is corrupted for some reason during the installation of MSSQL 2012 SP 1, so each time this process is triggered another subkey is added, which triggers another one, and another one, and so on.

Workaround

This issue is linked to the following MSSQL bug report: https://connect.microsoft.com/SQLServer/feedback/details/770630/msiexec-exe-processes-keep-running-after-installation-of-sql-server-2012-sp1#tabs

Check that thread for updates before applying my workaround. As of this moment (2012.12.19) there is no fix available.

Update 2013.01.03: KB2793634 released (se fix below)

If your system hive has reached 2GB

There is a way to compact bloated registry hives, but to do this you would first have to delete the excess registry keys and reduce the active size of the hive to <100MB. Otherwise the compression will fail. And ask yourself: do you want to live with a productions server that has experienced serious registry errors? In my opinion reinstallation is the only answer until when and if our friends at MS provides us with another option. If your servers are clustered, you could reinstall one node at the time as long as you are able to log on to at least one of the affected nodes. I have done this successfully on a cluster were both nodes had a 2GB software hive.

Furthermore I have no idea what so ever as to what happens if you delete the keys. I have tried asking MS Premier support, and so far I have yet to receive an answer.

Stop it from growing further

The following has been successful for me so far:

Just stop and disable ALL Microsoft .Net Framework NGEN services, and make sure you kill any lingering msiexec.exe processes.

image

Remove management tools

If you are able to control you SQL servers from another server or your own management computer, you could remove management tools from the server. I am not 100% certain that this will remove the problem, but others have tried it successfully. Be aware that this is not possible on a cluster afaik. You have to remove and reinstall the instance to remove components.

Fix

Update 2013.02.08: http://support.microsoft.com/kb/2790947 prevents the problem from occurring (SQL 2012 SP1 CU2). I have tested this extensively, and it worked as promised on all the systems I have tested it on. Make sure you enable the Microsoft .Net Framework NGEN services again if you followed my workaround and disabled them.

To remove the excessive registry keys, let the Ngen services run for 24 hours or more. This doesn’t compress the hive though. As the method described at http://support.microsoft.com/kb/2498915 only works on hives that are very small to begin with, another approach is needed (see below).

To check if all excess registry keys have been removed, execute the following commands:

%SystemRoot%\Microsoft.Net\Framework\v4.0.30319\ngen eqi
%SystemRoot%\Microsoft.Net\Framework64\v4.0.30319\ngen eqi

If you still have excessive keys, this will remove them provided you have installed KB2790947. This process can take several days to complete, especially on virtual machines with limited resources. On a dual CPU Xeon E5-2680 with 128GB ram it took less than 18 hours (the time that passed from I ran the commands and until I came back to check on it).

Compress registry hive

There are several ways to do this. The first method described below worked for me, but I have added another from the connect article. I recommend trying method 1 first and then move on to method 2 if method 1 fails, as method 1 is easier to implement.

Method 1

This method was developed with help from from Microsoft Support Smilefjes.

  1. Make sure you have installed SP1 CU2 and removed the excess registry keys (see above)

  2. Make sure the computer has at least 8GB of RAM, otherwise this process might fail. You can remove the additional memory after you are done. “ERROR: Insufficient system resources..” is returned if you are low on RAM. image

  3. Create a compacted hive and replace the bloated hive:

  1. Start windows in repair mode (F8 on boot)
  2. Open a command prompt
  3. Identify your Windows installation and it’s repair mode drive letter, usually NOT C:.
    image
  4. cd [drive]:\Windows\System32\config
  5. Execute the following command to load the bloated hive under HKLM

reg load HKLM\Bloated SOFTWARE

  1. Start regedit
  2. Right click on “HKEY_LOCAL_MACHINE\Bloated” and click on Export. Save the file as “Registry Hive File” to C:\windows\System32\config\SOFTWARENew to differentiate it from the original hive.
  3. Close the regedit
    image
  4. Execute the following command to unload the bloated registry hive from the registry
    reg unload HKLM\Bloated
  5. Rename C:\windows\System32\config\SOFTWARE to C:\windows\System32\config\SOFTWAREOld
  6. Rename C:\Windows\System32\config\SOFTWARENew to C:\Windows \System32\config\SOFTWARE
    image
  7. Close the command prompt and restart the machine

Method 2

I haven’t had time to test this myself yet, but nibbler2 posted a solution at https://connect.microsoft.com/SQLServer/feedback/details/770630/msiexec-exe-processes-keep-running-after-installation-of-sql-server-2012-sp1#tabs. I quote:

Posted by nibbler2 on 11.01.2013 at 14:45
@DizzyBadger. You put me on the right track, and I’ve solved the problem. Thanks!

Here are the steps.
-2) Read and understand
http://support.microsoft.com/kb/2498915
-1) Make sure you’ve cleaned out the junk keys from the registry.
0) Boot into Windows PE
1) copy the huge SOFTWARE hive file to a backup file for safekeeping
2) load the bloated software hive in regedit
3) export the Microsoft/Windows and the Microsoft/.net framework/2.0xx/Ngen/Roots keys into new files as hive files.
4) Delete the two keys exported above. The point here is to reduce the size of the remaining keys in the software hive. Exporting these two keys worked for me, you may have to experiment.
5) export the bloated software hive as a hive file. If you’ve deleted enough in steps 3 and 4, this will work.
6) Exit regedit, load regedit, and unload the bloated hive.
7) Load the newly saved hive that you saved in step 5.
8) Create the Microsoft/Windows key and import the exported Windows hive file into it.
9) Create the Microsoft/.net framework/…./Roots key and import the exported Roots hive into it.
10) Exit regedit, load regedit, and unload the hive we are working with.
11) Copy the newly fixed up hive overtop of SOFTWARE
12) Reboot into Windows
13) run sfc /scannow

Author: DizzyBadger

SQL DBA Principal Analyst

9 thoughts on “How SQL Server 2012 Service Pack 1 destroyed my life”

  1. Great write-up! I wish they’d escalate this issue as this one appears to have some serious long term effects. I can only assume that the problem is more widespread and may sit undetected until it’s too late.

    1. Thx 🙂
      The incubation period of the problem seems to vary wildly, thus the sneaky long term effects. Some servers grow the registry by a couple of entries a week, some with several each minute. I suspect the growth rate is linked to at what point in the installation you apply updates for .Net framework, and to SSMS sessions left open in disconnected or active RDP connections but I have yet to find conclusive proof of this.
      Furthermore, I won’t be surprised if the problem is actually due to a bug in .Net framework itself, and not in SQL server.

  2. Thanks a lot for writing this blog. I have experienced the same on two servers, luckily development servers. Finding your blog, saved me for doing some of the research you have done. We also found that the software registry segment reached 2 GB, and after some research, I found your article here 🙂
    I have tried to apply the fix of 2013.01.03. Hopefully this will prevent this error occurring at new servers, but it did note clean up an fix the corrupted servers. One one of the servers, it failed to run, due to WMI service cannont be started, and because it says the server is clustered (it’s not), and it cannot find the nodes. On the other server, the fix ran through, but the server still seems to be corrupted.

    1. In my experience so far, the fix clears out the bloated registry entries if you let the NGen services run for 24 hours. Sadly though, this does not reduce the size of the registry hive file. You can live with this, provided you are okay with the fact that almost 2GiB of ram is reserved for whitespace in the registry. Regarding your errors installing the fixed, this is because you have reached the 2GiB hive size limit. When this happens, the machine is by all practical means destroyed and has to be reinstalled.

  3. I’ve just experienced the same registry bloat problem following installation of SQL 2012 SP1, and managed to resolve it with the help of your fine article. Thanks. Here’s a few notes from my particular scenario that may help others …

    1. After installing SQL 2012 SP1 CU2, I tried to run NGEN.EXE directly to get it to process the queue and remove the duplicate registry entries. This was failing because of a lack of system resources. To solve this, in the Services control panel I changed all the SQL Services from “Automatic” to “Manual” start up, and then rebooted the server. After the reboot I was able run
    %SystemRoot%\Microsoft.Net\Framework\v4.0.30319\ngen eqi
    and
    %SystemRoot%\Microsoft.Net\Framework64\v4.0.30319\ngen eqi
    to remove the duplicate registry entries. This took about 20 minutes to run.

    2. My server is “Windows Server 2008 Standard” – to get to a command prompt in order to compress the registry file I had to
    a) change the BIOS settings to enable boot from CD/DVD drive
    b) Mount the WIndows install media
    c) Boot from the DVD and choose the “Repair” option, then open the command prompt

    3. Having got to the command prompt in repair mode, when I tried to export the bloated registry hive using regedit, this was failing with an error message “There may be a disk or file system error”. To get round this I had to use a variation of the method posted by nibbler2, which is to export the hive in multiple steps and then rejoin it back together. I found that I needed to export/delete just the section of the hive where the duplicate entries had been, which in my case was “HKLM\Bloated\WOW6432Node\Microsoft\.NETFramework\v2.0.50727\NGENService”. Here’s the specific steps I followed. (I used d: drive for temporary storage, as I had plenty of space there.)

    * copy c:\windows\system32\config\software d:\software.bak
    * reg load HKLM\Bloated c:\windows\system32\config\software
    * Run regedit, and export HKLM\Bloated\WOW6432Node\Microsoft\.NETFramework\v2.0.50727\NGENService node as a hive file to d:\NGENSERVICE
    * Delete node HKLM\Bloated\WOW6432Node\Microsoft\.NETFramework\v2.0.50727\NGENService
    * Export HKLM\Bloated\ node as a hive file to d:\SOFTWARENEW
    * Exit regedit
    * reg unload HKLM\Bloated
    * reg load HKLM\SNEW d:\SOFTWARENEW
    * run regedit and create node HKLM\SNEW\WOW6432Node\Microsoft\.NETFramework\v2.0.50727\NGENService
    * Import d:\NGENSERVICE into the newly created node
    * Exit regedit
    * reg unload HKLM\SNEW
    * copy d:\SOFTWARENEW c:\windows\system32\config\SOFTWARE
    * Reboot the server

    Once everything was determined to be good, I re-enabled the SQL services back to “Automatic” startup.

    1. Nice to hear it all worked out for you in the end 🙂
      Just because I’m curious, what amount of resources did this server have memory and cpu wise?

      1. It was virtual server with 8GB RAM, 4 cpus, and the repair took place out of business hours when there was no workload on the server.

  4. Thanks so much for posting your experiences, it was invaluable to me! I had 3 affected servers but they did not crash and software hive was only mildly bloated. We have since rebuilt them (they are virtual servers, used for development, so it was relatively painless). Anyone have any idea why Microsoft took so long to post the fix?

    1. Thanks for sharing you experience as well 🙂
      Regarding the slow fix, from what I understand the bug was both difficult to reproduce in their lab due to the long “incubation period” and difficult to fix. I suspect they had to work around some exotic .net bug, but that is pure speculation at my end.

Leave a Reply