File server memory leak?

Problem

On two of our fileservers (Windows 2008R2) we noticed an increase in memory usage over time. It would start out at say 1.5GiB after a boot, and then slowly work it’s way up to 6GiB, that was the server’s allocated amount of memory (vmware). This being a busy file server due to hosting our user profiles for citrix, we tried increasing the memory allocation to 8GiB. Sadly, this only had the effect that reaching 99% memory usage took longer time after a reboot. After a day or two it would be back up. Further investigation revealed that it also affected performance. Backup took 18 hours for 800GiB, and once in a while it would just give up. Testing also revealed that profile access was sometimes slow.

 

Analysis

Process Explorer is always my first resort in these cases, actually it was the reason we discovered the problem in the first place. The picture shows a “normal” system commit of 2.8G, and a physical memory usage of 6.9GiB.

image

The only reason physical memory isn’t even higher is because we implemented a scheduled task limiting the file system cahce workset (http://www.uwe-sieber.de/ntcacheset_e.html). This tool is actually written for 2003 server and vista, but it has a 64 bit version and does what it says, with the exception of making the limit permanent. We have a scheduled task running it at every boot, and thus we make sure that there is ample memory available for other system and user processes such as antivirus and backup. This increased performance somewhat, and enabled backup to complete, if given a day or two. This bought  me some time to identify the actual problem. Since it is a resource problem, blaming the virtualization environment is always an option. This didn’t bring the usual satisfaction though. Tiresome reading of vmware resource usage statistics and real time monitoring of vm vitals didn’t reveal anything out of the ordinary.

The system is also running a program who sets and updates user file quotas according to a database at regular intervals. I tried disabling this and letting the server run without it for a couple of days, but nothing changed. I even considered running process monitor, but sifting through hundreds of file operations per second searching for the ones causing trouble was not tempting. Antivirus software is another typical culprit on file servers, but we have several other file servers, all running the same version of NOD32. Having previously replaced Trend due to Trend filling Kernel Memory (leading to an unresponsive system), I felt it was unlikely to be the cause. Still, I tried disabling it and running without it for a couple of hours, to no avail. Then I came across RamMap, another one of the excellent Sysinternals tools. Using it, I could get a better picture of the system memory usage:

image

This revealed the fact that it was in fact not mapped file memory, but Metafile that was exhausting the RAM. According to Technet, Metafile memory is defined as such:

“Metafile is part of the system cache and consists of NTFS metadata. NTFS metadata includes the MFT as well as the other various NTFS metadata files. In the MFT each file attribute record takes 1k and each file has at least one attribute record. Add to this the other NTFS metadata files and you can see why the Metafile category can grow quite large on servers with lots of files.”

This prompted an analysis of the MFT on the affected servers. Using ntfsinfo, I discovered that the data volumes on the servers had MFT sizes of 7 960 and 13 635MB respectively. If the OS expects to load the entire MFT into RAM, I guess 8GiB wont be enough.

image

Further MFT analysis using Diskeeper shows the MFT as being 90% used, which means it is actually full of data. Running defrag –a –v [Drive]: gives the same data if you don’t have Diskeeper. On a “normal” server, having 1GiB allocated for Metafile should be more than enough. I ran RamMap on several other high throughput servers, and the biggest one I found vas 1.5GiB.

For background information about the MFT, see MSDN Definition of MFT and Master File Table zone reservation.

Solution (tentative)

I have scheduled downtime to add more memory to the servers, I will update this post with the results. It would seem that 2008R2 file servers should have enough RAM to load the entire MFT of all connected drives into RAM, and then some. I would recommend at least 8GiB plus the combined MFT Size of all local and SAN drives. And keep in mind that the MFT size is primarily influenced by the number of files, not the amount of data stored on the drive.

Update  2011.11.23: Adding memory to the servers seems to have done the trick. The RamMap values look a lot better, there is even memory to spare:

image

Author: DizzyBadger

SQL Server DBA, Cluster expert, Principal Analyst

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.