Be warned: This will be a long one with a lot of text and few images. I never planned on doing a write-up on this issue, so I did not take a lot of pictures.
I have been troubleshooting this issue on and off for two years, and I was on the brink of giving up several times. I pride myself in finding solutions where others only find stress and hair-loss, and do so routinely, but sadly there are still nuts I cannot crack. This issue was believed to be such a nut. But I was wrong. The solution had been staring me straight in the eyes for quite some time, but we must not get ahead of ourselves. Let us start at the beginning.
Problem
SMB sessions are invalidated, such that it is impossible to reconnect. This happens only on Windows 10 clients, Windows 7 and 8? clients running SMBv2.* can still reconnect as normal.
User story:
- The user opens a file explorer window and navigates to a folder on a fileserver containing documents the user wants to read and/or edit.
- This works without issue 100% of the time as long as the client computer has a network connection to the file server.
- After a period of inactivity the SMB session is suspended. The user does not detect this, everything is still ok.
- Some time later, the user will either
- Try to save a file
- Try to open a new file using the same File Explorer window
- Possible outcomes
- Everything works as expected
- It is impossible to save the file to the server, it has to be saved locally.
- The File Explorer window is gone. The user has to re-open the window and navigate back to the folder in question.
- Thus, the user gets annoyed and and complains about the stupid Windows 10 upgrade, which is understandable.
Relevant Event IDs: 30807 from SMBClient and 1016 from SMBServer.
Characters in our story
Fileserver 1
- Currently running Windows 2016
- Created sometime around 2012
- Domain controller
- FSMO master
The astute reader may observe that the server is way older than Windows 2016. It has been through at least one in place upgrade and a move from physical to virtual. It is old, but it is still chugging along an doing it’s job.
Windows 10 clients running SMB v3.1.1
Both laptops and VDI clients for remote work. Both types display similar symptoms.
Windows 7 clients running SMB v2.*
Old computers kept alive because they do not have this issue.
Fileserver2
Created for testing purposes, running Windows Server 2019.
Analysis
As the problem was isolated to SMBv3.1.1 clients running Windows 10, everything pointed to a client problem. The problem appeared to be random. It could happen after an hour, and it could go into hiding for days before it reared it’s ugly head again.
The issue did not happen simultaneously on multiple clients, cementing our beliefs that this was in deed a client issue. The server configuration had not been altered lately, with the exception of one feature, a change of antivirus agent from something else to F-Secure. Removal of the antivirus agent on both the server and some test clients did not make any difference though. A deep dive into server settings in the entire domain revealed a couple of issues that were rectified and tested:
- Physical network adapter issues on the Hyper-V hosts.
- The DNS server list on DHCP included a decommissioned DNS server.
- All servers were set to install updates and restart simultaneously, including the DCs.
- The AD-integrated DNS zones contained several references to decommissioned servers, so all entries were verified manually and corrected as needed. Yes, the entire zones.
All were rectified, and the situation did improve. But the issue did not disappear.
Settings that were changed:
- SMB protocol security signatures were turned off, no dice, so back on again.
- Network discovery was turned on, both client and server side.
- NetBios over TCP was disabled on the clients.
- The active Power plan was set to High Performance on the clients.
After all this back and forth, the amount of errors that occurred when saving open documents was greatly reduced, but the File Explorer windows were still disappearing at random times. Still only on the Windows 10 clients. Extensive research did not reveal any solutions, but we kept digging. In the end, both Wireshark and ProcMon captures were successfully gathered from several events, but they did not reveal anything that the EventLog had not already told us:
When the client tries to reconnect, the server responds that the session ID is invalid. thus, the File Explorer window closes. If you open a new File Explorer window, it connects immediately every time. In retrospect, the ProcMon trace gave the answer. But it was not the answer I was looking for, I was not even looking in the same neighborhood, so I went straight past it. But again, let us not get ahead of ourselves.
We eventually created a new VM to host a test-share, Fileserver2. The hypothesis was that the long history of Fileserver1 had created issues we could not find, hidden somewhere in the registry. And that the reason the bug only affected Windows 10 was that it came with a revamped File Explorer. Several internet opinions referring to bugs in W10 File Explorer was found. Not that it helped as staying on W7 and downgrading all clients was not really an option.
After a week of testing of Fileserver2, the conclusive answer was that it was just as bad ad Fileserver1, and we are back at square one. Or maybe not…
The client had noticed that while all shared folders displayed the same problematic behavior, the home folders did not. I found this to be odd, but initially dismissed it as another peculiarity. But it kept nagging me. Then it hit me: Home folders are assigned through the Active Directory, but shared folders are not.
The network shares are mapped through a GPO. A GPO that has not been changed for four years or more, at least not the drive mapping part. Could the GPO have been damaged at some point? It was probably created when the domain was still running in W2008R2 mode. Then another realization hits me like a freight train: Back in 2020, when I was running a ProcMon trace, a lot of messages about Group Policy related activity appeared in the trace. Some more research revealed a maddening scenario:
Sometime before the release of W10, something was changed about how drive mapping GPOs are applied on the client. Previously, such mappings are only updated at login. Thus, it was customary to set the mapping action to “Replace” to make sure that any old or user-created mappings were removed and replaced by the “proper” mappings as defined in the GPO. Which was fine as long as this only happened at login. Which still is the recommended way to do it if the internet is to be believed. It is just that GPO drive mappings does not work like that anymore. Especially on GPOs that have been made without the Administrative Templates for W10 installed.
W10 clients will try to re-apply the GPO while the system is running through Background GPO refreshes that happen at certain intervals while the user is logged in. According to https://docs.microsoft.com/en-us/previous-versions/windows/desktop/Policy/background-refresh-of-group-policy this happens every 90-120 minutes, but not always for every GPO object. The document is sort of fussy, and the fact that our issue may appear at random intervals of up to 48 hours during testing does not make it less fuzzy. Further research pointed me in the direction for this forum post: https://social.technet.microsoft.com/Forums/azure/en-US/06c53d39-4807-4c5c-b37b-b0f39e4bf79d/group-policy-user-drive-mapping-is-set-to-update-how-to-disconnect-while-keeping-update-setting?forum=winserverGP and a comment by user “Appleoddity” towards the bottom.
In short, the default processing mode for Drive mappings were changed to allow background refreshes at some point. Some say in W8.1, some say in W10. This was not(?) as intended, but it was not fixed. Instead, the GPO profiles for W10 got a new setting enabling you to change the behavior back to the old way. I have not been able to find an official statement about what happened from Microsoft, but at this point I no longer care. I have at long last found a solution that will be chronicled below.
What happens is this: If the drive mapping is set to the recommended “Replace” action, the drive mapping is deleted and re-created no matter what when the GPO Background process runs drive mapping GPOs. Whenever that is. This invalidates any existing sessions, even if they are to the exact same fileshare. Thus, File Explorer and any other processes that have an open handle cannot re-use it and has to create a new one to the new fileshare connection.
I realize that this analysis can be difficult to follow an lacking in references and pictures. This is partly because of the time that has passed, and partly because I am writing a lot of this from memory. Therefore I add a TLDR before we move on to the solution. To entice everyone to read the TLDR, it will contain information not mentioned above.
Analysis TLDR
- The problem is not SMB version related, it is a GPO problem.
- Windows 7 will only process drive mappings from GPOs at login and startup.
- Windows 10 will also process drive mappings in the background at random intervals while you are logged in.
- I do not have any W8 clients to test, so I do not know what they do at GPO background refresh time. Some say that W8 performs like W7, some say the snafu from W10 was backported to W8.1.
- If the drive mapping action is set to Replace, the drive mapping is disconnected and reconnected, thus invalidating all SMB sessions related to the client.
- If you change the drive mapping action to update, the problem disappears, but if you need to change the mapping later, you will have to delete it before you add the new one.
- If you change the “Make my drive mapping GPO work on W10” setting to “Yes”, the problem goes away and the GPO works as intended. The default setting is “No”. Details below.
Solution
The usual warnings: If you do not understand the solution, seek help. The size of the problem is directly or logarithmically linked to the number of clients you affect if it does not work as intended. Knowledge of GPO drive mapping is assumed, and those details are not explained.
Prerequisite: You need to update your GPO Administrative Templates. See https://docs.microsoft.com/en-us/troubleshoot/windows-client/group-policy/create-and-manage-central-store for the official guide. You may also need to run a certain domain and forest functional level. I have only tested this on W2016.
You may choose to go the way of changing your drive mapping actions to Update. That path is not detailed below, but should be simple enough.
Make my drive mapping GPO work on Windows 10
- Make sure you have control of the prerequisites mentioned above.
- Backups rock and testing is king.
- Open Group Policy Manager.
- Open your Drive Mapping GPO in editing mode.
- Navigate to Computer\Policies\Administrative Templates\System\Group Policy
- Find the “Configure Drive Maps preference extension policy processing” setting.
- Enable it
- Check the “Do not apply during periods of periodic background processing” option. The value of the other settings does not interfere with this one, so leave them as is.
- Save (close) the GPO.
- Wait for a domain GPO sync to occur, usually around 15 minutes.
- Restart the clients.
- Restart the clients again to make sure the new setting is active. Or you can just wait a couple of days if you enforce daily restarts.
- The setting should now be applied. If not, run gpupdate /force on the clients.
A note about GPO troubleshooting
In most cases the setting will be applied after the first restart, and even a simple logout/login cycle may be enough. But sometimes it is not. When I am called in to troubleshoot GPO settings, very often the problem is related to not waiting long enough before giving up and thinking that the change you made didn’t solve the problem. The larger the environment, the longer you may have to wait. And do not forget that your GPO structure or Domain Controller sync may be pooched. You will not necessarily notice this until you try to make a change.
Below is a list of things to consider when a GPO changes does not go your way.
- The change has to be synced to the Domain Controller your test client is talking to. This may not be the DC you think it is talking to.
- Synchronization between the domain controllers has to work properly.
- All your domain controllers has to work properly.
- Any pesky firewalls in between your Domain Controllers and clients has to be configured correctly. This is not trivial, proper domain functionality requires a lot of ports.
- If there is a conflicting GPO, yours need to be dominant.
- Your test client need to be affected by the GPO.
- The change has to be applied correctly. Some GPO changes are pushed but do not change the behavior of the client until the next restart. Thus the two restarts above. One to make sure the setting is applied, and another to make sure it is activated. You can test your specific environment to figure out what is needed, but in my experience a shotgun approach generates less fuss when waiting 15 minutes and running gpupdate /force does not help:
GPO management is a large topic, and in poorly managed complex environments with hidden errors and conflicting GPOs it may require considerable effort just to add one new setting. Like weeks of work.
I wish you the best of luck if you to are a victim of this nastiness.