MIM 2016 Lab series

Introduction

 

These are the notes from my adventures installing a MIM 2016 SP1 lab/dev environment. I am using https://docs.microsoft.com/en-us/microsoft-identity-manager/microsoft-identity-manager-deploy as a guide, so make sure you read it as well. A certain proficiency with Windows 2016 and AD is a prerequisite to understand this series. All servers in this lab are running Windows 2016 Datacenter with GUI, and I am using SQL Server 2016. Thus, I deviate from the guide mentioned above which is based on Win 2012 R2 and SQL 2014.

 

My plan is to use three VMs:

  • DC01, the domain controller.
  • IM01: SQL Server, MIM Sync, MIM Service.
  • IM02: Exchange and other stuff that should not be co-located with the software on IM01. I considered naming it Exchange, but there may come a time when I will install other stuff on it related to MIM.

I am creating a separate domain in a new forest for this lab, aptly named mim.local, as I am going to test multi-forest MIM deployments. This will certainly add some spice to the process…

Be aware, this is a lab. Do not use this setup as-is in production. I have tried to add remarks for what I would do different in production along the way.

Chapters

The will be a series, and new chapters will be added as I get time. The chapters are listed below.

Event 20501 Hyper-V-VMMS

Problem

The following event is logged non-stop in the Hyper-V High Availability log:

Log Name:      Microsoft-Windows-Hyper-V-High-Availability-Admin
Source:        Microsoft-Windows-Hyper-V-VMMS
Date:          27.07.2017 12.59.35
Event ID:      20501
Task Category: None
Level:         Warning
Description:
Failed to register cluster name in the local user groups: A new member could not be added to a local group because the member has the wrong account type. (0x8007056C). Hyper-V will retry the operation.

image

Analysis

I got this in as an error report on a new Windows Server 2016 Hyper-V cluster that I had not built myself. I ran a full cluster validation report, and it returned this warning:

Validating network name resource Name: [Cluster resource] for Active Directory issues.

Validating create computer object permissions was not run because the Cluster network name is part of the local Administrators group. Ensure that the Cluster Network name has “Create Computer Object” permissions.

I then checked AD, and found that the cluster object did in fact have the Create Computer Object permissions mentioned in the message.

The event log error refers to the cluster computer object being a member of the local admins group. I checked, and found that it was the case. The nodes themselves were also added as local admins on all cluster nodes. That is, the computer objects for node 1, 2 and so on was a member of the local admins group on all nodes. My records show that this practice was necessary when using SOFS storage in 2012. It is not necessary for Hyper-V clusters using FC-based shared storage.

The permissions needed to create a cluster in AD

  • Local admin on all the cluster nodes
  • Create computer objects on the Computers container, the default container for new computers in AD. This could be changes, in which case you need permissions in the new container.
  • Read all properties permissions in the Computers container.
  • If you specify a specific OU for the cluster object, you need permissions in this OU in addition to the new objects container.
  • If your nodes are located in a specific OU, and not the Computers OU, you will also need permissions in the specific OU as the cluster object will be created in the OU where the nodes reside.

See Grant create computer object permissions to the cluster for more details.

Solution

As usual, a warning: If you do not understand these tasks and their possible ramifications, seek help from someone that does before you continue.

Solution 1, low impact

If it is difficult to destroy the cluster as it requires the VMs to be removed from the cluster temporarily, you can try this method. We do not know if there are other detrimental effects caused by not having the proper permissions when creating the cluster.

  • Remove the cluster object from the local admin on all cluster nodes.
  • Remove the cluster nodes from the local admin group on all nodes.
  • Make sure that the cluster object has create computer objects permissions on the OU in which the cluster object and nodes are located
  • Make sure that the cluster object and the cluster node computer objects are all located in the same OU.
  • Validate the cluster and make sure that it is all green.

Solution 2, high impact

Shotgun approach, removes any collateral damage from failed attempts at fixing the problem.

  • Migrate any VMs away from the cluster
  • Remove the cluster from VMM if it is a member.
  • Remove the “Create computer objects” permissions for the cluster object
  • Destroy the cluster.
  • Delete the cluster object from AD
  • Re-create the cluster with the same name and IP, using a domain admin account.
  • Add create computer objects and read all properties permissions to the new cluster object in the current OU. 
  • Validate the cluster and make sure it is all green.
  • Add the server to VMM if necessary.
  • Migrate the VMs back.

sql_text returns *password—–

Problem

While hunting for performance bottlenecks using Spotlight some months ago I came across some strange statements listed only as *password—– (the – continues for a couple of pages). This was new to me, and even a search on the pesky trackers-with-integrated-search came up blank. We passed it off to the developers and never heard back, and I forgot about the whole thing. Then, during a stress test in QA it popped up on the radar again, and this time I decided to dig into it myself.

Analysis

Being a good boy, I was using extended events to trace down the problem. I do not know why, but profiler still feels more at home. Anyhow, I set up a session triggering on sql_text like “%*password—%. Looking something like this:

CREATE EVENT SESSION [pwdstar] ON SERVER
ADD EVENT sqlserver.rpc_starting( ACTION(sqlserver.client_app_name,sqlserver.client_connection_id,sqlserver.client_hostname,sqlserver.client_pid,sqlserver.database_name,sqlserver.session_id,sqlserver.sql_text,sqlserver.tsql_stack,sqlserver.username)
WHERE ([package0].[greater_than_uint64]([sqlserver].[database_id],(4)) AND [package0].[equal_boolean]([sqlserver].[is_system],(0)) AND ([sqlserver].[like_i_sql_unicode_string]([sqlserver].[sql_text],N'%*password--%') AND [sqlserver].[equal_i_sql_unicode_string]([sqlserver].[database_name],N'DBNAME'))))
ADD TARGET package0.event_file(SET filename=N'pwdstar',max_file_size=(200))
WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=ON,STARTUP_STATE=OFF
GO

I came back an hour later to look at the results, and was surprised to find that the tsql_stack returned a more normal looking statement:

image

Back to the trackers I went, this time trying to figure out why sql_text differed from statement. This produced far better results, and I came across this article: https://www.sqlskills.com/blogs/jonathan/understanding-the-sql_text-action-in-extended-events/, and following the links produced the following statement from Mr. Keheyias:

· If the sql_text Action returns:

<action name="sql_text" package="sqlserver">
<type name="unicode_string" package="package0" >
<value>*password------------------------------</value>
<text></text>
</type></action>

the text was stripped by the engine because it contained a password reference in it.  This functionality occurs in SQL Trace as well and is a security measure to prevent passwords from being returned in the trace results.

As I said, this was news to me, but the function seems to be an old one not necessarily related to xEvents. Extended Events does however use a more aggressive masking algorithm than what is used in profiler from what I can tell.

Solution

If you come across such queries in your profiler or xEvents results, you are dealing with something SQL server believes is password related. The results are thus hidden. The easiest way to get the statement is capturing tsql_stack as well. In SQL 2016 you can capture the statement directly:

image

Power profile drift

Problem

I received an alarm from one of my SQL Servers about IO stall time measured in seconds and went to investigate. We have had trouble with HBA Firmware causing FC stalls previously, so I suspected another storage error. The server in question was running virtual FC, and a cascading error among the other servers on the same host seemed to confirm my initial hypothesis about a HBA problem on the host.

Analysis

The kernel mode CPU time on the host was high (the red part of the graph in Process Explorer), something that is also a pointer in the direction of storage problems. The storage minions found no issue on the SAN though. Yet another pointer towards a problem on the server itself. We restarted it twice, and the situation seemed to normalize. It was all written off as collateral damage from a VMWare fault that flooded the SAN with invalid packet some time ago. I moved one of the VMs back and let it simmer overnight. I felt overly cautious not moving them all back, but the next morning the test VM was running 80% PCU without getting anything done, and the CPU load on the host was  about 50%, running a 3 cpu vm on a 2×12 core host…

25C33835

I failed the test vm back to the spare host, and the load on the VM went down immediately:

image

At this point I was ready to take a trip to the room of servers and visit the host in person, and I was already planning a complete re-imaging of the node in my head. But then I decided to run CPU-Z first, and suddenly it all became clear.

image

 

The host is equipped with Intel Xeon E5-2690 v3 CPUs. Intel Ark informs me that the base clock is indeed 2,6GHz as reported by CPU-Z, and the turbo frequency is as high as 3,5GHz. A core speed of 1195MHz as shown in CPU-Z is usually an indicator of one of two things. Either someone has fiddled with the power saving settings, or there is something seriously wrong with the hardware.

A quick check of the power profile revealed that the server was running in the so called “balanced” mode, a mode that should be called “run-around-in-circles-and-do-nothing-mode” on servers. The question then becomes, why did this setting change?

image

My server setup checklist clearly states that server should run in High performance mode. And I had installed this particular server myself, so I know it was set correctly. The culprit was found to be a firmware upgrade installed some months back. It had the effect of resetting the power profile both in the BIOS and in Windows to the default setting. There was even a change to the fan profile, causing the server to become very hot. The server in question is a HP ProLiant DL380 Gen 9, and the ROM version is P89 v2.30 (09/13/2016).

Solution

  • First you should change the power profile to High performance in the control panel. This change requires a reboot.
  • While you are rebooting the server, enter the BIOS settings and check the power profile. I recommend Maximum Performance mode for production servers.
  • image
  • Then, check the Fan Profile
  • image
  • Try increased cooling. If your servers still get exceedingly hot, there is a maximum cooling mode, but this basically runs all the fans at maximum all the time.

This is how CPU-Z looks after the change:

image

And the core utilization on the host, this time with 8 active SQL Server VMs:

image

Primary replica is not joined to the Availability group, or clusters past morphing into clusters present

Problem

I was upgrading an Availability group from SQL 2012 on Win 2012R2 to SQL 2016 on Win2016. I had expected to create the new AOAG as a separate cluster and move the data manually, but the users are always complaining when I want to use my allotted downtime quotas, so I decided to try a rolling upgrade instead. This post is a journal of some of the perils I encountered along the way, and how I overcame them. There were countless others, but most of them were related to crappy hardware, wrong hardware being delivered, missing LUNS on the SAN, delusional people who believe they can lock out DBAs from supporting systems, dragons, angry badgers, solar flares and whichever politician you dislike the most. Anyways, on with the tale of clusters past morphing into clusters present…

I started with adding the new node to the failover cluster. This went surprisingly well, in spite of the old servers being at least two generations older than my new rack servers. Sadly, both the new and the old servers are made by the evil wizards behind the silver slanted E due to factors outside of my control. But I digress. The cluster join went flawlessly. There was some yellow complaints about the nodes not having the same OS version in the cluster validation scroll, but everything worked.

Then came adding the new server as a replica in the availability group. This is done from the primary replica, and I just uttered a previously prepared spell from the book of disaster recovery belonging to this cluster, adding the name of the new node. As far as I can remember this is just the result of the standard “Add replica” wizard. The spell ran without complaints, and my new node was online.

This is the point where it all went to heck in a small hand-basket carried by an angry badger. I noticed a yellow warning next to the new node in the AOAG dashboard. But as the databases were all in the synchronizing state on the new replica, I believed this to be a note complaining about the OS-version. I was wrong. In my ignorance, I failed over to the new node and had the application  team minions run some tests. They came back positive, so I removed the old nodes in preparation for adding the last one. I even ran the Update-ClusterFunctionalLevel Powershell command without issues. But the warning persisted. This is the contents of the warning:

Availability replica not joined.

SNAGHTML57bbb24f

And it was no longer a lone warning, the AOAG dashboard did not look pretty as both the old nodes refused to accept the new node as their new primary replica.

Analysis

As far as I can tell, the join AOAG script failed in some way. It did not report any errors, but still, there is no doubt that something did go wrong.

The solution as reported by MSDN is simple, just join the availability group by casting the “alter availability group groupname join” spell from the secondary replica that is not joined. The attentive reader has probably already realized that this is the primary replica, and as you probably suspect, the aforementioned command fails.

Casting the following spell lists the replicas and their join state: “select join_state, join_state_desc from sys.dm_hadr_availability_replica_cluster_states”. This is the result:

image

In some way I have put the node in an invalid state. It still works perfectly, but I guess there is only a question about when, not if this issue is about to grow into a bigger problem.

Solution

With such an elaborate backstory, you would not be wrong to expect an equally elaborate solution. Whether or not it is, is really in the eye of the beholder.

Just the usual note of warning first: If you are new to availability groups, and all this cluster stuff sounds like the dark magic it is, I would highly suggest that you do not try to travel down the same path as me. Rather, you should turn around at the entrance and run as fast as you can into the relative safety of creating another cluster alongside the old one. Then migrate the data by backing up on the old cluster and restoring on the new cluster. And if backups and restores on availability groups sounds equally scary, then ask yourself whether or not you are ready to run AOAG in production. In contrast to what is often said in marketing materials and at conferences, AOAG is difficult and scary to the beginner. But there are lots of nice training resources out there, even some free ones.

Now, with the warnings out of the way, here is what ended up working for me. I tried a lot of different solutions, but I was bound by the following limitation: The service has to be online. That translates to no reboots, no AOAG-destroy and recreate, no cluster rebuilds and so on. A combination of which would probably have solved the problem in less than an hour of downtime. But I was allowed none, so this is what I did:

  • Remove any remaining nodes and replicas that are not Win2016 SQL2016.
  • Run the Powershell command Update-ClusterFunctionalLevel to make sure that the cluster is running in Win2016 mode.
  • Build another Win 2016 SQL 2016 node
  • Join the new node to the cluster
  • Make sure that the cluster validation scroll seems reasonable. This is a fluffy point I know, but there are way to many variables to make an exhaustive list. https://lokna.no/?p=1687 mentions some of the issues you may encounter.
  • Join the new node to the availability group as a secondary replica.
  • Fail the availability group over to the new node (make sure you are in synchronous commit mode for this step).
  • Everything is OK.

image

  • Fail back to the first node
  • Change back to asynchronous commit (if that is you default mode, otherwise leave it as synchronous).

 

Thus I have successfully upgraded a 2-node AOAG cluster from Win2012R2 and SQL 2012 to Win2016 and SQL 2016 with three failovers as the only downtime. In QA. Production may become an interesting journey, IF the change request is approved. There may be an update if I survive the process…

 

Update and final notes

I have now been through the same process in production, with similar results. I do not recommend doing this in production, the normal migration to a new cluster is far preferable, especially when you are crossing 2 SQL Server versions on the way. Then again, if the reduced downtime is worth the risk…

Be aware that a failover to a new node is a one way process. Once the SQL 2016 node becomes the primary replica, the database is updated to the latest file format, currently 852 whereas SQL 2012 is 706. And as far as I can tell from the log there is a significant number of upgrades to be made. See http://sqlserverbuilds.blogspot.no/2014/01/sql-server-internal-database-versions.html for a list of version numbers.

image

Microsoft Update with PSWindowsUpdate

Preface

Most of my Windows servers are patched by WSUS, SCCM or a similar automated patch management solution at regular intervals. But not all. Some servers are just too important to be autopatched. This is a combination of SLA requirements making downtime difficult to schedule and the sheer impact of a botched patch run on backend servers. Thus, a more hands-on approach is needed. In W2012R2 and far back this was easily achieved by running the manual Windows Update application. I ran through the process in QA, let it simmer for a while and went on to repeat the process in production if no nefarious effects were found during testing. Some systems even have three or more staging levels. It is a very manual process, but it works, and as we are required to hand-hold the servers during the update anyway, it does not really cost anything. Then along came Windows Server 2016. Or Windows 10 I should really say, as the Update-module in W2016 is carbon copied from W10 without changes. It is even trying to convince me to install W10 Creators update on my servers…

clip_image001

In Windows Server 2016 the lazy bastards at Microsoft just could not be bothered to implement the functionality from W2012R2 WU. It is no longer possible to defer specific updates I do not want, such as the stupid Silverlight mess. If I want Microsoft update, then I have to take it all. And if I should become slightly insane and suddenly decide I want driver updates from WU, the only way to do that is to go through device manager and check every single device for updates. Or install WUMT, a shady custom WU client of unknown origin.

I could of course use WSUS or SCCM to push just the updates I want, but then I have to magically imagine what updates each server wants and add them to an ever growing number of target groups. Every time I have a patch run. Now that is expensive. If I had enough of the “special needs” servers to justify the manpower-cost, I would have done so long ago. Thus, another solution was needed…

PSWindowsUpdate to the rescue. PSWindUpdate is a Powershell module written by a user called MichalGajda on the technet gallery enabling management of Windows Update through Powershell. In this post I go through how to install the module and use it to run Microsoft Update in a way that resembles the functionality from W2012R2. You could tell the module to install a certain list of updates, but I found it easier to hide the unwanted updates. It also ensures that they are not added by mistake with the next round of patches.

Getting started

(See the following chapters for details.)

  • You should of course start by installing the module. This should be a one-time deal, unless a new version has been released since last time you used it. New versions of the module should of course be tested in QA like any other software.
  • Then, make sure that Microsoft Update is active.
  • Check for updates to get a list of available patches.
  • Hide any unwanted patches
  • Install the updates
  • Re-check for updates to make sure there are no “round-two” patches to install.

Continue reading “Microsoft Update with PSWindowsUpdate”

More about password managers and security

In response to comments on my post about 1Password and cloud security, this is an update about other password managers and my way out. It is highly recommended that you read the previous post first to understand where I am going with this. I was looking for a password manager for a specific purpose, that is cloud sync ability with something that at least looks like true encryption without back doors, and my comments are written from that mindset.

Why I would not use Dropbox for my passwords

Or Google drive. Or Onedrive for that matter.

And note the “for my passwords” part of the sentence. It is not that I do not use such services, but I consider all of them as an insecure location. They are just to juicy a target for wrongdoers, both intelligence agencies and the other kind of cyber criminals.

About “Free” cloud services

Such as Facebook, or maybe Lastpass is a better example in this case. There is no free lunch. You are either a customer or a product. You are either the farmer or the pig. Pigs have no rights to privacy. And before you consider using any google-based service, read this: Street View cars slurping wi-fi. This case complex was for me the first warning that something was rotten in the house of google. And it has not become any better since. I guess the “Don’t be evil” company motto should have been a warning as well…

As such I would never install a password manager on an Android or IOS-based phone that contains passwords for other systems. They are way to easy to hack.

And just as a note, LastPass and LogMeOnce was not considered due to their lack of a desktop client.

KeePass

I have used KeePass in the past, but it is mostly a local solution only, and is therefore out of scope. At least initially. I am also slightly worried about the fact that it is free, as in free of charge. I have nothing against open source software, but for some needs I prefer to have access to someone to complain to if it all goes wrong. Someone who is paid to listen to my ails and complaints. That being said, KeePass is a nice product for local password management.

Keeper

Had a brief look at it, discovered it was dependent on java, uninstalled immediately. Also, at revisiting it seems to be “moving to the cloud”.

DashLane

A solid security policy. I ran the demo for some time, but the “Modern” UI was horrible, if not as horrible as the 1Password 6 beta. I wanted to like it, but after continuously having to click buttons two or three times for them to do anything, I gave up. I may re-test it in the future. This is sadly a complaint I have about most “Modern” UI applications, they do not respond to mouse clicks consistently. I also could not get the browser plugin to work properly in Vivaldi.

StickyPassword

This has been the best contender so far. I got as far as testing the synchronization, and I used it for the full trial period without a rage-uninstall. What actually stopped me from going for it in the end was it’s login dialog constantly popping up when I wasn’t trying to use it. It became a nuisance. I also did not like that it wants to run all the time, instead of when I actually want to use it. The chance of someone snagging access to it while walking by if I forgot to lock my screen is of course a danger, but primarily I stopped using it because it became irksome.

What I ended up doing

I stuck with what I had, a simple local-only password manager not to be named. Because the password manager itself is not important, as long as you have control of the data locally.  It is how you control the synchronization of data that is important. And as I do not really trust any of the “public cloud” alternatives, I decided to make my own. I installed Resilio Sync, a file synchronization application based on the BitTorrent protocol, and used it to keep my encrypted password store in sync across my computers.

This allows me to keep the data in sync and, to a certain degree, actually know where my data is physically located. It could still be hacked or intercepted of course, but that had to be a much more directed attack than the usual “lets  archive everything that was ever stored on Dropbox in case we need it some time” behavior we have come to expect from the people who are supposedly working to keep the world “safe”. I may come across as rather paranoid in this post, but such are the times.

No Microsoft Update

Problem

I was preparing to roll out SQL Server 2016 and Windows Server 2016 and had deployed the first server in  production. I suddenly noticed that even if I selected “Check online for updates from Microsoft Update” in the horrible new update dialog, I never got any of the additional updates. Btw, this link/button only appears when you have an internal SCCM or WSUS server configured. Clicking the normal Check For Updates button will get updates from WSUS.

image

 

Analysis

This was working as expected in the lab, but the lab does not have the fancy System Center Configuration Manager and WSUS systems. So of course I blamed SCCM and uninstalled the agent. But to no avail, still no updates. I lurked around the update dialog and found that the “Give me updates for other Microsoft products..” option was grayed out and disabled. I am sure that I checked this box during installation, as I remember looking for its location. But it was no longer selected, it was even grayed out.

image

This smells of GPOs. But I also remembered trying to get this option checked by a GPO to save time during installation, and that it was not possible to do so in Win2012R2. Into the Group Policy Manager of the lab DC I went…

It appears that GPO management of the Microsoft Update option has been added in Win2016:

image

This option is not available in Win2012R2, but as we have a GPO that defines “Configure Automatic Updates”, it defaults to disabled.

solution

Alternative 1: Upgrade your domain controllers to Win2016.

Alternative 2: Install the Win2016 .admx files on all your domain controllers and administrative workstations.

Then, change the GPO ensuring that “Install updates for other Microsoft products is enabled. Selecting 3 – Auto download used to be a safe setting.

Alternative 3: Remove the GPO or set “Configure Automatic Updates” to “Not Configured”, thus allowing local configuration.

Cluster Quorum witness

Introduction

Since W2012R2 it is recommended that all clusters have a quorum witness regardless of the number of cluster nodes. As you may know, the purpose of the cluster witness is to ensure a majority vote in the cluster. If you have 2 nodes with one vote each and add a cluster witness you create a possibility for a majority vote. If you have 3 nodes on the other hand, adding a witness will remove the majority vote as you have 4 votes total and a possible stalemate.

If as stalemate occurs, the cluster nodes may revolt and you are unable to get it working without a force quorum, or you could take a node out behind the barn and end its misery. Not a nice situation at all. W2012R2 solves this predicament by dynamic vote assignments. As long as a quorum has been established, if votes disappear due to nodes going offline, it will turn the witness vote on and off to make sure that you always have a possibility for node majority. As long as you HAVE a disk witness that is.

There are three types of disk witnesses:

  • A SAN-connected shared witness disk, usually FC or iSCSI. Recommended for clusters that use shared SAN-based cluster disks for other purposes, otherwise not recommended. If this sounds like gibberish to you, you should use another type of witness.
  • A File share witness. Just a file share. Any type of file share would do, as long as it resides on a Windows server in the same domain as the cluster nodes. SOFS shares are recommended, but not necessary. DO NOT build a SOFS cluster for this purpose alone. You could create a VM for cluster witnesses, as each cluster witness is only about 5MiB, but it is best to find an existing physical server with a high uptime requirement in the same security zone as the cluster and create some normal SMB-shares there. I recommend a physical server because a lot of virtual servers are Hyper-V based, and having the disk witness on a vm in the cluster it is a witness for is obviously a bad idea.
  • Cloud Witness. New in W2016. If you have an Azure storage account and are able to allow the cluster nodes a connection to Azure, this is a good alternative. Especially for stretch clusters that are split between different rooms.

How to set up a simple SMB File share witness

  • Select a server to host the witness, or create one if necessary.
  • Create a folder somewhere on the server and give it a name that denotes its purpose:
  • image
  • Open the Advanced Sharing dialog
  • image
  • Enable sharing and change the permissions. Make sure that everyone is removed, and add the cluster computer object. Give the cluster computer object full control permissions
  • image
  • Open Failover Cluster manager and connect to the cluster
  • Select “Configure Cluster Quorum Settings:
  • image
  • Chose Select The Quorum Witness
    image

  • Select File Share Witness

  • image

  • Enter the path to the files share as \\server\share

  • image

  • Finish the wizard

  • Make sure the cluster witness is online:

  • image

  • Done!

IP Address conflict on 0.0.0.0

Problem

Sometimes when I restart one of my Windows 10 computers the network never gets online. I have to disable/enable the network to get it back. the reason seems to be an IP conflict with the address 0.0.0.0. This computer has a fixed IP, no DHCP is involved. The NIC is an Intel I219-V.

Analysis

SNAGHTML2393ee4f

Event ID 4199, TCPIP: The system detected an address conflict for IP address 0.0.0.0 with the system having network hardware address 20-4C-9E-49-38-8A.

A quick check in the tracking system revealed this article from Cisco: http://www.cisco.com/c/en/us/support/docs/ios-nx-os-software/8021x/116529-problemsolution-product-00.html. It talks about a conflict between the IP conflict detection system in Windows and an ARP Probe sent by the switch as part of IP Device Tracking. I am no Cisco expert, but I would like to have a chat with whoever thought that IP conflict detection should start BEFORE the nic has an IP set…

As far as I can tell the IP Tracking function on the switch is enabled by default from IOS version 15.2.

Workaround

Turn off IP Device Tracking at the switch

https://supportforums.cisco.com/discussion/11960461/ip-device-tracking talks about running the following commands on the switch:

switch(config)# int range gig1/0/1 – 24
switch(config-if)# nmsp attach suppress
end

This is supposed to turn off the IP Device tracking on a  per switch basis. I do not have access to my switching infrastructure, so I have not tested this. I will update this post if I get the opportunity to test it.

Turn off the Gratuitous ARP Function

Refer to this ancient KB: https://support.microsoft.com/en-us/kb/219374. It is written for NT4, but it still works. Be aware, this basically turns off IP Conflict detection completely.

Upgrade your NIC driver

And hope that it helps…