Problem

While hunting for performance bottlenecks using Spotlight some months ago I came across some strange statements listed only as *password—– (the – continues for a couple of pages). This was new to me, and even a search on the pesky trackers-with-integrated-search came up blank. We passed it off to the developers and never heard back, and I forgot about the whole thing. Then, during a stress test in QA it popped up on the radar again, and this time I decided to dig into it myself.

Analysis

Being a good boy, I was using extended events to trace down the problem. I do not know why, but profiler still feels more at home. Anyhow, I set up a session triggering on sql_text like “%*password—%. Looking something like this:

CREATE EVENT SESSION [pwdstar] ON SERVER
ADD EVENT sqlserver.rpc_starting( ACTION(sqlserver.client_app_name,sqlserver.client_connection_id,sqlserver.client_hostname,sqlserver.client_pid,sqlserver.database_name,sqlserver.session_id,sqlserver.sql_text,sqlserver.tsql_stack,sqlserver.username)
WHERE ([package0].[greater_than_uint64]([sqlserver].[database_id],(4)) AND [package0].[equal_boolean]([sqlserver].[is_system],(0)) AND ([sqlserver].[like_i_sql_unicode_string]([sqlserver].[sql_text],N'%*password--%') AND [sqlserver].[equal_i_sql_unicode_string]([sqlserver].[database_name],N'DBNAME'))))
ADD TARGET package0.event_file(SET filename=N'pwdstar',max_file_size=(200))
WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=ON,STARTUP_STATE=OFF
GO

I came back an hour later to look at the results, and was surprised to find that the tsql_stack returned a more normal looking statement:

image

Back to the trackers I went, this time trying to figure out why sql_text differed from statement. This produced far better results, and I came across this article: https://www.sqlskills.com/blogs/jonathan/understanding-the-sql_text-action-in-extended-events/, and following the links produced the following statement from Mr. Keheyias:

· If the sql_text Action returns:

<action name="sql_text" package="sqlserver">
<type name="unicode_string" package="package0" >
<value>*password------------------------------</value>
<text></text>
</type></action>

the text was stripped by the engine because it contained a password reference in it.  This functionality occurs in SQL Trace as well and is a security measure to prevent passwords from being returned in the trace results.

As I said, this was news to me, but the function seems to be an old one not necessarily related to xEvents. Extended Events does however use a more aggressive masking algorithm than what is used in profiler from what I can tell.

Solution

If you come across such queries in your profiler or xEvents results, you are dealing with something SQL server believes is password related. The results are thus hidden. The easiest way to get the statement is capturing tsql_stack as well. In SQL 2016 you can capture the statement directly:

image

Print This Post Print This Post

Tags: ,

Problem

I received an alarm from one of my SQL Servers about IO stall time measured in seconds and went to investigate. We have had trouble with HBA Firmware causing FC stalls previously, so I suspected another storage error. The server in question was running virtual FC, and a cascading error among the other servers on the same host seemed to confirm my initial hypothesis about a HBA problem on the host.

Analysis

The kernel mode CPU time on the host was high (the red part of the graph in Process Explorer), something that is also a pointer in the direction of storage problems. The storage minions found no issue on the SAN though. Yet another pointer towards a problem on the server itself. We restarted it twice, and the situation seemed to normalize. It was all written off as collateral damage from a VMWare fault that flooded the SAN with invalid packet some time ago. I moved one of the VMs back and let it simmer overnight. I felt overly cautious not moving them all back, but the next morning the test VM was running 80% PCU without getting anything done, and the CPU load on the host was  about 50%, running a 3 cpu vm on a 2×12 core host…

25C33835

I failed the test vm back to the spare host, and the load on the VM went down immediately:

image

At this point I was ready to take a trip to the room of servers and visit the host in person, and I was already planning a complete re-imaging of the node in my head. But then I decided to run CPU-Z first, and suddenly it all became clear.

image

 

The host is equipped with Intel Xeon E5-2690 v3 CPUs. Intel Ark informs me that the base clock is indeed 2,6GHz as reported by CPU-Z, and the turbo frequency is as high as 3,5GHz. A core speed of 1195MHz as shown in CPU-Z is usually an indicator of one of two things. Either someone has fiddled with the power saving settings, or there is something seriously wrong with the hardware.

A quick check of the power profile revealed that the server was running in the so called “balanced” mode, a mode that should be called “run-around-in-circles-and-do-nothing-mode” on servers. The question then becomes, why did this setting change?

image

My server setup checklist clearly states that server should run in High performance mode. And I had installed this particular server myself, so I know it was set correctly. The culprit was found to be a firmware upgrade installed some months back. It had the effect of resetting the power profile both in the BIOS and in Windows to the default setting. There was even a change to the fan profile, causing the server to become very hot. The server in question is a HP ProLiant DL380 Gen 9, and the ROM version is P89 v2.30 (09/13/2016).

Solution

  • First you should change the power profile to High performance in the control panel. This change requires a reboot.
  • While you are rebooting the server, enter the BIOS settings and check the power profile. I recommend Maximum Performance mode for production servers.
  • image
  • Then, check the Fan Profile
  • image
  • Try increased cooling. If your servers still get exceedingly hot, there is a maximum cooling mode, but this basically runs all the fans at maximum all the time.

This is how CPU-Z looks after the change:

image

And the core utilization on the host, this time with 8 active SQL Server VMs:

image

Print This Post Print This Post

Tags:

Problem

I was upgrading an Availability group from SQL 2012 on Win 2012R2 to SQL 2016 on Win2016. I had expected to create the new AOAG as a separate cluster and move the data manually, but the users are always complaining when I want to use my allotted downtime quotas, so I decided to try a rolling upgrade instead. This post is a journal of some of the perils I encountered along the way, and how I overcame them. There were countless others, but most of them were related to crappy hardware, wrong hardware being delivered, missing LUNS on the SAN, delusional people who believe they can lock out DBAs from supporting systems, dragons, angry badgers, solar flares and whichever politician you dislike the most. Anyways, on with the tale of clusters past morphing into clusters present…

I started with adding the new node to the failover cluster. This went surprisingly well, in spite of the old servers being at least two generations older than my new rack servers. Sadly, both the new and the old servers are made by the evil wizards behind the silver slanted E due to factors outside of my control. But I digress. The cluster join went flawlessly. There was some yellow complaints about the nodes not having the same OS version in the cluster validation scroll, but everything worked.

Then came adding the new server as a replica in the availability group. This is done from the primary replica, and I just uttered a previously prepared spell from the book of disaster recovery belonging to this cluster, adding the name of the new node. As far as I can remember this is just the result of the standard “Add replica” wizard. The spell ran without complaints, and my new node was online.

This is the point where it all went to heck in a small hand-basket carried by an angry badger. I noticed a yellow warning next to the new node in the AOAG dashboard. But as the databases were all in the synchronizing state on the new replica, I believed this to be a note complaining about the OS-version. I was wrong. In my ignorance, I failed over to the new node and had the application  team minions run some tests. They came back positive, so I removed the old nodes in preparation for adding the last one. I even ran the Update-ClusterFunctionalLevel Powershell command without issues. But the warning persisted. This is the contents of the warning:

Availability replica not joined.

SNAGHTML57bbb24f

And it was no longer a lone warning, the AOAG dashboard did not look pretty as both the old nodes refused to accept the new node as their new primary replica.

Analysis

As far as I can tell, the join AOAG script failed in some way. It did not report any errors, but still, there is no doubt that something did go wrong.

The solution as reported by MSDN is simple, just join the availability group by casting the “alter availability group groupname join” spell from the secondary replica that is not joined. The attentive reader has probably already realized that this is the primary replica, and as you probably suspect, the aforementioned command fails.

Casting the following spell lists the replicas and their join state: “select join_state, join_state_desc from sys.dm_hadr_availability_replica_cluster_states”. This is the result:

image

In some way I have put the node in an invalid state. It still works perfectly, but I guess there is only a question about when, not if this issue is about to grow into a bigger problem.

Solution

With such an elaborate backstory, you would not be wrong to expect an equally elaborate solution. Whether or not it is, is really in the eye of the beholder.

Just the usual note of warning first: If you are new to availability groups, and all this cluster stuff sounds like the dark magic it is, I would highly suggest that you do not try to travel down the same path as me. Rather, you should turn around at the entrance and run as fast as you can into the relative safety of creating another cluster alongside the old one. Then migrate the data by backing up on the old cluster and restoring on the new cluster. And if backups and restores on availability groups sounds equally scary, then ask yourself whether or not you are ready to run AOAG in production. In contrast to what is often said in marketing materials and at conferences, AOAG is difficult and scary to the beginner. But there are lots of nice training resources out there, even some free ones.

Now, with the warnings out of the way, here is what ended up working for me. I tried a lot of different solutions, but I was bound by the following limitation: The service has to be online. That translates to no reboots, no AOAG-destroy and recreate, no cluster rebuilds and so on. A combination of which would probably have solved the problem in less than an hour of downtime. But I was allowed none, so this is what I did:

  • Remove any remaining nodes and replicas that are not Win2016 SQL2016.
  • Run the Powershell command Update-ClusterFunctionalLevel to make sure that the cluster is running in Win2016 mode.
  • Build another Win 2016 SQL 2016 node
  • Join the new node to the cluster
  • Make sure that the cluster validation scroll seems reasonable. This is a fluffy point I know, but there are way to many variables to make an exhaustive list. https://lokna.no/?p=1687 mentions some of the issues you may encounter.
  • Join the new node to the availability group as a secondary replica.
  • Fail the availability group over to the new node (make sure you are in synchronous commit mode for this step).
  • Everything is OK.

image

  • Fail back to the first node
  • Change back to asynchronous commit (if that is you default mode, otherwise leave it as synchronous).

 

Thus I have successfully upgraded a 2-node AOAG cluster from Win2012R2 and SQL 2012 to Win2016 and SQL 2016 with three failovers as the only downtime. In QA. Production may become an interesting journey, IF the change request is approved. There may be an update if I survive the process…

 

Update and final notes

I have now been through the same process in production, with similar results. I do not recommend doing this in production, the normal migration to a new cluster is far preferable, especially when you are crossing 2 SQL Server versions on the way. Then again, if the reduced downtime is worth the risk…

Be aware that a failover to a new node is a one way process. Once the SQL 2016 node becomes the primary replica, the database is updated to the latest file format, currently 852 whereas SQL 2012 is 706. And as far as I can tell from the log there is a significant number of upgrades to be made. See http://sqlserverbuilds.blogspot.no/2014/01/sql-server-internal-database-versions.html for a list of version numbers.

image

Print This Post Print This Post

Tags: ,

Preface

Most of my Windows servers are patched by WSUS, SCCM or a similar automated patch management solution at regular intervals. But not all. Some servers are just too important to be autopatched. This is a combination of SLA requirements making downtime difficult to schedule and the sheer impact of a botched patch run on backend servers. Thus, a more hands-on approach is needed. In W2012R2 and far back this was easily achieved by running the manual Windows Update application. I ran through the process in QA, let it simmer for a while and went on to repeat the process in production if no nefarious effects were found during testing. Some systems even have three or more staging levels. It is a very manual process, but it works, and as we are required to hand-hold the servers during the update anyway, it does not really cost anything. Then along came Windows Server 2016. Or Windows 10 I should really say, as the Update-module in W2016 is carbon copied from W10 without changes. It is even trying to convince me to install W10 Creators update on my servers…

clip_image001

In Windows Server 2016 the lazy bastards at Microsoft just could not be bothered to implement the functionality from W2012R2 WU. It is no longer possible to defer specific updates I do not want, such as the stupid Silverlight mess. If I want Microsoft update, then I have to take it all. And if I should become slightly insane and suddenly decide I want driver updates from WU, the only way to do that is to go through device manager and check every single device for updates. Or install WUMT, a shady custom WU client of unknown origin.

I could of course use WSUS or SCCM to push just the updates I want, but then I have to magically imagine what updates each server wants and add them to an ever growing number of target groups. Every time I have a patch run. Now that is expensive. If I had enough of the “special needs” servers to justify the manpower-cost, I would have done so long ago. Thus, another solution was needed…

PSWindowsUpdate to the rescue. PSWindUpdate is a Powershell module written by a user called MichalGajda on the technet gallery enabling management of Windows Update through Powershell. In this post I go through how to install the module and use it to run Microsoft Update in a way that resembles the functionality from W2012R2. You could tell the module to install a certain list of updates, but I found it easier to hide the unwanted updates. It also ensures that they are not added by mistake with the next round of patches.

Getting started

(See the following chapters for details.)

  • You should of course start by installing the module. This should be a one-time deal, unless a new version has been released since last time you used it. New versions of the module should of course be tested in QA like any other software.
  • Then, make sure that Microsoft Update is active.
  • Check for updates to get a list of available patches.
  • Hide any unwanted patches
  • Install the updates
  • Re-check for updates to make sure there are no “round-two” patches to install.

Read the rest of this entry »

Print This Post Print This Post

In response to comments on my post about 1Password and cloud security, this is an update about other password managers and my way out. It is highly recommended that you read the previous post first to understand where I am going with this. I was looking for a password manager for a specific purpose, that is cloud sync ability with something that at least looks like true encryption without back doors, and my comments are written from that mindset.

Why I would not use Dropbox for my passwords

Or Google drive. Or Onedrive for that matter.

And note the “for my passwords” part of the sentence. It is not that I do not use such services, but I consider all of them as an insecure location. They are just to juicy a target for wrongdoers, both intelligence agencies and the other kind of cyber criminals.

About “Free” cloud services

Such as Facebook, or maybe Lastpass is a better example in this case. There is no free lunch. You are either a customer or a product. You are either the farmer or the pig. Pigs have no rights to privacy. And before you consider using any google-based service, read this: Street View cars slurping wi-fi. This case complex was for me the first warning that something was rotten in the house of google. And it has not become any better since. I guess the “Don’t be evil” company motto should have been a warning as well…

As such I would never install a password manager on an Android or IOS-based phone that contains passwords for other systems. They are way to easy to hack.

And just as a note, LastPass and LogMeOnce was not considered due to their lack of a desktop client.

KeePass

I have used KeePass in the past, but it is mostly a local solution only, and is therefore out of scope. At least initially. I am also slightly worried about the fact that it is free, as in free of charge. I have nothing against open source software, but for some needs I prefer to have access to someone to complain to if it all goes wrong. Someone who is paid to listen to my ails and complaints. That being said, KeePass is a nice product for local password management.

Keeper

Had a brief look at it, discovered it was dependent on java, uninstalled immediately. Also, at revisiting it seems to be “moving to the cloud”.

DashLane

A solid security policy. I ran the demo for some time, but the “Modern” UI was horrible, if not as horrible as the 1Password 6 beta. I wanted to like it, but after continuously having to click buttons two or three times for them to do anything, I gave up. I may re-test it in the future. This is sadly a complaint I have about most “Modern” UI applications, they do not respond to mouse clicks consistently. I also could not get the browser plugin to work properly in Vivaldi.

StickyPassword

This has been the best contender so far. I got as far as testing the synchronization, and I used it for the full trial period without a rage-uninstall. What actually stopped me from going for it in the end was it’s login dialog constantly popping up when I wasn’t trying to use it. It became a nuisance. I also did not like that it wants to run all the time, instead of when I actually want to use it. The chance of someone snagging access to it while walking by if I forgot to lock my screen is of course a danger, but primarily I stopped using it because it became irksome.

What I ended up doing

I stuck with what I had, a simple local-only password manager not to be named. Because the password manager itself is not important, as long as you have control of the data locally.  It is how you control the synchronization of data that is important. And as I do not really trust any of the “public cloud” alternatives, I decided to make my own. I installed Resilio Sync, a file synchronization application based on the BitTorrent protocol, and used it to keep my encrypted password store in sync across my computers.

This allows me to keep the data in sync and, to a certain degree, actually know where my data is physically located. It could still be hacked or intercepted of course, but that had to be a much more directed attack than the usual “lets  archive everything that was ever stored on Dropbox in case we need it some time” behavior we have come to expect from the people who are supposedly working to keep the world “safe”. I may come across as rather paranoid in this post, but such are the times.

Print This Post Print This Post

Tags: ,

Problem

I was preparing to roll out SQL Server 2016 and Windows Server 2016 and had deployed the first server in  production. I suddenly noticed that even if I selected “Check online for updates from Microsoft Update” in the horrible new update dialog, I never got any of the additional updates. Btw, this link/button only appears when you have an internal SCCM or WSUS server configured. Clicking the normal Check For Updates button will get updates from WSUS.

image

 

Analysis

This was working as expected in the lab, but the lab does not have the fancy System Center Configuration Manager and WSUS systems. So of course I blamed SCCM and uninstalled the agent. But to no avail, still no updates. I lurked around the update dialog and found that the “Give me updates for other Microsoft products..” option was grayed out and disabled. I am sure that I checked this box during installation, as I remember looking for its location. But it was no longer selected, it was even grayed out.

image

This smells of GPOs. But I also remembered trying to get this option checked by a GPO to save time during installation, and that it was not possible to do so in Win2012R2. Into the Group Policy Manager of the lab DC I went…

It appears that GPO management of the Microsoft Update option has been added in Win2016:

image

This option is not available in Win2012R2, but as we have a GPO that defines “Configure Automatic Updates”, it defaults to disabled.

solution

Alternative 1: Upgrade your domain controllers to Win2016.

Alternative 2: Install the Win2016 .admx files on all your domain controllers and administrative workstations.

Then, change the GPO ensuring that “Install updates for other Microsoft products is enabled. Selecting 3 – Auto download used to be a safe setting.

Alternative 3: Remove the GPO or set “Configure Automatic Updates” to “Not Configured”, thus allowing local configuration.

Print This Post Print This Post

Tags: , ,

Introduction

Since W2012R2 it is recommended that all clusters have a quorum witness regardless of the number of cluster nodes. As you may know, the purpose of the cluster witness is to ensure a majority vote in the cluster. If you have 2 nodes with one vote each and add a cluster witness you create a possibility for a majority vote. If you have 3 nodes on the other hand, adding a witness will remove the majority vote as you have 4 votes total and a possible stalemate.

If as stalemate occurs, the cluster nodes may revolt and you are unable to get it working without a force quorum, or you could take a node out behind the barn and end its misery. Not a nice situation at all. W2012R2 solves this predicament by dynamic vote assignments. As long as a quorum has been established, if votes disappear due to nodes going offline, it will turn the witness vote on and off to make sure that you always have a possibility for node majority. As long as you HAVE a disk witness that is.

There are three types of disk witnesses:

  • A SAN-connected shared witness disk, usually FC or iSCSI. Recommended for clusters that use shared SAN-based cluster disks for other purposes, otherwise not recommended. If this sounds like gibberish to you, you should use another type of witness.
  • A File share witness. Just a file share. Any type of file share would do, as long as it resides on a Windows server in the same domain as the cluster nodes. SOFS shares are recommended, but not necessary. DO NOT build a SOFS cluster for this purpose alone. You could create a VM for cluster witnesses, as each cluster witness is only about 5MiB, but it is best to find an existing physical server with a high uptime requirement in the same security zone as the cluster and create some normal SMB-shares there. I recommend a physical server because a lot of virtual servers are Hyper-V based, and having the disk witness on a vm in the cluster it is a witness for is obviously a bad idea.
  • Cloud Witness. New in W2016. If you have an Azure storage account and are able to allow the cluster nodes a connection to Azure, this is a good alternative. Especially for stretch clusters that are split between different rooms.

How to set up a simple SMB File share witness

  • Select a server to host the witness, or create one if necessary.
  • Create a folder somewhere on the server and give it a name that denotes its purpose:
  • image
  • Open the Advanced Sharing dialog
  • image
  • Enable sharing and change the permissions. Make sure that everyone is removed, and add the cluster computer object. Give the cluster computer object full control permissions
  • image
  • Open Failover Cluster manager and connect to the cluster
  • Select “Configure Cluster Quorum Settings:
  • image
  • Chose Select The Quorum Witness
    image

  • Select File Share Witness

  • image

  • Enter the path to the files share as \\server\share

  • image

  • Finish the wizard

  • Make sure the cluster witness is online:

  • image

  • Done!

Print This Post Print This Post

Problem

Sometimes when I restart one of my Windows 10 computers the network never gets online. I have to disable/enable the network to get it back. the reason seems to be an IP conflict with the address 0.0.0.0. This computer has a fixed IP, no DHCP is involved. The NIC is an Intel I219-V.

Analysis

SNAGHTML2393ee4f

Event ID 4199, TCPIP: The system detected an address conflict for IP address 0.0.0.0 with the system having network hardware address 20-4C-9E-49-38-8A.

A quick check in the tracking system revealed this article from Cisco: http://www.cisco.com/c/en/us/support/docs/ios-nx-os-software/8021x/116529-problemsolution-product-00.html. It talks about a conflict between the IP conflict detection system in Windows and an ARP Probe sent by the switch as part of IP Device Tracking. I am no Cisco expert, but I would like to have a chat with whoever thought that IP conflict detection should start BEFORE the nic has an IP set…

As far as I can tell the IP Tracking function on the switch is enabled by default from IOS version 15.2.

Workaround

Turn off IP Device Tracking at the switch

https://supportforums.cisco.com/discussion/11960461/ip-device-tracking talks about running the following commands on the switch:

switch(config)# int range gig1/0/1 – 24
switch(config-if)# nmsp attach suppress
end

This is supposed to turn off the IP Device tracking on a  per switch basis. I do not have access to my switching infrastructure, so I have not tested this. I will update this post if I get the opportunity to test it.

Turn off the Gratuitous ARP Function

Refer to this ancient KB: https://support.microsoft.com/en-us/kb/219374. It is written for NT4, but it still works. Be aware, this basically turns off IP Conflict detection completely.

Upgrade your NIC driver

And hope that it helps…

Print This Post Print This Post

Background

I was spending Christmas with relatives on the western coast of Norway. A part of Norway where foul weather is no stranger and heavy rain is the norm. Where the Vikings learned to handle the waves of the Atlantic ocean next door.  Thus we are no strangers to voltage spikes. Power outages due to Thor’s angry electrons are quite common. It can be pretty, but such evening skies are usually a harbinger of bad weather.

2016122516-16-51-50

And surely it was. The next morning we were hammered by gale force winds. I would rate it as a medium storm, but the meteorologists gave it an official name (Urd, and old Viking female name) and called it extreme weather. The second night of the storm I was awaken by a loud crack from the direction of the intake breaker box located in the guest room, followed by thunder. The howling from the UPS in the server closet revealed a power outage. I waited for about 10 minutes, but all I could hear was the storm. The reason for waiting is this: If there is one place in the house you do not want to be during a lightning-strike, it is with your nose in the breaker box trying to get the power back on.

After a while the lack of electrical heating won over my concerns for further strikes, and I went to look at the main breaker box. I was expecting one of the breakers reduced to a pile of rubble, but everything looked OK.  If you wonder how I found my way through the darkness, let us just say that you do not grow up in this part of Norway without learning how to find one of your many flashlights in the dark. As all seemed OK in the intake box, on I went to investigate the distribution box in the next room. This is where the main distribution breaker, residual current device and surge protectors are located. Both the surge protectors and the residual current device were triggered, along with several circuit breakers. I primed the residual current device and switched it back on. Then I reset the circuit breakers, verified that the electric heater in my room was working and went back to sleep.

I was raised from my slumber by the users (i.e. my relatives) a couple of hours later. They complained about missing internet service and beseeched me to investigate. And investigate I did. For a normal residential house they have quite the advanced setup (I might be to blame for this), but it is made such to be resilient. After the installment of extra surge protectors some years back, the culprit is usually the ADSL modem. There are sadly no phone line surge protectors available that are powerful enough to resist the onslaught, so when the angry electrons enter the house through the phone lines they usually end up killing the modem. The ISDN phone connected to the same line has survived for more than a decade, but that is a German made ancient Siemens device unlike the chinesium crap the ISP calls an ADSL modem that is usually replaced twice per year.

A quick look at the modem revealed not a hint of status LEDs. Hoping for a quick fix in form of a power supply replacement I took it down from its mounting bracket, only to discover the unmistakable rattle of destroyed components from within. The VPN box was also dead.

image

I called the ISP and convinced them to send over a new modem. Due to this still being Christmas, it would take two days. Which isn’t bad, but usually we could get one the same day.

A quick summary of the components: The Wireless AP is there to provide a consistent WLAN. The modem has one built in, but each time it is replaced, the settings change. The VPN box is placed there by me to facilitate remote support. The DSL splitter is connected to the outside line and sends one signal to the ISDN NT1 and another signal to the modem. The NT1 is located in another room. The box on the lower right is supplied by the satellite TV supplier, and its function is unclear. It has some kind of wireless function, and I suspect it is a dedicated WLAN for the satellite decoder to call home.

Zyxel P8702N

But back to the modem. When you remove the top dark-out cover it looks like this:

2016122614-14-57-45

It identifies itself as a ZyXEL P8702N, which as far as I can tell is an ISP special, that is, only sold to ISPs. The hardware supports both an internal DSL modem and an external modem/adapter.

I was curious as to which components produced the rattle, so I removed the top cover. No screws, only fidgety plastic clips. Does not look like it is designed to be serviced. First glance revealed three separate confirmed problems.

image

1 – MNC G4804DG

This chip is a dual port gigabit ethernet line transformer. There are two of them, which correlates to the four LAN ports. There is also a WAN port connected to the G1806DG on the left. There are clear signs of carbon on the board, evident of a blue smoke leak. And as we all know, if the magic blue smoke gets out of the chip it stops working. This should have prompted me to investigate further at the other end of port 3, but more on that later.

2 – DSL line “protection”

image

The designers have tried to protect the modem from angry electrons by connecting the DSL line to a gas discharge tube and two in-line capacitors. As the picture clearly tells, this was not enough. As far as I could find out the capacitors are low quality chinesium, and I guess that goes for most of this box.

image

A close-up reveals further damage, even carbon on the connector itself. I would guess that the two capacitors were the source of the loud noise.

3 – Unknown chip

image

This could be the “modem” part, but it was to small and damaged to identify with the equipment I had available. The board shows a trail of destruction from the DSL-port down to this chip.

VPN Power

2016122721-21-25-44

image

All I could find is a broken 3-pin part, probably some kind of transistor. The power was luckily all that was broken, and a retrofit universal model from the local supplier brought it back to life. Local as in 50 clicks away, but I digress.

Network meltdown

I promised to return to case 1 from the modem. The one about carbon on the network interface transformer. After replacing the modem I quickly discovered that the server was no longer accessible. This is your typical small-business setup with one box running file, print, AD and accounting software connected to a couple of clients. There was sadly no time for pictures, but to sum it up, the angry electrons killed a HP Procurve switch and a network adapter in one of the computers.

Summary

All this from a single lightning strike far away. The angry electrons of Thor are not to be scoffed at.

Print This Post Print This Post

First a friendly warning; This post details procedures for messing with the time service on domain controllers. As always, if you do not understand the commands or their consequences; seek guidance.

Problem

I have been upgrading my lab to Windows Server 2016 in preparation for a production rollout. Some may feel I am late to the game, but I have always been reluctant to roll out new server operating systems quickly. I prefer to have a good baseline of other peoples problems to look for in your friendly neighborhood tracking service (AKA search engine) when something goes wrong.

Anyways, some weeks ago I rolled out 2016 on my domain controller. When I came back to upgrade the Hyper-V hosts, I noticed time was off by 126 seconds between the DC and the client. As the clock on the DC was correct, I figured the problem was client related. Into the abyss of w32tm we go.

Analysis

The Windows Time Service is not exactly known for its user friendliness, so I just started with the normal shotgun approach at the client:

net stop w32time
w32tm /config /syncfromflags:domhier
net start w32time

These commands, if executed at an administrative command prompt, will remind the client to get its time from the domain time sync hierarchy, in other words one of the DCs. If possible. Otherwise it will just let the clock drift until it passes the time delta maximum, at which time it will not be able to talk to the DC any more. This is usually the point when your friendly local monitoring system will alert you to the issue. Or your users will complain. But I digress.

Issuing a w32tm /resync command afterwards should guarantee an attempt to sync, and hopefully a successful result. At least in my dreams. In reality though, it just produced another nasty error:  0x800705B4. The tracking service indicated that it translates to “operation timed out”. 

The next step was to try a stripchart. The stripchart option instructs w32tm to query a given computer and show the time delta between the local and remote computer. Kind of like ping for time servers. The result should look something like this:

SNAGHTMLbf7b59

But unfortunately, this is what I got:

image

I shall spare you the details of all the head-scratching and ancient Viking rituals performed at the poor client to no avail. Suffice it to say that I finally realized the problem had to be related to the DC upgrade. I tried running the stripchart from the DC itself against localhost, and that failed as well. That should have been a clue that something was wrong with Time Service itself. But as troubleshooting the Time Service involves decoding its registry keys, I went to confirm the firewall rules instead. Which of course were hunky-dory.

image

I then ran dcdiag /test:advertising /v to check if the server was set to advertise as a time server:

image

 

The next step was to reset the configuration for the Time Service. The official procedure is as follows:

net stop w32time
w32tm.exe /unregister
w32tm.exe /register
net start w32time

This procedure usually ends with some error message complaining about the service being unable to start due to some kind of permission issue with the service. I seem to remember error 1902 is one of the options. If this happens, first try 2 consecutive reboots. Yes, two. Not one. Don’t ask why, no one knows. If that does not help, try again but this time with a reboot after the unregister command.

The procedure ran flawlessly this time, but it did not solve the problem.

Time to don the explorer’s hat and venture into the maze of the registry. The Time Service hangs out in HKLM\System\CurrentControlSet\Services\W32Time. After some digging around, I found that the NTP Server Enabled key was set to 0. Which would suggest that it was turned off. I mean, registry settings are tricksy, but there are limits. I tried changing it to 1 and restarted the service.

image

Suddenly, everything works. The question is why… Not why it started working, but why the setting was changed to 0. I am positive time sync was working fine prior to the upgrade. Back to the tracking service I went. Could there be a new method for time sync in Windows 2016? Was it all a big conspiracy caused by Russian hackers in league with Trump? Of course not. As usual the culprits are the makers of the code.

Solution

My scenario is not a complete match, but in KB3201265 Microsoft admits to having botched the upgrade process for Windows Time Service in both Windows Server 2016 and the corresponding Windows 10 1607. Basically, it applies default registry settings for a non-domain-joined server. Optimistic as always they tell you to export the registry settings for the service PRIOR to upgrading. As if I have the time to read every KB they publish. Anyways, it also details a couple of other possible solutions, such as how to get the previous registry settings out of Windows.old.

My recommendation is as such: Do not upgrade your domain controllers. Especially not in production. I only did it in the lab because I wanted to save time.

If you as me have put yourself in this situation, and honestly, why else would you have read this far, I recommend following method 3 in KB3201265. Unless you feel comfortable exploring the registry and fixing it manually.

Print This Post Print This Post

Tags: ,

« Older entries

%d bloggers like this: