Node unable to join cluster because the cluster db is out of date

Scenario

  • Complex 3-node AOAG cluster
  • 2 nodes in room A form a SQL Server failover cluster instance (FCI)
  • 1 node in room B is a stand-alone instance
  • The FCI and node 3 form an always on availability group (AOAG)
  • Alle nodes are in the same Windows failover cluster
  • All nodes run Windows Server 2022

Problem

The problem was reported as such: Node 3 is unable to join the cluster, and the AOAG sync has stopped. A look in the cluster events tab in Failover Cluster Manager revealed the following error being repeated; FailoverClustering Event ID 5398:

Cluster failed to start. The latest copy of cluster configuration data was not available within the set of nodes attempting to start the cluster. Changes to the cluster occurred while the set of nodes were not in membership and as a result were not able to receive configuration data updates.

And also; FailoverClustering Event ID 1652:

Cluster node 'Node 3' failed to join the cluster. A UDP connection could not be established to node(s) 'Node 1'. Verify network connectivity and configuration of any network firewalls

Analysis

Just looking at the error messages listed, one might be inclined to believe that something is seriously wrong with the cluster. Cluster database paxos (version) tag mismatch problems often lead to having to evict and re-join the complaining nodes. But experience, and especially with multi-room clusters, has taught me that this is seldom necessary. The cluster configuration database issue is just a symptom of the underlying network issue. What it is trying to say is that the concurrency of the database across the nodes cannot be verified, or that one of the nodes are unable to download the current version from one of the other nodes. Maybe due to an even number of active nodes, or not enough nodes online to form a majority.

A cluster validation run (Network only) was started, and indeed, there was a complete breakdown in communication between node 1 and 3. In both directions. Quote from the validation report:

Node 1 is not reachable from node 3. It is necessary that each cluster node can
communicate each other cluster node by a minimum of one network path (though multiple paths
are recommended to avoid a single point of failure). Please verify that existing networks
are configured properly or add additional networks.

If the communication works in one direction, you will only get one such message. In this case, we also have the corresponding message indicating a to-way issue:

Node  is not reachable from node 1. It is necessary that each cluster node can communicate each other cluster node by a minimum of one network path (though multiple paths are recommended to avoid a single point of failure). Please verify that existing networks are configured properly or add additional networks.

Be aware that a two-way issue does not necessarily indicate a problem with more than one node. It does however point to a problem located near or at the troublesome node, whereas a one-way issue points more in the direction of a firewall issue.

Thus, a search of the central firewall log repository was started. It failed to shine a light on the matter though. Sadly, that is not uncommon in these cases. A targeted search performed directly on the networking devices in combination with a review of relevant firewall rules and routing policies by a network engineer is often needed to root out the issue.

The cluster had been running without any changes or issues for quite some time, but a similar problem had occurred at least once before. Last time it was due to a network change, and we knew that a change to parts of the network infrastructure had recently been performed. But stil, something smelled fishy. And as we could not agree on where the smell came from, we chose to analyse a bit more before we summoned the network people.

The funny thing was that communication between node 2 and node 3 was working. One would think that the problem should be located on the interlink between room A and B, but if that was the case, why did it only affect node 1? We triggered a restart of the cluster service on node 3. The result was that the cluster, and thereby the AOAG listener and databases went down, quorum was re-arbitrated, and node 1 was kicked out. The FCI and AOAG primary failed over to node 2, node 3 joined the cluster and began to sync changes to the databases, and node 1 was offline.

So, the hunt was refocused. This time we were searching diligently for something wrong on node 1. Another validation report was triggered, this time not just for networking. It revealed several interesting things, whereof two became crucial to solving the problem.

1: Cluster networking was now not working at all on node 1, and as a result network validation failed completely. Quote from the validation report:

An error occurred while executing the test.
There was an error initializing the network tests.

There was an error creating the server side agent (CPrepSrv).

Retrieving the COM class factory for remote component with CLSID {E1568352-586D-43E4-933F-8E6DC4DE317A} from machine Node 1 failed due to the following error: 80080005 Node 1.

2: There was a pending reboot. On all nodes. Quote:

The following servers have updates applied which are pending a reboot to take effect. It is recommended to reboot the servers to complete the patching process.
Node 1
Node 2
Node 2

Now, it is important to note that patching and installation of software updates on these servers are very tightly regulated due to SLA-concerns. Such work should always end with a reboot, and there are fixed service windows to adhere to with the exception of emergency updates of critical patches. No such updates had been applied recently.

Rummaging around in the registry, two potential culprits were discovered: Microsoft xps print spooler drivers and Microsoft Edge browser updates. Now why these updates were installed, and why they should kill DCOM and by extension failover clustering I do not now. But they did. After restarting node 1, everything started working as expected again. Findings in the application installation log would indicate that Microsoft Edge was the problem, but this has not been verified. It does however make more sense than the XPS print spooler.

Solution

If you have this problem and you believe that the network/firewall/routing is not the issue, run as cluster validation report and look for pending reboots. You find them under “Validate Software Update Levels” in the “System Configuration” section.

If you have some pending reboots, restart the nodes one by one. The problem should vanish.

If you do not have any pending reboots, try rebooting the nodes anyway. If that does not help, hunt down someone from networking and have them look for traffic. Successful traffic is usually not logged, but you can turn it on temporarily. Capturing the network traffic on all of the nodes and running a wireshark analysis would be my next action point if the networking people are unable to find anything.

Images

MOMcertimport.exe not found

Scenario

  • You have a computer that is monitored using System Center Operations Manager (SCOM).
  • This computer is located outside of your normal AD structure, and as such is relying on certificate authentication. It could be located in:
    • The cloud
    • In a DMZ
    • In a disjointed domain
    • All of the above.
    • In a super secret location with way too many firewalls
  • The certificate or part of the certificate chain has expired and needs replacing
  • You are unable to run the MOMCertimport.exe tool that registers the certificate with the SCOM agent.

Solution

Note: I will assume that you have already created and installed a valid certificate on the computer in the correct way. In short:

  • Into the local computer certificate store
  • Including all root and intermediate certificates needed
  • And the private key for the certificate

Now, to make use of said certificate we would normally run MOMCertimport.exe. It is a tool located on the SCOM installation media that is written for the express purpose of informing the SCOM Agent as to which certificate it is supposed to use for communicating with the rest of the SCOM infrastructure, usually a gateway server. But maybe you do not have access to it? Or maybe, just maybe the computer in question is considered so secure that getting approval for using a tool like that will take weeks or even months?

Regedit to the rescue!

You will need the following information:

  • The certificate thumbprint
  • The certificate serial number

Action plan

If any details of this plan are unclear or confusing to you, seek assistance before you start.

  • Open regedit
  • Navigate to HKLM\Software\Microsoft\Microsoft OperationsManager\3.0\Machine Settings
  • Look for the ChannelCertificateSerialNumber value. If it does not exist, create it as a binary value.
  • Input the binary value in reverse. That is, if your serial number is AF 3C 56, input 56 3C AF. The pairs of numbers each represent a byte in hexadecimal format. Do not reverse the hex numbers, only the byte order as shown above.
  • Double check the numbers
  • Look for the ChannelCertificateHash value. If it does not exist, create it as a string value.
  • Input the certificate thumbprint into this field. This time, do not reverse the bytes. Also, remove any spaces. That is, input 99 df a3 as 99dfa3. Use lower case letters for the a b c d e f numbers. The thumbprint will usually be listed with lower case, whereas the binary value above will be listed with upper case.
  • Again, double check the numbers
  • Restart the Microsoft Monitoring Agent service
  • Look for event id 20053 in the Operations Manager event log, confirming that the certificate was valid. An invalid certificate will result in event id 20066.

Pictures

AOAG: Local disks are set offline

Problem

After a reboot, the local disks that are not the boot disk are offline. Disk manager reports the following status:

THE DISK IS OFFLINE BECAUSE OF POLICY SET BY AN ADMINISTRATOR

The SQL Server instance fails as the drives containing the database files are offline.

Information about the system where this fault was detected:

  • SQL Server 2019
  • Windows Server 2022
  • Three nodes
  • One node is a stand alone AOAG replica with local storage
  • Two nodes form an AOFCI instance using shared SAN storage
  • The AOFCI instance is participating in an AOAG together with the third node
  • Multiple subnets are in use
  • Most disks are mounted to a folder, not a driveletter
  • Intel Xeon gold
  • Physical servers made in 2021/22

After setting the disks online and restarting the node, the drives are online and the SQL Server instance starts. Subsequent reboots does not reveal a pattern. Sometimes all drives are offline, sometimes half of the drives are offline.

Analysis

San policy

The policy referenced in the message is probably the SAN policy from diskpart:

The alternatives are Offline shared (default), Online all, Offline All and Offline Internal. Offline shared sets all shared storage as offline by default, and it has to be brought online. Usually that will be the cluster service changing the state of shared drives in accordance with the state of cluster resources. If you ask your not-so-friendly search engine and spy, you will find a lot of references asking you to just change the policy to online all. And in this case, that would probably be ok. If you try to mount a shared disk on multiples nodes of an AOFCI cluster for instance, you may end of in a sad world of disk corruption. However, the node with the problem is not connected to a SAN or other forms of shared storage and would handle online all without problems.

Disk signatures

A look in the failover cluster validation report reveals that the cluster service identifies all the “problem disks” as eligible for failover cluster validation:

Looking further down, the drives are identified as only existing on one node. This is important, as different scenarios may create local drives with the same signature on multiple nodes. This is especially a problem on virtual machines and when using cloning software to install physical machines. If duplicate disk signatures had existed in the cluster, the disks would have been validated, and failover clustering would have tried to add them to the cluster.

Luckily that was not the case here. All the local drives had an unique signature:

Add all eligible storage to the cluster

When you add a node to an existing cluster or form a new cluster, the cluster wizard will add all eligible storage to the cluster as default.

Your not so friendly search engine will list numerous reports of SQL Server disks disappearing when someone is building an AOAG cluster and forgets to uncheck this option. Whether or not that was the case here is unknown. What I do now from the validation reports is that the drives were not formatted when the node with local storage was added to the cluster. Anyways, the solution reported by many internet patrons is to just online the drives in disk manager and restart/start SQL Server. I have yet to find reports of intermittent problems.

Hypothesis

After applying the tentative solution listed below I have yet to reproduce the error. That does in no way guarantee a solution, especially as I have not been able to determine the root cause with 100% certainty. Maybe not even 50/50. But here goes:

  • The “Add all eligible storage..” option was not unchecked
  • Cluster validation has not been executed since the drives where formatted and SQL Server was installed.
  • The disk controller HPE SR932-p Gen10+ is doing something it should not.
  • The drives are all NVME based but RAID is still being used.
  • Resulting in the disk automount service believing that the local drives are shared.

Tentative solution

I do not now if this is the final solution. I do not know why it worked. I will update if something changes.

As usual, make sure that you understand this plan before you attempt to implement it.

  • Online all disks that are offline
  • Move the “Available Storage” cluster resource group to the problematic node. It does not matter if it is offline.
  • Run a cluster validation with storage validation
  • Make sure that there are no disk signature conflicts in the report.
  • Restart the node

Error 87 when installing SQL Server Failover Cluster instance

Problem

When trying to install a SQL Server Failover Cluster instance, the installer fails with error 87. Full error message from the summary log:

Feature: Database Engine Services
Status: Failed
Reason for failure: An error occurred during the setup process of the feature.
Next Step: Use the following information to resolve the error, uninstall this feature, and then run the setup process again.
Component name: SQL Server Database Engine Services Instance Features
Component error code: 87
Error description: The parameter is incorrect.

And from the GUI:

Analysis

This is an error I have not come across before.

The SQL Server instance is kind of installed, but it is not working, so you have to uninstall it and remove it from the cluster to try again. This is rather common when a clustered installation fails though.

As luck would have it, the culprit was easy to identify. A quick look in the cluster log (the one for the SQL Server installation, not the actual cluster log) revealed that the IP address supplied for the SQL Server instance was invalid. The cluster in question was located in a very small subnet, a /26 subnet. The IP allocated was .63. A /26 subnet contains 64 addresses. As you may or may not know, the first and last addresses in a subnet are reserved. The first address (.0) is the network address, and the last address (yes, that would be .63) is reserved as the broadcast address. It is also common to reserve the second or second to last address for a gateway, that would be .1 or .62 in this case.

Snippet from the log:

Solution

Allocate a different IP address. In our case that meant moving the cluster to a different subnet. as the subnet was completely filled to the brim.

Action plan:

  • Replace the IP on node 1
  • Wait for the cluster to arbitrate using the heartbeat VLAN or direct attach crossover cable.
  • Add an IP to the cluster goup resource group and give it an address in the new subnet.
  • Bring the new IP online
  • Replace the IP on node 2
  • Wait for the cluster to achieve quorum
  • Remove the old IP from the Cluster Group resource group
  • Validate the cluster
  • Re-try the SQL Server Failover Cluster instance installation.

Thor is messing with my UPS

Or: Why are my battery status LEDs blinking all the time?

This post is related to a post from 2017 titled The Tale of Thor’s angry electrons. Related as in that it takes place at the same location in the western part of Norway where my family lives. Most of the equipment referenced in the previous post has been replaced by now. Most notably, the old ADSL internet line has been replaced by a long distance fiber line. That reduced the number of internet outages considerably. The power line and transformers were also updated at some point. I do not remember if this was before or after the incident chronicled in 2017, but it gave a massive improvement in power delivery. Multi-day outages was not uncommon during the rainy season. An in this part of Norway, the rainy season never ends, unless it is replaced by a short-lived snowstorm or a massive heat-wave, that is, a couple of days with temperatures above 20 degrees C.

But back to the internet. There will be no internet without power. But wait, we have mobile phones and laptops I hear you say. Well, the mobile phone talks to a base station. This base station requires power to operate. So even if we have a handy dead dinosaur converter that creates enough electricity to keep the fish frozen and the laptops charged, without power to the base station and the local internet distribution point there will not be any internet. Neither the magic floating wireless internet or the more traditional and stable wired variety coming out of the wall.

As you would know if you have read the previous chronicle, we have employed several measures to ensure a stable Internet connection (and power delivery). One of those measures is an APC SmartUPS 1500. It makes sure that the core network components receive clean power, and it provides backup power for at least 30 minutes. As a line-interactive UPS it is definitely a massive overkill for a residential building, but it has done everything asked of it without complaining since it was made in 2008.

I chose the APC SmartUPS series because I have only ever seen one that was utterly destroyed. It was connected to a network switch in the engine compartment of a massive cargo ship and had been subjected to “a small amount of water”. It still tried its best though. It didn’t care that the batteries had expanded inside the battery compartment and had to be removed using a crowbar and a hazard suit. Fitted with a new-ish battery it provided output, but the charging circuit was destroyed. Sadly no pictures, this was a long time ago.

Continue reading “Thor is messing with my UPS”

Quorum witness is online but does not work

Problem

The cluster appears to be working fine, but every 15 minutes or so the following events are logged on the node that owns the quorum witness disk:

Source:        Microsoft-Windows-Ntfs
Event ID:      98
Level:         Information
Description:
Volume WitnessDisk: (\Device\HarddiskVolumeNN) is healthy.  No action is needed.


Event ID:      1558
Source:        Microsoft-Windows-FailoverClustering
Level:         Warning
Description:
The cluster service detected a problem with the witness resource. The witness resource will be failed over to another node within the cluster in an attempt to reestablish access to cluster configuration data.


Log Name:      System
Event ID:      1069
Level:         Error
Description:
Cluster resource 'Witness' of type 'Physical Disk' in clustered role 'Cluster Group' failed.

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

Analysis

Some digging in the event log identified a disk error incident during a failover of the virtual machine:

Log Name:      System
Event ID:      1557
Level:         Error
Description:
Cluster service failed to update the cluster configuration data on the witness resource. Please ensure that the witness resource is online and accessible.


Log Name:      System
Source:        Microsoft-Windows-Ntfs
Event ID:      140
Description:
The system failed to flush data to the transaction log. Corruption may occur in VolumeId: WitnessDisk:, DeviceName: \Device\HarddiskVolumeNN.
({Device Busy}
The device is currently busy.)

And ultimately

Log Name:      System
Source:        Ntfs
Level:         Warning
Description:
{Delayed Write Failed} Windows was unable to save all the data for the file . The data has been lost. This error may be caused by a failure of your computer hardware or network connection. Please try to save this file elsewhere.

It appears that the witness disk had a non-responsive period during the failover of the VM that caused an update to the cluster database to fail, thus rendering the copy of the cluster database contained on the witness disk corrupt. The disk itself is fine, thus there are no faults in the cluster resource status. everything appears hunky dory. There could be other causes leading to the same situation, but in this case the issue corelates with a VM failover.

We need to replace the defective database with a fresh copy from one of the nodes.

Solution

The usual warning: If this procedure is new to you, seek help before attempting to do this in production. If your cluster has other issues, messing with the quorum setup may end you in serious trouble. And if you have any doubts what so ever about the integrity of the drive/LUN, replace it with a new one.

Warnings aside, this procedure is usually safe, and as long as the cluster is otherwise healthy you can do this live without scheduling downtime.

Action plan

  • Remove the quorum witness from the cluster.
  • Check that the disk is listed as available storage and online.
  • Take ownership of the defective “cluster” folder on the root ofr the quorum witness drive.
  • Rename it to “oldCluster” in case we need to extract some data.
  • Add the disk back as a quorum witness
  • Wait to check that the error messages does not re-appear.
  • If they do re-appear
    • Order a new LUN
    • Add it to the cluster
    • Use the new LUN as a quorum witness
    • Remove the old LUN from the cluster

SMBv3.1.1 disconnects and fails to reconnect on Windows 10

Be warned: This will be a long one with a lot of text and few images. I never planned on doing a write-up on this issue, so I did not take a lot of pictures.

I have been troubleshooting this issue on and off for two years, and I was on the brink of giving up several times. I pride myself in finding solutions where others only find stress and hair-loss, and do so routinely, but sadly there are still nuts I cannot crack. This issue was believed to be such a nut. But I was wrong. The solution had been staring me straight in the eyes for quite some time, but we must not get ahead of ourselves. Let us start at the beginning.

Problem

SMB sessions are invalidated, such that it is impossible to reconnect. This happens only on Windows 10 clients, Windows 7 and 8? clients running SMBv2.* can still reconnect as normal.

User story:

  • The user opens a file explorer window and navigates to a folder on a fileserver containing documents the user wants to read and/or edit.
  • This works without issue 100% of the time as long as the client computer has a network connection to the file server.
  • After a period of inactivity the SMB session is suspended. The user does not detect this, everything is still ok.
  • Some time later, the user will either
    • Try to save a file
    • Try to open a new file using the same File Explorer window
  • Possible outcomes
    • Everything works as expected
    • It is impossible to save the file to the server, it has to be saved locally.
    • The File Explorer window is gone. The user has to re-open the window and navigate back to the folder in question.
  • Thus, the user gets annoyed and and complains about the stupid Windows 10 upgrade, which is understandable.

Relevant Event IDs: 30807 from SMBClient and 1016 from SMBServer.

Continue reading “SMBv3.1.1 disconnects and fails to reconnect on Windows 10”

OS layer initialization failed while updating an Intel X710

Problem

When trying to update the NVM/Firmware on an intel X710 SFP+ network interface card running on Windows the process is interrupted with the error message below:

OS layer initialization failed

This card was made by Intel and mounted in a HP ProLiant DL380 gen9. It was chosen over a genuine HP-approved card due to supply chain issues. I was trying to install version 8.5, the latest available from Intel at the time of writing.

Analysis

There is some component that is not available. I suspect a hardening issue, as someone installed an ginormous amount of hardening GPOs some time ago.

A process monitor trace shows that the tool loads pnputil and tries to install some drivers with varying degrees of success. Specifically it appears to be looking for iqvsw64e in miscellaneous temporary folders.

It has been a while since last time I read a Procmon output, but as far as I can tell the process is not successful. The files are included with the package and self-identifies as an “Intel Network Adapter Diagnostic Driver”.

Hypothesis: the nvm updater needs the diagnostic drivers to communicate with the adapter, but something blocks the installation.

I do not have access to test this, but I am pretty sure that there is a GPO blocking the installation. I tested the previous version from 2018 that had been installed successfully on the server, and it now fails with the exact same error. The next step would be to start disabling hardening GPOs, but as I do not have the access to do that directly on this server, I gave up and started looking for a workaround. Some hours later I found one.

Workaround

As per usual, if you do not fully understand the consequences of the steps in the action plan below, seek help. This could brick your server, which is a nuisance when you have 20, but a catastrophe if you have only one and the support agreement has expired.

Prerequisites

  • HP ProLiant DL380 Gen9 (should work on all currently supported HPE ProLiant DL series servers).
  • Windows Server 2016. Probably compatible with 2012-2022, but I have yet to test that.
  • Other HP-approved Intel SFP+ network adapters mounted in the same server, in my case cards equipped with Intel 560-series chips. Could work with other Intel adapters as well.
  • A copy of a not too old Service Pack for ProLiant iso for your server or similar. The SPP for DL380 gen10 has been tested, and I can confirm that it works for this purpose even though it will refuse to install any SPP updates.
  • A valid HPE support contract is recommended
  • A towel is recommended.
  • A logged-in ILO remote console or local console access.
  • The local admin password for the server.

Action plan

NB: If you are not planning an SPP update as part of this process, or if you are unable to obtain one, see the update below for an alternative approach. You need an active support contract to download SPP packages, but individual cp-s are available.

  • Install the intel drivers that correspond with your firmware version. Preferably both should be the latest available.
  • Be aware that this will disconnect your server from the network if all your adapters are intel adapters. Usually only temporarily, but if it does not come back online, you may have to reconfigure your network adapters using ILO or a local console.
  • Reboot the server.
  • Extract the NVM update
  • In an administrative command shell, navigate to the Winx64 folder.
  • Try running nvmupdatew64e.exe and verify the error message.
  • Mount the SPP iso.
  • Run launch_sum.bat from the iso as admin.
  • In the web browser that appears, accept the cert error and start a localhost guided update:
  • While the inventory is running, switch back to the command shell and keep trying to start the nvm update.
  • This will take som time, so do not give up and remember your towel.
  • Suddenly, this will happen:
  • Update all cards that have an update available.
  • Reboot the server. You may complete the SPP update before you reboot.

Update: an alternative method

As the HP SPP is a fairly large download to haul around, I kept looking for a mor lightweight workaround. If you are going to install an SPP anyway, using it makes sense, but if you are using other methods for patching your servers it is a bit overkill to use at 10GB .iso to install a 70kb temporary driver. Thus a new plan was hatched.

  • Instead of the SPP iso, get a hold of the cpnnnn update package for you HP approved Intel-based network card. For my x560 card, the current cp is cp047026.
  • Extract the files to a new folder. I have not tested whether or not it is possible to extract the package without the card being installed, but it appears to be a branded winzip self-extracter or similar so I expect it to work.
  • Inside your new folder you will find a file called ofvpdfixW64e.exe. Run it from an administrative command shell.
  • Wait for it to finish scanning for adapters.
  • You should now be able to start nvmupdatew64e.exe and upgrade your X710.

As we can see, the tool detects both the HP approved and Intel original adapters. The tool is designed for a different purpose, but that is not important. All we need is a tool that will load the diagnostic drivers and thus enable our Intel updater to function. The package also contains rfendfixW64e.exe, another fixup tool that will load the driver. The HP branded firmware update tool (HPNVMUpdate.exe) may also load the driver in some scenarios. I guess what I am saying is try all of them if one is not working. And make sure to wait for the scan to complete before you try starting nvmupdatew64e.exe.

Also, make sure to install the PROSet drivers. I have had trouble getting this to work without them.

Why this works

The HP SPP is using a branded version of the Intel NVM updater. This updater is using the same driver mentioned above, at least for the 560-series of chips. It is running in a different host process and is thus able to circumvent the hardening that blocks the installation of the driver from the intel tool directly. When the SPP inventory process is querying Intel network adapters, the driver is loaded and keeps running until you reboot the server. You may be able to get this working even without any other Intel adapters, but I have not tested this scenario. It all depends on how the SPP inventory process runs.

Verify the result

You can verify the result using the Intel-supplied powershell commandlets. They are installed together with the PROSet driver package. You activate them by running this command:

Import-Module -Name "C:\Program Files\Intel\Wired Networking\IntelNetCmdlets\IntelNetCmdlets"

And you list the VNM versions running the next command. Be aware that HP branded adapters may not respond to this command and will be listed as not supported. These commands may be relatively slow to respond, this is normal.

Get-IntelNetAdapter | ft -Property Name, DriverVersion, ETrackID, NVMVersion

LIND CF-LNDDC120 Toughbook car adapter: Replacing the 12V socket

Problem

The 12VDC plug is broken, and the connection is unstable. Thus, the power to the ToughBook is intermittent especially when driving.

Broken plug

The broken plug is shown in the image above, and the spare part is shown below. We can clearly see that something is missing in the middle. The Plug was working, provided you held the cable at a certain angle.

Lumberg 1613 11

Solution

The solution should be quite simple. Basic soldering skills and a relatively high-powered soldering iron is required, plus som other basic tools of course, but this is not rocket surgery. The problem was, at least in my case, acquiring the spare part. You see, I live up north in the frozen plains of Norway, and this part appears to be made of at least 25% unobtanium. As the picture above shows, the two sockets on this device have opposing genders. The internet is fairly divided as to which one is male and female, making the search for the part number an epic battle worthy of its own post. Suffice it to say, I am unable to find one that is not produced by Lumberg, the company that is the OEM as far as my research shows. And do you think a Norwegian Lumberg-stockist that stocks this particular part exists? And one that is willing to sell me one, or maybe in a pinch five pieces, for an non-extortionate price? No way.

Thus I am left with importing one myself. Due to recent tax walls set up by the supposedly conservative government in cohorts with the local socialists and communists, this will be expensive. What would previously have been a 5USD part in a 10USD envelope became a 35USD ordeal. A new power supply is around 150USD, so still feasible but not cheap. Had the part been of a more standard variety, Winnie the Pooh and his cohorts in Candy Mountain would gladly have sold me a bag of 20pcs for that price and smuggled it into the country.

But enough ranting. A couple of weeks later the part finally arrived. It was time to don the HEV-suit and venture into the plague-ridden frozen wastelands and look for the post office. And I was lucky, the package was actually ready for pickup at the location listed in the tracking data. My local post office was destroyed in the great Norwegian postal wars of 2019, so I sometimes have to traverse the city several times to hunt down a package. For reference, the Lumberg part number is 1613 11. Yes, there is a space in the part number, and it is supposed to be there.

Some tips for the soldering work

  • Use a large flat screwdriver to open the case. Be patient and you will avoid breakage. Or use glue afterwards.
  • This is power electronics. That means big terminals.
  • The terminals are bent on the back of the PCB.
  • The terminals are covered with some kind of conformal coating that makes it difficult to heat up the solder. Scrape it off or use high heat + flux.
  • You should have a 60W+ soldering iron.
  • I found it easiest to just cut apart the old socket with a pair of snips. Just make sure to leave enough of the metal parts behind that you have something to grab and pull at while you are heating it up.
  • Some kind of solder removal equipment is necessary to remove enough solder to allow the new part to fit through the holes.
  • Make sure that the new part is soldered snugly against the PCB, otherwise it will not fit through the hole in the case.

Pictures

Use a flat head screwdriver to open the case
End cap removed
These the pins are the ones to remove
Ready for soldering

List VM Networks in SCVMM

To list the connected networks and subnets for each VM running in VMM, run the following in a VMM connected powershell window:

get-vm | select -ExpandProperty VirtualNetworkAdapters|select Name, VMNetwork, VMSubnet, IPV4Subnets, IPv4Addresses|sort-object VMnetwork|ft