Node unable to join cluster because the cluster db is out of date

Scenario

  • Complex 3-node AOAG cluster
  • 2 nodes in room A form a SQL Server failover cluster instance (FCI)
  • 1 node in room B is a stand-alone instance
  • The FCI and node 3 form an always on availability group (AOAG)
  • Alle nodes are in the same Windows failover cluster
  • All nodes run Windows Server 2022

Problem

The problem was reported as such: Node 3 is unable to join the cluster, and the AOAG sync has stopped. A look in the cluster events tab in Failover Cluster Manager revealed the following error being repeated; FailoverClustering Event ID 5398:

Cluster failed to start. The latest copy of cluster configuration data was not available within the set of nodes attempting to start the cluster. Changes to the cluster occurred while the set of nodes were not in membership and as a result were not able to receive configuration data updates.

And also; FailoverClustering Event ID 1652:

Cluster node 'Node 3' failed to join the cluster. A UDP connection could not be established to node(s) 'Node 1'. Verify network connectivity and configuration of any network firewalls

Analysis

Just looking at the error messages listed, one might be inclined to believe that something is seriously wrong with the cluster. Cluster database paxos (version) tag mismatch problems often lead to having to evict and re-join the complaining nodes. But experience, and especially with multi-room clusters, has taught me that this is seldom necessary. The cluster configuration database issue is just a symptom of the underlying network issue. What it is trying to say is that the concurrency of the database across the nodes cannot be verified, or that one of the nodes are unable to download the current version from one of the other nodes. Maybe due to an even number of active nodes, or not enough nodes online to form a majority.

A cluster validation run (Network only) was started, and indeed, there was a complete breakdown in communication between node 1 and 3. In both directions. Quote from the validation report:

Node 1 is not reachable from node 3. It is necessary that each cluster node can
communicate each other cluster node by a minimum of one network path (though multiple paths
are recommended to avoid a single point of failure). Please verify that existing networks
are configured properly or add additional networks.

If the communication works in one direction, you will only get one such message. In this case, we also have the corresponding message indicating a to-way issue:

Node  is not reachable from node 1. It is necessary that each cluster node can communicate each other cluster node by a minimum of one network path (though multiple paths are recommended to avoid a single point of failure). Please verify that existing networks are configured properly or add additional networks.

Be aware that a two-way issue does not necessarily indicate a problem with more than one node. It does however point to a problem located near or at the troublesome node, whereas a one-way issue points more in the direction of a firewall issue.

Thus, a search of the central firewall log repository was started. It failed to shine a light on the matter though. Sadly, that is not uncommon in these cases. A targeted search performed directly on the networking devices in combination with a review of relevant firewall rules and routing policies by a network engineer is often needed to root out the issue.

The cluster had been running without any changes or issues for quite some time, but a similar problem had occurred at least once before. Last time it was due to a network change, and we knew that a change to parts of the network infrastructure had recently been performed. But stil, something smelled fishy. And as we could not agree on where the smell came from, we chose to analyse a bit more before we summoned the network people.

The funny thing was that communication between node 2 and node 3 was working. One would think that the problem should be located on the interlink between room A and B, but if that was the case, why did it only affect node 1? We triggered a restart of the cluster service on node 3. The result was that the cluster, and thereby the AOAG listener and databases went down, quorum was re-arbitrated, and node 1 was kicked out. The FCI and AOAG primary failed over to node 2, node 3 joined the cluster and began to sync changes to the databases, and node 1 was offline.

So, the hunt was refocused. This time we were searching diligently for something wrong on node 1. Another validation report was triggered, this time not just for networking. It revealed several interesting things, whereof two became crucial to solving the problem.

1: Cluster networking was now not working at all on node 1, and as a result network validation failed completely. Quote from the validation report:

An error occurred while executing the test.
There was an error initializing the network tests.

There was an error creating the server side agent (CPrepSrv).

Retrieving the COM class factory for remote component with CLSID {E1568352-586D-43E4-933F-8E6DC4DE317A} from machine Node 1 failed due to the following error: 80080005 Node 1.

2: There was a pending reboot. On all nodes. Quote:

The following servers have updates applied which are pending a reboot to take effect. It is recommended to reboot the servers to complete the patching process.
Node 1
Node 2
Node 2

Now, it is important to note that patching and installation of software updates on these servers are very tightly regulated due to SLA-concerns. Such work should always end with a reboot, and there are fixed service windows to adhere to with the exception of emergency updates of critical patches. No such updates had been applied recently.

Rummaging around in the registry, two potential culprits were discovered: Microsoft xps print spooler drivers and Microsoft Edge browser updates. Now why these updates were installed, and why they should kill DCOM and by extension failover clustering I do not now. But they did. After restarting node 1, everything started working as expected again. Findings in the application installation log would indicate that Microsoft Edge was the problem, but this has not been verified. It does however make more sense than the XPS print spooler.

Solution

If you have this problem and you believe that the network/firewall/routing is not the issue, run as cluster validation report and look for pending reboots. You find them under “Validate Software Update Levels” in the “System Configuration” section.

If you have some pending reboots, restart the nodes one by one. The problem should vanish.

If you do not have any pending reboots, try rebooting the nodes anyway. If that does not help, hunt down someone from networking and have them look for traffic. Successful traffic is usually not logged, but you can turn it on temporarily. Capturing the network traffic on all of the nodes and running a wireshark analysis would be my next action point if the networking people are unable to find anything.

Images

MOMcertimport.exe not found

Scenario

  • You have a computer that is monitored using System Center Operations Manager (SCOM).
  • This computer is located outside of your normal AD structure, and as such is relying on certificate authentication. It could be located in:
    • The cloud
    • In a DMZ
    • In a disjointed domain
    • All of the above.
    • In a super secret location with way too many firewalls
  • The certificate or part of the certificate chain has expired and needs replacing
  • You are unable to run the MOMCertimport.exe tool that registers the certificate with the SCOM agent.

Solution

Note: I will assume that you have already created and installed a valid certificate on the computer in the correct way. In short:

  • Into the local computer certificate store
  • Including all root and intermediate certificates needed
  • And the private key for the certificate

Now, to make use of said certificate we would normally run MOMCertimport.exe. It is a tool located on the SCOM installation media that is written for the express purpose of informing the SCOM Agent as to which certificate it is supposed to use for communicating with the rest of the SCOM infrastructure, usually a gateway server. But maybe you do not have access to it? Or maybe, just maybe the computer in question is considered so secure that getting approval for using a tool like that will take weeks or even months?

Regedit to the rescue!

You will need the following information:

  • The certificate thumbprint
  • The certificate serial number

Action plan

If any details of this plan are unclear or confusing to you, seek assistance before you start.

  • Open regedit
  • Navigate to HKLM\Software\Microsoft\Microsoft OperationsManager\3.0\Machine Settings
  • Look for the ChannelCertificateSerialNumber value. If it does not exist, create it as a binary value.
  • Input the binary value in reverse. That is, if your serial number is AF 3C 56, input 56 3C AF. The pairs of numbers each represent a byte in hexadecimal format. Do not reverse the hex numbers, only the byte order as shown above.
  • Double check the numbers
  • Look for the ChannelCertificateHash value. If it does not exist, create it as a string value.
  • Input the certificate thumbprint into this field. This time, do not reverse the bytes. Also, remove any spaces. That is, input 99 df a3 as 99dfa3. Use lower case letters for the a b c d e f numbers. The thumbprint will usually be listed with lower case, whereas the binary value above will be listed with upper case.
  • Again, double check the numbers
  • Restart the Microsoft Monitoring Agent service
  • Look for event id 20053 in the Operations Manager event log, confirming that the certificate was valid. An invalid certificate will result in event id 20066.

Pictures

Error 87 when installing SQL Server Failover Cluster instance

Problem

When trying to install a SQL Server Failover Cluster instance, the installer fails with error 87. Full error message from the summary log:

Feature: Database Engine Services
Status: Failed
Reason for failure: An error occurred during the setup process of the feature.
Next Step: Use the following information to resolve the error, uninstall this feature, and then run the setup process again.
Component name: SQL Server Database Engine Services Instance Features
Component error code: 87
Error description: The parameter is incorrect.

And from the GUI:

Analysis

This is an error I have not come across before.

The SQL Server instance is kind of installed, but it is not working, so you have to uninstall it and remove it from the cluster to try again. This is rather common when a clustered installation fails though.

As luck would have it, the culprit was easy to identify. A quick look in the cluster log (the one for the SQL Server installation, not the actual cluster log) revealed that the IP address supplied for the SQL Server instance was invalid. The cluster in question was located in a very small subnet, a /26 subnet. The IP allocated was .63. A /26 subnet contains 64 addresses. As you may or may not know, the first and last addresses in a subnet are reserved. The first address (.0) is the network address, and the last address (yes, that would be .63) is reserved as the broadcast address. It is also common to reserve the second or second to last address for a gateway, that would be .1 or .62 in this case.

Snippet from the log:

Solution

Allocate a different IP address. In our case that meant moving the cluster to a different subnet. as the subnet was completely filled to the brim.

Action plan:

  • Replace the IP on node 1
  • Wait for the cluster to arbitrate using the heartbeat VLAN or direct attach crossover cable.
  • Add an IP to the cluster goup resource group and give it an address in the new subnet.
  • Bring the new IP online
  • Replace the IP on node 2
  • Wait for the cluster to achieve quorum
  • Remove the old IP from the Cluster Group resource group
  • Validate the cluster
  • Re-try the SQL Server Failover Cluster instance installation.

Thor is messing with my UPS

Or: Why are my battery status LEDs blinking all the time?

This post is related to a post from 2017 titled The Tale of Thor’s angry electrons. Related as in that it takes place at the same location in the western part of Norway where my family lives. Most of the equipment referenced in the previous post has been replaced by now. Most notably, the old ADSL internet line has been replaced by a long distance fiber line. That reduced the number of internet outages considerably. The power line and transformers were also updated at some point. I do not remember if this was before or after the incident chronicled in 2017, but it gave a massive improvement in power delivery. Multi-day outages was not uncommon during the rainy season. An in this part of Norway, the rainy season never ends, unless it is replaced by a short-lived snowstorm or a massive heat-wave, that is, a couple of days with temperatures above 20 degrees C.

But back to the internet. There will be no internet without power. But wait, we have mobile phones and laptops I hear you say. Well, the mobile phone talks to a base station. This base station requires power to operate. So even if we have a handy dead dinosaur converter that creates enough electricity to keep the fish frozen and the laptops charged, without power to the base station and the local internet distribution point there will not be any internet. Neither the magic floating wireless internet or the more traditional and stable wired variety coming out of the wall.

As you would know if you have read the previous chronicle, we have employed several measures to ensure a stable Internet connection (and power delivery). One of those measures is an APC SmartUPS 1500. It makes sure that the core network components receive clean power, and it provides backup power for at least 30 minutes. As a line-interactive UPS it is definitely a massive overkill for a residential building, but it has done everything asked of it without complaining since it was made in 2008.

I chose the APC SmartUPS series because I have only ever seen one that was utterly destroyed. It was connected to a network switch in the engine compartment of a massive cargo ship and had been subjected to “a small amount of water”. It still tried its best though. It didn’t care that the batteries had expanded inside the battery compartment and had to be removed using a crowbar and a hazard suit. Fitted with a new-ish battery it provided output, but the charging circuit was destroyed. Sadly no pictures, this was a long time ago.

Continue reading “Thor is messing with my UPS”

Quorum witness is online but does not work

Problem

The cluster appears to be working fine, but every 15 minutes or so the following events are logged on the node that owns the quorum witness disk:

Source:        Microsoft-Windows-Ntfs
Event ID:      98
Level:         Information
Description:
Volume WitnessDisk: (\Device\HarddiskVolumeNN) is healthy.  No action is needed.


Event ID:      1558
Source:        Microsoft-Windows-FailoverClustering
Level:         Warning
Description:
The cluster service detected a problem with the witness resource. The witness resource will be failed over to another node within the cluster in an attempt to reestablish access to cluster configuration data.


Log Name:      System
Event ID:      1069
Level:         Error
Description:
Cluster resource 'Witness' of type 'Physical Disk' in clustered role 'Cluster Group' failed.

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

Analysis

Some digging in the event log identified a disk error incident during a failover of the virtual machine:

Log Name:      System
Event ID:      1557
Level:         Error
Description:
Cluster service failed to update the cluster configuration data on the witness resource. Please ensure that the witness resource is online and accessible.


Log Name:      System
Source:        Microsoft-Windows-Ntfs
Event ID:      140
Description:
The system failed to flush data to the transaction log. Corruption may occur in VolumeId: WitnessDisk:, DeviceName: \Device\HarddiskVolumeNN.
({Device Busy}
The device is currently busy.)

And ultimately

Log Name:      System
Source:        Ntfs
Level:         Warning
Description:
{Delayed Write Failed} Windows was unable to save all the data for the file . The data has been lost. This error may be caused by a failure of your computer hardware or network connection. Please try to save this file elsewhere.

It appears that the witness disk had a non-responsive period during the failover of the VM that caused an update to the cluster database to fail, thus rendering the copy of the cluster database contained on the witness disk corrupt. The disk itself is fine, thus there are no faults in the cluster resource status. everything appears hunky dory. There could be other causes leading to the same situation, but in this case the issue corelates with a VM failover.

We need to replace the defective database with a fresh copy from one of the nodes.

Solution

The usual warning: If this procedure is new to you, seek help before attempting to do this in production. If your cluster has other issues, messing with the quorum setup may end you in serious trouble. And if you have any doubts what so ever about the integrity of the drive/LUN, replace it with a new one.

Warnings aside, this procedure is usually safe, and as long as the cluster is otherwise healthy you can do this live without scheduling downtime.

Action plan

  • Remove the quorum witness from the cluster.
  • Check that the disk is listed as available storage and online.
  • Take ownership of the defective “cluster” folder on the root ofr the quorum witness drive.
  • Rename it to “oldCluster” in case we need to extract some data.
  • Add the disk back as a quorum witness
  • Wait to check that the error messages does not re-appear.
  • If they do re-appear
    • Order a new LUN
    • Add it to the cluster
    • Use the new LUN as a quorum witness
    • Remove the old LUN from the cluster

SMBv3.1.1 disconnects and fails to reconnect on Windows 10

Be warned: This will be a long one with a lot of text and few images. I never planned on doing a write-up on this issue, so I did not take a lot of pictures.

I have been troubleshooting this issue on and off for two years, and I was on the brink of giving up several times. I pride myself in finding solutions where others only find stress and hair-loss, and do so routinely, but sadly there are still nuts I cannot crack. This issue was believed to be such a nut. But I was wrong. The solution had been staring me straight in the eyes for quite some time, but we must not get ahead of ourselves. Let us start at the beginning.

Problem

SMB sessions are invalidated, such that it is impossible to reconnect. This happens only on Windows 10 clients, Windows 7 and 8? clients running SMBv2.* can still reconnect as normal.

User story:

  • The user opens a file explorer window and navigates to a folder on a fileserver containing documents the user wants to read and/or edit.
  • This works without issue 100% of the time as long as the client computer has a network connection to the file server.
  • After a period of inactivity the SMB session is suspended. The user does not detect this, everything is still ok.
  • Some time later, the user will either
    • Try to save a file
    • Try to open a new file using the same File Explorer window
  • Possible outcomes
    • Everything works as expected
    • It is impossible to save the file to the server, it has to be saved locally.
    • The File Explorer window is gone. The user has to re-open the window and navigate back to the folder in question.
  • Thus, the user gets annoyed and and complains about the stupid Windows 10 upgrade, which is understandable.

Relevant Event IDs: 30807 from SMBClient and 1016 from SMBServer.

Continue reading “SMBv3.1.1 disconnects and fails to reconnect on Windows 10”

OS layer initialization failed while updating an Intel X710

Problem

When trying to update the NVM/Firmware on an intel X710 SFP+ network interface card running on Windows the process is interrupted with the error message below:

OS layer initialization failed

This card was made by Intel and mounted in a HP ProLiant DL380 gen9. It was chosen over a genuine HP-approved card due to supply chain issues. I was trying to install version 8.5, the latest available from Intel at the time of writing.

Analysis

There is some component that is not available. I suspect a hardening issue, as someone installed an ginormous amount of hardening GPOs some time ago.

A process monitor trace shows that the tool loads pnputil and tries to install some drivers with varying degrees of success. Specifically it appears to be looking for iqvsw64e in miscellaneous temporary folders.

It has been a while since last time I read a Procmon output, but as far as I can tell the process is not successful. The files are included with the package and self-identifies as an “Intel Network Adapter Diagnostic Driver”.

Hypothesis: the nvm updater needs the diagnostic drivers to communicate with the adapter, but something blocks the installation.

I do not have access to test this, but I am pretty sure that there is a GPO blocking the installation. I tested the previous version from 2018 that had been installed successfully on the server, and it now fails with the exact same error. The next step would be to start disabling hardening GPOs, but as I do not have the access to do that directly on this server, I gave up and started looking for a workaround. Some hours later I found one.

Workaround

As per usual, if you do not fully understand the consequences of the steps in the action plan below, seek help. This could brick your server, which is a nuisance when you have 20, but a catastrophe if you have only one and the support agreement has expired.

Prerequisites

  • HP ProLiant DL380 Gen9 (should work on all currently supported HPE ProLiant DL series servers).
  • Windows Server 2016. Probably compatible with 2012-2022, but I have yet to test that.
  • Other HP-approved Intel SFP+ network adapters mounted in the same server, in my case cards equipped with Intel 560-series chips. Could work with other Intel adapters as well.
  • A copy of a not too old Service Pack for ProLiant iso for your server or similar. The SPP for DL380 gen10 has been tested, and I can confirm that it works for this purpose even though it will refuse to install any SPP updates.
  • A valid HPE support contract is recommended
  • A towel is recommended.
  • A logged-in ILO remote console or local console access.
  • The local admin password for the server.

Action plan

NB: If you are not planning an SPP update as part of this process, or if you are unable to obtain one, see the update below for an alternative approach. You need an active support contract to download SPP packages, but individual cp-s are available.

  • Install the intel drivers that correspond with your firmware version. Preferably both should be the latest available.
  • Be aware that this will disconnect your server from the network if all your adapters are intel adapters. Usually only temporarily, but if it does not come back online, you may have to reconfigure your network adapters using ILO or a local console.
  • Reboot the server.
  • Extract the NVM update
  • In an administrative command shell, navigate to the Winx64 folder.
  • Try running nvmupdatew64e.exe and verify the error message.
  • Mount the SPP iso.
  • Run launch_sum.bat from the iso as admin.
  • In the web browser that appears, accept the cert error and start a localhost guided update:
  • While the inventory is running, switch back to the command shell and keep trying to start the nvm update.
  • This will take som time, so do not give up and remember your towel.
  • Suddenly, this will happen:
  • Update all cards that have an update available.
  • Reboot the server. You may complete the SPP update before you reboot.

Update: an alternative method

As the HP SPP is a fairly large download to haul around, I kept looking for a mor lightweight workaround. If you are going to install an SPP anyway, using it makes sense, but if you are using other methods for patching your servers it is a bit overkill to use at 10GB .iso to install a 70kb temporary driver. Thus a new plan was hatched.

  • Instead of the SPP iso, get a hold of the cpnnnn update package for you HP approved Intel-based network card. For my x560 card, the current cp is cp047026.
  • Extract the files to a new folder. I have not tested whether or not it is possible to extract the package without the card being installed, but it appears to be a branded winzip self-extracter or similar so I expect it to work.
  • Inside your new folder you will find a file called ofvpdfixW64e.exe. Run it from an administrative command shell.
  • Wait for it to finish scanning for adapters.
  • You should now be able to start nvmupdatew64e.exe and upgrade your X710.

As we can see, the tool detects both the HP approved and Intel original adapters. The tool is designed for a different purpose, but that is not important. All we need is a tool that will load the diagnostic drivers and thus enable our Intel updater to function. The package also contains rfendfixW64e.exe, another fixup tool that will load the driver. The HP branded firmware update tool (HPNVMUpdate.exe) may also load the driver in some scenarios. I guess what I am saying is try all of them if one is not working. And make sure to wait for the scan to complete before you try starting nvmupdatew64e.exe.

Also, make sure to install the PROSet drivers. I have had trouble getting this to work without them.

Why this works

The HP SPP is using a branded version of the Intel NVM updater. This updater is using the same driver mentioned above, at least for the 560-series of chips. It is running in a different host process and is thus able to circumvent the hardening that blocks the installation of the driver from the intel tool directly. When the SPP inventory process is querying Intel network adapters, the driver is loaded and keeps running until you reboot the server. You may be able to get this working even without any other Intel adapters, but I have not tested this scenario. It all depends on how the SPP inventory process runs.

Verify the result

You can verify the result using the Intel-supplied powershell commandlets. They are installed together with the PROSet driver package. You activate them by running this command:

Import-Module -Name "C:\Program Files\Intel\Wired Networking\IntelNetCmdlets\IntelNetCmdlets"

And you list the VNM versions running the next command. Be aware that HP branded adapters may not respond to this command and will be listed as not supported. These commands may be relatively slow to respond, this is normal.

Get-IntelNetAdapter | ft -Property Name, DriverVersion, ETrackID, NVMVersion

List VM Networks in SCVMM

To list the connected networks and subnets for each VM running in VMM, run the following in a VMM connected powershell window:

get-vm | select -ExpandProperty VirtualNetworkAdapters|select Name, VMNetwork, VMSubnet, IPV4Subnets, IPv4Addresses|sort-object VMnetwork|ft

Cluster validation fails: Culture is not supported

Problem

The Failover Cluster validation report shows an error for Inventory, List Operating System Information:

An error occurred while executing the test. There was an error getting information about the operating systems on the node. Culture is not supported. Parameter name: culture 3072 (0x0c00) is an invalid culture identifier.

If you look at the summary for the offending node, you will find that Locale and Pagefiles are missing.

Analysis

Clearly there is something wrong with the locale settings on the offending node. As the sample shows the locale is set to nb-NO for Norwegian, Norway. I immediately suspected that to be the culprit. Most testing is done on en-US, and the rest of us that want to see a sane 24-hour clock without latin abbreviations and a date with the month where it should be located usually have to suffer.

I was unable to determine exactly where the badger was buried, but the solution was simple enough.

Solution

Part 1

  • Make sure that the Region & language and Date & Time settings (modern settings) are set correctly on all nodes. Be aware of differences between the offending node and working nodes.
  • Make sure that the System Locale is set correctly in the Control Panel, Region, Administrative window.
  • Make sure that Windows Update works and is enabled on all nodes.
  • Check the Languages list under Region & Language (modern settings). If it flashes “Windows update” under one or more of the languages, you still have a Windows Update problem or an internet access problem.
  • Try to validate the cluster again. If the error still appears, go to the next line.
  • Run Windows Update and Microsoft Update on all cluster nodes.
  • Restart all cluster nodes.
  • Try to validate the cluster again. If the error still appears, go to part 2.

Part 2

  • Make sure that you have completed part 1.
  • Go to Control Panel, Region.
  • In the Region dialog, locate Formats, Format.
  • Change the format to English (United States). Be sure to select the correct English locale.
  • Click OK.
  • Run the validation again. It should be successful.
  • Change the format back.
  • Run the validation again. It should be successful.

If you are still not successful, there is something seriously wrong with you operating system. I have not yet had a case where the above steps does not resolve the problem, but I suspect that running chkdsk, sfc -scannow or a full node re-installation would be next. Look into the rest of the validation report for clues to other problems.

A new (and improved?) wasteland

This is a story in the “Knights of Hyper-V” series, an attempt at humor, with actual technical content hidden in the details. This particular one is just for fun though. Any resemblance of actual trademarks or people or events (real or those  that can only be found residing inside your mind), is purely coincidental and should be disregarded.

The knights of Hyper-V were doing some spring cleaning. Or, it was actually summer and thus to late in the year to call it spring cleaning anymore. Project setbacks, slow equipment deliveries and the plague-that-shall-not-be-named had severely hampered progress. But finally, the day had arrived to replace some of the hard working VMs with fresh new ones, running updated software versions glistening in the summer sun. Or covered in the more gloomy, but oh so common summer rain. And perhaps snow, locusts or other more or less funny local phenomenon.

The old servers was not really all that old, but a change in networking politics had ushered in an early swap over. We were leaving the The Wasteland of Nexus for a new and supposedly better (and cheaper) Wasteland. With software-defined wasteland processors or something to that effect. The knights did not really care, all they knew was that new network armor plate connections were required, and that was always a pain in the backside. The application minions would be grumpy as they would have to write scroll after scroll of requests beseeching for safe paths through the walls of fire.

But enough of the backstory. After a long, looong time the imposed quest for a new wasteland was nearing its end, and it was time for cleanup. Most of the knights were finally at summer vacation, preparing to queue along the congested paths to the beach, waiting in line to look over a cliff, visiting distant relations or hiding in a deep dungeon to escape the aforementioned plague. Only a skeleton crew (not composed of actual skeletons this time) remained to watch over the systems and do the odd cleanup job. A passing minstrel wrote an ode to one of the old servers in exchange for a late breakfast, or early lunch depending on  your point of view.:

Ode to server sixteen

New servers come in, and old ones get phased out.
It exists now only as a memory
Vanished into thin air
Like a fleeting ghost in the machine
Binary code rearranged to form new beginnings
It will always remain in our hearts

For the time being, all was well in the kingdom. All the VMs were kept in line by the automated all-seeing eye of OM, and it was time to relax, read, and practice dragon slaying if one was such inclined. Till next time, enjoy your life such as it is. Remember, things could always be worse. Before you know it, the roars of a three-headed Application bug dragon and the distant horrified screams of application team minions could wake you from your slumber…