Root Cause Analysis Archives - BlackCat Reasearch Facility

Catching the ghosts of forgotten VMs

This is a story in the “Knights of Hyper-V” series, an attempt at humor with actual technical content hidden in the details.

It was nearing the end of summer, but most of the Knights of Hyper-V were still on vacation. There was of course always one knight on call, but the others were lazily roaming the countryside, or lounging along the bank of a river pretending to be on a fishing trip. Some even went on expeditions to far away realms looking for trouble, relaxation, fancy fishing gear, VMWare-proof armor, or new riding boots. The all-seeing monitors however, were not on vacation. To be honest we do not even know if they ever sleep, they just seem to take turns going into hibernation mode. Instead they had spent the summer installing new crystal orbs, automated all-seeing eyes and such. One of their new contraptions was some kind of network enabled spooky ghost detector. Its purpose was to send probes into The Wasteland of Nexus and attempt to locate signs of the ghosts of forgotten VMs and other security problems.

This came about as flaws had been discovered in the procedure for disposal of outdated VMs. The minions responsible for dealing with outdated VM disposal had gotten increasingly bureaucratic, spending most of their time hassling others with demands of forms filled in triplicate to update documentation. And such tasks are of course important, but the most important thing is to actually dispose of the old VM. The result was a number of undocumented (as the documentation had been updated) VMs roaming The Wasteland of Nexus without updated security software, making the entire realm vulnerable to outside attacks from beyond the wall. Firewall that is.

Such was the back-story, when one dark and gloomy midsummer morning, a trouble ticket landed in the inbox of the knight on call with a loud boom. It was another list of suspect activities detected in the wasteland. A couple of probes had returned during the night, complaining about servers without patches several years old. To add a little spice to the mix, this was ghost servers. If you nocked on the right door they would answer, but they were not listed anywhere. Not in the labyrinthine CMDB, and certainly not in any of the address books. For all intents and purposes they did not exist. Except of course for the undeniable fact that they most certainly did. This was something that could provide days, if not months of confused contemplation for social studies majors, human resources, project managers and others of similar ilk. But the knight was an engineer and simply scoffed at such irrelevancies. To him this was simply a problem looking for a solution. But which solution? The available information pointed to an ancient server from 2010. That is a very long time ago, and at least two documentations systems has been sent off to Valhalla by the way of funeral pyre in the meantime. The current buzzword-friendly variant was named after the Chinese philosopher Confucius. He was the inventor of the term “Do not do to others what you do not want done to yourself”, but if such terms was to be enforced in documentation systems, violent outbreaks would be the norm, as most documentation can be interpreted as a form of torture. Anyways, no trace of the ghosts were found in the current system, and the old ones were burned. There was always a faint hope that someone had kept a personal log mentioning the ghosts former names, but no such luck was to be had this time around.

The knight went back to the all-seeing monitors and requested more information to aid him in his search. Another probe was dispatched into the spirit world, this time with instructions to look for identifying marks instead of fuzzing about missing security updates, foul stenches and gates left open. While waiting for the probes to return, the knight identified an old long forgotten storage system. The storage minions swore it had been properly decommissioned and disposed of years ago, but it was found to be chugging along under a desk, consuming power and collecting dust.

Another sub-quest expedition to the physical realm of Hyper-V hosts revealed that someone had been re-inserting old decommissioned servers that were kept around for spare parts into the magic cabinet of the silver slanted ‘E’. Or it could of course bee that they had never been removed in the first place due to bureaucratic loops and lost scrolls of Todo. Anyways, the knight had bagged two ghosts.

We rejoin our knight the next morning. For once it was a good morning. The sun was shining, and the success of yesterday’s sub-quests were still lingering in the knights mind. Sadly, that would soon change. The probes were back, and they were happily reporting that the former names of the ghosts had been decoded. This identified the responsible service team, but the service team minions were all relatively new and had never heard of these old ghosts. Armed with new knowledge the knight went straight to the VMM daemons to demand an explanation. But to his great alarm, he found that the VMM daemons too had never heard of these ghosts. Feverishly the knight searched the scrolls of physical servers, in a vain hope that the servers nevertheless were physical beings, but no. No such server had ever existed. With that, only one possible solution remained; the ghost were located in the realm of VMWare…

There was no choice other than to beseech the man with the crowbar to borrow his Hazard Suit and plan an expedition to the toxic fields of vCenter. Once there, the ghosts were immediately detected. On a closer (but hasty) inspection of the remaining area, the knight also identified two other ghosts. He quickly filled out a scroll identifying the ghosts, and went back to more pleasing surroundings. He then updated the trouble ticket and forwarded it to the unholy riders of VMWare, hoping that he wouldn’t have to go back for a long, long time.

Last edit: Friday, March 6, 2020

Dark magic at the backup site

This is a story in the “Knights of Hyper-V” series, an attempt at humor with actual technical content hidden in the details.

It was a nice Friday afternoon, and the Knights of Hyper-V were holding an end-of-week council meeting. The mood was light, as cunning and not so cunning plans were laid for the days ahead. There were misbehaving servers in need of an attitude adjustment, and misbehaving minions in need of a reality check. Just as the head knight was delivering a lengthy speech on the proper use of Virtual Message Queue-Incantations and the correct grammar for Receive Side Scaling spells of power, a stressed out envoy from the all-seeing monitors arrived and demanded immediate audience. They had lost all communication with one of the ghosts in the offsite backup dungeon. “Not another network armor issue!” exclaimed one of the Knights. She knew that some of the hosts at the offsite dungeon still used the inferior (and highly unstable) Broadcom plates as a connection to The Wasteland of Nexus.

The Knights leapt into action and went to interrogate the hosts. To their big surprise the hosts were in uproar. The cluster event scroll was gushing with blood red critical alerts, and host1 was completely unreachable. Undoubtedly, some dark magic was veiling this information from the gaze of the all-seeing monitors. The Knights bade the VMM daemons to put the missing host in maintenance mode. This proved to be a great mistake. As the VMM daemons tried to herd all the ghosts over to hosts 2 and 3, the cluster log cried out in red agony once more, but this time with storage problems on host3 as well. Thus the cluster was left with only one working node, and perhaps a cursed node spewing corrupt data into the storage. The Knights had no other choice but to take it all down and put all the ghosts to sleep. What started out as a quiet afternoon had suddenly turned into a frenzy of unruly runaway hosts and buzzards circling in the sky above the backup site.

Continue reading “Dark magic at the backup site”

Last edit: Friday, March 6, 2020

The search for the missing armor plate

One dark and stormy evening, The Knights of HyperV had trouble getting a new host in contact with The Wasteland of Nexus. One of the armor plates to which the connections were attached was not responding. Actually, it appeared to be missing entirely. Something which was odd, as it had been installed by a trusted minion just days earlier. The Knights sent an expedition to the gates of Hell (otherwise known as Dell CMC) to investigate. The envoys tried to open the gates, but instead of open doors, they were greeted with an error message.

After a lengthy discussion with the insane gatekeeper of LDAP-Auth, the envoys were finally granted access into The Ocean of Known Bugs that lay beyond the gates. A boat carried them over to The island of iDRAC. The journey was bumpy, and once ashore the envoys wasted several recovering from seasickness and the general discomfort caused by the putrid smell of the bugs.

Once recovered, they demanded access to the scrolls of system inventory for the server in question. To the horror of the envoys, the inventory only listed one armor plate, instead of the expected two. Luckily, the existing armor plate was made of the stable Intel-alloy as expected, but the second plate was missing. Could it have been stolen? Perhaps by one of the competing service team minions that dwelled in The Cursed Forests of Sharepoint? They would have to venture into the physical realm of the Hypervisors to find out for sure. The Knights tracked down the minion responsible for the armor plate installation and interrogated her for details. She insisted that the plate had to be there still, pleading to avoid another trip down the long and dangerous road to The Physical Realm of the Hypervisors, and suggested that the problem may be a curse. A spell from the book of dark forbidden magic, putting a veil over the labyrinth of UEFI and thus preventing the armor plate from being seen or used by the server.

Could it be? The knights snorted in disbelief, but as they had no other ideas at the time, they traveled to The Wizard of Badgerville and beseeched him to remove the curse if there was any. The wizard demanded an offering of three sausages and some boiled rice (as he was hungry). After devouring the food, the wizard started walking in circles around the remote console and muttered incomprehensible incantations that was somehow transmitted to the host without him ever touching the keyboard. A long time passed as the knights watched the wizard. At first, they watched in awe. But as the time went by, awe turned to glances, glances turned to boredom, until finally the knights were all sound asleep. Sometime later, whether hours or days we do not know, they were snatched away from slumber land. This annoyed The Knights, as they were all awoken from pleasant dreams about conquering the realm of VMWare.

“The deed is done!” declared The Wizard and vanished into a puff of smoke. The Knights staggered over to the console and were amazed to find that the server was not only able to see the missing armor plate, but it was already connected to the spirit world and jabbering happily with the domain controller. However, who had cursed the server? And why? Could it be the witch with the wardrobe of broken firmware patches? Or the unholy dark riders of VMWare? We may never find out, but if we are lucky, one day another tale will be told about the adventures of The Knights of HyperV…

Last edit: Friday, November 20, 2015

The quest for the source of the foul stench, part 2

Part 1:https://lokna.no/?p=1837

On a rainy day, one adventurous minion traveled the dark and lonesome road to the physical realm of the troublesome hypervisor with the itchy connection to The Wastelands of Nexus. The minion scratched and scratched the hypervisors itchy back, until the rotten residue from the horrors of Nexus was all gone. He then presented the troublesome hypervisor with a new and shiny armor plate, and it was yet again prepared to rejoin The Wastelands of Nexus with a fresh and healthy connection.

Alas, the minions work had all been in vain. When the poor hypervisor on host 2 tried sending data through The Wasteland of Nexus and into the spirit world, it was unable to get a response. The ghost within was still restless, and upon investigation the foul stench came back and seemed to originate from the pipeline better known as HyperV1.

I borrowed a hazardous environment suit from a man with a red crowbar and went to investigate. A foul-smelling substance oozed out of the pipeline, and I wondered how the poor servers were able to survive in such an environment. I plugged the pipeline, and as soon as I had washed off the foul-smelling substance and placed the rags in an airproof yellow hazmat-container, the stench subsided and I was able to breathe freely. The virtual servers were able to get a stable connection to The Wasteland of Nexus through the spare pipeline known as HyperV2, and the ghosts rejoiced in happiness, as they were again able to communicate with the spirit world.

But the adventure was not yet over. We were no closer to identifying the source of the smelly substance. All we know is that it enters the hypervisor in the physical realm outside the host itself. May it be that the connections going into the physical server has gone bad? I would have to dispatch another minion into the physical realm of the hypervisors to look for answers…

Armed with this knowledge, an infrastructure minion scurried over to the physical realm of the hypervisor once again. This time she tried changing the place of Host 2 with Host 4. As feared, the failure followed the slot over to Host 4. This meant one of two things. May it be that the people of The Wasteland of Nexus had deceived us, and the connection was actually faulty at their end after all? Or could it be even worse; a problem in the magic cabinet of the silver slanted E? Into which all these hosts had to be inserted? The mere thought was enough to make the minions shudder, as everything branded with the mark of the silver slanted E was notoriously difficult to troubleshoot.

At long last, the Knights of Hyper-V decided that enough was enough. They ordered the minions to move host 4 into another functioning slot, and vitrify both the slot in the magic cabinet and the pipe going to The Wasteland of Nexus to make sure they were never used again. They further swore to expel anyone caught trying to bring more equipment bearing the mark of the silver slanted E into the physical domain. At last, the ghosts within the hypervisors were happy again, and the minions rejoiced for a couple of minutes before they went on to the next trouble ticket.

Last edit: Wednesday, July 15, 2020

The quest for the source of the foul stench, part 1

Late one cloudy evening, I was alerted by one of the helmet-clad application team minions, who claimed that one of their spirits were not responding. It was time for an adventure!

Venturing into the land of the hypervisors, I noticed a foul stench of failure radiating from one of the hosts. Something was definitely rotten in the kingdom of Hyper-V host 2. I tried communicating with the ghosts within, but my calls went unanswered. The connection to the spirit world of the virtual servers residing in host 2 was down. I worried that the servers had run off to greener pastures, but The Wizard of VMM insisted that they were still in place. The Wizard also adamantly claimed that there was no problem whatsoever, and snorted offended that the foul smell had to originate elsewhere.

The horror struck me: I may have to go down to The Wasteland of Nexus to investigate. Such an adventure would require a companion fluent in IOS, the almost incomprehensible gibberish spoken by the dark-greenish inhabitants. May it be that the host was working fine as The Wizard claimed, but that the data just disappeared into the vast nothingness of DEV/NULL? I teleported in to the host console to interrogate the servers locally. After waiting in front of the server for a long time, the doors suddenly sprung open, and I was let inside. The poor thing was clearly in turmoil. The event scroll was running red with error messages complaining about everything from time sync to database connections. Some meditation in front of the scroll revealed that most of the problems were caused by a poor connection to the physical realm. This had led to an identity crisis; the server was not even sure of who it was anymore.

Just to make sure, I beseeched The Wizard of VMM to recite the incantations required to move the server out of host 2 and into a temporary host. After some time, the ghosts were once again responding to calls. I commanded the application minions to perform some tests. They sprang into action, and was shortly rejoicing, happy that their application was once again fully redundant. However, the quest was not over yet. Why were the ghosts so unhappy at host 2 and what was the source of the foul stench?

To be continued…

Part 2: https://lokna.no/?p=1842

Last edit: Wednesday, July 15, 2020

Lokna’s laws of system operations

1. Respect the unknown. The fact that reality seems to defy logic or even the laws of physics only proves that there is at least one unknown factor. Expecting systems to behave as the manuals and your knowledge of the system dictates is only logical. Refusing to adapt to the fact that the system didn’t work as expected is not. If an unknown factor or factors caused an error once, it is likely to happen again under similar circumstances.

2. Respect what you don’t know. As the saying goes, the more you learn, the more you realize that you don’t know.

3. Only a fool deals in absolutes. Pity the fool, but don’t cater to his delusions.

4. Reading books, blogs and forums can only bring you so far. They are full of faults and misconceptions, and what is correct is often only correct from the viewpoint of the author. You have to adapt it to your situation.

5. Set up a personal lab and use it. Build it yourself, and build a lab reflecting what you use or want to learn.

6. You cannot succeed being on both the dev- and the ops-team. They may or may not share the same goal, but the path to reach the goal is rarely the same for both teams. Trying will only split your focus. Focus on what you do best, and let others focus on the rest.

7. Your lack of planning is not my emergency. Trying to make it so will only anger me.

8. Some people are otters, some people are rocks. Figure out which you are, live with it, or work on a way to change it.

9. Be responsible. YOU are at the top of your pyramid, and responsible for all of it, down to the foundations. Expect others to do their part, but be prepared in case they don’t.

10. Use independent testing software, and understand the test scripts. Your results will never be worth more than the test scripts behind the results.

11. There is no way to test production performance in QA or AT. No, there really is not. Until someone invents a way to create parallel universes for testing purposes, it will be impossible to accurately predict production performance.

12. Do not expect code to be delivered on schedule.

13. Do not expect servers and infrastructure to be delivered on schedule either.

14. Expect that the dev-team and customers will demand that you deploy on schedule anyway. Refer to 7.

15. If you are not on the DBA-team, do not waste time asking for admin access to the SQL Server. This will only anger the DBAs, and trust me, you do NOT want to deal with angry DBAs later on.

16. Do not try to use virtual machines to do the job of physical machines. A four core VM may perform well at low load, but don’t expect it to do the job of a 16 core physical machine at peak load. I know marketing will have us believe that this is possible, but it isn’t.

17. When you size servers, design for peak load and add 30% if you can afford it. If you cannot afford it, expect outages and abysmal performance at peak load. A 20% average CPU load does not mean that you can add 4 times your current amount of VMs.

18. Adding more VMs without adding more physical hardware will NOT increase performance, it will reduce it. Doing so will only increase the overhead on the already strained physical hardware. There are no exceptions to this law. If you still think there are, keep on thinking. Every time you add a VM, you have to recalculate your oversubscription rate and add hardware as necessary. Increasing CPU oversubscription above 200% is asking for trouble.

19. If you work for one of those companies that think you can just add ever increasing numbers of VMs to the same hardware hosts; RUN. Virtualization allows you to better utilize your hardware, but in spite of the price tag it is not magicware that creates resources out of thin air.

20. Do not believe the vm admin when he talks about max 5% hypervisor overhead. He forgot to add the 10% complexity overhead and the 10% general ops stupidity overhead.

21. Everybody lies, the question you should ask yourself is what do they lie about.

22. There will always be someone that is smarter than you. There will also be people that are a lot dumber. Sometimes the two are hard to tell apart.

23. If you delegate a task to someone else, don’t expect them to perform the task exactly like you would have done it. However, do expect them to deliver according to your instructions.

24. Some people like to argue for the sake of argument. They are usually very skilled at it. Trying to win an argument with such people is futile. That doesn’t mean you have to give in. There are other ways of winning than winning the argument.

25. If someone is yelling at you, they are usually part of the problem. This most certainly also applies to your boss, your boss’s boss and the president of the company. That being said, telling them that they are a part of the problem is not always a good idea.

26. Make sure to leave a paper trail. Covering your ass is a key survival technique, and there is nothing like a good subtle “I told you so” further down the line to brighten your day.

27. If you become the go to guy, people will be bothering you all the time.

28. It is OK to believe everybody else are idiots, but smart people keep such opinions to themselves.

29. Don’t play the blame game. Learn to tell the difference between finding the error and finding out who is responsible. As long as there isn’t any evidence of malcontent or criminal behavior, focus on preventing the error from happening again. Give people the opportunity to learn from their mistakes.

30. Shit happens. People make mistakes. Hardware will fail. Software will fail. Be prepared when it does.

31. Data loss is inevitable. Plan for that as well. Do not refuse to plan for data loss because you have the best backup strategy in the world, or because you think you RAID is infallible.

32. Backups alone are useless. Restores are all that matters.

33. Do not make excuses. Make changes instead.

34. There is no such thing as a fool-proof system. As soon as someone invents a supposedly fool-proof system, someone else will invent a better fool. Do not underestimate the collective stupidity of a large group of people.

Last edit: Saturday, December 10, 2016

How SQL Server 2012 Service Pack 1 destroyed my life

Or to be more exact: seven days of my life, with every waking hour spent troubleshooting and finally reinstalling servers.

Problem

Some time after you install MSSQL 2012 SP 1 strange things start to happen on the server. Maybe you are unable to start management studio, or if you’re really unfortunate you can’t log in to the server at all. Maybe random services start to fail, Windows Update refuses to apply updates, and if you boot the server it might refuse to start SQL server at all or just greet you with a friendly BSOD. But a couple of days before everything was hunky-dory, and you are absolutely 100% sure nothing has changed since then. Yet the server fails, and a couple of days later another one goes down with the same symptoms. It’s as if they have contracted ebola or the swine flu or some other strange virus. It seems to be attacking the entire server all at once, and the only common denominator is they are all running MS SQL 2012 SP1.

Continue reading “How SQL Server 2012 Service Pack 1 destroyed my life”

Last edit: Tuesday, October 25, 2016

Redundancy versus Single Points of Failure

There seems to be a widespread misconception in the IT community regarding Single Points of Failure: as long as you have N+1 redundancy in all your components, you no longer have a single point of failure. This is not necessarily correct, and can lead to a very bad day when you discover that your “bullet proof” datacenter or system design turns out to be one big basket with all your eggs in it. The fact of the matter is that adding redundancy to a component will only reduce the chance of failure, it won’t make it impossible for the component to fail. Take a MSSQL failover cluster for instance, be it Active-Active or the more common Active-Passive. Compared to a stand-alone server it offers far better redundancy, and it will limit maintenance downtime to a bare minimum. But on its own it is still a single point of failure, in fact it has several single points of failure: shared network/IP, shared storage and the cluster service itself to mention a few. I have seen all of the above fail in production, resulting in complete failure of the cluster. Especially on Win2003 and earlier, a poorly configured cluster could easily cause more problems than a stand-alone server ever would, but even if everything is set up and maintained properly, bad things will happen sooner or later.

Continue reading “Redundancy versus Single Points of Failure”

Last edit: Tuesday, June 12, 2012

Systematisk analyse og etterforskning av feilsituasjoner i IKT-systemer

En praktisk tilnærming

Introduksjon

Artikkelen definerer et planverk som skal gjøre det enklere å finne årsaken til feilsituasjoner, noe som ofte omtales som Root Cause Analysis, med fokus på Windows baserte systemer. Målgruppen er 3.linje support og driftsteam. Jeg definerer en analyseprosess og et sett med verktøy. Analyseprosessen skal være generisk nok til å passe de fleste miljøer, mens verktøyene er mer tenkt som en verktøykasse der man tar frem det man tror passer best til en gitt problemstilling.

Jeg har valgt å fokusere på en metodisk prosess, da jeg av erfaring vet at det er lett å få skylapper når man begynner å grave seg ned i et problem. Dette skjer uansett om man har en overordnet feilsøkingsprosess eller ikke, da det finnes svært få definerte prosesser for den faktiske feilsøkingen. Slike prosesser er ofte for overordnet eller fokusert på enkeltverktøy. Målet er ikke å ta bort skylappene helt, men å vite når man skal ta dem av og når man skal ha dem på.

Ideen kom en kveld etter en lang feilsøkingsøkt der vi til slutt fikk et system opp igjen etter å ha begynt på nytt og byttet innfallsvinkel. Jeg begynte å diskutere metodikk med en kollega, og vi kom frem til at de metodene vi lykkes best med egentlig har sin rot langt utenfor IT. Jeg henter inspirasjon fra metoder som egentlig brukes til etterforskning, førstehjelp og ettersøkning av savnede, fagfelt som sjelden assosieres med IT. Jeg bestemte meg for å skrive ned og publisere disse tankene, og håper de er til glede og eller nytte for andre.

Alle eksempler er hentet fra en Windows-verden, da det er det jeg jobber med. Prosessen i seg selv bør dog kunne tilpasses til bruk i ethvert IT-system der man leverer eller drifter en tjeneste.

Dette er et dokument under utarbeidelse, og jeg oppdaterer det innimellom etter hvert som jeg får skrevet ferdig flere kapitler eller for å rette opp feil og uklarheter.

Continue reading “Systematisk analyse og etterforskning av feilsituasjoner i IKT-systemer”

Last edit: Tuesday, June 12, 2012

Category: Root Cause Analysis

Catching the ghosts of forgotten VMs

Like this:

Dark magic at the backup site

Like this:

The search for the missing armor plate

Like this:

The quest for the source of the foul stench, part 2

Like this:

The quest for the source of the foul stench, part 1

Like this:

Lokna’s laws of system operations

Like this:

How SQL Server 2012 Service Pack 1 destroyed my life

Problem

Like this:

Redundancy versus Single Points of Failure

Like this:

Systematisk analyse og etterforskning av feilsituasjoner i IKT-systemer

Introduksjon

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Problem

Share this:

Like this:

Share this:

Like this:

Introduksjon

Share this:

Like this: