There seems to be a widespread misconception in the IT community regarding Single Points of Failure: as long as you have N+1 redundancy in all your components, you no longer have a single point of failure. This is not necessarily correct, and can lead to a very bad day when you discover that your “bullet proof” datacenter or system design turns out to be one big basket with all your eggs in it. The fact of the matter is that adding redundancy to a component will only reduce the chance of failure, it won’t make it impossible for the component to fail. Take a MSSQL failover cluster for instance, be it Active-Active or the more common Active-Passive. Compared to a stand-alone server it offers far better redundancy, and it will limit maintenance downtime to a bare minimum. But on its own it is still a single point of failure, in fact it has several single points of failure: shared network/IP, shared storage and the cluster service itself to mention a few. I have seen all of the above fail in production, resulting in complete failure of the cluster. Especially on Win2003 and earlier, a poorly configured cluster could easily cause more problems than a stand-alone server ever would, but even if everything is set up and maintained properly, bad things will happen sooner or later.
“Virtualization solves the problem” is the mantra you usually hear if you voice such opinions. I will leave the debate regarding whether it is smart to virtualize a SQL cluster for another post, but running VMWare, Hyper-V or similar hypervisors with support for high availability and clustering is certainly better than stand-alone servers. That being said, a clustered hypervisor is also a single point of failure. My experience is mostly with VMWare, but I have no faith in the other hypervisors out there being any better. If anything I suspect they are worse. I am no expert at configuring hypervisors, but I have seen VMWare clusters configured by highly regarded experts and consultants fail miserably and stay down for extended periods of time. The culprit is usually shared storage or networking errors. Cluster technology, be it Windows failover clustering, redundant firewall appliances or VMWare HA all rely on some form heartbeat to find out which nodes are in good working order, usually network and/or storage based. This can and will fail, sometimes resulting in a split brain cluster where all or none of the nodes believe they are the master node, both of which results in some degree of cluster failure that require manual intervention to get it working again. And if the shared storage becomes completely unavailable for some reason; there is no data to be served, and ergo there is no database if we keep to the MSSQL server or no virtual hosts to run if it is a hypervisor cluster.
In my opinion, the only way to eliminate all single points of failure is to establish a secondary datacenter in a parallel dimension. That way you can truly have two or more of everything, even the users.
Just having a secondary and tertiary datacenter (in our dimension) isn’t enough. There will always be a single point of failure somewhere in the chain between the data and the user, and even if you should complete the unimaginable task of eliminating them all, what if both your primary and secondary unit fails at the same time somewhere in the chain? What it all comes down to is how many levels of redundancy you can justify from a cost/benefit point of view. It is always possible to add another layer. Just remember that each layer of redundancy adds complexity, increasing the chance of failures caused by the redundancy in it self.
If you fail to see the single points of failure in your own design, I suggest that you take a couple of steps back, zoom out and take another look. They are lurking in there somewhere…
Failure domains
The road to improvement is to make your failure domains truly independent, and to move the single points of failure as high up in the stack as possible, and as low as possible. If you are left with network edge load balancers and the local power substation as your only single points of failure for a datacenter, you are a lot better off than most. This will of course require technologies that support truly independent failure domains, but you can gain a lot from just thinking about this when you design your system. Lately, I have seen several technologies make huge advances in this area, and it is no longer reserved for high end enterprise systems only. That being said, the cost of uptime is still on a logarithmic scale, and moving from 97% to 98% can easily cost as much as the initial cost from 0 to 97%.
So, what are the requirements for an independent failure domain? The point were most designs come short is physical independence. For instance, you may have independent networks and top of the rack switches for you failure domains. This is all well and good, but if they are all controlled by a central core switch/router cluster, they are interdependent, and thereby not truly independent. I have seen server room switches grind to a halt due to misconfiguration of vlans in client networks configured by the same core switch. And what about blade servers? Well, if you mix servers from different failure domains in the same chassis you are fresh out of luck when the chassis fails. And if you think that your chassis cannot possibly fail due to the fact that it has redundant components like power and switches, think again.
A good test is trying to make a list of which physical components makes up your failure domains for a single service or system. If one physical component pops up in more than one failure domain, it is a single point of failure. When you have your list, perform the following thought experiments. Imagine destroying each and every component for one failure domain and ask yourself the following questions:
- would my system still be able to service users?
- would any other systems be affected?
- could this affect components in other failure domains? And note that I wrote could, not should. If it is connected to something else in another failure domain somehow, it is usually interdependent.
- what happens when the components are restored?
Then repeat this for each failure domain associated with the same system/service. If the failure of one or more of the failure domains leads to service failure, you still have a single point of failure.
Failure domain examples
Lets look at the server and storage level of a MSSQL 2008 database stack using physical servers. This is not an exhaustive list of HA options and configurations for MSSQL, for more information see http://msdn.microsoft.com/en-us/library/ms190202(v=sql.105).aspx
With just one single server, there is only one failure domain and no redundancy at all if the server fails. The physical server it self can of course have redundant components such as dual power supplies, dual processors and RAID storage, but most people will agree when I say that this does not ensure that the server will run forever.
With failover clustering the chance of failure is reduced significantly, as we are no longer dependent on one physical server to be functioning, we have a secondary server. This will also reduce downtime associated with maintenance, as it is possible to initiate a manual failover and service one of the nodes. Such a cluster can be active/passive or active/active, and may have more than two nodes depending on licensing, but they are usually designed such that it is possible to lose one server and still process the workload. But it is still a single failure domain, as it has shared storage which is a single point of failure. A properly designed (and expensive) SAN has a pretty high MTBF, and some of the really expensive ones even has a small RTO, but they are not perfect and will fail sooner or later.
Mirroring supports independent failure domains with automatic failover if you add a quorum server. It is possible to combine mirroring with failover clustering, thus increasing the availability further at the cost of more hardware, requiring at least two independent clusters. The role of the quorum server (which is also a MSSQL server) is to determine which one of the nodes (if any) are serviceable, thus enabling the secondary server to take over in case the primary (principal) fails. This method also provides some protection against data corruption, as it is able to recover corrupt pages from the secondary node in case one is discovered at the primary. Mirroring with automatic failover adds overhead to the system as transactions have to be shipped to the secondary server before they are committed. This implies that the servers should be located in close vicinity to each other from a networking point of view to reduce the overhead to a minimum. Furthermore, mirroring requires client software that supports mirroring. For a client to reconnect after a catastrophic failure of the primary node, it needs to know the name/ip of the secondary.
If the quorum server is in the same failure domain as one of the nodes, e.g. in the same blade chassis or behind the same switch, and that node is the primary/principal node at the moment of a failure causing the secondary server to loose connection to the quorum and primary at the same time, the service will fail and require manual intervention to restart.
If the primary and secondary server use the same SAN storage, interdependent SAN storage and/or the same SAN fabrics, a failure of the SAN will affect both nodes and cause the service to fail until the SAN is restored.