Microsoft Message Analyzer, the successor to Network Monitor, has the ability to read a lot more than just network captures. In this post I will show how you can open set of cluster logs from a SQL Server Failover Cluster instance. If you are new to Message Analyzer I recommend that you glance at the Microsoft Message Analyzer operating guide while you read this post for additional information.
Side quest: Basic cluster problem remediation
Remember that the cluster log is a debug log used for analyzing what went wrong AFTER you get it all working again. In most cases your cluster should self-heal, and all you have to do is figure out what went wrong and what you should do different to prevent it from happening again. If your cluster is still down and you are reading this post, you are on the wrong path.
Below you will find a simplified action plan for getting your cluster back online. I will assume that you have exhausted you normal troubleshooting process to no avail, that your cluster is down and that you do not know why. The type of Failover Cluster is somewhat irrelevant for this action plan.
- If your cluster has shared storage, call your SAN person and verify that all nodes can access the storage, and that there are no gremlins in the storage and fabric logs.
- If something works and something does not, restart all nodes one by one. If you cannot restart a node, power cycle it.
- If nothing works, shut down all nodes, then start one node. Just one.
- Verify that it has a valid connection to the rest of your environment, both networking and storage if applicable.
- If you have more than two nodes, start enough nodes to establish quorum. Usually n/2.
- Verify that your hardware is working. Check OOB logs and blinking lights.
- If the cluster is still not working, run a full cluster validation an correct any errors. If you had errors in the validation report BEFORE the cluster went down, your configuration is not supported and this is probably the reason for your predicament. Rectify all errors and try again.
- If you gave warnings in your cluster validation report, check each one and make a decision whether or not to correct it. Some clusters will have warnings by design.
- If your nodes are virtual, make sure that you are not using VMWare Raw Device Mapping. If you are, this is the probable cause of all your problems, both on this cluster and any personal problems you may have. Make the necessary changes to remove RDM.
- If your nodes are virtual, make sure there are no snapshots/checkpoints. If you find any, remove them. Snapshots/checkpoints left running for > 12 hours may destroy a production cluster.
- If the cluster is still not working, reformat, reinstall and restore.
Prerequisites and test environment
- A running failover cluster. Any type of cluster will do, but I will use a SQL Server Failover Cluster Instance as a sample.
- A workstation or server running Microsoft Message Analyzer 1.4 with all the current patches and updates as of march 2019.
- The cluster nodes in the lab are named SLQ19-1 and SQL19-2 and are running Windows Server 2019 with a SQL Server 2019 CTP 2.2 Failover Cluster Instance.
- To understand this post you need an understanding about how a Windows Failover Cluster works. If you have never looked at a cluster log before, this post will not teach you how to interpret the log. https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-2000-server/cc961673(v=technet.10) contains additional information about the cluster log. It is very old but still relevant, and at the time of writing the best source of information I could find. There is also an old article about the Resource Hosting Subsystem that may be of use here.
Obtaining the cluster log
- To get the current cluster log, execute Get-ClusterLog -Destination C:\TEMP –SkipClusterState in an administrative PowerShell windows on one of the cluster nodes.
- Be aware that the time zone in the log file will be Zulu time/GMT. MA should compensate for this.
- The SkipClusterState option removes a lot of unparseable information from the file. If you are investigating a serious problem you may want to run a separate export without this option.
- The TimeSpan option limits the log timespan. I used it to get a smaller sample set for this lab, and so should you if you know what timespan you want to investigate. You can also add a pre-filter in MA to limit the timespan.
- You should now have one file for each cluster node in C:\Temp.
- Copy the files to the machine running Message Analyzer.
Starting Message Analyzer and loading the logs
- Open Message Analyzer.
- Click New Session.
- Enter a session name.
- Click the Files-button.
- Add the .log files.
- Select the Cluster text log configuration.
- Click Start to start parsing the files.
- Wait while MA is parsing the files. Parsing time is determined by machine power and the size of the log, but it should normally take tens of minutes, not hours unless the file is hundreds of megabytes or more.
Filtering unparseable data
- After MA is done parsing the file, the list looks a little disconcerting. All you see are red error messages:
- Not to worry though, what you are looking at is just blank lines and other unparseable data from the file. You can read the unparseable data in the Details pane:
- It is usually log data that is split on multiple lines in the log file and headers dividing different logs included in the file. A similar message as the sample above looks like this in the log file:
- We can filter out these messages by adding #Timestamp to the filter pane and clicking Apply. This will filter out all messages without a timestamp.
Saving the session
To make the data load faster next time, we can save the parsed data and filter as a session. This will retain the workspace as we left it.
Looking for problems
The sample log files contains an incident where the iSCSI storage disappeared. This was triggered by a SAN reboot during a firmware update on a SAN without HA. I will go through some analysis of this issue to show how we can use MA to navigate the cluster logs.
- To make it easier to read the log, we will add a Grouping Viewer. Click New Viewer, Grouping, Cluster Logs:
- This will give you a Grouping pane on the left. Stat by clicking the Collapse All button:
- Then expand the ERR group and start with the messages without a subcomponent tag. The hexadecimal numbers are the ProcessId of the process writing the error to the log. Usually this is a resource hosting subsystem process.
- It is pretty clear that we have a storage problem:
- To check which log contains one of these messages, select one message and look in the Details pane, Properties mode. Scroll down until you find the TraceSource property:
- To read other messages logged at the same time, switch the Grouping viewer from Filter to Select mode:
- If we click the same ERR group again, the Analysis Grid view will scroll to the first message in this group and mark all messages in the group.
- The WARN InfoLevel for the RES SubComponent is also a good place to look for root causes:
- If you want to see results from one log file only, add *TraceSource == “filename” to the grouping filter.