Hyper-V Failover Cluster Troubleshooting and Recovery Part 1, Start with the Logs

When designed and implemented correctly, Windows failover clusters can be one of the most resilient server architectures available for your datacenter. Unfortunately, this does not translate into their never having issues. When they do suffer some type of trouble; diagnostics and recovery can be challenging due to the complexity of clusters resources.

I generally start the troubleshooting process by having Windows test and report the status of all the components. Generating the report is much faster than logging into each component individually and the report frequently points directly to the cause of an outage. To get started, open an elevated PowerShell console on any of the cluster node servers and run the following.

Get-Cluster | Test-Cluster

The one-liner should start a cluster validation report. The default CVR process is not invasive, no systems will be rebooted or otherwise heavily impacted. Invasive disk tests are skipped in the default report (at the time of this writing). If you didn’t redirect the output, the results will be located @ %SystemRoot%\Cluster\Reports. Three files are generated in the folder.

Please check official Microsoft documentation yourself. Microsoft has been known to change a PowerShell cmdlet’s default functions as a result of updates and upgrades, link below.

Open the .htm file in a web browser and look for items that do not report success. Click their links to see more information about the alert. Resolve any issues and reboot the cluster nodes or affected hardware, if you know it is safe to do so.

Rebooting a cluster node or other components can cause, or make a cluster failure worse. Hardware failure, data loss, and/or the interruption of services from multiple IT assets and business IT processes can, and does, occur. If you are not comfortable with, or are not authorized for these types decisions; stop-here and find someone that is.

To make this decision you should be fully confident in all aspects of the underpinning network, data-storage platforms, machine-level hardware, the operating systems of all involved equipment and their configurations and interactions. You should also know how to restore the cluster, and anything hosted on it, from backups.

Any action you take, is of course at your own risk. The idea of this article is to get you pointed in the right direction. Things like deciding what to do, and the results of any action, or inaction, are on you!

Standard cluster events populate the event viewer and you should review them. One of the easiest ways, is to open the failover cluster management console and look for cluster events on the dashboard.

Those that mention any malfunction of the Failover Cluster Database, but do not have a corresponding all clear message further down the log stream, are one indication of a possible cluster failure.

In addition to the Cluster Validation Report and events, Windows clusters also include node-level logging facilities. To generate them we need to switch back over to an elevated PowerShell console.

Get-ClusterLog

Running the cmdlet will export the running log for each of the nodes to a file. These reports can take much longer to compile than the CVR did, the reason for checking them, is that these logs will often compile even when the cluster, or its components, have suffered a major failure. The files created by the cmdlet will also be located @ C:\Windows\Cluster\Reports, assuming you did not specify a destination.

If you run Get-ClusterLog without specifying a time span, the resulting files can be very large. Hundreds of megabytes, or hundreds of gigabytes are not uncommon, depending on the cluster’s number of nodes, number of roles, and logging level configuration. You may actually struggle to open the files depending on your computer and its available software options.

Get-ClusterLog -Timespan 5 would pull logs for the last 5 minutes.

As for what to do with the files. Alas, that too is out of scope for a blog post. A good place to start is by searching them for key words like “error”. If you’ve gotten this far and haven’t found a solution, you might consider contacting a professional with Hyper-V Cluster experience, or opening a support ticket with Microsoft’s Hyper-V support team.

This may also be a good point to restore the host’s Windows session state backups or the hosts themselves. Either restoration option should let you restore the cluster to a working state. The catch is that restoration itself, is an expert-level decision and action. A decision with whole separate sets of requirements and consequences that you need to fully understand before deciding on.

If you don’t have backups, or the restore didn’t work, you’re going to want to stay tuned for my next post in this series.

Further Reading:

Part 2 Rebuild It- Hyper-V Failover Cluster Troubleshooting and Recovery Part 2, How to Rebuild It.

One thought on “Hyper-V Failover Cluster Troubleshooting and Recovery Part 1, Start with the Logs”

Pingback: Hyper-V Failover Cluster Troubleshooting and Recovery Part 2, How to Rebuild It. – techbloggingfool.com