Quantcast
Channel: High Availability (Clustering) forum
Viewing all articles
Browse latest Browse all 2306

How best to get system out of current "Removing From Pool" and "Repair" states?

$
0
0

Basic Context:

  • Windows Server 2019 Datacenter
  • 2-node Hyper-V cluster
  • Storage Spaces Direct
  • 2x Dell PowerEdge R740
  • 2x QLogic FastLinQ 41262 Dual Port 25Gb SFP28 Adapter
  • 2x 2x 800GB SSD SAS
  • 2x 10x 1.2TB 10K SAS
  • File Share Witness on NAS

Summary:

Two days ago, I discovered that the cluster group and IP address were offline and could not be enabled. Hoping to resolve the problem with a reboot of the nodes, I unfortunately went through the wrong steps for shutting down the VMs and restarting the cluster. When they restarted, the failover cluster on the nodes could no longer communicate, and much appeared broken. (It turns out that it was probably a simple matter of fixing broken firewall rules, but I did not discover this until later.) My colleague and followed some guidance that, it turned out later, did not apply to our problem (i.e. remove the drives marked "Communication Lost"), and I find myself with a degraded cluster. I don't wish to make any more missteps in bringing it back to a healthy state, so I turn to the wiser community for some guidance.

Here is how things stand, according to the commands I've seen recommended—I have an image of this, but I can't post it until my account is verified:

  • The command Get-PhysicalDisk shows the 12 drives in the node currently owning the cluster inOperationalStatus "OK" and HealthStatus "Healthy". The 12 drives on the once-disconnected node haveOperationalStatus"Removing From Pool, OK" and HealthStatus"Healthy".
  • The command Get-VirtualDisk shows the OperationalStatus of our four volumes as "Degraded" and theHealthStatus as "Warning".
  • The command Get-StorageJob shows each volume with "-Repair" appended to the name,IsBackgroundTask appears to fluctuate between "True" and "False", andJobState shows as "Suspended" for all four repair jobs. The elapsed time progresses, but there is no progress.

As you can hopefully see, there are a few things that are going on here, or could/should be. There's the removal of the once-disconnected node's drives from the clustered storage, and there's the repair of the cluster volumes. But I don't see progress on either front.

What do you think is the best way to proceed from this state? What commands will put things on the right track, and in which order should they be issued?

Many thanks in advance for considering my problem and contributing to the solution!



Viewing all articles
Browse latest Browse all 2306

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>