Hi!
We currently have multiple Hyper-V 2016 clusters running in our company, all with SAN luns connected.
This morning it seems we have had a technical issue on storage level which still has to be figured out.
From 6:37:29 till 6:37:35 all our luns, on all our clusters and nodes reported following event (5120):
Cluster Shared Volume 'LUNNAME' ('LUNNAME') has entered a paused state because of
'STATUS_CONNECTION_DISCONNECTED(c000020c)'. All I/O will temporarily be queued until a path to the volume is reestablished.
After the connection to every lun was lost, we received the following critical event (1135), again from every node.
Cluster node '99-001-203-s052' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node
having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network
adapters on this node. Also check for failures in any other network components to which the node is connected
such as hubs, switches, or bridges.
And finally - logically - followed by (1177):
The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the
cluster, or a failover of the witness disk. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network
adapter. Also check for failures in any other network components to which the node is connected
such as hubs, switches, or bridges.
However, since no network related events were found I do not understand why the cluster nodes went down after losing their storage. The only possible explanation for this behavior is a loss of authentication towards the domain controllers.
We have 2 physical and 2 virtual DC's, so this would mean that at this point all of the Hyper-V Hosts (32 hosts spread over 5 clusters) were authenticated to the virtual DC's and did not manage to re-authenticate to a physical one in a timely manner.
Moreover, it was in my understanding that combined use of Network Service & CLIUSR account drastically reduced (negated?) the necessity of available DC's to stay online in case of emergency.
Can anybody shed some light on this situation?