Clustered ONTAP Data Availability in the Event of 'Going Out of Quorum'

Something to be aware of in Clustered ONTAP (and pretty much all other clustered file systems), is that: if your cluster goes out of quorum, you will lose data availability until quorum is restored. The following post demonstrates this.

The Demonstration

We have a 4 node cluster running Data ONTAP 8.1.2P4. As per normal with a new cluster, epsilon exists on the first node that was entered into the cluster.

Note 1: All the epsilon does is add voting weight to the holder, so that - in normal circumstances - there will always be a victor in any election for quorum ownership. The epsilon node will have a voting weight of say 1.1, and all others 1.
Note 2: Epsilon can be moved, please contact NetApp Global Support (NGS) for this.

To see where Epsilon is, drop down into the advanced privilege level and run cluster show:

clust::> set -priv advanced
clust::*> cluster show
Node     Health Eligibility Epsilon
-------- ------ ----------- -------
clust-01 true   true        true
clust-02 true   true        false
clust-03 true   true        false
clust-04 true   true        false

Also, run cluster ring show to display cluster node member's replication rings:

clust::cluster*> cluster ring show
Node      UnitName Epoch DB Epoch DB Trnxs Master
--------- -------- ----- -------- -------- --------
clust-01  mgmt     5     5        259      clust-01
clust-01  vldb     3     3        60       clust-01
clust-01  vifmgr   3     3        145      clust-01
clust-01  bcomd    4     4        35       clust-01
clust-02  mgmt     5     5        259      clust-01
clust-02  vldb     3     3        60       clust-01
clust-02  vifmgr   3     3        145      clust-01
clust-02  bcomd    4     4        35       clust-01
clust-03  mgmt     5     5        259      clust-01
clust-03  vldb     3     3        60       clust-01
clust-03  vifmgr   3     3        145      clust-01
clust-03  bcomd    4     4        35       clust-01
clust-04  mgmt     5     5        259      clust-01
clust-04  vldb     3     3        60       clust-01
clust-04  vifmgr   3     3        145      clust-01
clust-04  bcomd    4     4        35       clust-01

In our test lab, we simply have two NFS volumes, one on clust-03, and one on clust-04, presented to an ESXi host as an NFS datastore.

Image: NFS datastores available (active)

Power Down of Epsilon Node and Partner

After powering down the Epsilon node - clust-01 - and clust-02, we get:

clust::*> cluster show
Node     Health Eligibility Epsilon
-------- ------ ----------- -------
clust-01 false  true        true
clust-02 false  true        false
clust-03 false  true        false
clust-04 false  true        false

clust::*> cluster ring show
Node     UnitName Epoch DB Epoch DB Trnxs Master
-------- -------- ----- -------- -------- ------
Warning: Unable to list entries on node clust-01. RPC: Port mapper failure - RPC: Timed out
Warning: Unable to list entries on node clust-02. RPC: Port mapper failure - RPC: Timed out
clust-03 mgmt     0     5        277      -
clust-03 vldb     0     3        60       -
clust-03 vifmgr   0     3        145      -
clust-03 bcomd    0     4        35       -
clust-04 mgmt     0     5        277      -
clust-04 vldb     0     3        60       -
clust-04 vifmgr   0     3        145      -
clust-04 bcomd    0     4        35       -

Note: You might lose connection to cluster management if it is on one of those two nodes, and it won’t fail over with the cluster being out of quorum. Connect via one of the surviving node management interfaces.

The Result

The NFS Datastores are unavailable.

Image: NFS datastores unavailable (inactive)

In this situation, the fix would simply be to bring up one of the downed nodes, thus restoring quorum, or contact NGS to move Epsilon.

Note: This is only applicable to clusters of greater than 2 nodes. In 2 node clusters, cluster ha is enabled with the command.
clust::*> cluster ha modify -configured true
This must be configured false for greater than 2 nodes.


Comments