Recovering ARL Headswap Out-of-Quorum (Missing/Unmapped Cluster Ports)

I’ve done a few ARL headswaps and always worked out a strategy to ensure at least one cluster port maps correctly to the new head (example: FAS3220 to FAS6220 ARL headswap by moving one of the FAS3220’s 10 GbE cards to slot 3 and moving cluster ports onto that, then taking the card across to the FAS6220 and slot 3, and post-ARL correcting the cluster ports to follow best practice). I was wondering what I’d do if I absolutely couldn’t come up with a cunning plan to map a cluster port from old controller to new controller. And using a 2-node SIM cluster, I demonstrate what I’d do.

Note: This is of course all unofficial stuff. On production systems, only NetApp support/personnel with a valid support case should be using commands to manipulate the CDB like we do here. Only the absolute minimum required amount of CDB modification is done to get the node back into quorum.

To demonstrate this, I have a 2-node cluster (C91) with nodes C91-01 and C91-02. What I do is ensure epsilon is on node 2 (C91-02), then I shutdown C91-01 and remove the two ports (network adapter 5 and 6 which map to e0e and e0f) used for cluster ports.

Image: Removing cluster ports e0e and e0f from the simulator

1) Gathering a few outputs and halting node 1


C91::*> version

NetApp Release 9.1: Thu Dec 22 23:05:58 UTC 2016

C91::*> cluster show

Node   Health  Eligibility Epsilon
------ ------- ----------- -------
C91-01 true    true        false
C91-02 true    true        true

C91::*> net int show -role cluster

Logical    Status     Network          Current Current Is
Interface  Admin/Oper Address/Mask     Node    Port    Home
---------- ---------- ---------------- ------- ------- ----
C91-01_clus1 up/up    169.254.94.74/16 C91-01  e0e     true
C91-01_clus2 up/up    169.254.94.84/16 C91-01  e0f     true
C91-02_clus1 up/up    169.254.47.70/16 C91-02  e0e     true
C91-02_clus2 up/up    169.254.47.80/16 C91-02  e0f     true

C91::*> net port show -role cluster

Node: C91-01
                                 Speed(Mbps) Health
Port IPspace Broadcast Link MTU  Admin/Oper  Status
---- ------- --------- ---- ---- ----------- -------
e0e  Cluster Cluster   up   1500 auto/1000   healthy
e0f  Cluster Cluster   up   1500 auto/1000   healthy

Node: C91-02
                                 Speed(Mbps) Health
Port IPspace Broadcast Link MTU  Admin/Oper  Status
---- ------- --------- ---- ---- ----------- -------
e0e  Cluster Cluster   up   1500 auto/1000   healthy
e0f  Cluster Cluster   up   1500 auto/1000   healthy

C91::*> halt -node C91-01


2) Remove network adapter 5 and 6 from the simulator

3) Power up node 1

4) Check cluster quorum

Notice that node 1 (C91-01) is out-of-quorum (cluster health = false).


C91::*> node show local -fields node
node
------
C91-01

C91::*> cluster show
Node   Health  Eligibility Epsilon
------ ------- ----------- -------
C91-01 false   true        false
C91-02 false   true        true


Notice that node 2 (C91-02) is in quorum (cluster health = true).


C91::*> node show local -fields node
node
------
C91-02

C91::*> cluster show
Node   Health  Eligibility Epsilon
------ ------- ----------- -------
C91-01 false   true        false
C91-02 true    true        true


5) Fix the problem

We modify ports e0a and e0b to be cluster ports.
Then we modify the cluster LIFs to be on e0a and e0b.
Finally we reboot node 1.


C91::*> net port show -role cluster

There are no entries matching your query.

C91::*> net int show -role cluster

Logical    Status     Network          Current Current Is
Interface  Admin/Oper Address/Mask     Node    Port    Home
---------- ---------- ---------------- ------- ------- ----
C91-01_clus1 up/down  169.254.94.74/16 C91-01  e0e     true
C91-01_clus2 up/down  169.254.94.84/16 C91-01  e0f     true

C91::*> broadcast-domain show

Error: show failed: Cannot run this command because the system is not fully initialized. Wait a few minutes, and then try the command again.

C91::*> set diag

C91::*> network ipspace cdb show

IPspace ID
------- -------
Cluster
        4294967294
Default
        4294967295

C91::*> network port cdb show

                           Auto-Neg Duplex Speed Flowcontrol
Node   Port Role      MTU  Admin    Admin  Admin Admin
------ ---- --------- ---- -------- ------ ----- -----------
C91-01
       e0a  data      1500 true     auto   auto  full
       e0b  data      1500 true     auto   auto  full
       e0c  node-mgmt 1500 true     auto   auto  full
       e0d  data      1500 true     auto   auto  full

Warning: Unable to list entries on node C91-02. RPC: Couldn't make connection

C91::*> network interface cdb show

                     Status Network                   Valid
Node   ID    Name    Admin  Address       Netmask     Id
------ ----- ------- ------ ------------- ----------- -----
C91-01
       1023  C91-01_ up     169.254.94.84 255.255.0.0 true
             clus2
       1024  C91-01_ up     169.254.94.74 255.255.0.0 true
             clus1

Warning: Unable to list entries on node C91-02. RPC: Couldn't make connection

C91::*> net port cdb modify -port e0a -node C91-01 -role cluster -mtu 1500 -flowcontrol-admin none -ipspace-id 4294967294
C91::*> net port cdb modify -port e0b -node C91-01 -role cluster -mtu 1500 -flowcontrol-admin none -ipspace-id 4294967294

C91::*> net port cdb show

                           Auto-Neg Duplex Speed Flowcontrol
Node   Port Role      MTU  Admin    Admin  Admin Admin
------ ---- --------- ---- -------- ------ ----- -----------
C91-01
       e0a  cluster   1500 true     auto   auto  none
       e0b  cluster   1500 true     auto   auto  none
       e0c  node-mgmt 1500 true     auto   auto  full
       e0d  data      1500 true     auto   auto  full

Warning: Unable to list entries on node C91-02. RPC: Couldn't make connection

C91::*> net int cdb show

                     Status Network                   Valid
Node   ID    Name    Admin  Address       Netmask     Id
------ ----- ------- ------ ------------- ----------- -----
C91-01
       1023  C91-01_ up     169.254.94.84 255.255.0.0 true
             clus2
       1024  C91-01_ up     169.254.94.74 255.255.0.0 true
             clus1

Warning: Unable to list entries on node C91-02. RPC: Couldn't make connection

C91::*> net int cdb modify -lif-id 1024 -node C91-01 -home-port e0a -home-node C91-01 -curr-port e0a -curr-node C91-01
C91::*> net int cdb modify -lif-id 1023 -node C91-01 -home-port e0b -home-node C91-01 -curr-port e0b -curr-node C91-01

C91::*> net int show -role cluster

Logical    Status     Network          Current Current Is
Interface  Admin/Oper Address/Mask     Node    Port    Home
---------- ---------- ---------------- ------- ------- ----
C91-01_clus1 up/down  169.254.94.74/16 C91-01  e0a     true
C91-01_clus2 up/down  169.254.94.84/16 C91-01  e0b     true

C91::*> node show local -fields node
node
------
C91-01

C91::*> reboot local


Notice above that even when we’ve got the cluster LIFs on the correct port, they are still marked as operationally down; this is why we have to do the reboot so the CDB can reload correctly.

6) Check everything is okay


C91::*> net int show -role cluster

Logical    Status     Network          Current Current Is
Interface  Admin/Oper Address/Mask     Node    Port    Home
---------- ---------- ---------------- ------- ------- ----
C91-01_clus1 up/up    169.254.94.74/16 C91-01  e0b     true
C91-01_clus2 up/up    169.254.94.84/16 C91-01  e0a     true
C91-02_clus1 up/up    169.254.47.70/16 C91-02  e0e     true
C91-02_clus2 up/up    169.254.47.80/16 C91-02  e0f     true

C91::*> cluster show

Node   Health Eligibility Epsilon
------ ------ ----------- -------
C91-01 true   true        false
C91-02 true   true        true

C91::*> network port broadcast-domain show -ipspace Cluster

IPspace Broadcast                   Update
Name    Domain Name MTU  Port List  Status
------- ----------- ---- ---------- ------
Cluster Cluster     1500
                         C91-01:e0a complete
                         C91-01:e0b complete
                         C91-02:e0e complete
                         C91-02:e0f complete


THE END

UPDATE!
This should be easier in ONTAP 9.0 and greater, but the above still works. Check out this KB article:

Headswap mapping ports steps fail with "Error: Cannot run this command because the system is not fully initialized."
https://kb.netapp.com/app/answers/answer_view/a_id/1084223/loc/en_US

Also related:

Cluster LIFs not visible after headswap
https://kb.netapp.com/app/answers/answer_view/a_id/1087737/loc/en_US

Comments