What Happens when Ports Go Missing ...

In the post we cover 5 scenarios that you might encounter in the field, when removing network cards, doing headswaps (non-disruptive ARL, or disruptive), etcetera. Here the version of NetApp ONTAP is 8.3.2.

1) After Ethernet card removal, have lost a port that was a home port for a data LIF.
2) After Ethernet card removal, have lost a port that was part of an IFGRP.
3) After Ethernet card removals, have lost both ports that were part of an IFGRP.
4) After a head swap, have lost both cluster ports on the Epsilon node (2-node cluster)
5) After a head swap, have lost both cluster ports on the non-Epsilon/out-of-quorum node (2-node cluster)

I’m using a 2-node simulator cluster to demonstrate these. The cluster is called CLU, and the two nodes are CLU-01 and CLU-02.

1) Lost e0j on CLU-01

Initial Setup:


CLU::> network interface show SVM1_NFS1

        Logical    Status     Network      Current Current Is
Vserver Interface  Admin/Oper Address/Mask Node    Port    Home
------- ---------- ---------- ------------ ------- ------- ----
SVM1    SVM1_NFS1    up/up    10.3.6.1/8   CLU-01  e0j     true

CLU::> network port show -node CLU-01 -port e0j -fields node,port,ipspace,broadcast-domain,link,mtu,speed-admin,speed-oper

node   port link mtu  speed-admin speed-oper ipspace broadcast-domain
------ ---- ---- ---- ----------- ---------- ------- ----------------
CLU-01 e0j  up   1500 auto        1000       Default Default


After halting system, and removing the port, booting back up, this is what we have:


CLU::> network port show -node CLU-01 -port e0j -fields node,port,ipspace,broadcast-domain,link,mtu,speed-admin,speed-oper

node   port link mtu speed-admin speed-oper ipspace broadcast-domain
------ ---- ---- --- ----------- ---------- ------- ----------------
CLU-01 e0j  -    -   auto        -          Default Default

CLU::> network interface show SVM1_NFS1

        Logical    Status     Network      Current Current Is
Vserver Interface  Admin/Oper Address/Mask Node    Port    Home
------- ---------- ---------- ------------ ------- ------- ----
SVM1    SVM1_NFS1    up/up    10.3.6.1/8   CLU-01  e0c     false


To tidy up/resolve:


CLU::> set adv
CLU::*> network port delete -node CLU-01 -port e0j

Error: command failed: Operation can't be completed because port is either the home port or failover target of a LIF.

CLU::*> network interface modify -lif SVM1_NFS1 -vserver SVM1 -home-node CLU-01 -home-port e0c
CLU::*> net port delete -node CLU-01 -port e0j


2) Lost e0i on CLU-01 (e0j is also in the ifgrp)

Initial setup:


CLU::> network interface show SVM1_NFS1

        Logical    Status     Network      Current Current Is
Vserver Interface  Admin/Oper Address/Mask Node    Port    Home
------- ---------- ---------- ------------ ------- ------- ----
SVM1    SVM1_NFS1    up/up    10.3.6.1/8   CLU-01  a0a     true

CLU::> ifgrp show -node CLU-01 -ifgrp a0a -fields ports

node   ifgrp ports
------ ----- -------
CLU-01 a0a   e0h,e0i

CLU::> network port show -node CLU-01 -port e0i -fields node,port,ipspace,broadcast-domain,link,mtu,speed-admin,speed-oper

node   port link mtu  speed-admin speed-oper ipspace broadcast-domain
------ ---- ---- ---- ----------- ---------- ------- ----------------
CLU-01 e0i  up   1500 auto        1000       Default -


After halting system, and removing the port, booting back up, this is what we have:


CLU::> network interface show SVM1_NFS1

        Logical    Status     Network      Current Current Is
Vserver Interface  Admin/Oper Address/Mask Node    Port    Home
------- ---------- ---------- ------------ ------- ------- ----
SVM1    SVM1_NFS1    up/up    10.3.6.1/8   CLU-01  a0a     true

CLU::> ifgrp show -node CLU-01 -ifgrp a0a -fields ports

node   ifgrp ports
------ ----- -------
CLU-01 a0a   e0h,e0i

CLU::> network port show -node CLU-01 -port e0i -fields node,port,ipspace,broadcast-domain,link,mtu,speed-admin,speed-oper

node   port link mtu speed-admin speed-oper ipspace broadcast-domain
------ ---- ---- --- ----------- ---------- ------- ----------------
CLU-01 e0i  -    -   auto        -          Default -


To tidy up/resolve:


CLU::*> set adv
CLU::*> ifgrp remove-port -ifgrp a0a -node CLU-01 -port e0i

Error: command failed: Port already has a lif bound.

CLU::*> net int modify -lif SVM1_NFS1 -home-node CLU-01 -home-port e0c -vserver SVM1
CLU::*> net int revert -lif SVM1_NFS1 -vserver SVM1
CLU::*> ifgrp remove-port -ifgrp a0a -node CLU-01 -port e0i
CLU::*> net port delete -node CLU-01 -port e0i


3) Lost e0i + e0j on CLU-02 (both ports in the ifgrp)

Initial Setup:


CLU::> network interface show SVM1_NFS1

        Logical    Status     Network      Current Current Is
Vserver Interface  Admin/Oper Address/Mask Node    Port    Home
------- ---------- ---------- ------------ ------- ------- ----
SVM1    SVM1_NFS1    up/up    10.3.6.1/8   CLU-02  a0a     true

CLU::> ifgrp show -node CLU-02 -ifgrp a0a -fields ports

node   ifgrp ports
------ ----- -------
CLU-02 a0a   e0i,e0j

CLU::> network port show -node CLU-02 -port e0i,e0j -fields node,port,ipspace,broadcast-domain,link,mtu,speed-admin,speed-oper

node   port link mtu  speed-admin speed-oper ipspace broadcast-domain
------ ---- ---- ---- ----------- ---------- ------- ----------------
CLU-02 e0i  up   1500 auto        1000       Default -
CLU-02 e0j  up   1500 auto        1000       Default -


After halting system, and removing the ports, booting back up, this is what we have:


CLU::> network interface show SVM1_NFS1

        Logical    Status     Network      Current Current Is
Vserver Interface  Admin/Oper Address/Mask Node    Port    Home
------- ---------- ---------- ------------ ------- ------- ----
SVM1    SVM1_NFS1    up/up    10.3.6.1/8   CLU-01  e0c     false

CLU::> ifgrp show -node CLU-02 -ifgrp a0a -fields ports

node   ifgrp ports
------ ----- -------
CLU-02 a0a   e0i,e0j

CLU::> network port show -node CLU-02 -port e0i,e0j -fields node,port,ipspace,broadcast-domain,link,mtu,speed-admin,speed-oper

node   port link mtu speed-admin speed-oper ipspace broadcast-domain
------ ---- ---- --- ----------- ---------- ------- ----------------
CLU-02 e0i  -    -   auto        -          Default -
CLU-02 e0j  -    -   auto        -          Default -


To tidy up/resolve:


CLU::> set adv
CLU*::> net int modify -lif SVM1_NFS1 -home-node CLU-02 -home-port e0c -vserver SVM1
CLU*::> net int revert -lif SVM1_NFS1 -vserver SVM1
CLU*::> ifgrp delete -node CLU-02 -ifgrp a0a
CLU*::> net port delete -node CLU-02 -port e0i
CLU*::> net port delete -node CLU-02 -port e0j


Prelimaries for 4 and 5:

Cluster, Cluster LIFs, and Cluster Ports setup:

CLU::*> cluster show

Node   Health  Eligibility   Epsilon
------ ------- ------------  -------
CLU-01 true    true          true
CLU-02 true    true          false

CLU::*> network interface show -role cluster

        Logical    Status     Network            Current Current Is
Vserver Interface  Admin/Oper Address/Mask       Node    Port    Home
------- ---------- ---------- ------------------ ------- ------- ----
Cluster
        CLU-01_clus1 up/up    169.254.76.193/16  CLU-01  e0g     true
        CLU-01_clus2 up/up    169.254.126.4/16   CLU-01  e0h     true
        CLU-02_clus1 up/up    169.254.33.108/16  CLU-02  e0g     true
        CLU-02_clus2 up/up    169.254.130.213/16 CLU-02  e0h     true

CLU::*> network port show -role cluster
                                                        Speed (Mbps)
Node   Port      IPspace Broadcast Domain Link   MTU    Admin/Oper
------ --------- ------- ---------------- ----- ------- ------------
CLU-01
       e0g       Cluster Cluster          up       1500  auto/1000
       e0h       Cluster Cluster          up       1500  auto/1000
CLU-02
       e0g       Cluster Cluster          up       1500  auto/1000
       e0h       Cluster Cluster          up       1500  auto/1000


Then we halt both nodes in the cluster:


CLU::*> halt !local -inhi -igno -skip
CLU::*> halt local -inhi -igno -skip


4) Lost e0g,e0h on CLU-01 (Node had Epsilon prior to 2-node cluster shutdown)

What we have:

CLU::> set adv
CLU::*> cluster show

Node   Health  Eligibility   Epsilon
------ ------- ------------  -------
CLU-01 true    true          true
CLU-02 false   true          false

CLU::*> network interface show -role cluster

        Logical    Status     Network            Current Current Is
Vserver Interface  Admin/Oper Address/Mask       Node    Port    Home
------- ---------- ---------- ------------------ ------- ------- ----
Cluster
        CLU-01_clus1 up/down  169.254.76.193/16  CLU-01  e0g     true
        CLU-01_clus2 up/down  169.254.126.4/16   CLU-01  e0h     true
        CLU-02_clus1 up/-     169.254.33.108/16  CLU-02  e0g     true
        CLU-02_clus2 up/-     169.254.130.213/16 CLU-02  e0h     true

CLU::*> network port show -role cluster
                                                   Speed (Mbps)
Node   Port IPspace Broadcast Domain Link   MTU    Admin/Oper
------ ---- ------- ---------------- ----- ------- ------------
CLU-01
       e0g  Cluster Cluster          -           -  auto/-
       e0h  Cluster Cluster          -           -  auto/-

Warning: Unable to list entries for vifmgr on node "CLU-02": RPC: Port mapper failure - RPC: Unable to send.


To fix (we are connected via the CLU-01's node management LIF):


CLU::*> broadcast-domain add-ports -broadcast-domain Cluster -IPspace Cluster -ports CLU-01:e0a
CLU::*> broadcast-domain add-ports -broadcast-domain Cluster -IPspace Cluster -ports CLU-01:e0b
CLU::*> net int modify -lif CLU-01_clus1 -vserver Cluster -home-port e0a -home-node CLU-01
CLU::*> net int modify -lif CLU-01_clus2 -vserver Cluster -home-port e0b -home-node CLU-01
CLU::*> net int revert -lif CLU-01_clus1 -vserver Cluster
CLU::*> net int revert -lif CLU-01_clus2 -vserver Cluster
CLU::*> net port delete -port e0g -node CLU-01
CLU::*> net port delete -port e0h -node CLU-01


Shows:


CLU::*>  cluster show

Node   Health  Eligibility   Epsilon
------ ------- ------------  -------
CLU-01 true    true          true
CLU-02 false   true          false

CLU::*> network interface show -role cluster

        Logical    Status     Network            Current Current Is
Vserver Interface  Admin/Oper Address/Mask       Node    Port    Home
------- ---------- ---------- ------------------ ------- ------- ----
Cluster
        CLU-01_clus1 up/up    169.254.76.193/16  CLU-01  e0a     true
        CLU-01_clus2 up/up    169.254.126.4/16   CLU-01   e0b     true
        CLU-02_clus1 up/-     169.254.33.108/16  CLU-02  e0g     true
        CLU-02_clus2 up/-     169.254.130.213/16 CLU-02  e0h     true

CLU::*> network port show -role cluster
                                                  Speed (Mbps)
Node   Port IPspace Broadcast Domain Link  MTU    Admin/Oper
------ ---- ------- ---------------- ----- ------ ------------
CLU-01
       e0a  Cluster Cluster          up    1500  auto/1000
       e0b  Cluster Cluster          up    1500  auto/1000

Warning: Unable to list entries for vifmgr on node "CLU-02": RPC: Port mapper failure - RPC: Timed out.
2 entries were displayed.


5) Lost e0g,e0h on CLU-02 (Node didn't have Epsilon prior to 2-node cluster shutdown)

What we have:

CLU::> set adv

CLU::*> cluster show

Node   Health  Eligibility   Epsilon
------ ------- ------------  -------
CLU-01 false   true          true
CLU-02 false   true          false

CLU::*> network interface show -role cluster

        Logical    Status     Network            Current Current Is
Vserver Interface  Admin/Oper Address/Mask       Node    Port    Home
------- ---------- ---------- ------------------ ------- ------- ----
Cluster
        CLU-02_clus1 up/down  169.254.33.108/16  CLU-02  e0g     true
        CLU-02_clus2 up/down  169.254.130.213/16 CLU-02  e0h     true

CLU::*> network port show -role cluster
                                                   Speed (Mbps)
Node   Port IPspace Broadcast Domain Link   MTU    Admin/Oper
------ ---- ------- ---------------- ----- ------- ------------
CLU-02
       e0g  Cluster -                -        1500  auto/-
       e0h  Cluster -                -        1500  auto/-


To fix (we are connected via the CLU-02's node management LIF):             


CLU::*> broadcast-domain add-ports -broadcast-domain Cluster -IPspace Cluster -ports CLU-02:e0a

Error: command failed: Cannot run this command because the system is not fully initialized. Wait a few minutes, and then try the command again.


OH SH*T!

The point of the post was to show how crucial the cluster ports are. If you’ve performed a headswap (ARL or disruptive), and haven’t fully considered how the cluster ports are going to work on the new platform, then you’ll be a bit stuck with an out-of-quorum node where you can’t make any changes. At this point it would either be a support case (support may have some secret diag commands to fix it), or you could physically restore the ports (i.e. if you were doing a headswap from FAS32XX with cluster ports on e1a and e2a, to a FAS80XX with cluster ports on e0a, e0c, you could move the 10 GbE cards from the FAS32XX to the FAS80XX).


Comments

  1. Hi, I just read through your procedure. Very nice tests and helpful information.

    I will do a headswap in this way next week and as far as I understand cDOT and this guide it will be sufficient to have one working cluster-network port (after the swap). I would then migrate the cluster-lif of the port that will disappear (e4a) to a port that will survive (e1a) before I do the swap and that should do the trick and the nodes will be able to form quorum with the two lifs on the single port.

    Would you agree with that?

    Kind reagrds
    Christian

    ReplyDelete
    Replies
    1. Hi Christian, yes, 1 cluster port is perfectly fine. As long as you have one that maps, you're good. Cheers, VC

      Delete
  2. Hi Vidad,
    I followed your guide and did a disruptive headswap from a FAS3240 over to a FAS8040. I ran into the issue where i was not able to move the CLUSTER ports from e1a and e2a ahead of time due to not having the PCI card in the FAS8040. Come to find out that the 10gb PCI card in the FAS3240 was not compatible with the FAS8040, so was not able to Move the card over. We had no choice but to proceed, knowing that one of the nodes was going to be OUT OF QUORUM due to these cluster ports. But just wanted to point out to anyone going through this, that the way we got through it was by adding the e0a and e0c ports to the CLUSTER broadcast domain and creating new LIFS in the CLUSTER SVM on both nodes after almost completing the headswap. Waited a couple minutes and both nodes were now in quorum. Verified with the "Cluster Show" and "Cluster ring show" commands. So it is possible to fix up your cluster towards the end of the headswap process.. assuming you follow the instructions correctly and make it that far.

    ReplyDelete

Post a Comment