Disruptive cDOT Headswap: Step-by-Step Walkthrough

Caveat Lector: Unofficial information!
This post is a guide to ‘Performing a Disruptive Clustered Data ONTAP Headswap’. At a high-level, the steps are:

1) Planning and Preparation
2) Decommissioning Old Controllers
3) Power Off and Re-Cable
4) Re-Assigning Disks and First Boot
5) Commissioning New Controllers

Note: This doesn’t cover all scenarios (i.e. V-Series, Storage Encryption, ...)

1) Planning and Preparation

1.1) Cluster Ports

Plan how Cluster ports are going to map to ports on the new controllers.
Note: This is critical! You may need to move the Cluster ports to suitable common ports before starting the headswap (even physically re-arrange cards).
Note: See What Happens when Ports Go Missing ... and scenario 5 for why.

1.2) Physical Cabling Plan

Create cabling schedule to map shelf, ACP, and data connections, from the old controllers, to ports on the new controllers.

1.3) Source Controller Version

Check in Hardware Universe that the new controllers support the ONTAP version on the old controllers.
If not, the old controllers must be upgraded (if this is more than a 2-node cluster, then the whole cluster must be upgraded.)

1.4) Destination Controller Version

The destination controller version must be on the same version as the source controllers (I’d go to the exact same P release.)
If not, you can upgrade the destination controllers with no disks attached, using the Boot Menu and Option 7 “Install new software first”.
Note: There is a benefit to running Option 7 twice to make sure that both images (there are 2 boot images) are the same, especially if you’re still on 8.2.x (since the 8.2 -> 8.3 upgrade validation script won’t trigger if you’ve already got an 8.3+ image.)

Image: Boot Menu Option 7 “Install new software first”
1.5) Re-purposing Controllers

If you are re-purposing controllers that have previously been in a Cluster, you must run wipeconfig. Start from the LOADER prompt:


LOADER> set-defaults
LOADER> setenv bootarg.init.boot_clustered true
LOADER> boot_ontap prompt


And from the boot menu type “wipeconfig”, press enter, and follow the prompts.

You know it’s successful when on reboot you get:

*******************************
*                             *
* Press Ctrl-C for Boot Menu. *
*                             *
*******************************
The boot device has changed. System configuration information
could be lost. Use option (6) to restore the system configuration, or
option (4) to initialize all disks and setup a new system.
Normal Boot is prohibited.

1.6) (Optional) Disabling Autoboot

An optional step that I like to do (you can add these lines above before the boot_ontap) since it gives you more control (especially if it’s two controllers in one chassis), is to disable autoboot on the new controllers.


LOADER> setenv AUTOBOOT false
LOADER> saveenv


1.7) New Controller Licenses

Acquire licenses for the new Clustered Data ONTAP controllers, and apply in advance using Clustershell commands::>


license add {LICENSE_CODE}


1.8) Additional Checks

- Config Advisor: Run Config Advisor against the old controllers and resolve any issues.
- ASUP: View the AutoSupport Health Summary and resolve any issues.
- ONTAP Release Notes: Check to be aware of any known issues.
- Official Documentation: Read and understand!

IMPORTANT NOTE: Switchless Clusters
If you’re headswapping a switchless cluster, be mindful of the bootvar setting -
bootarg.init.switchless_cluster.enable
- if defaults to false, so if this is not set to true on both replacement heads (in the switchless cluster) prior to boot, it’s a support case.

2) Decommissioning Old Controllers

Note: With the following commands, {NODE-A}, and {NODE-B}, are the nodes being Headswap-ed.

2.1) Epsilon

If this is a 4-node or larger cluster, make sure Epsilon is not on the 2 nodes being downed in the headswap::>


set adv
cluster show -epsilon true
cluster modify -node {EPSILON_NODE} -epsilon false
cluster modify -node {NEW_EPSILON_NODE} -epsilon true


2.2) Some Health Checks

Note: This is far from an exhaustive list - please feel free to add more checks as you wish.

For the 2 controllers to be head swapped:
- Verify Cluster communication is working correctly
- Verify the software image
- Check for missing and broken disks.


cluster ping-cluster -node {NODE-A}
cluster ping-cluster -node {NODE-B}
storage failover show -fields local-missing-disks,partner-missing-disks
system node image show -node {NODE-A},{NODE-B} -iscurrent true
storage disk show -nodelist {NODE-A},{NODE-B} -broken


Note: Remove any broken disks from the system.

2.3) Collect SYSIDs of Old Controllers


system node show -node {NODE-A},{NODE-B} -instance


2.4) Record Service-Processor Information


service-processor network show -node {NODEA}
service-processor network show -node {NODEB}


2.6) Take Backups

Note: This is a pre-cautionary step - we hope not to use them.

The following commands backup varfs and node configuration::>


security login unlock -username diag
security login password -username diag
set d
systemshell -node {NODE-A}
cp /mroot/etc/varfs.tgz /mroot/etc/varfs.bak
exit
systemshell -node {NODE-B}
cp /mroot/etc/varfs.tgz /mroot/etc/varfs.bak
exit
system configuration backup create -node {NODE-A} -backup-type node -backup-name Aheadswap
system configuration backup create -node {NODE-B} -backup-type node -backup-name Bheadswap


If this is a 4-Node or large cluster, then take a cluster backup on one of the other nodes - otherwise use both nodes::>


system configuration backup create -node {NODE} -backup-type cluster -backup-name clusbackup


And wait for the backup jobs to complete::>


job show -name *backup*


2.7) ASUPs

Take autosupports and wait for them to send::>


system node autosupport invoke -node {NODE-A} -type all -message "Starting Headswap"
system node autosupport invoke -node {NODE-B} -type all -message "Starting Headswap"
system node autosupport history show -node {NODE-A}
system node autosupport history show -node {NODE-B}


2.8) Cluster LIFs

As per 1.1 above, you may need to move Cluster LIFs before continuing with the headswap.

The Clustershell commands below illustrate how to move one Cluster LIF in 8.3+ from {PORT_A} to {PORT_B}::>


broadcast-domain remove-ports -broadcast-domain default -ports {NODE}:{PORT_B}
net port modify -node {NODE} -port {PORT_B} -flowcontrol-admin none -mtu 9000
broadcast-domain add-ports -broadcast-domain Cluster -ipspace Cluster -port {NODE}:{PORT_B}
net int modify -vserver Cluster -lif {CLUS_LIF} -home-port {PORT_B} -home-node {NODE}


Physically re-cable {NODE}:{PORT_A} to {PORT_B} - the {CLUS_LIF} LIF will automatically failover.


broadcast-domain remove-ports -broadcast-domain Cluster -ipspace Cluster -port {NODE}:{PORT_A}


2.9) LIFs

Note: Common port = port common to both old and new platforms.
Note: SAN LIFs are considered later.

- If this is a 4-Node or larger cluster, rehome data LIFs to other nodes
- If this is a 2-Node cluster, rehome data LIFs to a common port
- If this is a 4-Node or larger cluster, rehome cluster management to one of the other nodes
- If this is a 2-Node cluster, rehome cluster management to a common port
- For node LIFs (intercluster, node-mgmt), rehome to a common port

Typical commands::>


network interface show -curr-node {NODE}
net int modify -vserver {VSERVER} -lif {LIF_NAME} -home-port {PORT} -home-node {NODE}
net int revert *


2.10) STOP APPLICATION DATA ACCESS!

2.11) SAN LIFs

Down any SAN LIFs on the node’s being replaced::>


net int modify -vserver {VSERVER} -lif {SAN_LIF_NAME} -status-admin down


2.12) Cluster HA / Storage Failover

Verify cluster status::>


cluster show


(2-Node Cluster) Disable Cluster HA::>


cluster ha modify -configured false


Disable storage failover::>


storage failover show
storage failover modify -node {NODE-A} -enabled false
storage failover show


Note: If this is a 4 Node cluster, you may want to disable SFO on all nodes in the cluster.

2.13) Halt


halt -node {NODE-A} -inhi -igno -skip
halt -node {NODE-B} -inhi -igno -skip


3) Power Off and Re-Cable

3.1) Power Off Old Controllers

Once the old controllers are at the LOADER> prompt, they can be powered off.

3.2) Transfer any Cards (as required)

3.3) Install New Controllers

3.4) Re-Cable New Controllers

4) Re-Assigning Disks and First Boot

Do these steps (4.1 and 4.2) for both controllers (you can do them in tandem)

4.1) Reassigning Disks

Power on and boot the node into Maintenance Mode.


LOADER> boot_ontap prompt


Ctrl+C to access Boot Menu

Select 5 for Maintenance Mode

In maintenance mode run the following commands to find out the new controllers SYSID, and verify disk multi-pathing>


disk show -a
storage show disk -p


For node A (triple check before running)>


disk reassign -s {OLD_NODEA_SYSID} -d {NEW_NODEA_SYSID}


For node B (triple check before running)>


disk reassign -s {OLD_NODEB_SYSID} -d {NEW_NODEB_SYSID}


For both nodes>


disk show -a
mailbox destroy local
mailbox destroy partner
halt


4.2) UPDATE FLASH FROM BACKUP CONFIG

Boot the node to the Boot Menu.


LOADER> boot_ontap prompt


Ctrl+C to access Boot Menu

Select 6 for Update Flash from Backup Config

Note: It is critical that you catch this on the first boot after assigning disks otherwise you could end up with a messy support case.


Update Flash from Backup Config:
- WATCH the process
- It will take several minutes and the controller might reboot a few times
- Accept any warning about mis-matched sysid
- You should see something like:

ontap_varfs: restore using /mroot/etc/varfs.tgz
Rebooting to load the new varfs
Abandoned in-memory /var file system


5) Commissioning New Controllers

5.1) Re-Enable SFO::>


storage failover show
storage failover modify -enabled true -node {NODE-A}


(2-Node Cluster) Verify cluster ha has re-enabled::>


cluster ha show


5.2) Some Health Checks

Note: This is far from an exhaustive list - please feel free to add more checks as you wish.


cluster show
set adv
cluster ring show -unitname mgmt
cluster ring show -unitname vldb
cluster ring show -unitname vifmgr
cluster ring show -unitname bcomd
cluster ring show -unitname crs



5.3) Cluster LIFs

If Cluster LIFs need to be moved on the new platform, the Clustershell commands below illustrate how to move one Cluster LIF in 8.3+ from {PORT_A} to {PORT_B}::>


broadcast-domain remove-ports -broadcast-domain default -ports {NODE}:{PORT_B}
net port modify -node {NODE} -port {PORT_B} -flowcontrol-admin none -mtu 9000
broadcast-domain add-ports -broadcast-domain Cluster -ipspace Cluster -port {NODE}:{PORT_B}
net int modify -vserver Cluster -lif {CLUS_LIF} -home-port {PORT_B} -home-node {NODE}


Physically re-cable {NODE}:{PORT_A} to {PORT_B} - the {CLUS_LIF} LIF will automatically failover.


broadcast-domain remove-ports -broadcast-domain Cluster -ipspace Cluster -port {NODE}:{PORT_A}


5.4) LIFs

Re-home LIFs to their correct home port.

Typical commands::>


network interface show -curr-node {NODE}
net int modify -vserver {VSERVER} -lif {LIF_NAME} -home-port {PORT} -home-node {NODE}
net int revert *


At this stage you should also look at:

- Tidy up/correct Failover Groups
- Tidy up/correct Broadcast Domains

5.5) SAN LIFs

Restore any SAN LIFs::>


net int modify -vserver {VSERVER} -lif {SAN_LIF_NAME} -status-admin up


5.6) (Optional - Recommended) Test Failover

Do first for {NODE-A}, then repeat for {NODE-B}::>

storage failover show
storage failover takeover -ofnode {NODE}
storage failover show-takeover
storage failover giveback -ofnode {NODE}
storage failover show-giveback
cluster show
net int show -is-home false
net int revert *


5.7) RESTORE APPLICATION DATA ACCESS!

5.8) Configure Service-Processors::>


service-processor network modify -node {NODE} -address-type IPv4 -enable true -dhcp none -ip-address {ADDRESS} -netmask {MASK} -gateway {GATEWAY}


5.9) Tidy up licenses::>


license clean-up -unused true -simulate
license clean-up -unused true


5.10) ASUPs

Send autosupports::>


system node autosupport invoke -node {NODE-A} -type all -message "Finished Headswap"
system node autosupport invoke -node {NODE-B} -type all -message "Finished Headswap"


5.11) (Optional) Re-enable AUTOBOOT

If you disabled AUTOBOOT earlier, re-enable::>


set d
debug kenv modify -node {NODE-A} -variable AUTOBOOT -value true -persist true
debug kenv modify -node {NODE-B} -variable AUTOBOOT -value true -persist true


THE END!

Comments

  1. Hi Vidad,
    Do you have instruction on "non-disruptive" 2-node cDOT Headswap: Step-by-Step Walkthrough?
    Regards,
    Jackson

    ReplyDelete
    Replies
    1. Hi Jackson. "non-disruptive" ARL headswap information starts here:
      http://www.cosonok.com/2016/12/arl-headswap-part-14-preparation.html

      Delete
    2. Thanks Vidad, question on this Disruptive cDot Headswap. If the OLD controllers cluster ports are e0a/e0b/e0c/e0d and the new controllers cluster ports will be e0c/e0d/e0e/e0f.
      e0e and e0f are already assigned to a0a on the old controllers?
      When should I move the Cluster LIFs e0a to e0e and e0b to e0f? on section 2.8 or section 5.3 above?

      Delete
    3. Also on the DATA lifs - the old controllers are using a0a-vlanID (with LACP port e0e/e0f/e0g/e0h), the new controllers are using a0a-VLANID (with port e7a/e7b/e13a/e13b) - can I treat they are in the common port a0a?

      Delete
    4. Hi Jackson,
      The way I'd do it based on the information you've provided for this disruptive headswap:
      1) Pre-headswap remove e0a and e0b from the cluster broadcast-domain (re-home the cluster LIFs that are on them first to e0c and e0d)
      2) Pre-headswap modify the a0a ifgrp and remove the e0e, e0f, e0g, and e0h ports)
      3) Post-headswap, add e0e and e0f to the cluster broadcast-domain and sort out the cluster LIFs.
      4) Post-headswap, add the correct ports to a0a.

      Delete
    5. Thanks Vidad for your quick response and your advise, must appreciated.

      Delete
  2. Hi Vidad,
    I followed your guide and did a disruptive headswap from a FAS3240 over to a FAS8040. I ran into the issue where i was not able to move the CLUSTER ports from e1a and e2a ahead of time due to not having the PCI card in the FAS8040. Come to find out that the 10gb PCI card in the FAS3240 was not compatible with the FAS8040, so was not able to Move the card over. We had no choice but to proceed, knowing that one of the nodes was going to be OUT OF QUORUM due to these cluster ports. But just wanted to point out to anyone going through this, that the way we got through it was by adding the e0a and e0c ports to the CLUSTER broadcast domain and creating new LIFS in the CLUSTER SVM on both nodes after almost completing the headswap. Waited a couple minutes and both nodes were now in quorum. Verified with the "Cluster Show" and "Cluster ring show" commands. So it is possible to fix up your cluster towards the end of the headswap process.. assuming you follow the instructions correctly and make it that far.
    Thanks for making this guide !,

    Milther

    ReplyDelete
  3. Hi Vidad,

    thanks to your guide, I've successfully performed an head swap in Ontap 9.1. There was a difference with disk reassign, because option -p was mandatory, more precisely:

    disk reassign -s {OLD_NODEA_SYSID} -d {NEW_NODEA_SYSID} -p {OLD_NODEA_SYSID}
    disk reassign -s {OLD_NODEB_SYSID} -d {NEW_NODEB_SYSID} -p {NEW_NODEA_SYSID}

    It was weird the systems knew the right partner sysid to apply, because there was a check on the -p parameter.

    Do you have any experience on this?

    Cheers

    Lorenzo

    ReplyDelete
    Replies
    1. Hi Lorenzo,
      The -p is correct where you have advanced disk partitioning. This article was written before that time. So with ADP it's:
      disk reassign -s NODE-A_SYSID -d NODE-C_SYSID -p NODE-B_SYSID
      Cheers, VC

      Delete
  4. UPDATE: I happened to use this procedure for a disruptive headswap and shelf reallocation recently (ONTAP 9.1). A few lessons learned/remembered:
    1) If you're relocating shelves, it's very nice to purchase new rail kits!
    2) Remember to set date on the new controllers.
    3) Post headswap, set the storage failover hwassist IP.
    4) If you've set the SP beforehand, it gets overwritten by the original SP in the headswap.
    5) Finally, you might want to update the SP firmware after doing the headswap.

    ReplyDelete
  5. When I wrote this there wasn't an official document that covered the disruptive cDOT headswap. There is now (dates back to July 2019):
    "Upgrading Controller Hardware by Moving Storage"
    https://library.netapp.com/ecm/ecm_download_file/ECMLP2540637

    ReplyDelete

Post a Comment