Disruptive cDOT Headswap: Step-by-Step Walkthrough

on December 12, 2016

Caveat Lector: Unofficial information!

This post is a guide to ‘Performing a Disruptive Clustered Data ONTAP Headswap’. At a high-level, the steps are:

1) Planning and Preparation

2) Decommissioning Old Controllers

3) Power Off and Re-Cable

4) Re-Assigning Disks and First Boot

5) Commissioning New Controllers

Note: This doesn’t cover all scenarios (i.e. V-Series, Storage Encryption, ...)

1) Planning and Preparation

1.1) Cluster Ports

Plan how Cluster ports are going to map to ports on the new controllers.

Note: This is critical! You may need to move the Cluster ports to suitable common ports before starting the headswap (even physically re-arrange cards).

Note: See What Happens when Ports Go Missing ... and scenario 5 for why.

1.2) Physical Cabling Plan

Create cabling schedule to map shelf, ACP, and data connections, from the old controllers, to ports on the new controllers.

1.3) Source Controller Version

Check in Hardware Universe that the new controllers support the ONTAP version on the old controllers.

If not, the old controllers must be upgraded (if this is more than a 2-node cluster, then the whole cluster must be upgraded.)

1.4) Destination Controller Version

The destination controller version must be on the same version as the source controllers (I’d go to the exact same P release.)

If not, you can upgrade the destination controllers with no disks attached, using the Boot Menu and Option 7 “Install new software first”.

Note: There is a benefit to running Option 7 twice to make sure that both images (there are 2 boot images) are the same, especially if you’re still on 8.2.x (since the 8.2 -> 8.3 upgrade validation script won’t trigger if you’ve already got an 8.3+ image.)

Image: Boot Menu Option 7 “Install new software first”

1.5) Re-purposing Controllers

If you are re-purposing controllers that have previously been in a Cluster, you must run wipeconfig. Start from the LOADER prompt:

LOADER> set-defaults

LOADER> setenv bootarg.init.boot_clustered true

LOADER> boot_ontap prompt

And from the boot menu type “wipeconfig”, press enter, and follow the prompts.

You know it’s successful when on reboot you get:

*******************************

* *

* Press Ctrl-C for Boot Menu. *

* *

*******************************

The boot device has changed. System configuration information

could be lost. Use option (6) to restore the system configuration, or

option (4) to initialize all disks and setup a new system.

Normal Boot is prohibited.

1.6) (Optional) Disabling Autoboot

An optional step that I like to do (you can add these lines above before the boot_ontap) since it gives you more control (especially if it’s two controllers in one chassis), is to disable autoboot on the new controllers.

LOADER> setenv AUTOBOOT false

LOADER> saveenv

1.7) New Controller Licenses

Acquire licenses for the new Clustered Data ONTAP controllers, and apply in advance using Clustershell commands::>

license add {LICENSE_CODE}

1.8) Additional Checks

- Config Advisor: Run Config Advisor against the old controllers and resolve any issues.

- ASUP: View the AutoSupport Health Summary and resolve any issues.

- ONTAP Release Notes: Check to be aware of any known issues.

- Official Documentation: Read and understand!

IMPORTANT NOTE: Switchless Clusters
If you’re headswapping a switchless cluster, be mindful of the bootvar setting -
bootarg.init.switchless_cluster.enable
- if defaults to false, so if this is not set to true on both replacement heads (in the switchless cluster) prior to boot, it’s a support case.

2) Decommissioning Old Controllers

Note: With the following commands, {NODE-A}, and {NODE-B}, are the nodes being Headswap-ed.

2.1) Epsilon

If this is a 4-node or larger cluster, make sure Epsilon is not on the 2 nodes being downed in the headswap::>

set adv

cluster show -epsilon true

cluster modify -node {EPSILON_NODE} -epsilon false

cluster modify -node {NEW_EPSILON_NODE} -epsilon true

2.2) Some Health Checks

Note: This is far from an exhaustive list - please feel free to add more checks as you wish.

For the 2 controllers to be head swapped:

- Verify Cluster communication is working correctly

- Verify the software image

- Check for missing and broken disks.

cluster ping-cluster -node {NODE-A}

cluster ping-cluster -node {NODE-B}

storage failover show -fields local-missing-disks,partner-missing-disks

system node image show -node {NODE-A},{NODE-B} -iscurrent true

storage disk show -nodelist {NODE-A},{NODE-B} -broken

Note: Remove any broken disks from the system.

2.3) Collect SYSIDs of Old Controllers

system node show -node {NODE-A},{NODE-B} -instance

2.4) Record Service-Processor Information

service-processor network show -node {NODEA}

service-processor network show -node {NODEB}

2.6) Take Backups

Note: This is a pre-cautionary step - we hope not to use them.

The following commands backup varfs and node configuration::>

security login unlock -username diag

security login password -username diag

set d

systemshell -node {NODE-A}

cp /mroot/etc/varfs.tgz /mroot/etc/varfs.bak

exit

systemshell -node {NODE-B}

cp /mroot/etc/varfs.tgz /mroot/etc/varfs.bak

exit

system configuration backup create -node {NODE-A} -backup-type node -backup-name Aheadswap

system configuration backup create -node {NODE-B} -backup-type node -backup-name Bheadswap

If this is a 4-Node or large cluster, then take a cluster backup on one of the other nodes - otherwise use both nodes::>

system configuration backup create -node {NODE} -backup-type cluster -backup-name clusbackup

And wait for the backup jobs to complete::>

job show -name *backup*

2.7) ASUPs

Take autosupports and wait for them to send::>

system node autosupport invoke -node {NODE-A} -type all -message "Starting Headswap"

system node autosupport invoke -node {NODE-B} -type all -message "Starting Headswap"

system node autosupport history show -node {NODE-A}

system node autosupport history show -node {NODE-B}

2.8) Cluster LIFs

As per 1.1 above, you may need to move Cluster LIFs before continuing with the headswap.

The Clustershell commands below illustrate how to move one Cluster LIF in 8.3+ from {PORT_A} to {PORT_B}::>

broadcast-domain remove-ports -broadcast-domain default -ports {NODE}:{PORT_B}

net port modify -node {NODE} -port {PORT_B} -flowcontrol-admin none -mtu 9000

broadcast-domain add-ports -broadcast-domain Cluster -ipspace Cluster -port {NODE}:{PORT_B}

net int modify -vserver Cluster -lif {CLUS_LIF} -home-port {PORT_B} -home-node {NODE}

Physically re-cable {NODE}:{PORT_A} to {PORT_B} - the {CLUS_LIF} LIF will automatically failover.

broadcast-domain remove-ports -broadcast-domain Cluster -ipspace Cluster -port {NODE}:{PORT_A}

2.9) LIFs

Note: Common port = port common to both old and new platforms.

Note: SAN LIFs are considered later.

- If this is a 4-Node or larger cluster, rehome data LIFs to other nodes

- If this is a 2-Node cluster, rehome data LIFs to a common port

- If this is a 4-Node or larger cluster, rehome cluster management to one of the other nodes

- If this is a 2-Node cluster, rehome cluster management to a common port

- For node LIFs (intercluster, node-mgmt), rehome to a common port

Typical commands::>

network interface show -curr-node {NODE}

net int modify -vserver {VSERVER} -lif {LIF_NAME} -home-port {PORT} -home-node {NODE}

net int revert *

2.10) STOP APPLICATION DATA ACCESS!

2.11) SAN LIFs

Down any SAN LIFs on the node’s being replaced::>

net int modify -vserver {VSERVER} -lif {SAN_LIF_NAME} -status-admin down

2.12) Cluster HA / Storage Failover

Verify cluster status::>

cluster show

(2-Node Cluster) Disable Cluster HA::>

cluster ha modify -configured false

Disable storage failover::>

storage failover show

storage failover modify -node {NODE-A} -enabled false

storage failover show

Note: If this is a 4 Node cluster, you may want to disable SFO on all nodes in the cluster.

2.13) Halt

halt -node {NODE-A} -inhi -igno -skip

halt -node {NODE-B} -inhi -igno -skip

3) Power Off and Re-Cable

3.1) Power Off Old Controllers

Once the old controllers are at the LOADER> prompt, they can be powered off.

3.2) Transfer any Cards (as required)

3.3) Install New Controllers

3.4) Re-Cable New Controllers

4) Re-Assigning Disks and First Boot

Do these steps (4.1 and 4.2) for both controllers (you can do them in tandem)

4.1) Reassigning Disks

Power on and boot the node into Maintenance Mode.

LOADER> boot_ontap prompt

Ctrl+C to access Boot Menu

Select 5 for Maintenance Mode

In maintenance mode run the following commands to find out the new controllers SYSID, and verify disk multi-pathing>

disk show -a

storage show disk -p

For node A (triple check before running)>

disk reassign -s {OLD_NODEA_SYSID} -d {NEW_NODEA_SYSID}

For node B (triple check before running)>

disk reassign -s {OLD_NODEB_SYSID} -d {NEW_NODEB_SYSID}

For both nodes>

disk show -a

mailbox destroy local

mailbox destroy partner

halt

4.2) UPDATE FLASH FROM BACKUP CONFIG

Boot the node to the Boot Menu.

LOADER> boot_ontap prompt

Ctrl+C to access Boot Menu

Select 6 for Update Flash from Backup Config

Note: It is critical that you catch this on the first boot after assigning disks otherwise you could end up with a messy support case.

Update Flash from Backup Config:

- WATCH the process

- It will take several minutes and the controller might reboot a few times

- Accept any warning about mis-matched sysid

- You should see something like:

ontap_varfs: restore using /mroot/etc/varfs.tgz

Rebooting to load the new varfs

Abandoned in-memory /var file system

5) Commissioning New Controllers

5.1) Re-Enable SFO::>

storage failover show

storage failover modify -enabled true -node {NODE-A}

(2-Node Cluster) Verify cluster ha has re-enabled::>

cluster ha show

5.2) Some Health Checks

Note: This is far from an exhaustive list - please feel free to add more checks as you wish.

cluster show

set adv

cluster ring show -unitname mgmt

cluster ring show -unitname vldb

cluster ring show -unitname vifmgr

cluster ring show -unitname bcomd

cluster ring show -unitname crs

5.3) Cluster LIFs

If Cluster LIFs need to be moved on the new platform, the Clustershell commands below illustrate how to move one Cluster LIF in 8.3+ from {PORT_A} to {PORT_B}::>

broadcast-domain remove-ports -broadcast-domain default -ports {NODE}:{PORT_B}

net port modify -node {NODE} -port {PORT_B} -flowcontrol-admin none -mtu 9000

broadcast-domain add-ports -broadcast-domain Cluster -ipspace Cluster -port {NODE}:{PORT_B}

net int modify -vserver Cluster -lif {CLUS_LIF} -home-port {PORT_B} -home-node {NODE}

Physically re-cable {NODE}:{PORT_A} to {PORT_B} - the {CLUS_LIF} LIF will automatically failover.

broadcast-domain remove-ports -broadcast-domain Cluster -ipspace Cluster -port {NODE}:{PORT_A}

5.4) LIFs

Re-home LIFs to their correct home port.

Typical commands::>

network interface show -curr-node {NODE}

net int modify -vserver {VSERVER} -lif {LIF_NAME} -home-port {PORT} -home-node {NODE}

net int revert *

At this stage you should also look at:

- Tidy up/correct Failover Groups

- Tidy up/correct Broadcast Domains

5.5) SAN LIFs

Restore any SAN LIFs::>

net int modify -vserver {VSERVER} -lif {SAN_LIF_NAME} -status-admin up

5.6) (Optional - Recommended) Test Failover

Do first for {NODE-A}, then repeat for {NODE-B}::>

storage failover show

storage failover takeover -ofnode {NODE}

storage failover show-takeover

storage failover giveback -ofnode {NODE}

storage failover show-giveback

cluster show

net int show -is-home false

net int revert *

5.7) RESTORE APPLICATION DATA ACCESS!

5.8) Configure Service-Processors::>

service-processor network modify -node {NODE} -address-type IPv4 -enable true -dhcp none -ip-address {ADDRESS} -netmask {MASK} -gateway {GATEWAY}

5.9) Tidy up licenses::>

license clean-up -unused true -simulate

license clean-up -unused true

5.10) ASUPs

Send autosupports::>

system node autosupport invoke -node {NODE-A} -type all -message "Finished Headswap"

system node autosupport invoke -node {NODE-B} -type all -message "Finished Headswap"

5.11) (Optional) Re-enable AUTOBOOT

If you disabled AUTOBOOT earlier, re-enable::>

set d

debug kenv modify -node {NODE-A} -variable AUTOBOOT -value true -persist true

debug kenv modify -node {NODE-B} -variable AUTOBOOT -value true -persist true

THE END!

Comments

Unknown14 March 2018 at 20:27
Hi Vidad,
Do you have instruction on "non-disruptive" 2-node cDOT Headswap: Step-by-Step Walkthrough?
Regards,
Jackson
ReplyDelete
Replies
Milther10 July 2019 at 09:31
Hi Vidad,
I followed your guide and did a disruptive headswap from a FAS3240 over to a FAS8040. I ran into the issue where i was not able to move the CLUSTER ports from e1a and e2a ahead of time due to not having the PCI card in the FAS8040. Come to find out that the 10gb PCI card in the FAS3240 was not compatible with the FAS8040, so was not able to Move the card over. We had no choice but to proceed, knowing that one of the nodes was going to be OUT OF QUORUM due to these cluster ports. But just wanted to point out to anyone going through this, that the way we got through it was by adding the e0a and e0c ports to the CLUSTER broadcast domain and creating new LIFS in the CLUSTER SVM on both nodes after almost completing the headswap. Waited a couple minutes and both nodes were now in quorum. Verified with the "Cluster Show" and "Cluster ring show" commands. So it is possible to fix up your cluster towards the end of the headswap process.. assuming you follow the instructions correctly and make it that far.
Thanks for making this guide !,

Milther
ReplyDelete
Replies
Unknown22 November 2019 at 06:22
Hi Vidad,

thanks to your guide, I've successfully performed an head swap in Ontap 9.1. There was a difference with disk reassign, because option -p was mandatory, more precisely:

disk reassign -s {OLD_NODEA_SYSID} -d {NEW_NODEA_SYSID} -p {OLD_NODEA_SYSID}
disk reassign -s {OLD_NODEB_SYSID} -d {NEW_NODEB_SYSID} -p {NEW_NODEA_SYSID}

It was weird the systems knew the right partner sysid to apply, because there was a check on the -p parameter.

Do you have any experience on this?

Cheers

Lorenzo
ReplyDelete
Replies
Carsonok25 November 2019 at 01:54
UPDATE: I happened to use this procedure for a disruptive headswap and shelf reallocation recently (ONTAP 9.1). A few lessons learned/remembered:
1) If you're relocating shelves, it's very nice to purchase new rail kits!
2) Remember to set date on the new controllers.
3) Post headswap, set the storage failover hwassist IP.
4) If you've set the SP beforehand, it gets overwritten by the original SP in the headswap.
5) Finally, you might want to update the SP firmware after doing the headswap.
ReplyDelete
Replies
Carsonok18 May 2020 at 09:21
When I wrote this there wasn't an official document that covered the disruptive cDOT headswap. There is now (dates back to July 2019):
"Upgrading Controller Hardware by Moving Storage"
https://library.netapp.com/ecm/ecm_download_file/ECMLP2540637
ReplyDelete
Replies

Add comment

Cosonok's IT Blog

Search This Blog

Disruptive cDOT Headswap: Step-by-Step Walkthrough

Comments

Post a Comment