Monday, 12 December 2016

Disruptive cDOT Headswap: Step-by-Step Walkthrough

Caveat Lector: Unofficial information!
This post is a guide to ‘Performing a Disruptive Clustered Data ONTAP Headswap’. At a high-level, the steps are:

1) Planning and Preparation
2) Decommissioning Old Controllers
3) Power Off and Re-Cable
4) Re-Assigning Disks and First Boot
5) Commissioning New Controllers

Note: This doesn’t cover all scenarios (i.e. V-Series, Storage Encryption, ...)

1) Planning and Preparation

1.1) Cluster Ports

Plan how Cluster ports are going to map to ports on the new controllers.
Note: This is critical! You may need to move the Cluster ports to suitable common ports before starting the headswap (even physically re-arrange cards).
Note: See What Happens when Ports Go Missing ... and scenario 5 for why.

1.2) Physical Cabling Plan

Create cabling schedule to map shelf, ACP, and data connections, from the old controllers, to ports on the new controllers.

1.3) Source Controller Version

Check in Hardware Universe that the new controllers support the ONTAP version on the old controllers.
If not, the old controllers must be upgraded (if this is more than a 2-node cluster, then the whole cluster must be upgraded.)

1.4) Destination Controller Version

The destination controller version must be on the same version as the source controllers (I’d go to the exact same P release.)
If not, you can upgrade the destination controllers with no disks attached, using the Boot Menu and Option 7 “Install new software first”.
Note: There is a benefit to running Option 7 twice to make sure that both images (there are 2 boot images) are the same, especially if you’re still on 8.2.x (since the 8.2 -> 8.3 upgrade validation script won’t trigger if you’ve already got an 8.3+ image.)

Image: Boot Menu Option 7 “Install new software first”
1.5) Re-purposing Controllers

If you are re-purposing controllers that have previously been in a Cluster, you must run wipeconfig. Start from the LOADER prompt:


LOADER> set-defaults
LOADER> setenv bootarg.init.boot_clustered true
LOADER> boot_ontap prompt


And from the boot menu type “wipeconfig”, press enter, and follow the prompts.

You know it’s successful when on reboot you get:

*******************************
*                             *
* Press Ctrl-C for Boot Menu. *
*                             *
*******************************
The boot device has changed. System configuration information
could be lost. Use option (6) to restore the system configuration, or
option (4) to initialize all disks and setup a new system.
Normal Boot is prohibited.

1.6) (Optional) Disabling Autoboot

An optional step that I like to do (you can add these lines above before the boot_ontap) since it gives you more control (especially if it’s two controllers in one chassis), is to disable autoboot on the new controllers.


LOADER> setenv AUTOBOOT false
LOADER> saveenv


1.7) New Controller Licenses

Acquire licenses for the new Clustered Data ONTAP controllers, and apply in advance using Clustershell commands::>


license add {LICENSE_CODE}


1.8) Additional Checks

- Config Advisor: Run Config Advisor against the old controllers and resolve any issues.
- ASUP: View the AutoSupport Health Summary and resolve any issues.
- ONTAP Release Notes: Check to be aware of any known issues.
- Official Documentation: Read and understand!

IMPORTANT NOTE: Switchless Clusters
If you’re headswapping a switchless cluster, be mindful of the bootvar setting -
bootarg.init.switchless_cluster.enable
- if defaults to false, so if this is not set to true on both replacement heads (in the switchless cluster) prior to boot, it’s a support case.

2) Decommissioning Old Controllers

Note: With the following commands, {NODE-A}, and {NODE-B}, are the nodes being Headswap-ed.

2.1) Epsilon

If this is a 4-node or larger cluster, make sure Epsilon is not on the 2 nodes being downed in the headswap::>


set adv
cluster show -epsilon true
cluster modify -node {EPSILON_NODE} -epsilon false
cluster modify -node {NEW_EPSILON_NODE} -epsilon true


2.2) Some Health Checks

Note: This is far from an exhaustive list - please feel free to add more checks as you wish.

For the 2 controllers to be head swapped:
- Verify Cluster communication is working correctly
- Verify the software image
- Check for missing and broken disks.


cluster ping-cluster -node {NODE-A}
cluster ping-cluster -node {NODE-B}
storage failover show -fields local-missing-disks,partner-missing-disks
system node image show -node {NODE-A},{NODE-B} -iscurrent true
storage disk show -nodelist {NODE-A},{NODE-B} -broken


Note: Remove any broken disks from the system.

2.3) Collect SYSIDs of Old Controllers


system node show -node {NODE-A},{NODE-B} -instance


2.4) Record Service-Processor Information


service-processor network show -node {NODEA}
service-processor network show -node {NODEB}


2.6) Take Backups

Note: This is a pre-cautionary step - we hope not to use them.

The following commands backup varfs and node configuration::>


security login unlock -username diag
security login password -username diag
set d
systemshell -node {NODE-A}
cp /mroot/etc/varfs.tgz /mroot/etc/varfs.bak
exit
systemshell -node {NODE-B}
cp /mroot/etc/varfs.tgz /mroot/etc/varfs.bak
exit
system configuration backup create -node {NODE-A} -backup-type node -backup-name Aheadswap
system configuration backup create -node {NODE-B} -backup-type node -backup-name Bheadswap


If this is a 4-Node or large cluster, then take a cluster backup on one of the other nodes - otherwise use both nodes::>


system configuration backup create -node {NODE} -backup-type cluster -backup-name clusbackup


And wait for the backup jobs to complete::>


job show -name *backup*


2.7) ASUPs

Take autosupports and wait for them to send::>


system node autosupport invoke -node {NODE-A} -type all -message "Starting Headswap"
system node autosupport invoke -node {NODE-B} -type all -message "Starting Headswap"
system node autosupport history show -node {NODE-A}
system node autosupport history show -node {NODE-B}


2.8) Cluster LIFs

As per 1.1 above, you may need to move Cluster LIFs before continuing with the headswap.

The Clustershell commands below illustrate how to move one Cluster LIF in 8.3+ from {PORT_A} to {PORT_B}::>


broadcast-domain remove-ports -broadcast-domain default -ports {NODE}:{PORT_B}
net port modify -node {NODE} -port {PORT_B} -flowcontrol-admin none -mtu 9000
broadcast-domain add-ports -broadcast-domain Cluster -ipspace Cluster -port {NODE}:{PORT_B}
net int modify -vserver Cluster -lif {CLUS_LIF} -home-port {PORT_B} -home-node {NODE}


Physically re-cable {NODE}:{PORT_A} to {PORT_B} - the {CLUS_LIF} LIF will automatically failover.


broadcast-domain remove-ports -broadcast-domain Cluster -ipspace Cluster -port {NODE}:{PORT_A}


2.9) LIFs

Note: Common port = port common to both old and new platforms.
Note: SAN LIFs are considered later.

- If this is a 4-Node or larger cluster, rehome data LIFs to other nodes
- If this is a 2-Node cluster, rehome data LIFs to a common port
- If this is a 4-Node or larger cluster, rehome cluster management to one of the other nodes
- If this is a 2-Node cluster, rehome cluster management to a common port
- For node LIFs (intercluster, node-mgmt), rehome to a common port

Typical commands::>


network interface show -curr-node {NODE}
net int modify -vserver {VSERVER} -lif {LIF_NAME} -home-port {PORT} -home-node {NODE}
net int revert *


2.10) STOP APPLICATION DATA ACCESS!

2.11) SAN LIFs

Down any SAN LIFs on the node’s being replaced::>


net int modify -vserver {VSERVER} -lif {SAN_LIF_NAME} -status-admin down


2.12) Cluster HA / Storage Failover

Verify cluster status::>


cluster show


(2-Node Cluster) Disable Cluster HA::>


cluster ha modify -configured false


Disable storage failover::>


storage failover show
storage failover modify -node {NODE-A} -enabled false
storage failover show


Note: If this is a 4 Node cluster, you may want to disable SFO on all nodes in the cluster.

2.13) Halt


halt -node {NODE-A} -inhi -igno -skip
halt -node {NODE-B} -inhi -igno -skip


3) Power Off and Re-Cable

3.1) Power Off Old Controllers

Once the old controllers are at the LOADER> prompt, they can be powered off.

3.2) Transfer any Cards (as required)

3.3) Install New Controllers

3.4) Re-Cable New Controllers

4) Re-Assigning Disks and First Boot

Do these steps (4.1 and 4.2) for both controllers (you can do them in tandem)

4.1) Reassigning Disks

Power on and boot the node into Maintenance Mode.


LOADER> boot_ontap prompt


Ctrl+C to access Boot Menu

Select 5 for Maintenance Mode

In maintenance mode run the following commands to find out the new controllers SYSID, and verify disk multi-pathing>


disk show -a
storage show disk -p


For node A (triple check before running)>


disk reassign -s {OLD_NODEA_SYSID} -d {NEW_NODEA_SYSID}


For node B (triple check before running)>


disk reassign -s {OLD_NODEB_SYSID} -d {NEW_NODEB_SYSID}


For both nodes>


disk show -a
mailbox destroy local
mailbox destroy partner
halt


4.2) UPDATE FLASH FROM BACKUP CONFIG

Boot the node to the Boot Menu.


LOADER> boot_ontap prompt


Ctrl+C to access Boot Menu

Select 6 for Update Flash from Backup Config

Note: It is critical that you catch this on the first boot after assigning disks otherwise you could end up with a messy support case.


Update Flash from Backup Config:
- WATCH the process
- It will take several minutes and the controller might reboot a few times
- Accept any warning about mis-matched sysid
- You should see something like:

ontap_varfs: restore using /mroot/etc/varfs.tgz
Rebooting to load the new varfs
Abandoned in-memory /var file system


5) Commissioning New Controllers

5.1) Re-Enable SFO::>


storage failover show
storage failover modify -enabled true -node {NODE-A}


(2-Node Cluster) Verify cluster ha has re-enabled::>


cluster ha show


5.2) Some Health Checks

Note: This is far from an exhaustive list - please feel free to add more checks as you wish.


cluster show
set adv
cluster ring show -unitname mgmt
cluster ring show -unitname vldb
cluster ring show -unitname vifmgr
cluster ring show -unitname bcomd
cluster ring show -unitname crs



5.3) Cluster LIFs

If Cluster LIFs need to be moved on the new platform, the Clustershell commands below illustrate how to move one Cluster LIF in 8.3+ from {PORT_A} to {PORT_B}::>


broadcast-domain remove-ports -broadcast-domain default -ports {NODE}:{PORT_B}
net port modify -node {NODE} -port {PORT_B} -flowcontrol-admin none -mtu 9000
broadcast-domain add-ports -broadcast-domain Cluster -ipspace Cluster -port {NODE}:{PORT_B}
net int modify -vserver Cluster -lif {CLUS_LIF} -home-port {PORT_B} -home-node {NODE}


Physically re-cable {NODE}:{PORT_A} to {PORT_B} - the {CLUS_LIF} LIF will automatically failover.


broadcast-domain remove-ports -broadcast-domain Cluster -ipspace Cluster -port {NODE}:{PORT_A}


5.4) LIFs

Re-home LIFs to their correct home port.

Typical commands::>


network interface show -curr-node {NODE}
net int modify -vserver {VSERVER} -lif {LIF_NAME} -home-port {PORT} -home-node {NODE}
net int revert *


At this stage you should also look at:

- Tidy up/correct Failover Groups
- Tidy up/correct Broadcast Domains

5.5) SAN LIFs

Restore any SAN LIFs::>


net int modify -vserver {VSERVER} -lif {SAN_LIF_NAME} -status-admin up


5.6) (Optional - Recommended) Test Failover

Do first for {NODE-A}, then repeat for {NODE-B}::>

storage failover show
storage failover takeover -ofnode {NODE}
storage failover show-takeover
storage failover giveback -ofnode {NODE}
storage failover show-giveback
cluster show
net int show -is-home false
net int revert *


5.7) RESTORE APPLICATION DATA ACCESS!

5.8) Configure Service-Processors::>


service-processor network modify -node {NODE} -address-type IPv4 -enable true -dhcp none -ip-address {ADDRESS} -netmask {MASK} -gateway {GATEWAY}


5.9) Tidy up licenses::>


license clean-up -unused true -simulate
license clean-up -unused true


5.10) ASUPs

Send autosupports::>


system node autosupport invoke -node {NODE-A} -type all -message "Finished Headswap"
system node autosupport invoke -node {NODE-B} -type all -message "Finished Headswap"


5.11) (Optional) Re-enable AUTOBOOT

If you disabled AUTOBOOT earlier, re-enable::>


set d
debug kenv modify -node {NODE-A} -variable AUTOBOOT -value true -persist true
debug kenv modify -node {NODE-B} -variable AUTOBOOT -value true -persist true


THE END!

No comments:

Post a Comment