Caveat Lector: Unofficial information!
This post is a
guide to ‘Performing a Disruptive Clustered Data ONTAP Headswap’. At a
high-level, the steps are:
1) Planning and Preparation
2) Decommissioning Old Controllers
3) Power Off and Re-Cable
4) Re-Assigning Disks and First Boot
5) Commissioning New Controllers
Note: This doesn’t
cover all scenarios (i.e. V-Series, Storage Encryption, ...)
1) Planning and Preparation
1.1) Cluster Ports
Plan how Cluster ports are going to map to ports on the
new controllers.
Note: This is
critical! You may need to move the Cluster ports to suitable common ports
before starting the headswap (even physically re-arrange cards).
Note: See What
Happens when Ports Go Missing ... and scenario 5 for why.
1.2) Physical Cabling Plan
Create cabling schedule to map shelf, ACP, and data
connections, from the old controllers, to ports on the new controllers.
1.3) Source Controller Version
Check in Hardware Universe that the new controllers
support the ONTAP version on the old controllers.
If not, the
old controllers must be upgraded (if this is more than a 2-node cluster,
then the whole cluster must be upgraded.)
1.4) Destination Controller Version
The destination controller version must be on the same
version as the source controllers (I’d go to the exact same P release.)
If not, you
can upgrade the destination controllers with no disks attached, using the Boot
Menu and Option 7 “Install new software first”.
Note: There is a benefit
to running Option 7 twice to make sure that both images (there are 2 boot
images) are the same, especially if you’re still on 8.2.x (since the 8.2 ->
8.3 upgrade validation script won’t trigger if you’ve already got an 8.3+
image.)
Image: Boot Menu
Option 7 “Install new software first”
1.5) Re-purposing Controllers
If you are re-purposing controllers that have previously
been in a Cluster, you must run wipeconfig. Start from the LOADER prompt:
LOADER>
set-defaults
LOADER>
setenv bootarg.init.boot_clustered true
LOADER>
boot_ontap prompt
And from the boot menu type “wipeconfig”, press enter, and follow the prompts.
You know it’s successful when on reboot you get:
*******************************
* *
*
Press Ctrl-C for Boot Menu. *
* *
*******************************
The
boot device has changed. System configuration information
could
be lost. Use option (6) to restore the system configuration, or
option
(4) to initialize all disks and setup a new system.
Normal
Boot is prohibited.
1.6) (Optional) Disabling Autoboot
An optional step that I like to do (you can add these
lines above before the boot_ontap) since it gives you more control (especially
if it’s two controllers in one chassis), is to disable autoboot on the new
controllers.
LOADER>
setenv AUTOBOOT false
LOADER>
saveenv
1.7) New Controller Licenses
Acquire licenses for the new Clustered Data ONTAP
controllers, and apply in advance using Clustershell commands::>
license
add {LICENSE_CODE}
1.8) Additional Checks
- Config Advisor:
Run Config Advisor against the old controllers and resolve any issues.
- ASUP: View
the AutoSupport Health Summary and resolve any issues.
- ONTAP Release
Notes: Check to be aware of any known issues.
- Official
Documentation: Read and understand!
IMPORTANT NOTE: Switchless Clusters
If you’re headswapping a switchless cluster, be mindful of the bootvar setting -
bootarg.init.switchless_cluster.enable
- if defaults to false, so if this is not set to true on both replacement heads (in the switchless cluster) prior to boot, it’s a support case.
2) Decommissioning Old Controllers
Note: With the
following commands, {NODE-A}, and {NODE-B}, are the nodes being
Headswap-ed.
2.1) Epsilon
If this is a 4-node or larger cluster, make sure Epsilon
is not on the 2 nodes being downed in the headswap::>
set
adv
cluster
show -epsilon true
cluster
modify -node {EPSILON_NODE} -epsilon false
cluster
modify -node {NEW_EPSILON_NODE} -epsilon true
2.2) Some Health Checks
Note: This is far
from an exhaustive list - please feel free to add more checks as you wish.
For the 2 controllers to be head swapped:
- Verify Cluster communication is working correctly
- Verify the software image
- Check for missing and broken disks.
cluster
ping-cluster -node {NODE-A}
cluster
ping-cluster -node {NODE-B}
storage
failover show -fields local-missing-disks,partner-missing-disks
system
node image show -node {NODE-A},{NODE-B} -iscurrent true
storage
disk show -nodelist {NODE-A},{NODE-B} -broken
Note: Remove any
broken disks from the system.
2.3) Collect SYSIDs of Old Controllers
system
node show -node {NODE-A},{NODE-B} -instance
2.4) Record Service-Processor Information
service-processor
network show -node {NODEA}
service-processor
network show -node {NODEB}
2.6) Take Backups
Note: This is a
pre-cautionary step - we hope not to use them.
The following commands backup varfs and node
configuration::>
security
login unlock -username diag
security
login password -username diag
set
d
systemshell
-node {NODE-A}
cp
/mroot/etc/varfs.tgz /mroot/etc/varfs.bak
exit
systemshell
-node {NODE-B}
cp
/mroot/etc/varfs.tgz /mroot/etc/varfs.bak
exit
system
configuration backup create -node {NODE-A} -backup-type node -backup-name
Aheadswap
system
configuration backup create -node {NODE-B} -backup-type node -backup-name
Bheadswap
If this is a 4-Node or large cluster, then take a cluster
backup on one of the other nodes - otherwise use both nodes::>
system
configuration backup create -node {NODE} -backup-type cluster -backup-name
clusbackup
And wait for the backup jobs to complete::>
job
show -name *backup*
2.7) ASUPs
Take autosupports and wait for them to send::>
system
node autosupport invoke -node {NODE-A} -type all -message "Starting
Headswap"
system
node autosupport invoke -node {NODE-B} -type all -message "Starting
Headswap"
system
node autosupport history show -node {NODE-A}
system
node autosupport history show -node {NODE-B}
2.8) Cluster LIFs
As per 1.1 above, you may need to move Cluster LIFs
before continuing with the headswap.
The Clustershell commands below illustrate how to move
one Cluster LIF in 8.3+ from {PORT_A} to {PORT_B}::>
broadcast-domain
remove-ports -broadcast-domain default -ports {NODE}:{PORT_B}
net
port modify -node {NODE} -port {PORT_B} -flowcontrol-admin none -mtu 9000
broadcast-domain
add-ports -broadcast-domain Cluster -ipspace Cluster -port {NODE}:{PORT_B}
net
int modify -vserver Cluster -lif {CLUS_LIF} -home-port {PORT_B} -home-node
{NODE}
Physically re-cable {NODE}:{PORT_A} to {PORT_B} - the {CLUS_LIF}
LIF will automatically failover.
broadcast-domain
remove-ports -broadcast-domain Cluster -ipspace Cluster -port {NODE}:{PORT_A}
2.9) LIFs
Note: Common port =
port common to both old and new platforms.
Note: SAN LIFs are
considered later.
- If this is a 4-Node or larger cluster, rehome data LIFs to other nodes
- If this is a 2-Node cluster, rehome data LIFs to a common port
- If this is a 4-Node or larger cluster, rehome cluster management to one of the other
nodes
- If this is a 2-Node cluster, rehome cluster management to a common port
- For node LIFs
(intercluster, node-mgmt), rehome to a common port
Typical commands::>
network
interface show -curr-node {NODE}
net
int modify -vserver {VSERVER} -lif {LIF_NAME} -home-port {PORT} -home-node {NODE}
net
int revert *
2.10) STOP APPLICATION DATA ACCESS!
2.11) SAN LIFs
Down any SAN LIFs on the node’s being replaced::>
net
int modify -vserver {VSERVER} -lif {SAN_LIF_NAME} -status-admin down
2.12) Cluster HA / Storage Failover
Verify cluster status::>
cluster
show
(2-Node Cluster) Disable Cluster HA::>
cluster
ha modify -configured false
Disable storage failover::>
storage
failover show
storage
failover modify -node {NODE-A} -enabled false
storage
failover show
Note: If this is a
4 Node cluster, you may want to disable SFO on all nodes in the cluster.
2.13) Halt
halt
-node {NODE-A} -inhi -igno -skip
halt
-node {NODE-B} -inhi -igno -skip
3) Power Off and Re-Cable
3.1) Power Off Old Controllers
Once the old controllers are at the LOADER> prompt,
they can be powered off.
3.2) Transfer any Cards (as required)
3.3) Install New Controllers
3.4) Re-Cable New Controllers
4) Re-Assigning Disks and First Boot
Do these steps (4.1 and 4.2) for both controllers (you
can do them in tandem)
4.1) Reassigning Disks
Power on and boot the node into Maintenance Mode.
LOADER>
boot_ontap prompt
Ctrl+C to
access Boot Menu
Select 5 for Maintenance Mode
In maintenance mode run the following commands to find
out the new controllers SYSID, and verify disk multi-pathing>
disk
show -a
storage
show disk -p
For node A (triple check before running)>
disk reassign -s {OLD_NODEA_SYSID} -d {NEW_NODEA_SYSID}
For node B (triple check before running)>
disk
reassign -s {OLD_NODEB_SYSID} -d {NEW_NODEB_SYSID}
For both nodes>
disk
show -a
mailbox
destroy local
mailbox
destroy partner
halt
4.2) UPDATE FLASH FROM BACKUP CONFIG
Boot the node to the Boot Menu.
LOADER>
boot_ontap prompt
Ctrl+C to
access Boot Menu
Select 6 for Update Flash from
Backup Config
Note: It is
critical that you catch this on the first boot after assigning disks otherwise
you could end up with a messy support case.
Update Flash from Backup Config:
- WATCH the process
- It will take several minutes and the controller
might reboot a few times
- Accept any warning about mis-matched sysid
- You should see something like:
ontap_varfs:
restore using /mroot/etc/varfs.tgz
Rebooting
to load the new varfs
Abandoned
in-memory /var file system
5) Commissioning New Controllers
5.1) Re-Enable SFO::>
storage
failover show
storage
failover modify -enabled true -node {NODE-A}
(2-Node Cluster) Verify cluster ha has re-enabled::>
cluster
ha show
5.2) Some Health Checks
Note: This is far
from an exhaustive list - please feel free to add more checks as you wish.
cluster
show
set
adv
cluster
ring show -unitname mgmt
cluster
ring show -unitname vldb
cluster
ring show -unitname vifmgr
cluster
ring show -unitname bcomd
cluster
ring show -unitname crs
5.3) Cluster LIFs
If Cluster LIFs need to be moved on the new platform, the
Clustershell commands below illustrate how to move one Cluster LIF in 8.3+ from
{PORT_A} to {PORT_B}::>
broadcast-domain
remove-ports -broadcast-domain default -ports {NODE}:{PORT_B}
net
port modify -node {NODE} -port {PORT_B} -flowcontrol-admin none -mtu 9000
broadcast-domain
add-ports -broadcast-domain Cluster -ipspace Cluster -port {NODE}:{PORT_B}
net
int modify -vserver Cluster -lif {CLUS_LIF} -home-port {PORT_B} -home-node
{NODE}
Physically re-cable {NODE}:{PORT_A} to {PORT_B} - the
{CLUS_LIF} LIF will automatically failover.
broadcast-domain
remove-ports -broadcast-domain Cluster -ipspace Cluster -port {NODE}:{PORT_A}
5.4) LIFs
Re-home LIFs to their correct home port.
Typical commands::>
network
interface show -curr-node {NODE}
net
int modify -vserver {VSERVER} -lif {LIF_NAME} -home-port {PORT} -home-node
{NODE}
net
int revert *
At this stage you should also look at:
- Tidy up/correct Failover Groups
- Tidy up/correct Broadcast Domains
5.5) SAN LIFs
Restore any SAN LIFs::>
net
int modify -vserver {VSERVER} -lif {SAN_LIF_NAME} -status-admin up
5.6) (Optional - Recommended) Test Failover
Do first for {NODE-A}, then repeat for {NODE-B}::>
storage
failover show
storage
failover takeover -ofnode {NODE}
storage
failover show-takeover
storage
failover giveback -ofnode {NODE}
storage
failover show-giveback
cluster
show
net
int show -is-home false
net
int revert *
5.7) RESTORE APPLICATION DATA ACCESS!
5.8) Configure Service-Processors::>
service-processor
network modify -node {NODE} -address-type IPv4 -enable true -dhcp none
-ip-address {ADDRESS} -netmask {MASK} -gateway {GATEWAY}
5.9) Tidy up licenses::>
license
clean-up -unused true -simulate
license
clean-up -unused true
5.10) ASUPs
Send autosupports::>
system
node autosupport invoke -node {NODE-A} -type all -message "Finished
Headswap"
system
node autosupport invoke -node {NODE-B} -type all -message "Finished Headswap"
5.11) (Optional) Re-enable AUTOBOOT
If you disabled AUTOBOOT earlier, re-enable::>
set
d
debug
kenv modify -node {NODE-A} -variable AUTOBOOT -value true -persist true
debug
kenv modify -node {NODE-B} -variable AUTOBOOT -value true -persist true
THE END!
Hi Vidad,
ReplyDeleteDo you have instruction on "non-disruptive" 2-node cDOT Headswap: Step-by-Step Walkthrough?
Regards,
Jackson
Hi Jackson. "non-disruptive" ARL headswap information starts here:
Deletehttp://www.cosonok.com/2016/12/arl-headswap-part-14-preparation.html
Thanks Vidad, question on this Disruptive cDot Headswap. If the OLD controllers cluster ports are e0a/e0b/e0c/e0d and the new controllers cluster ports will be e0c/e0d/e0e/e0f.
Deletee0e and e0f are already assigned to a0a on the old controllers?
When should I move the Cluster LIFs e0a to e0e and e0b to e0f? on section 2.8 or section 5.3 above?
Also on the DATA lifs - the old controllers are using a0a-vlanID (with LACP port e0e/e0f/e0g/e0h), the new controllers are using a0a-VLANID (with port e7a/e7b/e13a/e13b) - can I treat they are in the common port a0a?
DeleteHi Jackson,
DeleteThe way I'd do it based on the information you've provided for this disruptive headswap:
1) Pre-headswap remove e0a and e0b from the cluster broadcast-domain (re-home the cluster LIFs that are on them first to e0c and e0d)
2) Pre-headswap modify the a0a ifgrp and remove the e0e, e0f, e0g, and e0h ports)
3) Post-headswap, add e0e and e0f to the cluster broadcast-domain and sort out the cluster LIFs.
4) Post-headswap, add the correct ports to a0a.
Thanks Vidad for your quick response and your advise, must appreciated.
DeleteHi Vidad,
ReplyDeleteI followed your guide and did a disruptive headswap from a FAS3240 over to a FAS8040. I ran into the issue where i was not able to move the CLUSTER ports from e1a and e2a ahead of time due to not having the PCI card in the FAS8040. Come to find out that the 10gb PCI card in the FAS3240 was not compatible with the FAS8040, so was not able to Move the card over. We had no choice but to proceed, knowing that one of the nodes was going to be OUT OF QUORUM due to these cluster ports. But just wanted to point out to anyone going through this, that the way we got through it was by adding the e0a and e0c ports to the CLUSTER broadcast domain and creating new LIFS in the CLUSTER SVM on both nodes after almost completing the headswap. Waited a couple minutes and both nodes were now in quorum. Verified with the "Cluster Show" and "Cluster ring show" commands. So it is possible to fix up your cluster towards the end of the headswap process.. assuming you follow the instructions correctly and make it that far.
Thanks for making this guide !,
Milther
Hi Vidad,
ReplyDeletethanks to your guide, I've successfully performed an head swap in Ontap 9.1. There was a difference with disk reassign, because option -p was mandatory, more precisely:
disk reassign -s {OLD_NODEA_SYSID} -d {NEW_NODEA_SYSID} -p {OLD_NODEA_SYSID}
disk reassign -s {OLD_NODEB_SYSID} -d {NEW_NODEB_SYSID} -p {NEW_NODEA_SYSID}
It was weird the systems knew the right partner sysid to apply, because there was a check on the -p parameter.
Do you have any experience on this?
Cheers
Lorenzo
Hi Lorenzo,
DeleteThe -p is correct where you have advanced disk partitioning. This article was written before that time. So with ADP it's:
disk reassign -s NODE-A_SYSID -d NODE-C_SYSID -p NODE-B_SYSID
Cheers, VC
UPDATE: I happened to use this procedure for a disruptive headswap and shelf reallocation recently (ONTAP 9.1). A few lessons learned/remembered:
ReplyDelete1) If you're relocating shelves, it's very nice to purchase new rail kits!
2) Remember to set date on the new controllers.
3) Post headswap, set the storage failover hwassist IP.
4) If you've set the SP beforehand, it gets overwritten by the original SP in the headswap.
5) Finally, you might want to update the SP firmware after doing the headswap.
When I wrote this there wasn't an official document that covered the disruptive cDOT headswap. There is now (dates back to July 2019):
ReplyDelete"Upgrading Controller Hardware by Moving Storage"
https://library.netapp.com/ecm/ecm_download_file/ECMLP2540637