Performing a
non-disruptive ARL (Aggregate Relocate) Headswap for NetApp Clustered Data
ONTAP - Step-by-Step Walkthrough Series:
Caveat Lector: Unofficial information!
0) Introduction
At a high-level, the steps involved in the ARL process
are:
1) Due Diligence
2) Prepare New Controllers
3) Prepare Existing Cluster and Controllers
4) Use ARL to Relocate Aggregates from NODE-A to NODE-B
5) Record NODE-A
info
6) Migrate Data LIFs off NODE-A
7) Handle NODE-A’s
Other LIFs and Networking
8) Disable SFO
9) Retire NODE-A
10) Replace NODE-A
with NODE-C
11) Return Data LIFs to NODE-C
12) Handle NODE-C’s
Other LIFs
13) Use ARL to Relocate Aggregates from NODE-B to NODE-C
14) Record NODE-B
info
15) Migrate Data LIFs off NODE-B
16) Handle NODE-B’s
Other LIFs and Networking
17) Retire NODE-B
18) Replace NODE-B
with NODE-D
19) Return data LIFs to NODE-D
20) Handle NODE-D’s
Other LIFs
21) Use ARL to Relocate (Selected) Aggregates to NODE-D
22) Re-Enable SFO
23) ARL Finishing Touches
24) Test Failover
Image: ARL Headswap
Process
Note 1: This series
doesn’t cover all scenarios (i.e. V-Series, Storage Encryption ...)
Note 2: To keep
this post short I only include key commands. Advanced ONTAP skills are a
pre-requisite in order to perform the ARL headswap.
1) Due Diligence
1.1) Information
Gathering and Planning
Gather the following information:
- Existing Cluster Information
- New Controllers Information
- Platform Mixing Rules Verification (from HWU) - for 4+
node clusters
- On-Board Ports for Old Controllers (from HWU)
- On-Board Ports for New Controllers (from HWU)
- Cards in Old Controllers
- Slot Configuration for New Controllers
- Mapping Physical Ports on Old Controllers to Physical
Ports on New Controllers
IMPORTANT: At least one
cluster port needs to map (at least temporarily) from Old to New Controllers
(see Scenario 5 of When Ports Go
Missing)
- Physical Networking on Old Controllers (ports, IFGRPs,
VLANs)
- Logical Networking on Old Controllers (failover-groups,
broadcast domains)
- LIFs (Logical Interfaces) on Old Controllers
- Licenses on Old Controllers
- Licenses for New Controllers
- Service Processor Info
- Non-Root (Data) Aggregates
Note: This is not
an exhaustive list.
1.2) Verify AutoSupports
and perform any required remediation(s)
1.3) Verify Config
Advisor outputs and perform any required remediation(s)
1.4) Obtain
software images as required
Note: Recommend
using the exact same version (including P/D release on the new controllers.)
1.5) Obtain
official ARL headswap documentation
Read and understand the process!
1.6) Obtain
FAS/AFF controller documentation set
1.7) Obtain ONTAP
documentation set
1.8) Site
Verification (via customer/site-survey)
- Rack space for new controllers
- Access to cabs
- Available power
- Cable lengths, SFPs, I/O cards...
Note: This is not
an exhaustive list.
2) Prepare New Controllers
2.1) Power on New
Controllers
Interrupt the boot process by pressing Ctrl-C to access the
LOADER> environment
IMPORTANT: If you get a
warning “The (NVRAM) battery is unfit to
retain data”, allow the battery to charge (do not override).
On both controllers, with no disks attached, run from
LOADER>
set-defaults
setenv
bootarg.init.boot_clustered true
setenv
AUTOBOOT false
saveenv
boot_ontap
prompt
Note: We set
AUTOBOOT to false to give more control over the boot process (we set it back to
true after the ARL). To boot normally type> boot_ontap
2.2) System Images
Upgrade/downgrade both software images using boot
menu selection 7 “Install new software first.”
2.3) Wipeconfig
(Used Controllers Only)
If either of the controllers was previously used, run wipeconfig from the boot menu:
Selection
(1-8)? wipeconfig
Let the controller reboot. The wipeconfig is successful
if you see:
The
boot device has changed. System configuration information could be lost. Use
option (6) to restore the system configuration, or option (4) to initialize all
disks and setup a new system.
Normal
Boot is prohibited.
Note: If you are
upgrading to a system with both nodes in the same chassis, install both
nodes in the chassis. Both nodes can be left on the LOADER> prompt.
IMPORTANT NOTE: Switchless Clusters
If you’re headswapping a switchless cluster, be mindful of the bootvar setting -
bootarg.init.switchless_cluster.enable
- if defaults to false, so if this is not set to true on both replacement heads (in the switchless cluster) prior to boot, it’s a support case.
3) Prepare Existing Cluster and Controllers
3.1) Install
licenses for new controllers
3.2) Verify
Storage Encryption is disabled*
*storage encryption
is not covered in this guide
(pre__8.3.1)::>
node run NODENAME -command disk encrypt show
(post_8.3.1)::>
security key-manager show -status
3.3) (For 4+ node
cluster) Verify Epsilon not on the HA-pair being ARL-ed
If it is then move epsilon::>
set
-c off; set adv; cluster show
cluster
modify -node NODE_WITH_EPSILON -epsilon false
cluster
modify -node NODE_NOT_BEING_ARL-ed -epsilon true
3.4) Verify
Cluster and Nodes::>
cluster
show
set
-c off; set adv
cluster
ping-cluster -node NODE-A
cluster
ping-cluster -node NODE-B
version
system
node image show -node NODE-A,NODE-B -iscurrent true
3.5) Verify
Storage Failover::>
storage
failover show
3.6) (For 2 node
cluster) Verify Cluster HA::>
cluster
ha show
3.7) Verify
aggregates are owned by their home node::>
storage
aggregate show -nodes NODE-A -is-home false -fields owner-name,home-name,state
storage
aggregate show -nodes NODE-B -is-home false -fields owner-name,home-name,state
3.8) Verify disks::>
storage
failover show -node NODE-A,NODE-B -fields
local-missing-disks,partner-missing-disks
storage
disk show -nodelist NODE-A,NODE-B -broken
3.9) Data
Collection
Note: This is far
from an exhaustive list. The idea is a snapshot of information which can be
used for comparison purposes later on.
storage
aggregate show -node NODE-A -state online
storage
aggregate show -node NODE-B -state online
volume
show -node NODE-A -state offline
volume
show -node NODE-B -state offline
ucadmin
show -node NODE-A
ucadmin
show -node NODE-B
system
node service-processor show -node NODE-A -instance
system
node service-processor show -node NODE-B -instance
event
log show -messagename scsiblade.*
3.10) Node and
Cluster Backups::>
security
login unlock -username diag
security
login password -username diag
set
d
systemshell
-node NODE-A
cp
/mroot/etc/varfs.tgz /mroot/etc/varfs.bak
exit
systemshell
-node NODE-B
cp
/mroot/etc/varfs.tgz /mroot/etc/varfs.bak
exit
system
configuration backup create -node NODE-A -backup-type node -backup-name
ACheadswap
system
configuration backup create -node NODE-B -backup-type node -backup-name
BDheadswap
system
configuration backup create -node NODE-A -backup-type cluster -backup-name
clusbackup
job
show -name *backup*
Wait for the backups to complete.
3.11) Send
AutoSupports and Verify Sent Successfully::>
system
node autosupport invoke -node NODE-A -type all -message "Starting ARL
process"
system
node autosupport invoke -node NODE-B -type all -message "Starting ARL
process"
system
node autosupport history show -node NODE-A
system
node autosupport history show -node NODE-B
Hi Vidad, we have some FAS8080EX systems which require a head swap out to FAS9000 units. Would you happen to know what needs to be done via a non disruptive head swap with the ifgrps when the physical ports differ from the source and destination heads ?
ReplyDeleteHello Unknown, anything but cluster ports is fairly easy to handle. You'll have to move all the LIFs from those ifgrps onto another node in the cluster (host any node local LIFs on an appropriate port, can just be temporary), destroy the ifgrp, perform the headswap of the node, recreate the ifgrps, move LIFs back, and repeat. Cheers, VC
DeleteThanks for the reply Vidad. Is there any particular document on the process with more detailed steps anyway ?
DeleteHello David, the official guide is here:
Deletehttps://library.netapp.com/ecm/ecm_download_file/ECMLP2659356
Cheers, VC
Thanks Vidad, much appreciated.
DeleteDavid.
Issue I am having with FAS9000 is the 40G cluster ports. If you have a switch with only 10G ports (i.e. CN1610 or NX5596), and use breakout cables from the FAS9000, you must then use 8 switch ports per node (according to HWU). It is for a 4 node cluster, so I end up using the remaining ports only because the 4 FAS8060s are only using 2 of the 4 cluster ports (e0a/e0c). After two of the nodes are upgraded, additional disk shelves are added to it, and volumes are all moved to FAS9000, I actually need to then make this switched 4 node cluster into a 2 node switchless cluster. This will require switching the FAS9000s back to 40G during the process which appears to require a complete outage to enter the command from maintenance mode. Any ideas on avoiding this outage, assuming that the end requirement is to have a 2-node switchless cluster? At this time, it seems like the only option to make this non-disruptive is to purchase/borrow a 3132 switch from NetApp.
ReplyDeleteHi csdragon43. That's a good point.
DeleteHow many 40GbE cards do you have per controller?
If you have 2 per controller (4 ports), 1 card can be 40 GbE, the other card can be 10GbE (I can't remember if you can have 1 port 4x10 and the other 40GbE) and you could move your cluster LIFs across from 10GbE to 40GbE as part of the switched to switchless conversion. Will need a TO/GB to sort the ports to the bandwidth you want. Cheers, VC