ARL Headswap: Part 1/4 - Preparation

Performing a non-disruptive ARL (Aggregate Relocate) Headswap for NetApp Clustered Data ONTAP - Step-by-Step Walkthrough Series:


Caveat Lector: Unofficial information!

0) Introduction

At a high-level, the steps involved in the ARL process are:

1) Due Diligence
2) Prepare New Controllers
3) Prepare Existing Cluster and Controllers

4) Use ARL to Relocate Aggregates from NODE-A to NODE-B
5) Record NODE-A info
6) Migrate Data LIFs off NODE-A
7) Handle NODE-A’s Other LIFs and Networking
8) Disable SFO
9) Retire NODE-A

10) Replace NODE-A with NODE-C
11) Return Data LIFs to NODE-C
12) Handle NODE-C’s Other LIFs
13) Use ARL to Relocate Aggregates from NODE-B to NODE-C
14) Record NODE-B info
15) Migrate Data LIFs off NODE-B
16) Handle NODE-B’s Other LIFs and Networking
17) Retire NODE-B

18) Replace NODE-B with NODE-D
19) Return data LIFs to NODE-D
20) Handle NODE-D’s Other LIFs
21) Use ARL to Relocate (Selected) Aggregates to NODE-D
22) Re-Enable SFO

23) ARL Finishing Touches
24) Test Failover

Image: ARL Headswap Process
Note 1: This series doesn’t cover all scenarios (i.e. V-Series, Storage Encryption ...)
Note 2: To keep this post short I only include key commands. Advanced ONTAP skills are a pre-requisite in order to perform the ARL headswap.

1) Due Diligence

1.1) Information Gathering and Planning
Gather the following information:
- Existing Cluster Information
- New Controllers Information
- Platform Mixing Rules Verification (from HWU) - for 4+ node clusters
- On-Board Ports for Old Controllers (from HWU)
- On-Board Ports for New Controllers (from HWU)
- Cards in Old Controllers
- Slot Configuration for New Controllers
- Mapping Physical Ports on Old Controllers to Physical Ports on New Controllers
IMPORTANT: At least one cluster port needs to map (at least temporarily) from Old to New Controllers (see Scenario 5 of When Ports Go Missing)
- Physical Networking on Old Controllers (ports, IFGRPs, VLANs)
- Logical Networking on Old Controllers (failover-groups, broadcast domains)
- LIFs (Logical Interfaces) on Old Controllers
- Licenses on Old Controllers
- Licenses for New Controllers
- Service Processor Info
- Non-Root (Data) Aggregates
Note: This is not an exhaustive list.

1.2) Verify AutoSupports and perform any required remediation(s)
1.3) Verify Config Advisor outputs and perform any required remediation(s)

1.4) Obtain software images as required
Note: Recommend using the exact same version (including P/D release on the new controllers.)

1.5) Obtain official ARL headswap documentation
Read and understand the process!

1.6) Obtain FAS/AFF controller documentation set
1.7) Obtain ONTAP documentation set

1.8) Site Verification (via customer/site-survey)
- Rack space for new controllers
- Access to cabs
- Available power
- Cable lengths, SFPs, I/O cards...
Note: This is not an exhaustive list.

2) Prepare New Controllers

2.1) Power on New Controllers
Interrupt the boot process by pressing Ctrl-C to access the LOADER> environment
IMPORTANT: If you get a warning “The (NVRAM) battery is unfit to retain data”, allow the battery to charge (do not override).
On both controllers, with no disks attached, run from LOADER>

set-defaults
setenv bootarg.init.boot_clustered true
setenv AUTOBOOT false
saveenv
boot_ontap prompt

Note: We set AUTOBOOT to false to give more control over the boot process (we set it back to true after the ARL). To boot normally type> boot_ontap

2.2) System Images
Upgrade/downgrade both software images using boot menu selection 7 “Install new software first.”

2.3) Wipeconfig (Used Controllers Only)
If either of the controllers was previously used, run wipeconfig from the boot menu:

Selection (1-8)? wipeconfig

Let the controller reboot. The wipeconfig is successful if you see:

The boot device has changed. System configuration information could be lost. Use option (6) to restore the system configuration, or option (4) to initialize all disks and setup a new system.
Normal Boot is prohibited.


Note: If you are upgrading to a system with both nodes in the same chassis, install both nodes in the chassis. Both nodes can be left on the LOADER> prompt.

IMPORTANT NOTE: Switchless Clusters
If you’re headswapping a switchless cluster, be mindful of the bootvar setting -
bootarg.init.switchless_cluster.enable
- if defaults to false, so if this is not set to true on both replacement heads (in the switchless cluster) prior to boot, it’s a support case.

3) Prepare Existing Cluster and Controllers

3.1) Install licenses for new controllers

3.2) Verify Storage Encryption is disabled*
*storage encryption is not covered in this guide

(pre__8.3.1)::> node run NODENAME -command disk encrypt show
(post_8.3.1)::> security key-manager show -status


3.3) (For 4+ node cluster) Verify Epsilon not on the HA-pair being ARL-ed
If it is then move epsilon::>

set -c off; set adv; cluster show
cluster modify -node NODE_WITH_EPSILON -epsilon false
cluster modify -node NODE_NOT_BEING_ARL-ed -epsilon true


3.4) Verify Cluster and Nodes::>

cluster show
set -c off; set adv
cluster ping-cluster -node NODE-A
cluster ping-cluster -node NODE-B
version
system node image show -node NODE-A,NODE-B -iscurrent true


3.5) Verify Storage Failover::>

storage failover show


3.6) (For 2 node cluster) Verify Cluster HA::>

cluster ha show


3.7) Verify aggregates are owned by their home node::>

storage aggregate show -nodes NODE-A -is-home false -fields owner-name,home-name,state
storage aggregate show -nodes NODE-B -is-home false -fields owner-name,home-name,state


3.8) Verify disks::>

storage failover show -node NODE-A,NODE-B -fields local-missing-disks,partner-missing-disks
storage disk show -nodelist NODE-A,NODE-B -broken


3.9) Data Collection
Note: This is far from an exhaustive list. The idea is a snapshot of information which can be used for comparison purposes later on.

storage aggregate show -node NODE-A -state online
storage aggregate show -node NODE-B -state online
volume show -node NODE-A -state offline
volume show -node NODE-B -state offline
ucadmin show -node NODE-A
ucadmin show -node NODE-B
system node service-processor show -node NODE-A -instance
system node service-processor show -node NODE-B -instance
event log show -messagename scsiblade.*


3.10) Node and Cluster Backups::>

security login unlock -username diag
security login password -username diag
set d
systemshell -node NODE-A
cp /mroot/etc/varfs.tgz /mroot/etc/varfs.bak
exit
systemshell -node NODE-B
cp /mroot/etc/varfs.tgz /mroot/etc/varfs.bak
exit
system configuration backup create -node NODE-A -backup-type node -backup-name ACheadswap
system configuration backup create -node NODE-B -backup-type node -backup-name BDheadswap
system configuration backup create -node NODE-A -backup-type cluster -backup-name clusbackup
job show -name *backup*

Wait for the backups to complete.

3.11) Send AutoSupports and Verify Sent Successfully::>

system node autosupport invoke -node NODE-A -type all -message "Starting ARL process"
system node autosupport invoke -node NODE-B -type all -message "Starting ARL process"
system node autosupport history show -node NODE-A
system node autosupport history show -node NODE-B


Comments

  1. Hi Vidad, we have some FAS8080EX systems which require a head swap out to FAS9000 units. Would you happen to know what needs to be done via a non disruptive head swap with the ifgrps when the physical ports differ from the source and destination heads ?

    ReplyDelete
    Replies
    1. Hello Unknown, anything but cluster ports is fairly easy to handle. You'll have to move all the LIFs from those ifgrps onto another node in the cluster (host any node local LIFs on an appropriate port, can just be temporary), destroy the ifgrp, perform the headswap of the node, recreate the ifgrps, move LIFs back, and repeat. Cheers, VC

      Delete
    2. Thanks for the reply Vidad. Is there any particular document on the process with more detailed steps anyway ?

      Delete
    3. Hello David, the official guide is here:
      https://library.netapp.com/ecm/ecm_download_file/ECMLP2659356
      Cheers, VC

      Delete
    4. Thanks Vidad, much appreciated.

      David.

      Delete
  2. Issue I am having with FAS9000 is the 40G cluster ports. If you have a switch with only 10G ports (i.e. CN1610 or NX5596), and use breakout cables from the FAS9000, you must then use 8 switch ports per node (according to HWU). It is for a 4 node cluster, so I end up using the remaining ports only because the 4 FAS8060s are only using 2 of the 4 cluster ports (e0a/e0c). After two of the nodes are upgraded, additional disk shelves are added to it, and volumes are all moved to FAS9000, I actually need to then make this switched 4 node cluster into a 2 node switchless cluster. This will require switching the FAS9000s back to 40G during the process which appears to require a complete outage to enter the command from maintenance mode. Any ideas on avoiding this outage, assuming that the end requirement is to have a 2-node switchless cluster? At this time, it seems like the only option to make this non-disruptive is to purchase/borrow a 3132 switch from NetApp.

    ReplyDelete
    Replies
    1. Hi csdragon43. That's a good point.
      How many 40GbE cards do you have per controller?
      If you have 2 per controller (4 ports), 1 card can be 40 GbE, the other card can be 10GbE (I can't remember if you can have 1 port 4x10 and the other 40GbE) and you could move your cluster LIFs across from 10GbE to 40GbE as part of the switched to switchless conversion. Will need a TO/GB to sort the ports to the bandwidth you want. Cheers, VC

      Delete

Post a Comment