ARL Headswap: Part 1/4

Performing a non-disruptive ARL (Aggregate Relocate) Headswap for NetApp Clustered Data ONTAP - Step-by-Step Walkthrough Series:

Part 1 - Introduction and Preparation

Part 2 - Relocating Aggregates to Node B and Retiring Node A

Part 3 - Replace Node A with C, Relocating Aggregates to Node C and Retiring Node B

Part 4 - Replace Node B with D, Relocating Aggregates to Node D and Finish

Caveat Lector: Unofficial information!

0) Introduction

At a high-level, the steps involved in the ARL process are:

1) Due Diligence

2) Prepare New Controllers

3) Prepare Existing Cluster and Controllers

4) Use ARL to Relocate Aggregates from NODE-A to NODE-B

5) Record NODE-A info

6) Migrate Data LIFs off NODE-A

7) Handle NODE-A’s Other LIFs and Networking

8) Disable SFO

9) Retire NODE-A

10) Replace NODE-A with NODE-C

11) Return Data LIFs to NODE-C

12) Handle NODE-C’s Other LIFs

13) Use ARL to Relocate Aggregates from NODE-B to NODE-C

14) Record NODE-B info

15) Migrate Data LIFs off NODE-B

16) Handle NODE-B’s Other LIFs and Networking

17) Retire NODE-B

18) Replace NODE-B with NODE-D

19) Return data LIFs to NODE-D

20) Handle NODE-D’s Other LIFs

21) Use ARL to Relocate (Selected) Aggregates to NODE-D

22) Re-Enable SFO

23) ARL Finishing Touches

24) Test Failover

Image: ARL Headswap Process

Note 1: This series doesn’t cover all scenarios (i.e. V-Series, Storage Encryption ...)

Note 2: To keep this post short I only include key commands. Advanced ONTAP skills are a pre-requisite in order to perform the ARL headswap.

1) Due Diligence

1.1) Information Gathering and Planning

Gather the following information:

- Existing Cluster Information

- New Controllers Information

- Platform Mixing Rules Verification (from HWU) - for 4+ node clusters

- On-Board Ports for Old Controllers (from HWU)

- On-Board Ports for New Controllers (from HWU)

- Cards in Old Controllers

- Slot Configuration for New Controllers

- Mapping Physical Ports on Old Controllers to Physical Ports on New Controllers

IMPORTANT: At least one cluster port needs to map (at least temporarily) from Old to New Controllers (see Scenario 5 of When Ports Go Missing)

- Physical Networking on Old Controllers (ports, IFGRPs, VLANs)

- Logical Networking on Old Controllers (failover-groups, broadcast domains)

- LIFs (Logical Interfaces) on Old Controllers

- Licenses on Old Controllers

- Licenses for New Controllers

- Service Processor Info

- Non-Root (Data) Aggregates

Note: This is not an exhaustive list.

1.2) Verify AutoSupports and perform any required remediation(s)

1.3) Verify Config Advisor outputs and perform any required remediation(s)

1.4) Obtain software images as required

Note: Recommend using the exact same version (including P/D release on the new controllers.)

1.5) Obtain official ARL headswap documentation

Read and understand the process!

1.6) Obtain FAS/AFF controller documentation set

1.7) Obtain ONTAP documentation set

1.8) Site Verification (via customer/site-survey)

- Rack space for new controllers

- Access to cabs

- Available power

- Cable lengths, SFPs, I/O cards...

Note: This is not an exhaustive list.

2) Prepare New Controllers

2.1) Power on New Controllers

Interrupt the boot process by pressing Ctrl-C to access the LOADER> environment

IMPORTANT: If you get a warning “The (NVRAM) battery is unfit to retain data”, allow the battery to charge (do not override).

On both controllers, with no disks attached, run from LOADER>

set-defaults

setenv bootarg.init.boot_clustered true

setenv AUTOBOOT false

saveenv

boot_ontap prompt

Note: We set AUTOBOOT to false to give more control over the boot process (we set it back to true after the ARL). To boot normally type> boot_ontap

2.2) System Images

Upgrade/downgrade both software images using boot menu selection 7 “Install new software first.”

2.3) Wipeconfig (Used Controllers Only)

If either of the controllers was previously used, run wipeconfig from the boot menu:

Selection (1-8)? wipeconfig

Let the controller reboot. The wipeconfig is successful if you see:

The boot device has changed. System configuration information could be lost. Use option (6) to restore the system configuration, or option (4) to initialize all disks and setup a new system.

Normal Boot is prohibited.

Note: If you are upgrading to a system with both nodes in the same chassis, install both nodes in the chassis. Both nodes can be left on the LOADER> prompt.

IMPORTANT NOTE: Switchless Clusters
If you’re headswapping a switchless cluster, be mindful of the bootvar setting -
bootarg.init.switchless_cluster.enable
- if defaults to false, so if this is not set to true on both replacement heads (in the switchless cluster) prior to boot, it’s a support case.

3) Prepare Existing Cluster and Controllers

3.1) Install licenses for new controllers

3.2) Verify Storage Encryption is disabled*

*storage encryption is not covered in this guide

(pre__8.3.1)::> node run NODENAME -command disk encrypt show

(post_8.3.1)::> security key-manager show -status

3.3) (For 4+ node cluster) Verify Epsilon not on the HA-pair being ARL-ed

If it is then move epsilon::>

set -c off; set adv; cluster show

cluster modify -node NODE_WITH_EPSILON -epsilon false

cluster modify -node NODE_NOT_BEING_ARL-ed -epsilon true

3.4) Verify Cluster and Nodes::>

cluster show

set -c off; set adv

cluster ping-cluster -node NODE-A

cluster ping-cluster -node NODE-B

version

system node image show -node NODE-A,NODE-B -iscurrent true

3.5) Verify Storage Failover::>

storage failover show

3.6) (For 2 node cluster) Verify Cluster HA::>

cluster ha show

3.7) Verify aggregates are owned by their home node::>

storage aggregate show -nodes NODE-A -is-home false -fields owner-name,home-name,state

storage aggregate show -nodes NODE-B -is-home false -fields owner-name,home-name,state

3.8) Verify disks::>

storage failover show -node NODE-A,NODE-B -fields local-missing-disks,partner-missing-disks

storage disk show -nodelist NODE-A,NODE-B -broken

3.9) Data Collection

Note: This is far from an exhaustive list. The idea is a snapshot of information which can be used for comparison purposes later on.

storage aggregate show -node NODE-A -state online

storage aggregate show -node NODE-B -state online

volume show -node NODE-A -state offline

volume show -node NODE-B -state offline

ucadmin show -node NODE-A

ucadmin show -node NODE-B

system node service-processor show -node NODE-A -instance

system node service-processor show -node NODE-B -instance

event log show -messagename scsiblade.*

3.10) Node and Cluster Backups::>

security login unlock -username diag

security login password -username diag

set d

systemshell -node NODE-A

cp /mroot/etc/varfs.tgz /mroot/etc/varfs.bak

exit

systemshell -node NODE-B

cp /mroot/etc/varfs.tgz /mroot/etc/varfs.bak

exit

system configuration backup create -node NODE-A -backup-type node -backup-name ACheadswap

system configuration backup create -node NODE-B -backup-type node -backup-name BDheadswap

system configuration backup create -node NODE-A -backup-type cluster -backup-name clusbackup

job show -name *backup*

Wait for the backups to complete.

3.11) Send AutoSupports and Verify Sent Successfully::>

system node autosupport invoke -node NODE-A -type all -message "Starting ARL process"

system node autosupport invoke -node NODE-B -type all -message "Starting ARL process"

system node autosupport history show -node NODE-A

system node autosupport history show -node NODE-B

Comments

Unknown13 September 2017 at 15:59
Hi Vidad, we have some FAS8080EX systems which require a head swap out to FAS9000 units. Would you happen to know what needs to be done via a non disruptive head swap with the ifgrps when the physical ports differ from the source and destination heads ?
csdragon438 July 2019 at 09:18
Issue I am having with FAS9000 is the 40G cluster ports. If you have a switch with only 10G ports (i.e. CN1610 or NX5596), and use breakout cables from the FAS9000, you must then use 8 switch ports per node (according to HWU). It is for a 4 node cluster, so I end up using the remaining ports only because the 4 FAS8060s are only using 2 of the 4 cluster ports (e0a/e0c). After two of the nodes are upgraded, additional disk shelves are added to it, and volumes are all moved to FAS9000, I actually need to then make this switched 4 node cluster into a 2 node switchless cluster. This will require switching the FAS9000s back to 40G during the process which appears to require a complete outage to enter the command from maintenance mode. Any ideas on avoiding this outage, assuming that the end requirement is to have a 2-node switchless cluster? At this time, it seems like the only option to make this non-disruptive is to purchase/borrow a 3132 switch from NetApp.

Cosonok's IT Blog

Search This Blog

ARL Headswap: Part 1/4 - Preparation

Comments

Post a Comment