Sunday, 6 May 2018

System Node Migrate-Root - Experience and Tips

I needed to test out ‘system node migrate-root’ in preparation for potentially using it prior to performing a headswap. No physical hardware here just an ONTAP 9.3 HA Simulator lab. These are some observations with a couple of tips.

Documentation

System Node Migrate-Root is documented here:

The syntax is very simple, for example:


cluster1::*>  system node migrate-root -node cluster1-01 -disklist VMw-1.17,VMw-1.18,VMw-1.19 -raid-type raid_dp

Warning: This operation will create a new root aggregate and replace the existing root on the node "cluster1-01". The existing root aggregate will be discarded.
Do you want to continue? {y|n}: y

Info: Started migrate-root job. Run "job show -id 86 -instance" command to
      check the progress of the job.


The process (see below) starts straight away.

The official documentation mentions:

The command starts a job that backs up the node configuration, creates a new aggregate, set it as new root aggregate, restores the node configuration and restores the names of original aggregate and volume. The job might take as long as a few hours depending on time it takes for zeroing the disks, rebooting the node and restoring the node configuration.

The Process and Timings

The SIM has tiny disks of 10.66GB and 28.44GB usable size. The 10.66GB disks were originally used for 3 disk root aggregates, and the migrate-root moved the root aggregate to the slightly larger virtual disks. On a physical system with much bigger than 28.44GB disks, I would expect the timings to be considerably longer than the below. The below timings are taken from acquiring the ‘Execution Progress’ string - from the job show output - every second.

0-27 seconds: Starting node configuration backup. This might take several minutes.
28-146 seconds: Starting aggregate relocation on the node "cluster1-02"
147-212 seconds: Rebooting the node to create a new root aggregate.
213-564 seconds: Waiting for the node to create a new root and come online. This might take a few minutes.
565-682 seconds: Making the old root aggregate online.
683-686 seconds: Copying contents from old root volume to new root volume.
687-864 seconds: Starting removal of old aggregate and volume and renaming the new root.
865-1653 seconds: Starting node configuration restore.
1654-1772 seconds: Enabling HA and relocating the aggregates. This might take a few minutes.
1773 seconds: Complete: Root aggregate migration successfully completed [0]

Nearly 30 minutes for migrate-root on one node of a tiny ONTAP 9.3RC1 HA SIM! And you still need to do a takeover/giveback of the node whose root aggregate was moved (see below).

Tips

1) The process disables HA and Storage Failover, and Aggregate Relocation is used to move the data aggregates to the node that’s staying up. The process does not move data LIFs, these will failover automatically, but I noticed a bit of a delay (my test CIFS share was down for 45 seconds), so I’d recommend moving data LIFs onto the node that’s going to stay up first.

2) I noticed - consistently - that if you run ‘system health alert show’ after the migrate-root completes, you get some weird output. Perform a takeover/giveback of the affected node after the migrate-root completes to correct this.


cluster1::*> system health alert show
This table is currently empty.

Warning: Unable to list entries for schm on node "cluster1-02": RPC: Remote
         system error [from mgwd on node "cluster1-02" (VSID: -1) to schm at
         127.0.0.1].
         Unable to list entries for shm on node "cluster1-02": RPC: Remote
         system error [from mgwd on node "cluster1-02" (VSID: -1) to shm at
         127.0.0.1].
         Unable to list entries for cphm on node "cluster1-02": RPC: Remote
         system error [from mgwd on node "cluster1-02" (VSID: -1) to cphm at
         127.0.0.1].
         Unable to list entries for cshm on node "cluster1-02": RPC: Remote
         system error [from mgwd on node "cluster1-02" (VSID: -1) to cshm at
         127.0.0.1].

cluster1::*> Replaying takeover WAFL log
May 06 14:01:04 [cluster1-01:monitor.globalStatus.critical:EMERGENCY]: This node has taken over cluster1-02.

cluster1::*> system health alert show
This table is currently empty.

cluster1::*>


Image: Remote system error [from mgwd on node...

Further Reading:

How to non-disruptively migrate a node's root aggregate onto new disks in ONTAP 9
https://kb.netapp.com/app/answers/answer_view/a_id/1086971

Real World Experience!

If you login to the above KB with your NetApp account, there's a mention:

"Normally, the root-migration will be successful and transparent, however, there are bugs that can cause the script to fail. The node will boot up with a new root aggregate, however, additional clean up steps must be be taken. Technical Support should be engaged to document the failure and assist in recovery."

It's very possible you will need to log a support call, and support will follow certain recovery steps in the KB. So give yourself time. Don't plan to cram root vol moves followed by an ARL headswap into the same day - it might be a long day.

Bug Summary

Fixed: ONTAP 9.3+
Workaround:
Perform the following steps to resolve this issue:
 1. Run the 'aggr status' command on nodeshell to check if the new aggregate 'new_root' is created.
 2. If the aggregate is created and online, perform either of the following steps to resume the process:
 - For ONTAP 9.2 or later, run the ' system node migrate-root -resume ' command.
 - For earlier ONTAP versions, contact technical support for assistance.

Fixed: ONTAP 9.0P1+
Workaround:
If you encounter this issue, contact technical support to manually complete the root replacement procedure.

Fixed: ONTAP 9.2+
Workaround:
Perform the following steps to resolve this issue:
  - If the command fails with this error message "Internal error: Failed to verify the new root aggregate status" for ONTAP 9.2, resume the migrate-root operation using the "system node migrate-root -resume" command. If you see the same issue in ONTAP 9.1, call NetApp Support.
  - For other errors, contact NetApp Support for assistance.

Fixed: ONTAP 9.6+
Workaround:
Perform the root replacement using the manual root migration procedure.

The Manual Method

No comments:

Post a Comment