Sunday, 6 May 2018

System Node Migrate-Root - Experience and Tips

I needed to test out ‘system node migrate-root’ in preparation for potentially using it prior to performing a headswap. No physical hardware here just an ONTAP 9.3 HA Simulator lab. These are some observations with a couple of tips.

Documentation

System Node Migrate-Root is documented here:

The syntax is very simple, for example:


cluster1::*>  system node migrate-root -node cluster1-01 -disklist VMw-1.17,VMw-1.18,VMw-1.19 -raid-type raid_dp

Warning: This operation will create a new root aggregate and replace the existing root on the node "cluster1-01". The existing root aggregate will be discarded.
Do you want to continue? {y|n}: y

Info: Started migrate-root job. Run "job show -id 86 -instance" command to
      check the progress of the job.


The process (see below) starts straight away.

The official documentation mentions:

The command starts a job that backs up the node configuration, creates a new aggregate, set it as new root aggregate, restores the node configuration and restores the names of original aggregate and volume. The job might take as long as a few hours depending on time it takes for zeroing the disks, rebooting the node and restoring the node configuration.

The Process and Timings

The SIM has tiny disks of 10.66GB and 28.44GB usable size. The 10.66GB disks were originally used for 3 disk root aggregates, and the migrate-root moved the root aggregate to the slightly larger virtual disks. On a physical system with much bigger than 28.44GB disks, I would expect the timings to be considerably longer than the below. The below timings are taken from acquiring the ‘Execution Progress’ string - from the job show output - every second.

0-27 seconds: Starting node configuration backup. This might take several minutes.
28-146 seconds: Starting aggregate relocation on the node "cluster1-02"
147-212 seconds: Rebooting the node to create a new root aggregate.
213-564 seconds: Waiting for the node to create a new root and come online. This might take a few minutes.
565-682 seconds: Making the old root aggregate online.
683-686 seconds: Copying contents from old root volume to new root volume.
687-864 seconds: Starting removal of old aggregate and volume and renaming the new root.
865-1653 seconds: Starting node configuration restore.
1654-1772 seconds: Enabling HA and relocating the aggregates. This might take a few minutes.
1773 seconds: Complete: Root aggregate migration successfully completed [0]

Nearly 30 minutes for migrate-root on one node of a tiny ONTAP 9.3RC1 HA SIM! And you still need to do a takeover/giveback of the node whose root aggregate was moved (see below).

Tips

1) The process disables HA and Storage Failover, and Aggregate Relocation is used to move the data aggregates to the node that’s staying up. The process does not move data LIFs, these will failover automatically, but I noticed a bit of a delay (my test CIFS share was down for 45 seconds), so I’d recommend moving data LIFs onto the node that’s going to stay up first.

2) I noticed - consistently - that if you run ‘system health alert show’ after the migrate-root completes, you get some weird output. Perform a takeover/giveback of the affected node after the migrate-root completes to correct this.


cluster1::*> system health alert show
This table is currently empty.

Warning: Unable to list entries for schm on node "cluster1-02": RPC: Remote
         system error [from mgwd on node "cluster1-02" (VSID: -1) to schm at
         127.0.0.1].
         Unable to list entries for shm on node "cluster1-02": RPC: Remote
         system error [from mgwd on node "cluster1-02" (VSID: -1) to shm at
         127.0.0.1].
         Unable to list entries for cphm on node "cluster1-02": RPC: Remote
         system error [from mgwd on node "cluster1-02" (VSID: -1) to cphm at
         127.0.0.1].
         Unable to list entries for cshm on node "cluster1-02": RPC: Remote
         system error [from mgwd on node "cluster1-02" (VSID: -1) to cshm at
         127.0.0.1].

cluster1::*> Replaying takeover WAFL log
May 06 14:01:04 [cluster1-01:monitor.globalStatus.critical:EMERGENCY]: This node has taken over cluster1-02.

cluster1::*> system health alert show
This table is currently empty.

cluster1::*>


Image: Remote system error [from mgwd on node...

No comments:

Post a Comment