I needed to test out ‘system node migrate-root’
in preparation for potentially using it prior to performing a headswap. No
physical hardware here just an ONTAP 9.3 HA Simulator lab. These are some
observations with a couple of tips.
Further Reading:
How to non-disruptively migrate a node's root aggregate onto new disks in ONTAP 9
https://kb.netapp.com/app/answers/answer_view/a_id/1086971
Real World Experience!
If you login to the above KB with your NetApp account, there's a mention:
"Normally, the root-migration will be successful and transparent, however, there are bugs that can cause the script to fail. The node will boot up with a new root aggregate, however, additional clean up steps must be be taken. Technical Support should be engaged to document the failure and assist in recovery."
It's very possible you will need to log a support call, and support will follow certain recovery steps in the KB. So give yourself time. Don't plan to cram root vol moves followed by an ARL headswap into the same day - it might be a long day.
Documentation
System
Node Migrate-Root is documented here:
The
syntax is very simple, for example:
cluster1::*> system node migrate-root -node cluster1-01
-disklist VMw-1.17,VMw-1.18,VMw-1.19 -raid-type raid_dp
Warning: This operation will
create a new root aggregate and replace the existing root on the node
"cluster1-01". The existing root aggregate will be discarded.
Do you want to continue?
{y|n}: y
Info: Started migrate-root
job. Run "job show -id 86 -instance" command to
check the progress of the job.
The
process (see below) starts straight away.
The official
documentation mentions:
The command starts a job that backs up the node
configuration, creates a new aggregate, set it as new root aggregate, restores
the node configuration and restores the names of original aggregate and volume.
The job might take as long
as a few hours depending on time it takes for zeroing the disks,
rebooting the node and restoring the node configuration.
The Process and Timings
The SIM has tiny disks of 10.66GB and 28.44GB usable
size. The 10.66GB disks were originally used for 3 disk root aggregates, and
the migrate-root moved the root aggregate to the slightly larger virtual disks.
On a physical system with much bigger than 28.44GB disks, I would expect the
timings to be considerably longer than the below. The below timings are taken
from acquiring the ‘Execution Progress’ string - from the job show output - every
second.
0-27
seconds: Starting node configuration
backup. This might take several minutes.
28-146
seconds: Starting aggregate relocation
on the node "cluster1-02"
147-212
seconds: Rebooting the node to create a
new root aggregate.
213-564
seconds: Waiting for the node to create
a new root and come online. This might take a few minutes.
565-682
seconds: Making the old root aggregate
online.
683-686
seconds: Copying contents from old root
volume to new root volume.
687-864
seconds: Starting removal of old
aggregate and volume and renaming the new root.
865-1653
seconds: Starting node configuration
restore.
1654-1772
seconds: Enabling HA and relocating the
aggregates. This might take a few minutes.
1773
seconds: Complete: Root aggregate
migration successfully completed [0]
Nearly
30 minutes for migrate-root on one node of a tiny ONTAP 9.3RC1 HA SIM! And you
still need to do a takeover/giveback of the node whose root aggregate was moved
(see below).
Tips
1) The
process disables HA and Storage Failover, and Aggregate Relocation is used to
move the data aggregates to the node that’s staying up. The process does not
move data LIFs, these will failover automatically, but I noticed a bit of a
delay (my test CIFS share was down for 45 seconds), so I’d recommend moving
data LIFs onto the node that’s going to stay up first.
2) I noticed
- consistently - that if you run ‘system health alert show’ after the
migrate-root completes, you get some weird output. Perform a takeover/giveback
of the affected node after the migrate-root completes to correct
this.
cluster1::*> system
health alert show
This table is currently
empty.
Warning: Unable to list
entries for schm on node "cluster1-02": RPC: Remote
system error [from mgwd on node
"cluster1-02" (VSID: -1) to schm at
127.0.0.1].
Unable to list entries for shm on node
"cluster1-02": RPC: Remote
system error [from mgwd on node
"cluster1-02" (VSID: -1) to shm at
127.0.0.1].
Unable to list entries for cphm on
node "cluster1-02": RPC: Remote
system error [from mgwd on node
"cluster1-02" (VSID: -1) to cphm at
127.0.0.1].
Unable to list entries for cshm on
node "cluster1-02": RPC: Remote
system error [from mgwd on node
"cluster1-02" (VSID: -1) to cshm at
127.0.0.1].
cluster1::*> Replaying
takeover WAFL log
May 06 14:01:04
[cluster1-01:monitor.globalStatus.critical:EMERGENCY]: This node has taken over
cluster1-02.
cluster1::*> system
health alert show
This table is currently
empty.
cluster1::*>
Image: Remote system error [from mgwd on node...
Further Reading:
How to non-disruptively migrate a node's root aggregate onto new disks in ONTAP 9
https://kb.netapp.com/app/answers/answer_view/a_id/1086971
Real World Experience!
If you login to the above KB with your NetApp account, there's a mention:
"Normally, the root-migration will be successful and transparent, however, there are bugs that can cause the script to fail. The node will boot up with a new root aggregate, however, additional clean up steps must be be taken. Technical Support should be engaged to document the failure and assist in recovery."
It's very possible you will need to log a support call, and support will follow certain recovery steps in the KB. So give yourself time. Don't plan to cram root vol moves followed by an ARL headswap into the same day - it might be a long day.
Bug Summary
Fixed: ONTAP 9.3+
Workaround:
Perform the following steps to resolve this issue:
1. Run the
'aggr status' command on nodeshell to check if the new aggregate 'new_root' is
created.
2. If the
aggregate is created and online, perform either of the following steps to
resume the process:
- For ONTAP 9.2
or later, run the ' system node migrate-root -resume ' command.
- For earlier
ONTAP versions, contact technical support for assistance.
Fixed: ONTAP 9.0P1+
Workaround:
If you encounter this issue, contact technical support
to manually complete the root replacement procedure.
Fixed: ONTAP 9.2+
Workaround:
Perform the following steps to resolve this issue:
- If the
command fails with this error message "Internal error: Failed to verify
the new root aggregate status" for ONTAP 9.2, resume the migrate-root
operation using the "system node migrate-root -resume" command. If
you see the same issue in ONTAP 9.1, call NetApp Support.
- For other
errors, contact NetApp Support for assistance.
Fixed: ONTAP 9.6+
Workaround:
Perform the root replacement using the manual root
migration procedure.
The Manual Method
Comments
Post a Comment