Thursday, 9 October 2014

Testing System Configuration Recovery on a Single Node Cluster

CAVEAT LECTOR! This is a totally unofficial post by a NetApp enthusiast, and these actions were performed on a SIM, not a real live production system!


Our starting point is a brand new, post-‘Cluster Setup Wizard’, Clustered ONTAP 8.2.1 single-node cluster.

We’ve completed a very basic cluster setup so that we have something to get back when we do the cluster restore:

storage disk assign -node NACLU7-01 -all
storage aggregate create aggr1 -diskcount 12
system license add NFSLICENCECODE
system timeout modify 0
vserver setup

In the Vserver setup script, we create a Vserver for NFS, with 3 data volumes, a data LIF for NFS ...

Configuring System Configuration Backup

The open source FileZilla FTP Server is used as a destination to upload System Configuration Backups to. The following commands configure System Configuration Backup, and test:

set -privilege advanced
system configuration backup settings modify -destination -username ftpuser
system configuration backup settings set-password

system configuration backup show
system configuration backup create -node NACLU7-01 -backup-type node -backup-name 20141009NACLU7-01
system configuration backup create -node NACLU7-01 -backup-type cluster -backup-name 20141009NACLU7
system configuration backup upload -node NACLU7-01 -backup 20141009NACLU7-01.7z -destination
system configuration backup upload -node NACLU7-01 -backup 20141009NACLU7.7z -destination

Breaking the Single-Node Cluster

We reboot the load, go into maintenance mode, and destroy the root aggregate aggr0!

::> reboot

Ctrl-C for Boot Menu
Selection (5) Maintenance mode boot

*> aggr status
*> aggr offline aggr0
*> aggr destroy aggr0
*> halt

When the node reboots it will get to the below error and then reboot, and continue in this loop until we fix it:

raid.assim.tree.noRootVol:error]: No usable root volume was found!

Fixing the Single-Node Cluster

If we were moving the root aggregate we would have pre-created a new root aggregate. As it is in this instance, we have to run (4) from the Boot Menu to create a new root aggregate. For this to work without wiping the data aggregate that is still intact, we must unassign all disks, and assign just the 3 disks we want to be used for the new root aggregate.

An Aside...

Note: If it is unknown what are the spare disks, it is possible to temporarily set that data aggregate as root, and boot into that and run a::>

::> storage disk show -container spare

This requires selecting option 5 from the boot menu and:

Ctrl-C for Boot Menu
Selection (5) Maintenance mode boot

We set the existing data aggregate - aggr1 - to ha_policy cfo and root. After we reboot, the node will boot from a newly created skeleton root volume (AUTOROOT.)

*> aggr options aggr1 ha_policy cfo
*> aggr options aggr1 root
*> halt

Fixing the Single-Node Cluster (Continued)

Ctrl-C for Boot Menu
Selection (5) Maintenance mode boot

*> aggr offline aggr1
*> disk remove_ownership all
*> disk show # should show no disks
*> aggr status # should show no aggregates
*> disk assign v5.28
*> disk assign v5.29
*> disk assign v5.32
*> halt

After the node reboots:

Ctrl-C for Boot Menu
Selection (4) Clean configuration and initialize all disks

The node will boot to a login prompt, and when you login with the original credentials you will get this System Message:

A new root volume was detected. This node is not fully operational.

Fixing the Single-Node Cluster: Recovering the Data Aggregate(s)

Reboot again and go back into Maintenance mode to reassign disks (cannot do this in the Clustershell in the current cluster state) and get the data aggregate back

::> reboot

Ctrl-C for Boot Menu
Selection (5) Maintenance mode boot

*> disk assign all
*> aggr status
*> aggr online aggr1
*> aggr options aggr0 root
*> halt

Fixing the Single-Node Cluster: Recovery

After the system reboots, and we’ve logged back into the ‘temporary’ Clustershell, run these commands (Note: The node mgmt1 LIF is remembered):

storage aggregate show
network interface show

set diag
system configuration backup download -node local -source
system configuration backup download -node local -source
system configuration recovery cluster recreate -from backup -backup 20141009NACLU7.7z

Note: We recovery from the cluster backup since this contains the node information.

Warning: This command will destroy your existing cluster. It will rebuild a new single-node cluster consisting of this node by using the contents of the specified backup package. This command should only be used to recover from a disaster. Do not perform any other recovery operations while this operation is in progress. This command will cause all the cluster applications on this node to restart, causing an interruption in CLI and Web interface.

Then halt the node:

::> halt

On reboot, press to access the loader prompt, and run these commands:

VLOADER> unsetenv bootarg.init.boot_recovery
VLOADER> boot_ontap

It’s Fixed!

When the Single-Node Cluster boots up this time, you log in, and run -

cluster show
set advanced
cluster ring show

volume show

Everything should be back to full health!

No comments:

Post a Comment