Saturday, 11 February 2017

Overview: Replacing the Controller Module for a FAS 8040/8060/8080 running Clustered Data ONTAP

Aka: swapping out the Motherboard.
Note: Storage Encryption is not covered here.
Note: Stand-alone controllers are not covered here.

Introduction

Firstly, I need to point out that I’m not a Field Engineer, but I nearly got to do one of these. Long story short-ish, we had issues updating the service processor firmware - it was taking too long to update - and a particular process we were doing needed a takeover/giveback done, and - after doing the takeover - the node with issues rebooted to “Waiting for SP” and would never boot any further. The moral of the story is “always leave updating of service processor firmware to the last task - and be patient with it”. Of course, we’ll never know if there was some issue before the reboot, perhaps the node would have panic-ed after we’d left site, which would have been worse. So, the motherboard needed replacing.

This post is an overview because - if you’re going to do this - you should follow this instead:
http://mysupport.netapp.com/documentation/productsatoz/index.html > FAS 8000 Series > (FAS80?0) All Documents > ‘Replacing the Controller Module’ PDF/EPUB

Interestingly, the Field Engineer followed a PDF from elsewhere - personally, I would have followed my notes (which are based on the official documentation). For what it’s worth, the system was running ONTAP 8.3.2.

Before you begin

- All disk shelves must be working properly.
- If your system is in an HA pair, the healthy node must be able to take over the node that is being replaced (the impaired node).

About this task

In this procedure, the boot device is moved from the impaired node to the replacement node so that the replacement node boots up to the same version of ONTAP as the old controller. It is important that you apply the commands in these steps on the correct systems:

Impaired node = node that is being replaced.
Replacement node = new node that is replacing the impaired node.
Healthy node = surviving node.

High-Level Steps

1. Preparing for controller replacement
2. Replacing the controller module
3. Restoring and verifying the system configuration
4. Running diagnostics tests
5. Completing re-cabling and restoration of operations
6. Completing the replacement process

1. Preparing for controller replacement

Note: I skim over this section since the controller was down already.
1.1: Collect CNA information
1.2: Check SCSI blade is operational and in quorum
1.3: Shut down the impaired controller
1.4: Confirm the impaired controller has been taken over
1.5: Shut down power through SP
1.6: (1 controller in chassis) Turn off and disconnect the power supplies

Image: Check the NVRAM LED is not flashing

If the NVRAM LED is not flashing, there is no content in the NVRAM.

2. Replacing the controller module

2.1: Remove the impaired controller module
2.2: Move components to replacement controller module

Locate and remove the boot device.

Attention: Always lift the boot device straight up out of the housing. Lifting it out at an angle can bend or break the
connector pins in the boot device.

Image: Locate and remove the boot device

Locate and remove the NVRAM battery.

Attention: Do not connect the NVRAM keyed battery plug into the socket until after the NVRAM DIMM has been
installed.

Image: Locate and remove the NVRAM battery.

- Open the CPU cover.
- In the new controller module, seat the battery in the holder.
- Open the CPU air duct and locate the NVRAM battery.
- Locate the battery plug and squeeze the clip on the face of the battery plug to release the plug from the socket, and then unplug the battery cable from the socket.
- Press the blue locking tabs or tabs on the edge of the battery, and then slide the battery out of the controller module
- Align the tab or tabs on the battery holder with the notches in the new controller module, and gently push down on the battery housing until the battery housing clicks into place.

Note: Ensure that you push down squarely on the battery housing.
Attention: Do not connect the NVRAM battery plug into the socket until after the NVRAM DIMM has been installed.

- Leave the CPU air duct open until you have moved the system DIMM.

Moving the DIMMs to the new controller module.

Image: Moving the DIMMs to the new controller module.

Attention: NVRAM battery plug, do not reconnect until instructed to do so!

Image: Moving the DIMMs to the new controller module.

Finally, plug the battery back into the controller module.

2.3: Install in chassis

Note: For HA pairs with two controllers in the same chassis, the sequence in which you reinstall the controller module is especially important because it attempts to reboot as soon as you completely seat it in the chassis. The system might update the system firmware when it boots - do not abort this process!

2.3.1: Push the controller module halfway into the system.
Note: Do not completely insert the controller module in the chassis!

2.3.2: Re-cable the console port so you can follow the boot process.
2.3.3: Complete the reinstall of the controller module.

2.4: (1 controller in chassis) Power on
2.5: Boot to Maintenance mode

Important: During the boot process, you might see the following:
- A prompt warning of a system ID mismatch and asking to override the system ID
- A prompt warning when entering Maintenance mode
You can safely respond Y to these prompts.

3. Restoring and verifying the system configuration

3.1: Verify HA state (ha-config show)
If your system is...
- In an HA pair, HA state for all components = ha
- In a 4-node MetroCluster, HA state for all components = mcc
- In a 2-node MetroCluster, HA state for all components = mcc-2n
- Stand-alone, HA state for all components = non-ha

*> ha-config show
*> ha-config modify controller {SEE ABOVE}
*> ha-config modify chassis {SEE ABOVE}

3.2: Verify FC Configuration
*> ucadmin show
*> ucadmin modify -mode fc -type initiator|target adapter_name
*> unified-connect modify-mode fc -type initiator|target adapter_name

3.3: Exit Maintenance Mode
*> halt
Note: If you made changes above, boot back into maintenance mode to verify the changes.

3.4: Verify System Time
On the healthy node:
::> date

On the replacement node:
LOADER>
show date
set date mm/dd/yyyy
set time hh:mm:ss
show date

3.5: Install SP and other F/W

4. Running diagnostics tests

Note: I skim this section. It is recommended to run diags.
i) Boot diags
ii) Clear status logs
iii) Display and note available devices on the controller module
iv) Display and note available devices on the other modules
v) Modify tests
vi) Run sldiag on all devices
vii) Or on individual devices
viii) Review sldiag status

LOADER>
boot_diags
sldiag device clearstatus
sldiag device show -dev mb
sldiag device show -dev dev_name
sldiag device modify -dev dev_name -index test_index_number -selection enable|disable
sldiag device run
sldiag device run -dev dev_name
sldiag device status -long -status failed

5. Completing re-cabling and restoration of operations

5.1: Re-cable the system
5.2: Reassign the disks
Verify automatic system ID change on cDOT HA-Pair 8.3+
LOADER> boot_ontap menu

Wait until “Waiting for giveback...” appears on the replacement node.
Then on the healthy node:
::> storage failover show

Note: You should see:
Replacement Node: “System ID changed on partner (Old: XXXXXXXXX, New: YYYYYYYYY), ...”
Healthy Node: “Waiting for giveback (HA mailboxes)

From healthy node, verify that any coredumps are saved
::> set adv
::> system node run -node {HEALTHY_NODE} partner savecore

Wait for savecore command to complete before issuing giveback. To view the status:
::> system node run -node {HEALTHY_NODE} partner savecore -s

Running ONTAP 8.2.2+
::> storage failover giveback -ofnode {REPLACEMENT_NODE}
Note i: Type Y to warning of a system ID mismatch
Note ii: Type Y to warning of a system ID mismatch

::> storage failover giveback -ofnode {REPLACEMENT_NODE}
Note i: Type Y to warning of a system ID mismatch
Note ii: If giveback is vetoed you will need to override vetoes.

::> storage failover show-giveback
::> storage failover show

And perform standard Cluster Health checks!

5.3: Install licenses on the replacement node

6. Complete the replacement process
The old controller module will need to be returned to NetApp.

No comments:

Post a Comment