Aka: swapping out
the Motherboard.
Note: Storage
Encryption is not covered here.
Note: Stand-alone
controllers are not covered here.
Introduction
Firstly, I need to point out that I’m not a Field
Engineer, but I nearly got to do one of these. Long story short-ish, we had
issues updating the service processor firmware - it was taking too long to
update - and a particular process we were doing needed a takeover/giveback
done, and - after doing the takeover - the node with issues rebooted to
“Waiting for SP” and would never boot any further. The moral of the story is
“always leave updating of service processor firmware to the last task - and be
patient with it”. Of course, we’ll never know if there was some issue before the reboot, perhaps
the node would have panic-ed after we’d left site, which would have been worse.
So, the motherboard needed replacing.
This post is an overview because - if you’re going to do
this - you should follow this instead:
http://mysupport.netapp.com/documentation/productsatoz/index.html
> FAS 8000 Series > (FAS80?0) All Documents > ‘Replacing the Controller Module’
PDF/EPUB
Interestingly, the
Field Engineer followed a PDF from elsewhere - personally, I would have
followed my notes (which are based on the official documentation). For what it’s
worth, the system was running ONTAP 8.3.2.
Before you begin
- All disk shelves must be working properly.
- If your system is in an HA pair, the healthy node must
be able to take over the node that is being replaced (the impaired node).
About this task
In this procedure, the boot device is moved from the
impaired node to the replacement node so that the replacement node boots up to
the same version of ONTAP as the old controller. It is important that you apply
the commands in these steps on the correct systems:
Impaired node
= node that is being replaced.
Replacement node
= new node that is replacing the impaired node.
Healthy node =
surviving node.
High-Level Steps
1. Preparing for
controller replacement
2. Replacing the
controller module
3. Restoring and
verifying the system configuration
4. Running
diagnostics tests
5. Completing re-cabling
and restoration of operations
6. Completing the
replacement process
1. Preparing for controller replacement
Note: I skim over
this section since the controller was down already.
1.1: Collect CNA
information
1.2: Check SCSI
blade is operational and in quorum
1.3: Shut down the
impaired controller
1.4: Confirm the
impaired controller has been taken over
1.5: Shut down
power through SP
1.6: (1 controller
in chassis) Turn off and disconnect the power supplies
Image: Check the NVRAM LED
is not flashing
If the NVRAM LED is not flashing, there is no
content in the NVRAM.
2. Replacing the controller module
2.1: Remove the
impaired controller module
2.2: Move
components to replacement controller module
Locate and remove the boot device.
Attention: Always
lift the boot device straight up out of the housing. Lifting it out at an angle
can bend or break the
connector pins in
the boot device.
Image: Locate and
remove the boot device
Locate and remove the NVRAM battery.
Attention: Do not
connect the NVRAM keyed battery plug into the socket until after the NVRAM DIMM
has been
installed.
Image: Locate and
remove the NVRAM battery.
- Open the CPU cover.
- In the new controller module, seat the battery in the
holder.
- Open the CPU air duct and locate the NVRAM battery.
- Locate the battery plug and squeeze the clip on the
face of the battery plug to release the plug from the socket, and then unplug
the battery cable from the socket.
- Press the blue locking tabs or tabs on the edge of the
battery, and then slide the battery out of the controller module
- Align the tab or tabs on the battery holder with the
notches in the new controller module, and gently push down on the battery
housing until the battery housing clicks into place.
Note: Ensure that
you push down squarely on the battery housing.
Attention: Do not
connect the NVRAM battery plug into the socket until after the NVRAM DIMM has
been installed.
- Leave the CPU air duct open until you have moved the
system DIMM.
Moving the DIMMs to the new controller module.
Image: Moving the
DIMMs to the new controller module.
Attention: NVRAM
battery plug, do not reconnect until instructed to do so!
Image: Moving the
DIMMs to the new controller module.
Finally, plug the
battery back into the controller module.
2.3: Install in
chassis
Note: For HA pairs
with two controllers in the same chassis, the sequence in which you reinstall
the controller module is especially important because it attempts to reboot as
soon as you completely seat it in the chassis. The system might update the system firmware
when it boots - do not abort this process!
2.3.1: Push
the controller module halfway into the system.
Note: Do not
completely insert the controller module in the chassis!
2.3.2: Re-cable
the console port so you can follow the boot process.
2.3.3: Complete
the reinstall of the controller module.
2.4: (1 controller
in chassis) Power on
2.5: Boot to Maintenance mode
Important: During
the boot process, you might see the following:
- A prompt warning
of a system ID mismatch and asking to override the system ID
- A prompt warning
when entering Maintenance mode
You can safely
respond Y to these prompts.
3. Restoring and verifying the system configuration
3.1: Verify HA
state (ha-config show)
If your system is...
- In an HA pair,
HA state for all components = ha
- In a 4-node
MetroCluster, HA state for all components = mcc
- In a 2-node
MetroCluster, HA state for all components = mcc-2n
- Stand-alone,
HA state for all components = non-ha
*> ha-config
show
*> ha-config modify controller {SEE ABOVE}
*> ha-config modify chassis {SEE ABOVE}
3.2: Verify FC
Configuration
*> ucadmin
show
*> ucadmin
modify -mode fc -type initiator|target adapter_name
*> unified-connect
modify-mode fc -type initiator|target adapter_name
3.3: Exit
Maintenance Mode
*> halt
Note: If you made
changes above, boot back into maintenance mode to verify the changes.
3.4: Verify System
Time
On the healthy node:
::> date
On the replacement node:
LOADER>
show date
set date
mm/dd/yyyy
set time hh:mm:ss
show date
3.5: Install SP
and other F/W
4. Running diagnostics tests
Note: I skim this
section. It is recommended to run diags.
i) Boot diags
ii) Clear status
logs
iii) Display and
note available devices on the controller module
iv) Display and
note available devices on the other modules
v) Modify tests
vi) Run sldiag on
all devices
vii) Or on
individual devices
viii) Review
sldiag status
LOADER>
boot_diags
sldiag device
clearstatus
sldiag device
show -dev mb
sldiag device
show -dev dev_name
sldiag device
modify -dev dev_name -index test_index_number -selection enable|disable
sldiag device run
sldiag device run
-dev dev_name
sldiag device
status -long -status failed
5. Completing re-cabling and restoration of operations
5.1: Re-cable the
system
5.2: Reassign the
disks
Verify automatic system ID change on cDOT HA-Pair 8.3+
LOADER>
boot_ontap menu
Wait until “Waiting for giveback...” appears on the
replacement node.
Then on the healthy node:
::> storage
failover show
Note: You should
see:
Replacement Node: “System ID changed on partner (Old:
XXXXXXXXX, New: YYYYYYYYY), ...”
Healthy Node: “Waiting
for giveback (HA mailboxes)
From healthy node, verify that any coredumps are saved
::> set adv
::> system
node run -node {HEALTHY_NODE} partner savecore
Wait for savecore command to complete before issuing
giveback. To view the status:
::> system
node run -node {HEALTHY_NODE} partner savecore -s
Running ONTAP 8.2.2+
::> storage
failover giveback -ofnode {REPLACEMENT_NODE}
Note i: Type Y to
warning of a system ID mismatch
Note ii: Type Y to
warning of a system ID mismatch
::> storage
failover giveback -ofnode {REPLACEMENT_NODE}
Note i: Type Y to warning
of a system ID mismatch
Note ii: If giveback
is vetoed you will need to override vetoes.
::> storage
failover show-giveback
::> storage
failover show
And perform standard Cluster Health checks!
5.3: Install
licenses on the replacement node
6. Complete the replacement process
The old controller
module will need to be returned to NetApp.
Comments
Post a Comment