Saturday, 7 September 2013

Notes on Planning, Implementing, and Using MetroCluster

Following on from the previous two posts on SyncMirror (Part 1 & Part 2); this post is my home for a few notes on MetroCluster - planning, implementing and using. MetroCluster is about as complicated as it gets with NetApp - but, when you get your head around it, it is a surprisingly simple solution and very elegant too!

First things first

You’ll want to read:

Best Practices for MetroCluster Design and Implementation (78 pages)
By Jonathan Bell, April 2013, TR-3548
NOTE: I’ve found that the US links tend to be the most up-to-date ones!

Brief Synopsis

Two types of MetroCluster - Stretch (SMC) and Fabric-Attached (FMC):
1) SMC spans maximum distance of 500m*
2) FMC spans maximum distance of 200km*
*Correct at the time of writing.

Image: Types of NetApp MetroCluster
MetroCluster uses SyncMirror to replicate data.
FMC requires 4 fabric switches (NetApp provided Brocade/Cisco only) - 2 in each site (and additionally 4 ATTO bridges - 2 in each site- if using SAS shelves.)

1. Planning

Essential Reading/Research:

1.1) MetroCluster Planning Worksheet
NOTE: This can be found on page 70/1 of the document below.

1.2) High-Availability and MetroCluster Configuration Guide (232 pages)

NOTE: The above link is for Data ONTAP 8.1. For Data ONTAP documentation specific to your 8.X version, go to > Documentation > Product Documentation > Data ONTAP 8

1.3) MetroCluster Production Documentation

This link currently links to:

Configuring a MetroCluster system with SAS disk shelves and FibreBridge 6500N bridges (28 pages)

Fabric-attached MetroCluster Systems Cisco Switch Configuration Guide (35 pages)

Fabric-attached MetroCluster™ Systems: Brocade® Switch Configuration Guide (32 pages)

Instructions for installing Cisco 9148 switch into NetApp cabinet (5 pages)

Specifications for the X1922A Dual-Port, 2-Gb, MetroCluster Adapter (2 pages)

1.4) Interoperability Matrix Tool (new location for the MetroCluster Compatibility Matrix)

The old PDF which goes up to ONTAP versions 7.3.6/8.0.4/8.1.2 is here:

1.5) Product Documentation for FAS/V-Series Controller Model > Documentation > Production Documentation

1.6) Product Documentation for Shelves > Documentation > Production Documentation

2. Implementing

NOTE: This post is only intended as rough notes, and is missing a lot of detail, for more details please refer to the above documents.

2.1) Rack and Stack
2.2) Cabling
2.2) Shelf power-up and setting shelf IDs
2.3) Configuring ATTO SAS bridges {FMC}
2.4) Configuring fabric switches {FMC}
2.5) Controller power-up and setup
2.6) Licensing
2.7) Assigning Disks
2.8) Configuring SyncMirror*
*Check out my previous SyncMirror posts Part 1 & Part 2)

2.6) Licensing

license add XXXXXXX # cf
license add XXXXXXX # cf_remote
license add XXXXXXX # syncmirror_local

NOTE: syncmirror_local requires a reboot to enable, but you can just use cf takeover/giveback!

2.7) Assigning Disks

disk show -v
disk assign -p 1 -s 1234567890 0b.23 {SMC}
disk assign -p 0 sw1:3* {FMC}
disk assign -p 1 sw3:5* {FMC}
storage show disk -p

NOTE: “When assigning disk ownership, always assign all disks on the same loop or stack to the same storage system and pool as well as the same adaptor.” Use Pool 0 for the local disks and Pool 1 for the remote disks.

3: Using

3.1) Recovering from a Site Failure

3.1.1) Restrict access to the failed site (FENCING - IMPORTANT)
3.1.2) Force the surviving node into takeover mode

SiteB> cf forcetakeover -d

3.1.3) Remount volumes from the failed node (if needed)
3.1.4) Recover LUNs of the failed node (if needed)

NOTE: If you have an iSCSI-attached host and the “options cf.takeover.change_fsid on” (default), you will need to recover LUNS from the failed node.

3.1.5) Fix failures caused by the disaster
3.1.6) Reestablish the MetroCluster configuration (including giveback)

NOTE: Here, SiteB controller has “taken over” a failed controller SiteA (which is in the “disaster site”)

To validate you can access the storage in Site A:

SiteB(takeover)> aggr status -r

To switch to the console of the recovered Site A controller:

SiteB(takeover)> partner

On determining the remote site is accessible. Turn Site A controller on. To determine status of aggregates for both sites:

SiteB/SiteA> aggr status

If the aggregates in the disaster site are showing online, need to change the state to offline:

SiteB/SiteA> aggr offline aggr_SiteA_01

To re-create the mirrored aggregate (here we choose the “disaster site” aggregate as the victim):

SiteB/SiteA> aggr mirror aggr_SiteA_rec_SiteB_01 -v aggr_SiteA_01

To check resyncing progress:

SiteB/SiteA> aggr status

NOTE: When aggregates are ready they transition to mirrored.

After all aggregates have been re-joined, return to the SiteB node and do the giveback:

SiteB/SiteA> partner
SiteB(takeover)> cf giveback

One final thing you might want to do is rename the aggregates back to how they were before the disaster:

SiteA> aggr rename aggr_SiteA_rec_SiteB_01 aggr_SiteA_01

3.2) Maintenance

If you’re just after simple maintenance (not site fail-over or anything like that):

cf disable

Do your work (i.e. re-cabling, power things down or up), then - when finished:

aggr status

Wait* for the aggregates to transition from resyncing to mirrored*:
*Here you need to watch for the rate of change on either side - too many changes and it might take forever for the resync to complete, hence the need for a maintenance window!

cf enable

You see, I told you it was simple :)

NOTE: You might want to do an options autosupport.doit “MC Maintence” to let NetApp support know.

No comments:

Post a Comment