Following on from
the previous two posts on SyncMirror (Part
1 & Part
2); this post is my home for a few notes on MetroCluster - planning,
implementing and using. MetroCluster is about as complicated as it gets with
NetApp - but, when you get your head around it, it is a surprisingly simple
solution and very elegant too!
First things
first
You’ll want to read:
Best Practices for MetroCluster
Design and Implementation (78 pages)
By Jonathan Bell, April 2013, TR-3548
NOTE: I’ve found
that the US links tend to be the most up-to-date ones!
Brief Synopsis
Two types of MetroCluster - Stretch (SMC) and
Fabric-Attached (FMC):
1) SMC spans maximum distance of 500m*
2) FMC spans maximum distance of 200km*
*Correct at the
time of writing.
Image: Types of
NetApp MetroCluster
MetroCluster uses SyncMirror
to replicate data.
FMC requires 4 fabric switches (NetApp provided
Brocade/Cisco only) - 2 in each site (and additionally 4 ATTO bridges - 2 in
each site- if using SAS shelves.)
1. Planning
Essential
Reading/Research:
1.1) MetroCluster Planning Worksheet
NOTE: This can be
found on page 70/1 of the document below.
1.2) High-Availability and
MetroCluster Configuration Guide (232 pages)
NOTE: The above link
is for Data ONTAP 8.1. For Data ONTAP documentation specific to your 8.X version,
go to http://support.netapp.com >
Documentation > Product Documentation > Data ONTAP 8
1.3) MetroCluster
Production Documentation
This link currently
links to:
Configuring a MetroCluster
system with SAS disk shelves and
FibreBridge 6500N bridges (28 pages)
Fabric-attached MetroCluster
Systems Cisco Switch Configuration
Guide (35 pages)
Fabric-attached MetroCluster™
Systems: Brocade® Switch
Configuration Guide (32 pages)
Instructions for
installing Cisco 9148 switch into NetApp cabinet (5 pages)
Specifications for
the X1922A Dual-Port, 2-Gb, MetroCluster Adapter (2 pages)
1.4) Interoperability
Matrix Tool (new location for the MetroCluster Compatibility Matrix)
The old PDF which
goes up to ONTAP versions 7.3.6/8.0.4/8.1.2 is here:
1.5) Product
Documentation for FAS/V-Series Controller Model
http://support.netapp.com
> Documentation > Production Documentation
1.6) Product
Documentation for Shelves
http://support.netapp.com
> Documentation > Production Documentation
2. Implementing
NOTE: This post is
only intended as rough notes, and is missing a lot of detail, for more details
please refer to the above documents.
2.1) Rack and Stack
2.2) Cabling
2.2) Shelf power-up and setting shelf IDs
2.3) Configuring ATTO SAS bridges {FMC}
2.4) Configuring fabric switches {FMC}
2.5) Controller power-up and setup
2.6) Licensing
2.7) Assigning Disks
2.8) Configuring SyncMirror*
2.6) Licensing
license add XXXXXXX # cf
license add XXXXXXX # cf_remote
license add XXXXXXX # syncmirror_local
NOTE: syncmirror_local
requires a reboot to enable, but you can just use cf takeover/giveback!
2.7) Assigning
Disks
sysconfig
disk show -v
disk assign -p 1 -s 1234567890
0b.23 {SMC}
disk assign -p 0 sw1:3*
{FMC}
disk assign -p 1 sw3:5*
{FMC}
storage show disk -p
NOTE: “When
assigning disk ownership, always assign all disks on the same loop or stack to
the same storage system and pool as well as the same adaptor.” Use Pool 0 for
the local disks and Pool 1 for the remote disks.
3: Using
3.1) Recovering
from a Site Failure
3.1.1) Restrict access to the failed site (FENCING - IMPORTANT)
3.1.2) Force the surviving node into takeover mode
SiteB> cf
forcetakeover -d
3.1.3) Remount volumes from the failed node (if needed)
3.1.4) Recover LUNs of the failed node (if needed)
NOTE: If you have
an iSCSI-attached host and the “options cf.takeover.change_fsid on”
(default), you will need to recover LUNS from the failed node.
3.1.5) Fix failures caused by the disaster
3.1.6) Reestablish the MetroCluster configuration
(including giveback)
NOTE: Here, SiteB controller
has “taken over” a failed controller SiteA (which is in the “disaster site”)
To validate you can access the storage in Site A:
SiteB(takeover)>
aggr status
-r
To switch to the console of the recovered Site A
controller:
SiteB(takeover)>
partner
On determining the remote site is accessible. Turn Site A
controller on. To determine status of aggregates for both sites:
SiteB/SiteA> aggr status
If the aggregates in the disaster site are showing
online, need to change the state to offline:
SiteB/SiteA> aggr offline
aggr_SiteA_01
To re-create the mirrored aggregate (here we choose the “disaster
site” aggregate as the victim):
SiteB/SiteA> aggr mirror
aggr_SiteA_rec_SiteB_01 -v aggr_SiteA_01
To check resyncing progress:
SiteB/SiteA> aggr status
NOTE: When
aggregates are ready they transition to mirrored.
After all aggregates have been re-joined, return to the
SiteB node and do the giveback:
SiteB/SiteA> partner
SiteB(takeover)>
cf giveback
One final thing you might want to do is rename the
aggregates back to how they were before the disaster:
SiteA> aggr rename
aggr_SiteA_rec_SiteB_01 aggr_SiteA_01
3.2) Maintenance
If you’re just after simple maintenance (not site
fail-over or anything like that):
cf disable
Do your work (i.e. re-cabling, power things down or up),
then - when finished:
aggr status
Wait* for the aggregates to transition from resyncing to
mirrored*:
*Here you need to
watch for the rate of change on either side - too many changes and it might
take forever for the resync to complete, hence the need for a maintenance
window!
cf enable
You see, I told you it was simple :)
NOTE: You might
want to do an options autosupport.doit “MC Maintence” to let NetApp
support know.
Amazing article! I was confused about , but now got a clear view of the definition.
ReplyDeleteSAS Course in Chennai
SAS Training Institutes in Chennai