Focusing on a 4-Node Fabric Attached MetroCluster -
APPENDIX: Miscellaneous
“For most storage platforms, Utilization is CPU utilization. For ONTAP, it’s a fancier hybrid metric that incorporates CPU utilization, back-to-back checkpoints, Kahuna domain utilization, and some other secret ONTAP sauce to provide a more accurate metric of controller performance headroom.
"Memory utilization might not be a meaningful metric in ONTAP.”
The old sysstat commands -x/-m/-M continue to be available, so these counters will be available (for Grafana/Harvest - https://community.netapp.com/t5/OnCommand-Storage-Management-Software-Discussions/NetApp-Harvest-1-4-1-released/td-p/142907):
node run local sysstat -x 1
node run local sysstat -m 1
node run local sysstat -M 1
Image: MetroCluster
FC Diagram
The main components of a Fabric MetroCluster are:
- 1) Storage Controllers/ONTAP Cluster
- 2) ATTO Bridges
- 3) Storage Shelves & Disks
- 4) Fabric Switches (usually Brocade)
- 5) Cluster Switches (if not switchless)
- 6) ISLs
- 7) MetroCluster Configuration
- 8) OnCommand Unified Manager
- 9) (Optional) Tiebreaker
Each component should be monitored, and alerts received
if anything goes wrong.
Tools
- IMT
- Config Advisor
- AutoSupport/ActiveIQ
- MetroCluster Data Collector (MCDC - formerly FMC_DC): https://mysupport.netapp.com/tools/info/ECMLP2434886I.html?productID=61930&pcfContentID=ECMLP2434886
- OnCommand Unified Manager
- OnCommand Insight
- NetApp Harvest for Grafana
- Brocade Network Advisor (BNA) Pro+
Main Components
1) Storage Controllers/ONTAP Cluster
Check ONTAP Version, SP/BMC, shelf firmware, DQP, disk
firmware.
storage aggregate
show
storage failover
show
storage failover
show-giveback
upgrade-revert
show
system health
alert show
system health
subsystem show
system health
config show
system
environment sensors show -state fault
(DIAG) debug vreport
show
2) ATTO Bridges
storage bridge
show
*provides
information to OCUM
“While the
management port on the 7500 and 7600 still is there in 9.5 and later we highly
recommend inband monitoring and disabling the port.” (Q: How do you upgrade ATTOs then?)
How to change an
ATTO FibreBridge to in-band management in 9.5 and later:
How to secure an
ATTO FibreBridge in 9.5 and later:
3) Storage Shelves & Disks
storage shelf
show -errors
disk show -pool
Pool0
disk show -pool
Pool1
4) Fabric Switches
storage switch
show
*provides
information to OCUM
Included with the
Brocade switches that are ordered through NetApp:
- An Enterprise license bundle:
-- Fabric Vision -
includes Fabric Watch and Advanced Performance
-- Monitoring
- Brocade Network
Advisor Pro+ (which includes the capability for switches to send event
messages to AutoSupport and can be used with any of the Brocade switches in
your environment.)
5) Cluster Switches (if not switchless)
system
cluster-switch show
6) ISLs
Use BNA Pro+
Tracking NVlog latency over FCVI:
set diag
-confirmations off; statistics show -object fcvi -raw true -node local; set
admin
Image: FCVI Write
Latency with Grafana
Image: MetroCluster
Dashboard in Grafana
7) MetroCluster Configuration
metrocluster
check run
metrocluster check
show
metrocluster show
metrocluster show
-periodic-check-status
metrocluster node
show
metrocluster
interconnect mirror show
Q: cron jobs synchronized?
MetroCluster ConfigReplication failed - Job Schedules not
replicated
https://kb.netapp.com/app/answers/answer_view/a_id/1014596/loc/en_US
NetApp recommends
redundant networks for the cluster peering network. If the cluster peering
network is unavailable, health monitor alerts and EMS messages are sent, and planned
switchover is not possible until at least one link is restored.
The primary value of Automatic Unattended Switchover
(AUSO) is to improve the HA capabilities of MetroCluster systems.
NVFAIL: Any
database volume on ONTAP storage should have the nvfail parameter set to on. This
setting protects the volume from a catastrophic failure of NVRAM journaling
that puts data integrity in question. The nvfail parameter takes effect during
startup. If NVRAM errors are detected, then there
might be uncommitted changes that have been lost, and the
drive state might not match the database cache. ONTAP then sets volumes with an
nvfail parameter of on to in-nvfailed-state. As a result, any database process
attempting to access the data receives an I/O error, which leads to a protective
crash or shutdown of the database.
system health
alert definition show -subsystem MetroCluster
system health
alert show -subsystem metroCluster
8) OnCommand Unified Manager
Unified Manager
uses the information that is collected by the MetroCluster health monitors to
gather information about the configuration and to collect events that are
related to the components. The health monitors use SNMP to monitor the switches
and bridges. For current MetroCluster versions in-band monitoring is available
for bridges (7500N or later models) for configurations where SNMP is not
desired.
9) (Optional) Tiebreaker
NetApp provides a fully
supported capability, MetroCluster Tiebreaker software, which is installed at a
third site with independent connections to each of the two clusters. The
purpose of the Tiebreaker software is to monitor and to detect both individual
site failures and intersite link failures. MetroCluster Tiebreaker software can
raise an SNMP trap if a site disaster occurs. It operates in observer mode and
can detect and send an alert if a disaster requiring switchover occurs. The
switchover then can be issued manually by the administrator. Tiebreaker
software can be configured to automatically issue the command for switchover if
a disaster occurs.
ClusterLion is an advanced MetroCluster monitoring
appliance that functions as a virtual third site. This approach allows
MetroCluster to be safely deployed in a two-site configuration with fully
automated switchover capability.
APPENDIX: CLI Commands
MetroCluster Modify
-auto-switchover-failure-domain
This parameter specifies configuration of the automatic
switchover. Supported values are as follows:
auso-on-cluster-disaster
- triggers an unplanned switchover if all nodes in a DR cluster are down (default)
auso-on-dr-group-disaster
- triggers an unplanned switchover if both nodes of a DR group are down
auso-disabled
- automatic switchover is disabled
-node-object-limit
If all the nodes in a DR group have this option enabled
(the default state), then all the nodes in that DR group are protected from
double failures.
-automatic-switchover-onfailure
This parameter is used to enable automatic switchover on
failure on a node when it is disabled because of internal errors. All the nodes
in MCC configuration must have this option enabled (the default state) to
enable automatic switchover on failure.
MetroCluster Operation Show
APPENDIX: MetroCluster (FC) New Features in ONTAP 9.5
- Active storage virtual machines (SVMs) in a
MetroCluster configuration can be used as sources with the SnapMirror SVM
disaster recovery feature.
- Cluster update with OnCommand System Manager.
APPENDIX: Miscellaneous
“For most storage platforms, Utilization is CPU utilization. For ONTAP, it’s a fancier hybrid metric that incorporates CPU utilization, back-to-back checkpoints, Kahuna domain utilization, and some other secret ONTAP sauce to provide a more accurate metric of controller performance headroom.
"Memory utilization might not be a meaningful metric in ONTAP.”
The old sysstat commands -x/-m/-M continue to be available, so these counters will be available (for Grafana/Harvest - https://community.netapp.com/t5/OnCommand-Storage-Management-Software-Discussions/NetApp-Harvest-1-4-1-released/td-p/142907):
node run local sysstat -x 1
node run local sysstat -m 1
node run local sysstat -M 1
References
ONTAP 9
Documentation Center (see ‘MetroCluster configuration’)
NetApp MetroCluster
FC for ONTAP 9.5
Oracle on
MetroCluster: Integrated Data Protection, Disaster Recovery, and High
Availability
NetApp MetroCluster:
Solution Architecture and Design
ONTAP 9: MetroCluster
Service and Expansion Guide
Tiebreaker Software
1.21 Installation and Configuration Guide
OnCommand Unified
Manager
OnCommand Unified Manager: Documentation
Resources
Hi,
ReplyDeleteGood post. Monitoring the MetroCluster with ONTAP 9.5 and netapp harvest 1.4.2 works for you? Since we upgrade ONTAP to 9.5 we have no more graphe.