Notes: MetroCluster Monitoring and Alerting (with Performance Reporting)

Focusing on a 4-Node Fabric Attached MetroCluster -

Image: MetroCluster FC Diagram

The main components of a Fabric MetroCluster are:

- 1) Storage Controllers/ONTAP Cluster
- 2) ATTO Bridges
- 3) Storage Shelves & Disks
- 4) Fabric Switches (usually Brocade)
- 5) Cluster Switches (if not switchless)
- 6) ISLs
- 7) MetroCluster Configuration
- 8) OnCommand Unified Manager
- 9) (Optional) Tiebreaker

Each component should be monitored, and alerts received if anything goes wrong.

Tools

- IMT
- Config Advisor
- AutoSupport/ActiveIQ
- OnCommand Unified Manager
- OnCommand Insight
- NetApp Harvest for Grafana
- Brocade Network Advisor (BNA) Pro+

Main Components

1) Storage Controllers/ONTAP Cluster

Check ONTAP Version, SP/BMC, shelf firmware, DQP, disk firmware.

storage aggregate show
storage failover show
storage failover show-giveback
upgrade-revert show
system health alert show
system health subsystem show
system health config show
system environment sensors show -state fault
(DIAG) debug vreport show

2) ATTO Bridges

storage bridge show
*provides information to OCUM

“While the management port on the 7500 and 7600 still is there in 9.5 and later we highly recommend inband monitoring and disabling the port.” (Q: How do you upgrade ATTOs then?)
How to change an ATTO FibreBridge to in-band management in 9.5 and later:
How to secure an ATTO FibreBridge in 9.5 and later:

3) Storage Shelves & Disks

storage shelf show -errors
disk show -pool Pool0
disk show -pool Pool1

4) Fabric Switches

storage switch show
*provides information to OCUM

Included with the Brocade switches that are ordered through NetApp:
- An Enterprise license bundle:
-- Fabric Vision - includes Fabric Watch and Advanced Performance
-- Monitoring
- Brocade Network Advisor Pro+ (which includes the capability for switches to send event messages to AutoSupport and can be used with any of the Brocade switches in your environment.)

5) Cluster Switches (if not switchless)

system cluster-switch show

6) ISLs

Use BNA Pro+

Tracking NVlog latency over FCVI:
set diag -confirmations off; statistics show -object fcvi -raw true -node local; set admin

Image: FCVI Write Latency with Grafana

Image: MetroCluster Dashboard in Grafana

7) MetroCluster Configuration

metrocluster check run
metrocluster check show
metrocluster show
metrocluster show -periodic-check-status
metrocluster node show
metrocluster interconnect mirror show

Q: cron jobs synchronized?
MetroCluster ConfigReplication failed - Job Schedules not replicated
https://kb.netapp.com/app/answers/answer_view/a_id/1014596/loc/en_US

NetApp recommends redundant networks for the cluster peering network. If the cluster peering network is unavailable, health monitor alerts and EMS messages are sent, and planned switchover is not possible until at least one link is restored.

The primary value of Automatic Unattended Switchover (AUSO) is to improve the HA capabilities of MetroCluster systems.

NVFAIL: Any database volume on ONTAP storage should have the nvfail parameter set to on. This setting protects the volume from a catastrophic failure of NVRAM journaling that puts data integrity in question. The nvfail parameter takes effect during startup. If NVRAM errors are detected, then there
might be uncommitted changes that have been lost, and the drive state might not match the database cache. ONTAP then sets volumes with an nvfail parameter of on to in-nvfailed-state. As a result, any database process attempting to access the data receives an I/O error, which leads to a protective crash or shutdown of the database.

system health alert definition show -subsystem MetroCluster
system health alert show -subsystem metroCluster

8) OnCommand Unified Manager

Unified Manager uses the information that is collected by the MetroCluster health monitors to gather information about the configuration and to collect events that are related to the components. The health monitors use SNMP to monitor the switches and bridges. For current MetroCluster versions in-band monitoring is available for bridges (7500N or later models) for configurations where SNMP is not desired.

9) (Optional) Tiebreaker

NetApp provides a fully supported capability, MetroCluster Tiebreaker software, which is installed at a third site with independent connections to each of the two clusters. The purpose of the Tiebreaker software is to monitor and to detect both individual site failures and intersite link failures. MetroCluster Tiebreaker software can raise an SNMP trap if a site disaster occurs. It operates in observer mode and can detect and send an alert if a disaster requiring switchover occurs. The switchover then can be issued manually by the administrator. Tiebreaker software can be configured to automatically issue the command for switchover if a disaster occurs.

ClusterLion is an advanced MetroCluster monitoring appliance that functions as a virtual third site. This approach allows MetroCluster to be safely deployed in a two-site configuration with fully automated switchover capability.

APPENDIX: CLI Commands

MetroCluster Modify

-auto-switchover-failure-domain
This parameter specifies configuration of the automatic switchover. Supported values are as follows:
auso-on-cluster-disaster - triggers an unplanned switchover if all nodes in a DR cluster are down (default)
auso-on-dr-group-disaster - triggers an unplanned switchover if both nodes of a DR group are down
auso-disabled - automatic switchover is disabled

-node-object-limit
If all the nodes in a DR group have this option enabled (the default state), then all the nodes in that DR group are protected from double failures.

-automatic-switchover-onfailure
This parameter is used to enable automatic switchover on failure on a node when it is disabled because of internal errors. All the nodes in MCC configuration must have this option enabled (the default state) to enable automatic switchover on failure.

MetroCluster Operation Show

APPENDIX: MetroCluster (FC) New Features in ONTAP 9.5

- Active storage virtual machines (SVMs) in a MetroCluster configuration can be used as sources with the SnapMirror SVM disaster recovery feature.

- Cluster update with OnCommand System Manager.

APPENDIX: Miscellaneous

“For most storage platforms, Utilization is CPU utilization.  For ONTAP, it’s a fancier hybrid metric that incorporates CPU utilization, back-to-back checkpoints, Kahuna domain utilization, and some other secret ONTAP sauce to provide a more accurate metric of controller performance headroom.
"Memory utilization might not be a meaningful metric in ONTAP.”

The old sysstat commands -x/-m/-M continue to be available, so these counters will be available (for Grafana/Harvest - https://community.netapp.com/t5/OnCommand-Storage-Management-Software-Discussions/NetApp-Harvest-1-4-1-released/td-p/142907):

node run local sysstat -x 1
node run local sysstat -m 1
node run local sysstat -M 1

References


ONTAP 9 Documentation Center (see ‘MetroCluster configuration’)

NetApp MetroCluster FC for ONTAP 9.5

Oracle on MetroCluster: Integrated Data Protection, Disaster Recovery, and High
Availability

NetApp MetroCluster: Solution Architecture and Design

ONTAP 9: MetroCluster Service and Expansion Guide

Tiebreaker Software 1.21 Installation and Configuration Guide

OnCommand Unified Manager

OnCommand Unified Manager: Documentation Resources

Comments

  1. Hi,
    Good post. Monitoring the MetroCluster with ONTAP 9.5 and netapp harvest 1.4.2 works for you? Since we upgrade ONTAP to 9.5 we have no more graphe.

    ReplyDelete

Post a Comment