Saturday, 1 February 2014

Brief Notes on Advanced Troubleshooting in CDOT

A few rough notes that might save ones bacon one day! It’s unlikely you’ll ever have a need to use any of the information in this post - if you have a problem you’ll be calling NetApp Global Support and not trying stuff found on a completely unofficial NetApp enthusiasts blog post, stuff which is not going to get updated, and where the author never really expects anyone to actually read any of this stuff - still, it’s interesting to know these things are there! As always - with this and any other unofficial blog - caveat lector!

The Obligatory Image: Apologies if the sight of juicy succulent bacon offends - no offense intended!
Most of the commands below are available from the clustershell diag privilege level (::> set d); and a lot of the others via the systemshell (%).

::> set d
::*> sec login unlock diag
::*> sec login password diag
::*> systemshell -node NODENAME
%

To get a new clustershell from systemshell

% ngsh
::*> exit
%

Note: If you log into the console as diag it takes you to the systemshell.

Cluster Health Basics

::*> cluster show
::*> cluster ring show

Effect of the following replicated database (RDB) applications not running

mgwd … there is no clustershell
vifmgr … you cannot manage networking
vldb … you cannot create volumes
bcomd … you cannot manage SAN data access

Moving Epsilon

::*> system node modify -node OLDNODEwEPSILON -epsilon false
::*> system node modify -node NEWNODEwEPSILON -epsilon true

You have to set the original to false and the new owner to true. If try to set the new owner to true without first setting the original to false you get this error:
Error: command failed: Could not change epsilon of specified node: SL_EPSILON_ERROR (code 36). Epsilon manipulation error: The epsilon must be assigned to at most one eligible node, and it is required for a single node cluster. In two node cluster HA configurations epsilon cannot be assigned.

Note: Also remember the use of the below when a system is being taken down for prolonged maintenance -
::*> system node modify -node NODENAME -eligibility false

Types of Failover

::*> aggr show -fields ha-policy
ha-policy = cfo (cluster failover) for mroot aggregates
ha-policy = sfo (storage failover) for data aggregates
::*> storage failover giveback -ofnode NODENAME -only-cfo-aggregates true
The above only gives back the root aggregate!
::*> sto fail progress-table show -node NODENAME

Some LOADER Environment Stuff

Un-Muting Console Logs:
LOADER> setenv bootarg.init.console_muted false

Setting to boot as clustered:
LOADER> setenv bootarg.init.boot_clustered true

Configuring an interface for netboot:
LOADER> ifconfig e0a -addr=X.X.X.X -mask=X.X.X.X -gw=X.X.X.X -dns=X.X.X.X -domain=DNS_DOMAIN
LOADER> netboot http://X.X.X.X/netboot/kernel

Note: To see the boot loader environment variables in the clustershell or systemshell:
::*> debug kenv show
% kenv

To start a node without job manager (also see “User Space Processes” below):
LOADER> setenv bootarg.init.mgwd_jm_nostart true

For a list of job manager types

::*> job type show

Job Manager Troubleshooting

::*> job initstate show
::*> job show
::*> job schedule show
::*> job history show
::*> job store show -id JOB_UUID

% cat /mroot/etc/log/mlog/command-history.log
% cat /mroot/etc/cluster_config/mdb/mgwd/job_history_table
% cat /mroot/etc/log/mlog/jm-restart.log
% cat /mroot/etc/log/ems

To keep an eye on the “tail” of a log:
% tail -f LOGNAME
% tail -f /var/log/notifyd*

% tail -100 /mroot/etc/mlog/mgwd.log | more

Logs

Location:
/mroot/etc/log/mlog

Includes logs for:
Message, mgwd, secd, vifmgr, vldb, notifyd

::> event log show -severity emergency
::> event log show -severity alert
::> event log show -severity critical
::> event log show -severity error
::> event log show -severity warning
::> event log show -time "01/21/2014 09:00:00".."01/22/2014 09:00:00" -severity !informational,!notice,!debug

::*> debug log files show
::*> debug log show ?

Autosupport

Invoke autosupport:
::> system autosupport invoke -node * -type all

Autosupport trigger:
::*> system node autosupport trigger show -node NODENAME
::*> system node autosupport trigger modify -node NODENAME -?

Troubleshooting asup with debug smdb:
::*> debug smdb table nd_asup_lock show

Unmounting mroot then remounting

% cd /etc
% sudo ./netapp_mroot_unmount
% sudo mgwd

Mounting /mroot for an HA Partner in Takeover (for core/logs/… collection)

% sudo mount
% sudo mount_partner
% sudo umount

Unlock diag user with mgwd not functioning

Option 1) Reboot to Ctrl-C for Boot Menu and option (3) Change password.
Option 2) If option (3) doesn’t work from the boot menu do:
Selection (1-8)? systemshell
# /usr/bin/passwd diag
# exit

Note: The same method can be used to reset admin but must the password update quickly after logging into clustershell after the change, otherwise new password is overwritten by original password from RDB
::> security login password -username admin

Panic Testing and System Coredumps

::*> system node run -node NODENAME -command panic

An in-state core from RLM/SP:
> system core

Out-state core from Clustershell:
::> reboot -node NODENAME -dump true

Out-state core from Nodeshell:
> halt -d

Out-state core from systemshell:
% sysctl debug.debugger_on_panic+0
% sysctl debug.panic=1

Controlling automatic coring, and core type (sparse contains no user data):
::> storage failover modify -onpanic true -node NODENAME
::> system coredump config modify -sparsecore-enabled true -node NODENAME

Reviewing and uploading:
::> coredump show
::> coredump upload
% scp /mroot/etc/crash/COREFILE REMOTEHOST:/path

User-Space Process and Cores

% ps -aux
% ps -aux | grep PROCESSNAME
% pgrep PROCESSNAME
% sudo kill -TRAP PID
Note: Processes monitored by spmd restart automatically

Root Volume and RDB (Node/Cluster) Backup and Recovery

::*> system configuration backup ?
::*> system configuration recovery ?

Items in a node configuration backup include:
- Persistent bootarg variables in cfcard/env
- varfs.tgz and oldvarfs.tgz in /cfcard/x86_64/freebsd/
- Any configuration file under /mroot/etc

In a cluster configuration backup:
- Replicated records of all the RDB rings - mgwd, VLDB, vifmgr, BCOM

Backup files @ /mroot/etc/backups
Can be redirected to a URL

Simple Management Framework (SMF)

To see all the simple management database (SMDB) iterator objects in the SMF:
::*> debug smdb table {TAB}

Examples:
::*> debug smdb table bladeTable show
::*> debug smdb table cluster show

Another way to get Volume Information

Note: Not to be used on production systems unless instructed by NGS!

::*> vol show -vserver VS1 -volume VOLNAME -fields UUID
::*> net int show -role cluster
% zsmcli -H CLUSTERICIP d-volume-list-info id=UUID desired-attrs=name
% zsmcli -H CLUSTERICIP d-volume-list-info id=UUID

User-Space Processes

RDB Applications (MGWD, VifMgr, VLDB, BCOM):
% ps aux | grep mgwd
etcetera…

Non-RDB Applications (secd, NDMP, spmd, mlogd, notifyd, schmd, sktlogd, httpd):
% ps aux | grep /sbin

List managed processes:
% spmctl -l

Stop monitoring a process:
% spmctl -dh PROCESSNAME

SPMCTL help:
% spmctl --help

List of well-known handles:
vldb, vifmgr, mgwd, secd, named, notifyd, time_state, ucoreman, env_mgr, spd, mhostexecd, bcomd, cmd, ndmpd, schmd, nchmd, shmd, nphmd, cphmd, httpd, mdnsd, sktlogd, kmip_client, raid_lm, #upgrademgr, mntsvc, coresegd, hashd, servprocd, cshmd, fpolicy, ntpd, memevt

Set MGWD to start without job manager running:
% spmctl -s -h mgwd
% sudo /sbin/mgwd --jm-nostart

To stop spmd from monitoring a process:
% spmctl -s -h vifmgr
To restart spmd monitoring:
% spmctl -e -c /sbin/vifmgr -h vifmgr

Notable processes:
secd = security daemon
notifyd = notify daemon (required for autosupport to run)
httpmgr = manager daemon for Apache httpd daemon
schmd = monitors SAS connectivity across a HA pair
nchmd = monitors SAS connectivity per node
ndmpd = used for NDMP backups

Debugging M-Host Applications:
::*> debug mtrace show

To restart vifmgr (and flush the cache):
% spmctl -h vifmgr -s
% spmctl -h vifmgr -e

Viewing the Configuration Database Info

Essentially: the CDB is local, the RDB is clusterwide.
% cdb_viewer /var/CDB.mgwd
::*> net int cdb show
::*> net port cdb show
::*> net routing-groups cdb show
::*> net routing-groups route cdb show
etcetera…
::*> debug smdb table {AS ABOVE}

For RDB status information

% rdb_dump --help
% rdb_dump

MSIDs

MSIDs appear as FSID in a packet trace and are also located in the VLDB. Look for errors in event log show referencing the MSID.

::*> vol show -vserver SVM -fields msid -volume VOLNAME

Identify which clients have NFS connectivity to a node and unmount clients if mounted to a volume that no longer exists:

::*> network connections active show -node NODENAME -service nfs*
::*> network connections active show -service ?

Using vReport to scan for VLDB and D-blade discrepancies

::*> debug vreport show
::*> debug vreport fix ?

DNS-Based Load Balancing

::> network interface show-zones
::> network interface modify -vserver SVM - lif LIFNAME -dns-zone CMODE.DOMAIN.COM

C:\> nslookup CMODE
C:\> nslookup CMODE.DOMAIN.COM

Note: Regular round-robin DNS can overload LIFs.

Automatic LIF Rebalancing

::*> net int show -fields allow-lb-migrate,lb-weight -vserver SVM

Note: You should create separate LIFs for CIFS and NFSv4 traffic. NFSv3 can auto-rebalance (not for use with VMware). Stateful protcols like CIFS and NFSv4 cannot auto-rebalance.

Firewall

::*> debug smdb table firewall_policy_table show
::*> firewall policy service create -service SERVICENAME -protocol tcp/udp -port PORT
::*> firewall policy service show
::*> firewall policy create -policy POLICYNAME -service SERVICENAME -action deny -ip-list X.X.X.X/X
::*> firewall policy show

Cluster Session Manager

::*> debug csm ?
::*> debug csm session show -node NODENAME
::*> event log show -messagename csm*
::*> cluster ping-cluster -node NODENAME
% sysctl sysvar.csm
% sysctl sysvar.nblade

Execution Thresholds

% sysctl -a sysvar.nblade.debug.core | grep exec

Local Fastpath (Cannot see any reason to change this but the option’s there!)

% sudo sysctl kern.bootargs=bootarg.nblade.localbypass_in=0
% sudo sysctl kern.bootargs=bootarg.nblade.localbypass_out=0

Nitrocfg Configures and Queries Basic N-Blade Components

% nitrocfg

The RDB Site List

% cat /var/rdb/_sitelist

LIF Connection Errors

::> event log show -message Nblade.*
::> event log show -instance -seqnum SEQUENCE_ID
% sysctl -a | grep sysvar.csm.rc
% sysctl sysvar.csm.session

Examine the:
/mroot/etc/log/mlog/vifmgr.log
/mroot/etc/log/mlog/mgwd.log

LIF Failure to Migrate

::> net port show -link !up
::> net int show -status-oper !up
::> cluster show
% rdb_dump
::> event log show -node NODENAME -message VifMgr*

Unbalanced LIFs

::> network connections active show

Duplicate LIF IDs

::*> net int show -fields numeric-id
::*> net int cdb show
::*> net int ids show
% cdb_viewer /var/CDB.mgwd
::*> net int ids delete -owner SVM -lif LIFNAME
::*> debug smdb table {vifmgr? or virtual_interface?} show

Nodeshell Network-Related Commands

> ifconfig -a
> netstat
> route -gsn
> ping -i e0d X.X.X.X
> traceroute

> pktt start e0c
> pktt dump e0c
> pktt stop e0c

Systemshell Tools

% ifconfig
% nitrocfg
% netstat -rn
% route
% vifmgrclient
% pcpconfig
% ipfstat -ioh
% rdb_dump
% tcpdump -r /mroot/e0d_DATE_TIME.trc

vifmgrclient for debugging vifmgr

% vigmgrclient --verbosity-on
% vifmgrclient -set-trace-level 0cfff
% vifmgrclient -debug
::*> debug smdb table rg_routes_view show

Cluster Peer Ping (Intercluster)

::*> cluster peer ping

Miscellaneous - Using Command History

::> history
::> !LINE

No comments:

Post a Comment