A few rough notes
that might save ones bacon one day! It’s unlikely you’ll ever have a need to
use any of the information in this post - if you have a problem you’ll be
calling NetApp Global Support and not trying stuff found on a completely
unofficial NetApp enthusiasts blog post, stuff which is not going to get
updated, and where the author never really expects anyone to actually read any
of this stuff - still, it’s interesting to know these things are there! As
always - with this and any other unofficial blog - caveat lector!
The Obligatory Image:
Apologies if the sight of juicy succulent bacon offends - no offense intended!
Most of the commands below are available from the clustershell
diag privilege level (::> set d); and a lot of the others via the
systemshell (%).
::> set d
::*> sec login unlock diag
::*> sec login password diag
::*> systemshell -node
NODENAME
%
To get a new
clustershell from systemshell
% ngsh
::*> exit
%
Note: If you log
into the console as diag it takes you to the systemshell.
Cluster Health Basics
::*> cluster show
::*> cluster ring show
Effect of the
following replicated database (RDB) applications not running
mgwd … there is no clustershell
vifmgr … you cannot manage networking
vldb … you cannot create volumes
bcomd … you cannot manage SAN data access
Moving Epsilon
::*> system node modify
-node OLDNODEwEPSILON -epsilon false
::*> system node modify
-node NEWNODEwEPSILON -epsilon true
You have to set the
original to false and the new owner to true. If try to set the new owner to
true without first setting the original to false you get this error:
Error: command failed: Could not change epsilon of specified
node: SL_EPSILON_ERROR (code 36). Epsilon manipulation error: The epsilon must
be assigned to at most one eligible node, and it is required for a single node
cluster. In two node cluster HA configurations epsilon cannot be assigned.
Note: Also remember
the use of the below when a system is being taken down for prolonged
maintenance -
::*> system node modify
-node NODENAME -eligibility false
Types of
Failover
::*> aggr show -fields
ha-policy
ha-policy = cfo (cluster failover) for mroot aggregates
ha-policy = sfo (storage failover) for data aggregates
::*> storage failover
giveback -ofnode NODENAME -only-cfo-aggregates true
The above only
gives back the root aggregate!
::*> sto fail progress-table
show -node NODENAME
Some LOADER
Environment Stuff
Un-Muting Console
Logs:
LOADER> setenv
bootarg.init.console_muted false
Setting to boot as
clustered:
LOADER> setenv bootarg.init.boot_clustered
true
Configuring an
interface for netboot:
LOADER> ifconfig e0a -addr=X.X.X.X
-mask=X.X.X.X -gw=X.X.X.X -dns=X.X.X.X -domain=DNS_DOMAIN
LOADER> netboot http://X.X.X.X/netboot/kernel
Note: To see the
boot loader environment variables in the clustershell or systemshell:
::*> debug kenv show
% kenv
To start a node
without job manager (also see “User Space Processes” below):
LOADER> setenv
bootarg.init.mgwd_jm_nostart true
For a list of
job manager types
::*> job type show
Job Manager Troubleshooting
::*> job initstate show
::*> job show
::*> job schedule show
::*> job history show
::*> job store show -id
JOB_UUID
% cat /mroot/etc/log/mlog/command-history.log
% cat /mroot/etc/cluster_config/mdb/mgwd/job_history_table
% cat /mroot/etc/log/mlog/jm-restart.log
% cat /mroot/etc/log/ems
To keep an eye on
the “tail” of a log:
% tail -f LOGNAME
% tail -f /var/log/notifyd*
% tail -100
/mroot/etc/mlog/mgwd.log | more
Logs
Location:
/mroot/etc/log/mlog
Includes logs for:
Message, mgwd, secd,
vifmgr, vldb, notifyd
::> event log show -severity
emergency
::> event log show -severity
alert
::> event log show -severity
critical
::> event log show -severity
error
::> event log show -severity
warning
::> event log show -time
"01/21/2014 09:00:00".."01/22/2014 09:00:00" -severity
!informational,!notice,!debug
::*> debug log files show
::*> debug log show ?
Autosupport
Invoke autosupport:
::> system autosupport
invoke -node * -type all
Autosupport
trigger:
::*> system node autosupport
trigger show -node NODENAME
::*> system node autosupport
trigger modify -node NODENAME -?
Troubleshooting
asup with debug smdb:
::*> debug smdb table
nd_asup_lock show
Unmounting
mroot then remounting
% cd /etc
% sudo ./netapp_mroot_unmount
% sudo mgwd
Mounting /mroot
for an HA Partner in Takeover (for core/logs/… collection)
% sudo mount
% sudo mount_partner
% sudo umount
Unlock diag user
with mgwd not functioning
Option 1) Reboot to Ctrl-C for Boot Menu and option (3)
Change password.
Option 2) If option (3) doesn’t work from the boot menu
do:
Selection (1-8)? systemshell
# /usr/bin/passwd diag
# exit
Note: The same method
can be used to reset admin but must the password update quickly after logging
into clustershell after the change, otherwise new password is overwritten by
original password from RDB
::> security login password -username
admin
Panic Testing
and System Coredumps
::*> system node run -node
NODENAME -command panic
An in-state core
from RLM/SP:
> system core
Out-state core from
Clustershell:
::> reboot -node NODENAME
-dump true
Out-state core from
Nodeshell:
> halt -d
Out-state core from
systemshell:
% sysctl debug.debugger_on_panic+0
% sysctl debug.panic=1
Controlling
automatic coring, and core type (sparse contains no user data):
::> storage failover modify
-onpanic true -node NODENAME
::> system coredump config
modify -sparsecore-enabled true -node NODENAME
Reviewing and
uploading:
::> coredump show
::> coredump upload
% scp /mroot/etc/crash/COREFILE
REMOTEHOST:/path
User-Space
Process and Cores
% ps -aux
% ps -aux | grep PROCESSNAME
% pgrep PROCESSNAME
% sudo kill -TRAP PID
Note: Processes
monitored by spmd restart automatically
Root Volume and
RDB (Node/Cluster) Backup and Recovery
::*> system configuration
backup ?
::*> system configuration
recovery ?
Items in a node configuration backup include:
- Persistent bootarg variables in cfcard/env
- varfs.tgz and oldvarfs.tgz in /cfcard/x86_64/freebsd/
- Any configuration file under /mroot/etc
In a cluster configuration backup:
- Replicated records of all the RDB rings - mgwd, VLDB,
vifmgr, BCOM
Backup files @ /mroot/etc/backups
Can be redirected to a URL
Simple
Management Framework (SMF)
To see all the simple
management database (SMDB) iterator objects in the SMF:
::*> debug smdb table {TAB}
Examples:
::*> debug smdb table
bladeTable show
::*> debug smdb table
cluster show
Another way to
get Volume Information
Note: Not to be
used on production systems unless instructed by NGS!
::*> vol show -vserver VS1
-volume VOLNAME -fields UUID
::*> net int show -role
cluster
% zsmcli -H CLUSTERICIP
d-volume-list-info id=UUID desired-attrs=name
% zsmcli -H CLUSTERICIP
d-volume-list-info id=UUID
User-Space
Processes
RDB Applications
(MGWD, VifMgr, VLDB, BCOM):
% ps aux | grep mgwd
etcetera…
Non-RDB
Applications (secd, NDMP, spmd, mlogd, notifyd, schmd, sktlogd, httpd):
% ps aux | grep /sbin
List managed
processes:
% spmctl -l
Stop monitoring a
process:
% spmctl -dh PROCESSNAME
SPMCTL help:
% spmctl --help
List of well-known
handles:
vldb, vifmgr, mgwd,
secd, named, notifyd, time_state, ucoreman, env_mgr, spd, mhostexecd, bcomd,
cmd, ndmpd, schmd, nchmd, shmd, nphmd, cphmd, httpd, mdnsd, sktlogd,
kmip_client, raid_lm, #upgrademgr, mntsvc, coresegd, hashd, servprocd, cshmd,
fpolicy, ntpd, memevt
Set MGWD to start
without job manager running:
% spmctl -s -h mgwd
% sudo /sbin/mgwd --jm-nostart
To stop spmd from
monitoring a process:
% spmctl -s -h vifmgr
To restart spmd
monitoring:
% spmctl -e -c /sbin/vifmgr -h
vifmgr
Notable processes:
secd = security daemon
notifyd = notify daemon (required for autosupport to run)
httpmgr = manager daemon for Apache httpd daemon
schmd = monitors SAS connectivity across a HA pair
nchmd = monitors SAS connectivity per node
ndmpd = used for NDMP backups
Debugging M-Host
Applications:
::*> debug mtrace show
To restart vifmgr
(and flush the cache):
% spmctl -h vifmgr -s
% spmctl -h vifmgr -e
Viewing the
Configuration Database Info
Essentially: the
CDB is local, the RDB is clusterwide.
% cdb_viewer /var/CDB.mgwd
::*> net int cdb show
::*> net port cdb show
::*> net routing-groups cdb
show
::*> net routing-groups
route cdb show
etcetera…
::*> debug smdb table {AS
ABOVE}
For RDB status
information
% rdb_dump --help
% rdb_dump
MSIDs
MSIDs appear as
FSID in a packet trace and are also located in the VLDB. Look for errors in
event log show referencing the MSID.
::*> vol show -vserver SVM
-fields msid -volume VOLNAME
Identify which
clients have NFS connectivity to a node and unmount clients if mounted to a
volume that no longer exists:
::*> network connections
active show -node NODENAME -service nfs*
::*> network connections
active show -service ?
Using vReport
to scan for VLDB and D-blade discrepancies
::*> debug vreport show
::*> debug vreport fix ?
DNS-Based Load
Balancing
::> network interface show-zones
::> network interface modify
-vserver SVM - lif LIFNAME -dns-zone CMODE.DOMAIN.COM
C:\> nslookup CMODE
C:\> nslookup
CMODE.DOMAIN.COM
Note: Regular
round-robin DNS can overload LIFs.
Automatic LIF
Rebalancing
::*> net int show -fields
allow-lb-migrate,lb-weight -vserver SVM
Note: You should
create separate LIFs for CIFS and NFSv4 traffic. NFSv3 can auto-rebalance (not
for use with VMware). Stateful protcols like CIFS and NFSv4 cannot
auto-rebalance.
Firewall
::*> debug smdb table
firewall_policy_table show
::*> firewall policy service
create -service SERVICENAME -protocol tcp/udp -port PORT
::*> firewall policy service
show
::*> firewall policy create
-policy POLICYNAME -service SERVICENAME -action deny -ip-list X.X.X.X/X
::*> firewall policy show
Cluster Session
Manager
::*> debug csm ?
::*> debug csm session show
-node NODENAME
::*> event log show
-messagename csm*
::*> cluster ping-cluster
-node NODENAME
% sysctl sysvar.csm
% sysctl sysvar.nblade
Execution
Thresholds
% sysctl -a
sysvar.nblade.debug.core | grep exec
Local Fastpath
(Cannot see any reason to change this but the option’s there!)
% sudo sysctl
kern.bootargs=bootarg.nblade.localbypass_in=0
% sudo sysctl kern.bootargs=bootarg.nblade.localbypass_out=0
Nitrocfg
Configures and Queries Basic N-Blade Components
%
nitrocfg
The RDB Site
List
%
cat /var/rdb/_sitelist
LIF Connection
Errors
::>
event log show -message Nblade.*
::>
event log show -instance -seqnum SEQUENCE_ID
%
sysctl -a | grep sysvar.csm.rc
%
sysctl sysvar.csm.session
Examine the:
/mroot/etc/log/mlog/vifmgr.log
/mroot/etc/log/mlog/mgwd.log
LIF Failure to
Migrate
::>
net port show -link !up
::>
net int show -status-oper !up
::>
cluster show
%
rdb_dump
::>
event log show -node NODENAME -message VifMgr*
Unbalanced LIFs
::>
network connections active show
Duplicate LIF
IDs
::*> net int show -fields
numeric-id
::*> net int cdb show
::*> net int ids show
% cdb_viewer /var/CDB.mgwd
::*> net int ids delete
-owner SVM -lif LIFNAME
::*> debug smdb table
{vifmgr? or virtual_interface?} show
Nodeshell
Network-Related Commands
> ifconfig -a
> netstat
> route -gsn
> ping -i e0d X.X.X.X
> traceroute
> pktt start e0c
> pktt dump e0c
> pktt stop e0c
Systemshell
Tools
% ifconfig
% nitrocfg
% netstat -rn
% route
% vifmgrclient
% pcpconfig
% ipfstat -ioh
% rdb_dump
% tcpdump -r
/mroot/e0d_DATE_TIME.trc
vifmgrclient
for debugging vifmgr
% vigmgrclient --verbosity-on
% vifmgrclient -set-trace-level
0cfff
% vifmgrclient -debug
::*> debug smdb table
rg_routes_view show
Cluster Peer
Ping (Intercluster)
::*> cluster peer ping
Miscellaneous -
Using Command History
::> history
::> !LINE
really good content, thanks!
ReplyDelete