Saturday, 14 September 2019

Why you need to give a FEC (with 25GbE networking)

Image: Forward Error Correction in Practice

Some reading on FEC (Forward Error Correction):


3 instances where you need to give a FEC!

1) E-Series with 25GbE for iSCSI

If you’ve got E-Series with 25GbE for iSCSI. And the ports don’t come up, you don’t even get a link light, even though you can see a light coming out the SFP, opposite side to a light going in from the LC/LC cable (so not a cable transposition error), then it might be FEC.

I wasn’t getting any connection until the network guy disabled FEC on the ports.

2) 25Gb card connections on a A300/FAS8200 MCIP using Cisco 3232 switches

From the MCIP guide:

If your AFF A300 or FAS8200 system is configured using 25-Gbps connectivity, you need to set the Forward Error Correction (FEC) parameter manually to off after applying the RCF file.

The RCF file does not apply this setting if the cable connecting the controller module is not inserted into the port.

This task must be performed on all four switches in the MetroCluster IP configuration.

Steps:

Set the fec parameter to off on each 25-Gbps port, and then copy the running configuration to the startup configuration:

a. Enter configuration mode:
config t
b. Specify the 25-Gbps interface to configure:
interface interface-ID
c. Set fec to off:
fec off
d. Repeat the previous steps for each 25-Gbps port on the switch.
e. Exit configuration mode:
exit

Do a copy run start to save the changes.

 3) On StorageGrid SGA6060 with 25G ports when connected to Cisco

 You might need to change the autonegotiation to auto, using ethtool.

First time is when Pre-Grid environment is booted, login and run ethtool against ports eth3-6.
After reboot that is part of joining grid, login again and now run ethtool against hic1-4.

Fixed in 11.3 or in a hotfix.

Lessons Learned Deploying NetApp HCI (to NDE 1.6P1)

I’ve not done a massive amount of NetApp HCI installations (not yet double figures). All my experiences so far have been with the H410C (Compute) and H410S (Storage) nodes.

The NetApp Deployment Engine (NDE) is very cool. You fill in your variables, click the button to go, and it sets up your storage cluster, ESXi hosts, vCenter Server, mNode, and vSphere Plugins.

When NDE fails to deploy to 100%, most commonly it is due to network setup issues. It’s rarely an actual NDE issue. Arguably, it is a good thing NDE does fail as it tells you there’s some issue with your network, and you don’t want to go into production if the network is not 100%!

Image: What you want to see every time you run the NDE “Your installation is complete”

Here are my lessons learned (so far):

1) Configure LACP for Bond10G on the storage nodes.

Setting up compute and storages nodes in order to run the NetApp Deployment Agent (NDE) is straightforward:

Rack and stack
Cable
(Optional) RTFI to a desired version

BIOS:
- configure IPMI + Date & Time (BIOS uses UTC-0 time)

Compute TUI:
- Don’t need to configure anything unless you are putting a VLAN tag on the iSCSI Storage Network, in which case for Bond10G configure VLAN (and temp IP)*

Storage TUI:
- For one storage node configure a temporary IP on the Bond1G interface
- Set Bond10G to LACP (for every storage node)
- Don’t need to configure anything else unless you are putting a VLAN tag on the iSCSI Storage Network, in which case for Bond10G configure VLAN (and temp IP)*

*I’ve configured temp IP on the storage network when using VLANs, but not 100% sure I needed to...

The documentation suggests LACP is optional, but in my experience, it is not optional, it is necessary, otherwise Storage Performance is poor (NDE failed after timing out deploying the mNode.)

2) Configure ActiveIQ after NDE has completed!

This is a tip! NDE will fail if you tick the box to configure ActiveIQ and the firewall ports to let the mNode talk to NetApp are not open. It’s easy to configure ActiveIQ post NDE deployment, just read the deployment guide.

3) Remember you can often resume NDE from another storage node if you’ve corrected the issue that caused NDE to fail.

Just configure another Bond1G temp IP and continue.

4) If you use the ‘NDE Settings Easy Form’ and don’t want the Management VLAN tagged, remember to untag it before running the NDE.

Mybad.

Once when I filled out the ‘NDE Settings Easy Form’ (you need to put in a VLAN ID for management), I forgot to later delete all references of that management VLAN tag from the configuration form before submitting it. And understandably NDE failed.

5) NDE 1.6 Doesn’t Support $ in the Password.

I had the NDE fail to deploy the mNode with NDE 1.6 because we had a $ sign in the password. ‘.’ and ‘!’ are fine. Don’t have a password with $ in!

6) If NDE fails, don’t try to manually recover, work out what was the problem and re-deploy.

Carrying on from 5 above. It is possible to manually deploy the mNode and we were able to successfully do this. Unfortunately, after having spent nearly an entire morning doing this (including updating the mNode from 2.0 to 2.1), we then couldn’t get NDE Scaler to work, so couldn’t expand the HCI cluster with additional nodes (could still have manually expanded). We reset** and re-ran NDE afresh.

7) Don’t try to be too ambitious with running NDE. Better to start with minimal NDE deploy, then scale later.

If you have lots of storage and compute nodes to deploy, I’d recommend running NDE with the minimum 4 Storage + 2 Compute first, then the probability of encountering an issue with the networking of a node (which would might NDE to fail) is reduced. And if NDE does fail, you don’t need to reset** as many nodes before you try again.

8) If the 10/25GbE ports aren’t coming up, it might very well be the Cisco Switch firmware.

Had an issue where the 10/25GbE ports simply would not come up. Lights out of the cable, and out of the SFP were fine. The problem was the 7.0(3)I4(2) firmware on the Cisco switches. Once the switch firmware was updated all was fine.

Note: Apply licenses to the switches beforehand. One instance we had issues enabling any speed greater than 1000 on the switch ports without the correct license.

9) If you can ping the gateway from a compute node, but can’t traceroute to the gateway, it means the firewall is blocking traffic out.

If – from a device on the network – you can’t ping the IPMI or temp Bond1G IP Address, but you can ping the default gateway from a compute node (but can’t traceroute to the gateway), it’s likely a firewall ACL that’s at fault.

10) Ensure Jumbo Frames is set on all 10/25GbE switch ports

“Compute nodes cannot be displayed as their software version is not supported be the NDE” is a false error! You get this when jumbo frames have not been correctly set for all the 10/25GbE connections. Make sure jumbo frames is set correctly.

11) Setting the time is important (remember the IPMI uses UTC-0)

Another NDE failure I’ve seen at the ‘Deploying vCenter’ stage results in lots of “Waiting for VMware APIs to come online” in the NDE log, and eventually a timeout. I’m not sure if this was fixed by opening firewall ports, or correcting the time setting via the IPMI (mybad, wasn’t set to the correct UTC-0 time.)

12) Portfast on all switch edge ports!

If the NDE fails at the vCenter stage and you see lots of “Network configuration change disconnected the host 'X.X.X.X’ from vCenter server and has been rolled back.” There is a KB: https://kb.netapp.com/app/answers/answer_view/a_id/1092527/loc/en_US

The fix is to enable portfast on all the ports on the network switch connected to the NetApp HCI nodes (I believe setting portfast on edge ports is a common best practice.)

THE END

Note: These 3rd-party SFPs are known to work with the H410C and H410S nodes: Flexoptix P.8525G.01 (25G SFP28 SR with dual CDR)

**Yes, you can reset the storage and compute nodes if the NDE fails at a step and the option to restart NDE from another storage node isn’t available to you. You don’t have to go and RTFI everything all over again.

Compute Node Reset:

1) Power Reset
2) At the ESXi Boot Menu, choose Safe Mode (Note: This is AFTER the NetApp Splash Screen)
3) Go to Reset Node to Factory in the TUI

Storage Node Reset:

Contact Tech Support to perform a storage node reset. Only they know the password!
Special Login for Storage Node RESET (don’t need to power reset)

1) Alt+F2
2) Login with user = root, pass = ***********
3) Run: /sf/hci/nde_reset

Monday, 2 September 2019

RTFI-ing NetApp HCI Nodes (1.6P1 Update)

Thought I’d do an update to my November 2018 post ‘Creating Bootable USB Sticks to RTFI* NetApp HCI Nodes’, since I had to create some new USB Sticks for HCI 1.6P1.

Download HCI 1.6P1 from:

solidfire-rtfi-sodium-patch3-11.3.1.5.iso
solidfire-compute-sodium-patch3-11.3.0.14235.iso

Creating the bootable USB sticks is very simple:

1) Use Rufus 3.3 for the Storage Node image

Default settings in Rufus 3.3 is fine (except I select cluster size = 8192 bytes, but don’t think it really makes a difference).

Image: Using Rufus 3.3 for Storage Node

Note: On my PC with new USB keys it took 3 minutes.

2) Use Rufus 2.6 for the Compute Node image

Default settings in Rufus 2.6 is fine.
Important to use the DD image mode (not ISO mode as per the Storage Node image.)

Image: Using Rufus 2.6 for Compute Node

Note: On my PC with new USB keys it took 20 minutes.
Note: We do the Storage Node images first using rufus 3.3, since it downloads the correct version of ISOLINUX for rufus 2.6 to use later (rufus 2.6 cannot download the correct version.)

Further Reading

HCI - How to RTFI using a USB key

"Power on or Power Reset and during boot up process press F11 for selecting Boot Device.
In the boot device selection menu, highlight the USB option."

HCI - How to RTFI a HCI Compute Node (via BMC)

Rufus.log

What rufus sees when a newly created compute node (DD image) key is inserted:

Found USB device 'SanDisk Ultra USB 3.0 USB Device' (0781:5591) [GP]
1 device found
No volume information for drive 0x81
Disk type: Removable, Sector Size: 512 bytes
Cylinders: 3740, TracksPerCylinder: 255, SectorsPerTrack: 63
Partition type: MBR, NB Partitions: 1
Disk ID: 0x4CE4C097
Drive has an unknown Master Boot Record
Partition 1:
  Type: Hidden NTFS (0x17)
  Size: 16.5 GB (17755537408 bytes)
  Start Sector: 0, Boot: Yes, Recognized: Yes

Monday, 12 August 2019

Tech Roundup – 11th August 2019

Stuff collated/new since Tech Roundup – 23 June 2019 with headings:
FlexPod, Kubernetes, Microsoft, NetApp, NetApp Cloud, NetApp E-Series, NetApp HCI, NetApp.io, NetApp Tech ONTAP Podcast, NetApp TRs, Python, Security, Storage Industry News, Ubuntu, Veeam, Miscellaneous

FlexPod

Introducing Memory-Accelerated Data for FlexPod
- Optimized integration of Cisco UCS B200 M5 with Intel Optane DCPMM into the FlexPod design
- MAX Data on FlexPod is capable of 5 times more I/O operations with 25 times less latency

Image: AFF A300 v AFF A300 + MAX Data on FlexPod B200 M5

Kubernetes

Kubernetes Cheat Sheet

Image: Kubernetes Cheat Sheet 1/2

Image: Kubernetes Cheat Sheet 2/2

Also, from Linux Academy see:

AWS Developer Tools Overview and CodeCommit Cheat Sheet

Ansible Roles Explained | Cheat Sheet

Your AWS Terminology Cheat Sheet

Microsoft

Azure migration center

About the Azure Site Recovery Deployment Planner for VMware to Azure

Microsoft’s new Windows Terminal now available to download for Windows 10
This change is by design and is intended to help reduce the overall disk footprint size of Windows. To recover a system with a corrupt registry hive, Microsoft recommends that you use a system restore point.

The new Windows Terminal

Windows Terminal (Preview)

The system registry is no longer backed up to the RegBack folder starting in Windows 10 version 1803

Microsoft Teams usage passes Slack in new survey
IT pros expect its presence to double by 2020

NetApp (General)

NetApp NVMe for your database

AFF A320: NVMe Building Block for the Modern SAN

MAX Data
Turbo-charge your applications
Rocket Fuel for Your Enterprise Apps

How to add storage capacity to a NetApp ONTAP Select 9.6 cluster

Protecting Your Data: Perfect Forward Secrecy (PFS) with NetApp ONTAP

Updated MetroCluster resources page

NetApp SnapCenter Plug-in for VMware vSphere 4.2 (NetApp Data Broker 1.0):

NetApp Data Broker 1.0

NetApp Data Broker 1.0: Release Notes

NetApp Data Broker 1.0: Deployment Guide for SnapCenter Plug-in for VMware vSphere

NetApp Data Broker 1.0: Data Protection Guide for VMs, Datastores, and VMDKs using the SnapCenter Plug-in for VMware vSphere

NetApp Data Broker 1.0 Documentation

NetApp SnapCenter 4.2:

Key Value Proposition of SnapCenter 4.2 – Simplicity:
- Simplified Installation
- Simplified Operations
- Continued quality enhancements

New Features in Version 4.2:
- SnapCenter Plug-in for VMware vSphere is part of NetApp Data Broker
- Simplified Storage management (Cluster Management LIF support)
- Simplified Host Management
- Enhanced Dashboard and Monitoring
- Simplified RBAC
- SnapCenter Custom plugin Integration with Linux File system
- Configuration Checker Integration

SnapCenter 4.2 Documentation

NetApp Cloud

Deploy SQL Server Over SMB with Azure NetApp Files

Azure Migration: The Keys to a Successful Enterprise Migration to Azure

Get the Most Out of Your Oracle Databases in Cloud Volumes Service for AWS

Cloud OnAir: New high-performance storage with NetApp and Google Cloud

What's New in the Beta Release of Cloud Volumes Service for GCP?

Cloud Volumes for GCP Technical Architecture and Automated Access

Global User Accessible API with Cloud Volumes for GCP

Cloud Volumes Service for Google Cloud – bringing high-performance file storage as a service to you

Lift and DON’T shift
Free up high-performance AFF storage space by automatically tiering infrequently used data to the cloud.

Any Cloud. One Experience.

The Route to Data is Now a Multi-Lane Super-Highway with ONTAP

Manage Your Data on the World's Biggest Clouds (NetApp Cloud Volumes Service (CVS))

A CEO Speaks: Why Azure NetApp Files Delivers Better Cloud Transformation

Cloud File Sharing: Backup and Archiving

Monitoring the Costs of Underutilized EBS Volumes

Get a First Look at Cloud Volumes ONTAP for Google Cloud (Webinar)

A Tour of NetApp Cloud Insights

Microsoft Announces Azure NetApp Files is Available

NetApp E-Series

Introducing Power New Analytics and Orchestration for E-Series:

Solution Brief: NetApp E-Series + Grafana: Performance Monitoring

eseries-perf-analyzer

NetApp E-Series Performance Analyzer

Grafana Handout

Solution Brief: Improve IT Automation with NetApp E-Series & Ansible

Ansible Gateway: nar_santricity_host

TR-4574: Deploying NetApp E-Series with Ansible

NetApp HCI

Disaggregated HCI Becomes a Thing
“IDC has announced a new ‘disaggregated’ subcategory of the HCI market in its most recent Worldwide Quarterly Converged Systems Tracker.  IDC is expanding the definition of HCI to include a disaggregated category with products that allow customers to scale in a non-linear fashion”

NetApp HCI Reference Architecture with Veeam Backup and Replication 9.5 Update 4

Image: NetApp HCI Reference Architecture with Veeam B&R

Element 11.3 and HCI 1.6 available on NSS (11 July):

- For Element 11.3 upgrades, use the mNode 11.1 with the latest HealthTools from NSS.
- mNode 11.3 and management services require Element 11.3 on the storage cluster (refer to the Management Node User Guide).
- NetApp HCI 1.6 Compute node image will now update the firmware and Bootstrap OS leaving ESXi and configuration data intact. Use Factory Reset option from the Compute TUI for reimaging. 

Download Links:

Element Plug-in for vCenter Server 4.3: https://mysupport.netapp.com/products/p/epvcenter.html
Element 11.3 Postman collection on GitHub:  https://github.com/solidfire/postman

Documentation Links:

HCI Documentation Center: http://docs.netapp.com/hci/index.jsp
SolidFire Documentation Center: http://docs.netapp.com/sfe-113/index.jsp
Management Node User Guide (also available from the Doc center links above): https://library.netapp.com/ecm/ecm_download_file/ECMLP2858123
Firmware and driver versions for NetApp HCI and NetApp Element software:  https://kb.netapp.com/app/answers/answer_view/a_id/1088658
Detailed procedure with screenshots to update NetApp HCI compute node firmware and driver: https://kb.netapp.com/app/answers/answer_view/a_id/1088186
In-place upgrade procedure for existing management node 11.0 or 11.1 to management node 11.3 (without requiring a new VM deployment): https://kb.netapp.com/app/answers/answer_view/a_id/1088660

NetApp.io

From June to now:

Dealing with the Unexpected

Trident 19.07 is now GA

Running a Playbook Against Multiple ONTAP Clusters

Using On-Demand Snapshots with CSI Trident

All New CSI Trident!

Welcome to Trident 19.07 Alpha!

Extending Kubernetes to Manage Trident’s State

Simple Made Simpler – Ansible Roles for ONTAP Select

NetApp Tech ONTAP Podcast

From Episode 196 to now:

Episode 203: Intel and NetApp - Edge to Core to Cloud to VMworld 2019

Episode 202: TCP Performance Enhancements in ONTAP 9.6 - CUBIC TCP

Episode 201: NVIDIA, NetApp and AI

Episode 200 - Cloud, Mentorship and Comics with Kaslin Fields

VMworld 2019 Tech ONTAP Podcast Playlist

Episode 198: NetApp A-Team ETL 2019

Episode 197: NetApp Accelerates Genomics

Episode 196: Intel, NetApp and High-Performance Computing

NetApp TRs + NVAs + Solution Briefs + White Papers

New NetApp TRs since Tech Roundup – 23 June 2019:

TR-4793: NetApp ONTAP AI and OmniSci GPU-Accelerated Analytics Platform: ONTAP in an OmniSci Environment    

TR-4790: FlexPod Solution Delivery Guide                                                                    

TR-4789: VMware Configuration Guide for E-Series SANtricity iSCSI Integration with ESXi 6.X: Solution Design

TR-4788: Architecting I/O-Intensive MongoDB Databases on NetApp                                             

TR-4785: AI Deployment with NetApp E-Series and BeeGFS                                                      

TR-4760: NetApp for Oracle Database 18c: Solution Delivery Guide                                            

TR-4758: Microsoft SQL Server 2017 on NetApp ONTAP: Solution Delivery Guide                                 

NVAs:

NetApp HCI for Multicloud Data Protection with Cloud Volumes ONTAP

Solution Briefs:

SB-3997: AI on E-Series and BeeGFS

SB-3996: NetApp E-Series + Grafana: Performance Monitoring

SB-3995: Improve IT Automation with NetApp E-Series & Ansible

SB-3994: NetApp SANtricity Cloud Connector

SB 3831: NetApp SaaS Backup for Salesforce

White Papers:

NetApp HCI for DevOps with NetApp Kubernetes Service

Python

Recommended Python course:
Pre-requisites: Python 3 and a decent editor.

Security

Using Windows FSRM to build a Killswitch for Ransomware
“I wanted to share a solution that uses resources already built into Windows”

Florida city gives in to $600,000 bitcoin ransomware demand

Storage Industry News

The Digitization of the World: From Edge to Core

Image: Tape is making a comeback!

The LTO Program Announces Fujifilm and Sony Are Now Both Licensees of Generation 8 Technology
“LTO Seeing Continued Relevance for Archive and Offline Long-Term storage”

Boeing’s 737 Max Software Outsourced to $9-an-Hour Engineers

Ubuntu

Statement on 32-bit i386 packages for Ubuntu 19.10 and 20.04 LTS
“Thanks to the huge amount of feedback this weekend from gamers, Ubuntu Studio, and the WINE community, we will change our plan and build selected 32-bit i386 packages for Ubuntu 19.10 and 20.04 LTS.”

Veeam

IT Guide to Build Converged and Hyper-Converged Infrastructures (White Paper)

Veeam Agent for Microsoft Windows FREE

Hyper-V Virtual lab Tips and Tricks

A few articles that allow you to fully unleash the power of Hyper-V:

What’s new in Hyper-V 2016

How to configure Hyper-V virtual switch

Resource Metering in Hyper-V

Setting up Hyper-V Failover Cluster in Windows Server 2012 R2

12 things you should know about Hyper-V snapshots

Miscellaneous

NetApp ActiveIQ PAS:

You can deploy PAS on premises:

There are a few KBs out there on https://kb.netapp.com/. Do a keyword search for PAS (login first).

There’s also a 3rd party site, and if you have questions you can post to the AIQ PAS Yammer or e-mail the NG on the download site.

Saturday, 27 July 2019

How to Restore SnapCenter database - SnapCenter Repository Backup Restore

Carrying on from the previous post - Disaster Recovery of SnapCenter Server with IP Address Change - restoring SnapCenter from a repository backup is straightforward.
Note: Using a SnapCenter 4.1 lab.

Restoring with MySQL and SnapManagerCoreService Up and Running


PS> Open-SmConnection -SMSbaseUrl https://SnapCtr.demo.corp.com:8146

cmdlet Open-SmConnection at command pipeline position 1
Supply values for the following parameters:
(Type !? for Help.)
Credential

PS> (Get-SmHost).HostName
DAG1.demo.corp.com
mb1.demo.corp.com
mb2.demo.corp.com
mb3.demo.corp.com
snapctr.demo.corp.com

PS> Get-SmRepositoryBackups

Backup Name
-----------
MYSQL_DS_SC_Repository_snapctr_07-26-2019_09.34.20.3512
MYSQL_DS_SC_Repository_snapctr_07-26-2019_10.35.02.3271
MYSQL_DS_SC_Repository_snapctr_07-26-2019_11.33.04.0570
MYSQL_DS_SC_Repository_snapctr_07-26-2019_12.33.03.8936

PS> Restore-SmRepositoryBackup -BackupName MYSQL_DS_SC_Repository_snapctr_07-26-2019_12.33.03.8936 -HostName SnapCtr.demo.corp.com

SnapCenter respository restored successfully!


Note: Don’t need to use Open-SmConnection and Get-SmHost before running SmRepositoryBackup cmdlets. I just did this to get the correct hostname.

One Slighty Odd Thing

One slightly odd thing is that when you restore, you lose the MySQL dmp from your backup location (perhaps this is to be expected).


PS> Restore-SmRepositoryBackup -BackupName MYSQL_DS_SC_Repository_snapctr_07-26-2019_12.33.03.8936 -HostName SnapCtr.demo.corp.com

Restore-SmRepositoryBackup: Backup files doesn't exist. Please provide a valid location that contains the SnapCenter backup files or specify RestoreFileSystem option to restore from the snapshot.

PS> Get-SmRepositoryBackups

Backup Name
-----------
MYSQL_DS_SC_Repository_snapctr_07-26-2019_09.34.20.3512
MYSQL_DS_SC_Repository_snapctr_07-26-2019_10.35.02.3271
MYSQL_DS_SC_Repository_snapctr_07-26-2019_11.33.04.0570
MYSQL_DS_SC_Repository_snapctr_07-26-2019_12.33.03.8936

PS> Restore-SmRepositoryBackup -BackupName MYSQL_DS_SC_Repository_snapctr_07-26-2019_11.33.04.0570 -HostName SnapCtr.demo.corp.com
SnapCenter respository restored successfully.


Image: After 2 restores, I’ve lost 2 dmp’s!

Q: How to Restore if the Database is Completely Gone?

If the database is completely gone and MySQL won’t start, what to do?

I’ve tried a few things but have not yet found the magic trick using DB restore CLI commands (MySQL or SnapCenter). The easiest way is going to be to revert the VM and database to a known good date.

This link looked promising (not for SnapCenter database though):

Or simply, with SnapManagerCoreService and MySQL57 stopped, replace these files with files from a known good backup -

C:\ProgramData\NetApp\SnapCenter\MySQL Data\Data

ib_buffer_pool
ib_logfile0
ib_logfile1
ibdata1

- restart MySQL57 and SnapManagerCoreService. And – hey presto – should all be back up and running again.

Image: Need to restore these ib files