Saturday, 14 September 2019

Lessons Learned Deploying NetApp HCI (to NDE 1.6P1)

I’ve not done a massive amount of NetApp HCI installations (not yet double figures). All my experiences so far have been with the H410C (Compute) and H410S (Storage) nodes.

The NetApp Deployment Engine (NDE) is very cool. You fill in your variables, click the button to go, and it sets up your storage cluster, ESXi hosts, vCenter Server, mNode, and vSphere Plugins.

When NDE fails to deploy to 100%, most commonly it is due to network setup issues. It’s rarely an actual NDE issue. Arguably, it is a good thing NDE does fail as it tells you there’s some issue with your network, and you don’t want to go into production if the network is not 100%!

Image: What you want to see every time you run the NDE “Your installation is complete”

Here are my lessons learned (so far):

1) Configure LACP for Bond10G on the storage nodes.

Setting up compute and storages nodes in order to run the NetApp Deployment Agent (NDE) is straightforward:

Rack and stack
Cable
(Optional) RTFI to a desired version

BIOS:
- configure IPMI + Date & Time (BIOS uses UTC-0 time)

Compute TUI:
- Don’t need to configure anything unless you are putting a VLAN tag on the iSCSI Storage Network, in which case for Bond10G configure VLAN (and temp IP)*

Storage TUI:
- For one storage node configure a temporary IP on the Bond1G interface
- Set Bond10G to LACP (for every storage node)
- Don’t need to configure anything else unless you are putting a VLAN tag on the iSCSI Storage Network, in which case for Bond10G configure VLAN (and temp IP)*

*I’ve configured temp IP on the storage network when using VLANs, but not 100% sure I needed to...

The documentation suggests LACP is optional, but in my experience, it is not optional, it is necessary, otherwise Storage Performance is poor (NDE failed after timing out deploying the mNode.)

2) Configure ActiveIQ after NDE has completed!

This is a tip! NDE will fail if you tick the box to configure ActiveIQ and the firewall ports to let the mNode talk to NetApp are not open. It’s easy to configure ActiveIQ post NDE deployment, just read the deployment guide.

3) Remember you can often resume NDE from another storage node if you’ve corrected the issue that caused NDE to fail.

Just configure another Bond1G temp IP and continue.

4) If you use the ‘NDE Settings Easy Form’ and don’t want the Management VLAN tagged, remember to untag it before running the NDE.

Mybad.

Once when I filled out the ‘NDE Settings Easy Form’ (you need to put in a VLAN ID for management), I forgot to later delete all references of that management VLAN tag from the configuration form before submitting it. And understandably NDE failed.

5) NDE 1.6 Doesn’t Support $ in the Password.

I had the NDE fail to deploy the mNode with NDE 1.6 because we had a $ sign in the password. ‘.’ and ‘!’ are fine. Don’t have a password with $ in!

6) If NDE fails, don’t try to manually recover, work out what was the problem and re-deploy.

Carrying on from 5 above. It is possible to manually deploy the mNode and we were able to successfully do this. Unfortunately, after having spent nearly an entire morning doing this (including updating the mNode from 2.0 to 2.1), we then couldn’t get NDE Scaler to work, so couldn’t expand the HCI cluster with additional nodes (could still have manually expanded). We reset** and re-ran NDE afresh.

7) Don’t try to be too ambitious with running NDE. Better to start with minimal NDE deploy, then scale later.

If you have lots of storage and compute nodes to deploy, I’d recommend running NDE with the minimum 4 Storage + 2 Compute first, then the probability of encountering an issue with the networking of a node (which would might NDE to fail) is reduced. And if NDE does fail, you don’t need to reset** as many nodes before you try again.

8) If the 10/25GbE ports aren’t coming up, it might very well be the Cisco Switch firmware.

Had an issue where the 10/25GbE ports simply would not come up. Lights out of the cable, and out of the SFP were fine. The problem was the 7.0(3)I4(2) firmware on the Cisco switches. Once the switch firmware was updated all was fine.

Note: Apply licenses to the switches beforehand. One instance we had issues enabling any speed greater than 1000 on the switch ports without the correct license.

9) If you can ping the gateway from a compute node, but can’t traceroute to the gateway, it means the firewall is blocking traffic out.

If – from a device on the network – you can’t ping the IPMI or temp Bond1G IP Address, but you can ping the default gateway from a compute node (but can’t traceroute to the gateway), it’s likely a firewall ACL that’s at fault.

10) Ensure Jumbo Frames is set on all 10/25GbE switch ports

“Compute nodes cannot be displayed as their software version is not supported be the NDE” is a false error! You get this when jumbo frames have not been correctly set for all the 10/25GbE connections. Make sure jumbo frames is set correctly.

11) Setting the time is important (remember the IPMI uses UTC-0)

Another NDE failure I’ve seen at the ‘Deploying vCenter’ stage results in lots of “Waiting for VMware APIs to come online” in the NDE log, and eventually a timeout. I’m not sure if this was fixed by opening firewall ports, or correcting the time setting via the IPMI (mybad, wasn’t set to the correct UTC-0 time.)

12) Portfast on all switch edge ports!

If the NDE fails at the vCenter stage and you see lots of “Network configuration change disconnected the host 'X.X.X.X’ from vCenter server and has been rolled back.” There is a KB: https://kb.netapp.com/app/answers/answer_view/a_id/1092527/loc/en_US

The fix is to enable portfast on all the ports on the network switch connected to the NetApp HCI nodes (I believe setting portfast on edge ports is a common best practice.)

THE END

Note: These 3rd-party SFPs are known to work with the H410C and H410S nodes: Flexoptix P.8525G.01 (25G SFP28 SR with dual CDR)

**Yes, you can reset the storage and compute nodes if the NDE fails at a step and the option to restart NDE from another storage node isn’t available to you. You don’t have to go and RTFI everything all over again.

Compute Node Reset:

1) Power Reset
2) At the ESXi Boot Menu, choose Safe Mode (Note: This is AFTER the NetApp Splash Screen)
3) Go to Reset Node to Factory in the TUI

Storage Node Reset:

Contact Tech Support to perform a storage node reset. Only they know the password!
Special Login for Storage Node RESET (don’t need to power reset)

1) Alt+F2
2) Login with user = root, pass = ***********
3) Run: /sf/hci/nde_reset

No comments:

Post a Comment