I’ve not done a massive amount of NetApp HCI
installations (not yet double figures). All my experiences so far have been
with the H410C (Compute) and H410S (Storage) nodes.
The NetApp Deployment Engine (NDE) is very cool. You fill
in your variables, click the button to go, and it sets up your storage cluster,
ESXi hosts, vCenter Server, mNode, and vSphere Plugins.
When NDE fails to deploy to 100%, most commonly it is due
to network setup issues. It’s rarely an actual NDE issue. Arguably, it is a
good thing NDE does fail as it tells you there’s some issue with your network,
and you don’t want to go into production if the network is not 100%!
Image: What you want to see every time you run the NDE
“Your installation is complete”
Here are my lessons learned (so far):
1) Configure LACP for
Bond10G on the storage nodes.
Setting up compute and storages nodes in order to run the
NetApp Deployment Agent (NDE) is straightforward:
Rack and stack
Cable
(Optional) RTFI to a desired version
BIOS:
- configure IPMI + Date & Time (BIOS uses UTC-0
time)
Compute TUI:
- Don’t need to configure anything unless you are putting
a VLAN tag on the iSCSI Storage Network, in which case for Bond10G configure
VLAN (and temp IP)*
Storage TUI:
- For one storage node configure a temporary IP on the
Bond1G interface
-
Set Bond10G to LACP (for every storage node)
- Don’t need to configure anything else unless you are
putting a VLAN tag on the iSCSI Storage Network, in which case for Bond10G configure
VLAN (and temp IP)*
*I’ve configured temp IP on the storage network when
using VLANs, but not 100% sure I needed to...
The documentation suggests LACP is optional, but in my experience,
it is not optional, it is necessary, otherwise Storage Performance is poor (NDE
failed after timing out deploying the mNode.)
2) Configure ActiveIQ
after NDE has completed!
This is a tip! NDE will fail if you tick the box to configure
ActiveIQ and the firewall ports to let the mNode talk to NetApp are not open.
It’s easy to configure ActiveIQ post NDE deployment, just read the deployment
guide.
3) Remember you can
often resume NDE from another storage node if you’ve corrected the issue that
caused NDE to fail.
Just configure another Bond1G temp IP and continue.
4) If you use the ‘NDE Settings
Easy Form’ and don’t want the Management VLAN tagged, remember to untag it
before running the NDE.
Mybad.
Once when I filled out the ‘NDE Settings Easy Form’ (you
need to put in a VLAN ID for management), I forgot to later delete all
references of that management VLAN tag from the configuration form before submitting
it. And understandably NDE failed.
5) NDE 1.6 Doesn’t
Support $ in the Password.
I had the NDE fail to deploy the mNode with NDE 1.6
because we had a $ sign in the password. ‘.’ and ‘!’ are fine. Don’t have a
password with $ in!
6) If NDE fails, don’t
try to manually recover, work out what was the problem and re-deploy.
Carrying on from 5 above. It is possible to manually
deploy the mNode and we were able to successfully do this. Unfortunately, after
having spent nearly an entire morning doing this (including updating the mNode
from 2.0 to 2.1), we then couldn’t get NDE Scaler to work, so couldn’t expand
the HCI cluster with additional nodes (could still have manually expanded). We
reset** and re-ran NDE afresh.
7) Don’t try to be too
ambitious with running NDE. Better to start with minimal NDE deploy, then scale
later.
If you have lots of storage and compute nodes to deploy,
I’d recommend running NDE with the minimum 4 Storage + 2 Compute first, then
the probability of encountering an issue with the networking of a node (which
would might NDE to fail) is reduced. And if NDE does fail, you don’t need to reset**
as many nodes before you try again.
8) If the 10/25GbE
ports aren’t coming up, it might very well be the Cisco Switch firmware.
Had an issue where the 10/25GbE ports simply would not
come up. Lights out of the cable, and out of the SFP were fine. The problem was
the 7.0(3)I4(2) firmware on the Cisco switches. Once the switch firmware was
updated all was fine.
Note: Apply licenses to the switches beforehand. One
instance we had issues enabling any speed greater than 1000 on the switch ports
without the correct license.
9) If you can ping the
gateway from a compute node, but can’t traceroute to the gateway, it means the
firewall is blocking traffic out.
If – from a device on the network – you can’t ping the
IPMI or temp Bond1G IP Address, but you can ping the default gateway from a
compute node (but can’t traceroute to the gateway), it’s likely a firewall ACL
that’s at fault.
10) Ensure Jumbo Frames
is set on all 10/25GbE switch ports
“Compute nodes cannot be displayed as their software
version is not supported be the NDE” is a false error! You get this when jumbo
frames have not been correctly set for all the 10/25GbE connections. Make sure jumbo
frames is set correctly.
11) Setting the time is
important (remember the IPMI uses UTC-0)
Another NDE failure I’ve seen at the ‘Deploying vCenter’
stage results in lots of “Waiting for VMware APIs to come online” in the NDE
log, and eventually a timeout. I’m not sure if this was fixed by opening firewall
ports, or correcting the time setting via the IPMI (mybad, wasn’t set to the
correct UTC-0 time.)
12) Portfast on all
switch edge ports!
If the NDE fails at the vCenter stage and you see lots of
“Network configuration change disconnected the host 'X.X.X.X’ from vCenter
server and has been rolled back.” There is a KB: https://kb.netapp.com/app/answers/answer_view/a_id/1092527/loc/en_US
The fix is to enable portfast on all the ports on the network
switch connected to the NetApp HCI nodes (I believe setting portfast on edge
ports is a common best practice.)
THE END
Note: These 3rd-party
SFPs are known to work with the H410C and H410S nodes: Flexoptix P.8525G.01
(25G SFP28 SR with dual CDR)
**Yes, you can reset the storage and compute nodes if
the NDE fails at a step and the option to restart NDE from another storage node
isn’t available to you. You don’t have to go and RTFI everything all over
again.
Compute Node Reset:
1) Power Reset
2) At the ESXi Boot Menu,
choose Safe Mode (Note: This is AFTER the NetApp Splash Screen)
3) Go to Reset Node to
Factory in the TUI
Storage Node Reset:
Contact Tech Support to
perform a storage node reset. Only they know the password!
Special Login for Storage
Node RESET (don’t need to power reset)
1) Alt+F2
2) Login with user = root,
pass = ***********
3) Run: /sf/hci/nde_reset
Comments
Post a Comment