[StorageGRID][FabricPool] Changing From Dual Copy to EC 2+1

** Caveat lector! Totally unofficial writings. **

Scenario

We have an ONTAP lab cluster (running 9.9.1) using StorageGRID (11.5.0) as a cloud tier for tiering of data.

cluster2::storage aggregate object-store> show-space
                                                            
Aggregate      Object Store Name Provider Type Used Space   
-------------- ----------------- ------------- -----------
aggr1_cluster2 sgws_71           SGWS               3.04GB
aggr2_cluster2 sgws_71           SGWS               2.61GB
aggr3_cluster2 sgws_71           SGWS              771.0MB

Note: This is a lab system so the used space is quite small. But the volume are set to the "All" tiering policy, so the above represents pretty much all the data in our lab Cluster being on StorageGRID.

Our StorageGRID grid consists of 3 storage nodes. Our ILM policy contains one rule called "Make 2 Copies".

We want to change the "Make 2 Copies" rule to an "EC 2+1" rule.


** See below: "Things to Check Before and After Making the Change" **

Steps

Note: You may need to acquire some information before starting, such as bucket names.

cluster2::storage aggregate object-store> config show

Name: sgws_71
Server: dc1-adm1.demo.company.com
Container Name: fabricpool-cluster2
Provider Type: SGWS
  • 1) Create a Storage Pool to contain All Storage Nodes in the site
    • ILM > Storage Pools
      • Click Create to create a new Storage Pool
        • Name: Data Center 1
        • Site: Data Center 1
        • Storage Grade: All Storage Nodes
        • Click Save
  • 2) Create the new Erasure Coding Profile {<-- Skip for 11.8}
    • ILM > Erasure Coding
      • Click Create to create a new EC Profile
        • Profile Name: DC1_EC_2plus1
        • Storage Pool: Data Center 1
        • Scheme: 2+1
        • Click Save
  • 3) Create a new ILM rule with the EC 2+1
    • Note: Here we could have not specify the bucket name, because this lab StorageGRID has one site and is just serving FabricPool as a cloud tier. But then it would be a default rule that applies to everything, which I didn't want.
    • ILM > Rules
      • Click Create to create a new ILM Rule
        • >> Step 1 of 3 <<
        • Name: DC1_EC_2plus1
        • Description:
        • Tenant Accounts (optional):
        • Bucket Name: fabricpool-cluster2
        • Click Next
        • >> Step 2 of 3 <<
        • Reference Time: Ingest Time (default)
        • From day 0 store forever (default)
        • Type: erasure coded
        • Location: Data Center 1 (DC1_EC_2plus1)
        • Click Next
        • >> Step 3 of 3 <<
        • Select Balanced for the ingest behaviour
        • Click Save
  • 4) Clone the ILM Policy
    • ILM > Policies
      • Click Clone to clone the Active policy
      • Name: 2024_06_25 Policy
      • Reason for change: Converting 2 Copies to EC 2+1 for FabricPool
      • Click the button Select Rules
        • Select Default Rule: Make 2 Copies
        • Select Other Rules: DC1_EC_2plus1
        • Click Apply
        • Verify the order of the rule is correct (the default rule is always last).
        • Click Save
  • 5) Active the new Proposed ILM policy
    • ILM > Policies
      • Select the new Proposed policy
      • Click Activate
      • Click OK to the "Activate the proposed policy. Errors in an ILM policy can cause irreparable data loss. Review and test the policy carefully before activating. Are you sure you want to active the proposed policy?"
All being well (the important thing is the "From day 0 store forever" which means that StorageGRID does not delete stuff, stuff is only deleted by the S3 client, which is ONTAP in this case), the cluster won't notice any change. You can check with

event log show
system health status show
system health subsystem show

Monitoring ILM Progress

The best way to monitor the ILM change progress is by looking at the ILM queue build up and then settle down. It is also possible to monitor ILM queue/scan rate using Grafana (but not super insighful.)

Note: On my lab, because the data set is so small and load non-existent, I saw pretty much nothing in the graphs except a blip on Network Traffic.


Note: In 11.8, under Support > Metrics > ILM, there are the Grafana graphs, which are quite good. The below example has no data (no ILM).


Difference in StorageGRID 11.8

In StorageGRID 11.8, you can skip (2) above and got straight to creating the rule with an Erasure Coding Profile.

Things to Check Before and After Making the Change
  1. Check Alerts
  2. Check Support > Diagnostics
  3. Check Dashboard > ILM Tab*
  4. Check Nodes > Data Center > ILM graph*
  5. Check Nodes > Storage Nodes > ILM statistics and graph*
  6. Check Nodes > Storage Nodes > Hardware For CPU and Memory utilization*
  7. Check Nodes > Storage Nodes > Network for network utilization*
*We expect all these to increase whilst the ILM conversion is happening.

Also you want to get full details of the existing ILM Policy and rules, also knowledge of buckets might be useful too:
  1. ILM > Rules
  2. ILM > Policies
  3. ILM > Erasure coding
  4. Tenants > [Your Tenant] > Bucket details
APPENDIX: Checks Included in StorageGRID 11.8 Diagnostics

StorageGRID 11.8 Diagnostics are under SUPPORT > Diagnostics and they include many checks:
  • Node uptime
  • Cassandra automatic restarts
  • Cassandra blocked task queue too large
  • Cassandra commit log latency
  • Cassandra commit log queue depth
  • Cassandra compaction queue too large
  • Cassandra deleted data errors
  • Cassandra dropped messages
  • Cassandra garbage collection
  • Cassandra imbalanced SSTables
  • Cassandra memory
  • Cassandra memtable flushes
  • Cassandra offheap memor too high
  • Cassandra pending message queue too large
  • Cassandra read latency consistently high
  • Cassandra reclaimable space
  • Cassandra repair progress
  • Cassandra request timeouts
  • Cassandra requests unable to achieve consistency
  • Cassandra table partitions too large
  • CPU IO wait
  • CPU utilization
  • Custom SSH settings
  • Dirty page ratio
  • Disk read latency
  • Disk write latency
  • Erasure-coded groups in repair change over time
  • Erasure-coded groups repair health
  • Erasure-coded groups writable counts
  • Grid options
  • Invalid prefix corrections for bucket listing
  • LDR Storage Desired State
  • Load balancer - request timeouts
  • Load balancer - upstream connection problems
  • Load balancer - upstream retries exceeded
  • Network MTU values
  • Replicated repair jobs not progressing
  • SSD Cache Hit Rate
  • Storage Node client connections
  • Storage used - object data
  • StorageGRID version consistency
  • TCP connection tracking utilization
  • TCP retransmission rate

Comments