Saturday, 12 October 2019

Q: Can you Resync a Mirror-Vault with a Sync-SnapMirror Destination post Cutover?

A: No

Below was an experiment to see if resyncing a mirror-vault relationship with a Sync (StrictSync) mirror destination volume is possible post cutover. The result was no.

SnapMirror Synchronous (SM-S) has two modes:
- SnapMirror Synchronous Mode
- SnapMirror Strict Synchronous Mode

Note: SM-S is targeted at relatively short distances with less than 10ms RTT

1) Pre-requisites for Synchronous SnapMirror (SM-S)

Version of source and destination cluster must be ONTAP 9.5+
Clusters must be peered
SVMs must be peered


version
cluster peer show
vserver peer show


2) Setup the Strict Synchronous Mirror Relationships


cluster2::> vol create -volume OCF_dest_sm_s -aggregate aggr_dp -size 32.99G -type DP
cluster2::> vol create -volume ODATA_dest_sm_s -aggregate aggr_dp -size 40G -type DP
cluster2::> vol create -volume OFRA_dest_sm_s -aggregate aggr_dp -size 30G -type DP

cluster2::> snapmirror create -source-path svm_src1:OCF -destination-path svm_dst1:OCF_dest_sm_s -policy StrictSync
cluster2::> snapmirror create -source-path svm_src1:ODATA -destination-path svm_dst1:ODATA_dest_sm_s -policy StrictSync
cluster2::> snapmirror create -source-path svm_src1:OFRA -destination-path svm_dst1:OFRA_dest_sm_s -policy StrictSync

Warning: You are creating a SnapMirror relationship with a policy of type "strict-sync-mirror" that only supports all LUN based applications with FCP and iSCSI protocols, as well as NFSv3 protocol for enterprise applications such as databases, VMWare, etc.
Warning: For a SnapMirror relationship with a policy of type "strict-sync-mirror", client I/O will fail in order to maintain strict synchronization when the secondary is inaccessible.
Do you want to continue? y

cluster2::> snapmirror initialize -destination-path svm_dst1:OCF_dest_sm_s
cluster2::> snapmirror initialize -destination-path svm_dst1:ODATA_dest_sm_s
cluster2::> snapmirror initialize -destination-path svm_dst1:OFRA_dest_sm_s

cluster2::> snapmirror show -policy StrictSync -fields status,policy,lag-time
source-path  destination-path       policy     status lag-time
------------ ---------------------- ---------- ------ --------
svm_src1:OCF svm_dst1:OCF_dest_sm_s StrictSync InSync 0:0:0
svm_src1:ODATA svm_dst1:ODATA_dest_sm_s StrictSync InSync 0:0:0
svm_src1:OFRA svm_dst1:OFRA_dest_sm_s StrictSync InSync 0:0:0
3 entries were displayed.


3) Mirror-Vault the Source Volumes

cluster1::> snapshot policy create -vserver cluster1 -policy 24_hourly -enabled true -schedule1 hourly -count1 24 -snapmirror-label1 hourly -prefix1 hourly
cluster1::> volume modify -volume OCF,ODATA,OFRA -snapshot-policy 24_hourly -vserver svm_src1

cluster2::> vserver create -vserver svm_mirror_vault -subtype default -rootvolume svm_root -rootvolume-security-style unix -language C.UTF-8 -snapshot-policy none -aggregate aggr_dp
cluster2::> vserver peer create -vserver svm_mirror_vault -peer-vserver svm_src1 -peer-cluster cluster1 -applications snapmirror

cluster1::> vserver peer accept -vserver svm_src1 -peer-vserver svm_mirror_vault

cluster2::> snapmirror policy create -vserver cluster2 -policy 96_hourly -tries 8 -transfer-priority normal -ignore-atime false -restart always -type mirror-vault
cluster2::> snapmirror policy add-rule -vserver cluster2 -policy 96_hourly -snapmirror-label hourly -keep 96
cluster2::> vol create -volume OCF_mv -aggregate aggr_dp -size 32.99G -type DP -vserver svm_mirror_vault -language en_US.UTF-8
cluster2::> vol create -volume ODATA_mv -aggregate aggr_dp -size 40G -type DP -vserver svm_mirror_vault -language en_US.UTF-8
cluster2::> vol create -volume OFRA_mv -aggregate aggr_dp -size 30G -type DP -vserver svm_mirror_vault -language en_US.UTF-8
cluster2::> snapmirror create -source-path svm_src1:OCF -destination-path svm_mirror_vault:OCF_mv -policy 96_hourly -schedule hourly
cluster2::> snapmirror create -source-path svm_src1:ODATA -destination-path svm_mirror_vault:ODATA_mv -policy 96_hourly -schedule hourly
cluster2::> snapmirror create -source-path svm_src1:OFRA -destination-path svm_mirror_vault:OFRA_mv -policy 96_hourly -schedule hourly
cluster2::> snapmirror initialize -destination-path svm_mirror_vault:*
cluster2::> snapmirror show -destination-path svm_mirror_vault:* -fields state,status
source-path  destination-path        state        status
------------ ----------------------- ------------ ------
svm_src1:OCF svm_mirror_vault:OCF_mv Snapmirrored Idle
svm_src1:ODATA svm_mirror_vault:ODATA_mv Snapmirrored Idle
svm_src1:OFRA svm_mirror_vault:OFRA_mv Snapmirrored Idle
3 entries were displayed.


4) Failover to Secondary Site on Primary Site Failure


cluster1::> vol offline -volume OCF -vserver svm_src1
cluster1::> vol offline -volume ODATA -vserver svm_src1
cluster1::> vol offline -volume OFRA -vserver svm_src1

Volume "OFRA" in Vserver "svm_src1" has LUNs associated with it. Taking this volume offline will disable the LUNs until the volume is brought online again.
The LUNs will not appear in the output of "lun show" and any clients using the LUNs will experience a data service outage. Are you sure you want to continue? y

Warning: Volume "OFRA" in Vserver "svm_src1" is currently part of a SnapMirror Synchronous relationship. Taking the volume offline operation will disrupt the zero RPO protection. Do you want to continue? y

cluster2::> snapmirror quiesce -destination-path svm_dst1:OCF_dest_sm_s
cluster2::> snapmirror quiesce -destination-path svm_dst1:ODATA_dest_sm_s
cluster2::> snapmirror quiesce -destination-path svm_dst1:OFRA_dest_sm_s
cluster2::> snapmirror break -destination-path svm_dst1:OCF_dest_sm_s
cluster2::> snapmirror break -destination-path svm_dst1:ODATA_dest_sm_s
cluster2::> snapmirror break -destination-path svm_dst1:OFRA_dest_sm_s


5) See if we can resync our Mirror-Vaults


cluster2::> snapshot policy create -vserver cluster2 -policy 24_hourly -enabled true -schedule1 hourly -count1 24 -snapmirror-label1 hourly -prefix1 hourly
cluster2::> volume modify -vserver svm_dst1 -volume OCF_dest_sm_s -snapshot-policy 24_hourly
cluster2::> volume modify -vserver svm_dst1 -volume ODATA_dest_sm_s -snapshot-policy 24_hourly
cluster2::> volume modify -vserver svm_dst1 -volume OFRA_dest_sm_s -snapshot-policy 24_hourly

cluster2::> vserver peer create -vserver svm_dst1 -peer-vserver svm_mirror_vault -applications snapmirror
cluster2::> snapmirror delete -destination-path svm_mirror_vault:OCF_mv
cluster2::> snapmirror delete -destination-path svm_mirror_vault:ODATA_mv
cluster2::> snapmirror delete -destination-path svm_mirror_vault:OFRA_mv
cluster2::> snapmirror create -source-path svm_dst1:OCF_dest_sm_s -destination-path svm_mirror_vault:OCF_mv -policy 96_hourly -schedule hourly
cluster2::> snapmirror create -source-path svm_dst1:ODATA_dest_sm_s -destination-path svm_mirror_vault:ODATA_mv -policy 96_hourly -schedule hourly
cluster2::> snapmirror create -source-path svm_dst1:OFRA_dest_sm_s -destination-path svm_mirror_vault:OFRA_mv -policy 96_hourly -schedule hourly
cluster2::> snapmirror show -destination-vserver svm_mirror_vault
                                                                       Progress
Source            Destination Mirror  Relationship   Total             Last
Path        Type  Path        State   Status         Progress  Healthy Updated
----------- ---- ------------ ------- -------------- --------- ------- --------
svm_dst1:OCF_dest_sm_s XDP svm_mirror_vault:OCF_mv Broken-off Idle - true -
svm_dst1:ODATA_dest_sm_s XDP svm_mirror_vault:ODATA_mv Broken-off Idle - true -
svm_dst1:OFRA_dest_sm_s XDP svm_mirror_vault:OFRA_mv Broken-off Idle - true -
3 entries were displayed.

cluster2::> snapmirror delete -destination-path svm_dst1:OCF_dest_sm_s
cluster2::> snapmirror delete -destination-path svm_dst1:ODATA_dest_sm_s
cluster2::> snapmirror delete -destination-path svm_dst1:OFRA_dest_sm_s

cluster2::> snapmirror resync -destination-path  svm_mirror_vault:OCF_mv
Error: command failed: No common Snapshot copy found between svm_dst1:OCF_dest_sm_s and svm_mirror_vault:OCF_mv.

cluster2::> snapmirror resync -destination-path  svm_mirror_vault:ODATA_mv
Error: command failed: No common Snapshot copy found between svm_dst1:ODATA_dest_sm_s and svm_mirror_vault:ODATA_mv.

cluster2::> snapmirror resync -destination-path  svm_mirror_vault:OFRA_mv
Error: command failed: No common Snapshot copy found between svm_dst1:OFRA_dest_sm_s and svm_mirror_vault:OFRA_mv.

cluster2::> snapshot show -volume OCF_mv
                                                                 ---Blocks---
Vserver  Volume   Snapshot                                  Size Total% Used%
-------- -------- ------------------------------------- -------- ------ -----
svm_mirror_vault OCF_mv
hourly.2019-10-12_1205                   348KB     0%    0%
snapmirror.022e075b-ecdf-11e9-9600-005056b01916_2160175154.2019-10-12_120500 249.6MB 1% 2%
hourly.2019-10-12_1305                   308KB     0%    0%
snapmirror.022e075b-ecdf-11e9-9600-005056b01916_2160175154.2019-10-12_130500 147.1MB 0% 1%

cluster2::> snapshot show -volume OCF_dest_sm_s
                                                                 ---Blocks---
Vserver  Volume   Snapshot                                  Size Total% Used%
-------- -------- ------------------------------------- -------- ------ -----
svm_dst1 OCF_dest_sm_s
snapmirror.12ceb7f0-b078-11e8-baec-005056b013db_2160175147.2019-10-12_120504 272.3MB 1% 2%
snapmirror.12ceb7f0-b078-11e8-baec-005056b013db_2160175147.2019-10-12_130504 30.17MB 0% 0%


6) Performing the Reverse Resync

After my failure to resync the Mirror-Vault with the Sync-SnapMirror destination, I then also could not reverse resync the Sync-SnapMirror relationship back to cluster1. Deleting the original Sync-SnapMirror relationship was likely the problem. There is a Reverse Resync button in OnCommand System Manager, which makes doing the Reverse Resync - to restore services back to the primary cluster - easy.

Image: Reverse Resync button in OnCommand System Manager

Friday, 11 October 2019

Trident Resources and Collateral

20191009: Stateful Workloads in Kubernetes with Trident — The NetApp Way:

Trident: Automate and Orchestrate Storage in a Container World:

Image: Trident: Automate and Orchestrate Storage in a Container World.

20191008: Trident and Disaster Recovery Pt. 1:

Trident GitHub page (download):

ThePub @ NetApp YouTube channel:
With Trident and Ansible and more - installation, configuration and how-to videos:

Trident Documentation:
Step-by-step instructions, sample code, detailed description of features/functions:

Customer Success Story:

20190911: Storage in a DevOps World:
An enlightening DevOps podcast from Chris Merz.


And if you’re attending NetApp Insight 2019 Las Vegas (or catching up on videos from Insight), check out these Trident Sessions.

Trident Sessions at Insight 2019

1362-3 CSI Trident: State of the Art storage provisioning for Kubernetes
Trident has adopted the container storage interface (CSI) while retaining the ability to innovate outside the limited confines of CSI, and Trident now supports NetApp Snapshot copies and clones for CSI volumes. Trident has moved from etcd to Kubernetes-native custom resource definitions (CRDs) for its internal state, which greatly simplifies Trident deployment.

1361-3 Hybrid Cloud and Containers: A Perfect Match
Container orchestrators such as Kubernetes enable the automation of deployment, scaling, and management of applications in your cloud of choice.

2022-2 Verizon: Persistent Storage and Cloud Bursting for Kubernetes
Customer presentation.


Finally, stuff on the NetApp Field Portal (for NetApp Partners and NetApp Internal):

When Data Disks in a RAID Group can be Smaller than Other Data Disks in the RAID Group

Something I noticed recently that I didn’t know (not sure this knowledge is actually that useful), is that you can have smaller data disks in a RAID group with larger data disks, and the larger data disks are not right-sized down to the size of the smaller disk!

The rule is that you can have data disks in the RAID group smaller that the largest data disk, and not cause the larger data disks to be down-sized, as-long-as the parity disks are the size of the largest data disk. The smaller disk just get truncated.

Here are a few tests/examples:

TEST 1: Create an aggregate with 5 * 113.4MB SSD and usable size = 233.8MB
TEST 2: Add a 527.6MB SSD to the aggregate created in TEST 1, and - as expected - the disk would be downsized
TEST 3: Create an aggregate with 6 * 527.6MB SSD + 1 * 113.4MB SSD and usable size = 1.83GB (not 389.5MB as might have been expected)
TEST 4: Create an aggregate with 7 * 527.6MB SSD and add 1 * 113.4MB SSD
TEST 5: What happens when we try to replace the 113.4MB SSD with a 527.6MB SSD?

Test Results

TEST 1: Create an aggregate with 5 * 86.56MB SSD and usable size = 233.8MB


CLU01::> aggr create -aggregate TEST_SSD2 -disklist NET-1.37,NET-1.38,NET-1.39,NET-1.40,NET-1.41
Info: The layout for aggregate "TEST_SSD2" on node "CLU01-01" would be:
      First Plex
        RAID Group rg0, 5 disks (block checksum, raid_dp)
                                                            Usable Physical
          Position   Disk                      Type           Size     Size
          ---------- ------------------------- ---------- -------- --------
          dparity    NET-1.37                  SSD               -        -
          parity     NET-1.38                  SSD               -        -
          data       NET-1.39                  SSD         86.56MB  113.4MB
          data       NET-1.40                  SSD         86.56MB  113.4MB
          data       NET-1.41                  SSD         86.56MB  113.4MB
      Aggregate capacity available for volume use would be 233.8MB.


TEST 2: Add a 500MB SSD to the aggregate created in TEST 1, and - as expected - the disk would be downsized


CLU01::> aggr add-disk -aggregate TEST_SSD2 -disklist NET-1.8 -raidgroup rg0
Info: Disks would be added to aggregate "TEST_SSD2" on node "CLU01-01" in the following manner:
      First Plex
        RAID Group rg0, 6 disks (block checksum, raid_dp)
                                                            Usable Physical
          Position   Disk                      Type           Size     Size
          ---------- ------------------------- ---------- -------- --------
          data       NET-1.8                   SSD         86.56MB  527.6MB
      Aggregate capacity available for volume use would be increased by 77.93MB.
      WARNING: One or more disks to be added will be downsized.


TEST 3: Create an aggregate with 6 * 527.6MB SSD + 1 * 113.4MB SSD and usable size = 1.83GB (not 389.5MB as might have been expected)


CLU01::> aggr create -aggregate TEST_SSD -disklist NET-1.1,NET-1.46,NET-1.47,NET-1.48,NET-1.49,NET-1.50,NET-1.51
Info: The layout for aggregate "TEST_SSD" on node "CLU01-01" would be:
      First Plex
        RAID Group rg0, 7 disks (block checksum, raid_dp)
                                                            Usable Physical
          Position   Disk                      Type           Size     Size
          ---------- ------------------------- ---------- -------- --------
          dparity    NET-1.46                  SSD               -        -
          parity     NET-1.47                  SSD               -        -
          data       NET-1.1                   SSD         86.56MB  113.4MB
          data       NET-1.48                  SSD           500MB  527.6MB
          data       NET-1.49                  SSD           500MB  527.6MB
          data       NET-1.50                  SSD           500MB  527.6MB
          data       NET-1.51                  SSD           500MB  527.6MB
      Aggregate capacity available for volume use would be 1.83GB.

CLU01::> aggr show

Aggregate     Size Available Used% State   RAID Status
--------- -------- --------- ----- ------- ------------
TEST_SSD    1.83GB    1.83GB    0% online  raid_dp,normal


Image: The smaller data disk doesn’t affect the usable size of the larger data disks, it just truncates!

TEST 4: Create an aggregate with 7 * 527.6MB SSD and add 1 * 113.4MB SSD


CLU01::> aggr create -aggregate TESTSSD3 -disklist NET-1.6,NET-1.7,NET-1.52,NET-1.53,NET-1.54,NET-1.55,NET-1.56
Info: The layout for aggregate "TESTSSD3" on node "CLU01-01" would be:
      First Plex
        RAID Group rg0, 7 disks (block checksum, raid_dp)
                                                            Usable Physical
          Position   Disk                      Type           Size     Size
          ---------- ------------------------- ---------- -------- --------
          dparity    NET-1.6                   SSD               -        -
          parity     NET-1.7                   SSD               -        -
          data       NET-1.52                  SSD           500MB  527.6MB
          data       NET-1.53                  SSD           500MB  527.6MB
          data       NET-1.54                  SSD           500MB  527.6MB
          data       NET-1.55                  SSD           500MB  527.6MB
          data       NET-1.56                  SSD           500MB  527.6MB
      Aggregate capacity available for volume use would be 2.20GB.

CLU01::> aggr add-disk -aggr "TESTSSD3" -disklist NET-1.4 -raidgroup rg0
Info: Disks would be added to aggregate "TESTSSD3" on node "CLU01-01" in the following manner:
      First Plex
        RAID Group rg0, 8 disks (block checksum, raid_dp)
                                                            Usable Physical
          Position   Disk                      Type           Size     Size
          ---------- ------------------------- ---------- -------- --------
          data       NET-1.4                   SSD         86.56MB  113.4MB
      Aggregate capacity available for volume use would be increased by 77.93MB.

CLU01::> aggr show
Aggregate     Size Available Used% State   RAID Status
--------- -------- --------- ----- ------- ------------
TESTSSD3    2.27GB    2.27GB    0% online  raid_dp,normal


TEST 5: What happens when we try to replace the 113.4MB SSD with a 527.6MB SSD?


CLU01::> disk replace -disk NET-1.4 -replacement NET-1.51 -action start
Warning: Replacement 520.5MB disk will be downsized to 107.1MB. Continue? {y|n}: n