Using Active/Active DFS Namespace as a Site Failover Mechanism

First, some good links:

Introduction

As DFS namespaces scale in size, managing DR in an efficient manner is a matter of interest. One idea - if you’ve got a large namespace - is to use Active/Active DFS, or two paths, one to the primary site (with read-write data), one to the secondary site (with read-only data), both targets enabled, but the DR path is effectively down so it should just use the active primary one!

Image: Example of a DFS Namespace and a target with active/active (enabled/enabled) targets
In the above image, PRICLU1V1 is the Enabled(UP) path, and SECCLU1V1 is the Enabled(DOWN) path. We can control the UP/DOWN on a NetApp Clustered ONTAP system by simply running the command::>

net int modify -vserver VSERVERNAME -lif LIFNAME -status-admin up/down

But Does it Work?

Not quite as we’d like. The first time you connect to one of the folders in the DFS namespace, there is a so-many-second timeout before the folder share opens (if you’ve not been referred to the up one). This delay is easy to demonstrate by creating and running a batch file like the below:

ECHO %TIME%
net use * \\lab.priv\NASTEST\TEST1
ECHO %TIME%
net use * /delete /YES
ECHO %TIME%
net use * \\lab.priv\NASTEST\TEST1
ECHO %TIME%
net use * /delete /YES
REM Repeat the above as many times as required!
PAUSE

As an example in a test lab:

C:\Users\Administrator\Desktop>ECHO 13:03:05.23

13:03:05.23

C:\Users\Administrator\Desktop>net use * \\lab.priv\NASTEST\TEST2

Drive Z: is now connected to \\lab.priv\NASTEST\TEST2.
The command completed successfully.

C:\Users\Administrator\Desktop>ECHO 13:03:26.44

13:03:26.44

C:\Users\Administrator\Desktop>net use * /delete /YES
C:\Users\Administrator\Desktop>ECHO 13:03:26.46

13:03:26.46

C:\Users\Administrator\Desktop>net use * \\lab.priv\NASTEST\TEST2

Drive Z: is now connected to \\lab.priv\NASTEST\TEST2.
The command completed successfully.

C:\Users\Administrator\Desktop>ECHO 13:03:26.47

13:03:26.47

Notice the first time we connect it took over 20 seconds! The next time 0.01 of a second!

Why?

“... there was still a link in the namespace to a server that was down, so the long pause when opening DFS was because it was searching for that server and failing.”

And remember, this is for every DFS Folder/Link, if the link hasn’t been already cached (and you’re referred to the wrong one). Even though once you’re connected it is no problem, this delay can impact login times, and if the drive gets disconnected, the wait for re-connect will contribute to a poor end-user experience, which is unacceptable!

What can we do?

DFS Properties in DFS Management

What options are there in DFS Management around this?

Namespace Referrals Settings

Image: Namespace Referrals Settings
Image: Namespace Referrals Ordering Method
Folder Referrals Settings

Image: Folder Referrals Settings
Folder Target Referrals Settings

Image: Folder Target Referrals Settings
Note: The DFS Server in the images and examples above is a Windows Server 2008R2 DFS box!

How to Fix the Problem - Part 1?

After going to the effort of creating subnets and sites in Active Directory Sites and Services; setting the site link costs to favour primary (most things are in the default site); and configuring referral ordering; the behaviour was more predictable. Still there was a ~20 second timeout when connecting to a link after failover (but not every link.) The behaviour when primary was up, never had delay. And, when the DFS targets were cached (for 1800 seconds) and even beyond that time, once connected (either in failover or not), the behaviour was consistently quick to connect.

Remember “The DFSN client connectivity design isn’t for instant failover; it’s for geographical high availability and closest targeting. If you need instant failover, clustering is the way to go.”

Still, why the 20 second timeout, can we not reduce/fix it?

Image: DFS Folder Targets (no longer using “Default-First-Site-Name”)
Image: Site Link Costs
Image: Folder Target Override Referral Ordering for Primary
Image: Folder Target Override Referral Ordering for Secondary
How to Fix the Problem - Part 2?

A very useful tool for troubleshooting DFS referrals is dfsutil. And for this topic, the commands:

dfsutil /pktinfo
dfsutil /pktflush


One thing that had popped up a few times whilst researching this is:


And “DFSDnsConfig registry key must be added to each server that will participate in the DFS namespace for all computers to understand fully qualified names ... which includes the DFS Namespace Servers and the Domain Controllers.”

So, following the instructions which are essentially:

1) Run the following on all DCs and DFS servers:

Dfsutil.exe server registry dfsdnsconfig set

2) Restart DFS on all the DCs and DFS servers:

Net stop dfs; Net start dfs

3) Recreate the namespace and create all folder targets with FQDNs:

Image: Folder Targets with FQDN path
... still had the 20 second timeouts! (Perhaps something was missed...)

To be continued...

PS Another idea - if active/active DFS is desired, but we want  the secondary referred first in a failover, update the site cost to make it more attractive (and referred first)?

Comments