Recently I had the honour of performing a NetApp 7-Mode MetroCluster DR Test. After my previous outing which can be read in its full gory details on another blog post I was suitably apprehensive about performing the test once again. Following the last test I worked with NetApp Support to find a root cause of the DR failure. The final synopsis is that it was due to the Service Processor being online while the DR site was down which caused hardware support to kick in automatically. This meant that a takeover was already running when the ‘cf forcetakeover -d’ command was issues. If the Service Processor is online for even a fraction of a second longer than the controller is it will initiate a takeover. Local NetApp engineers confirmed this was the case thanks to another customer suffering a similar issue and they performed multiple tests both with the Service Processor connected and disconnect. Only those tests that had the Service Processor disconnected were successful. However it wasn’t just the Service Processor. The DR procedure that I followed was not suitable for the test. WARNING: DO NOT USE TR-3788 FROM NETAPP AS THE GUIDELINE FOR FULL SITE DR TESTING. You’ll be in a world of pain if you do.
I had intended on this being just one blog post but it escalated quickly and had to be broken out. The first post is around the overview of steps followed and the health check steps carried out in advance. Part 2 covers the physical kit shutdown and the failover process. Part 3 goes into detail around the giveback process and some things that were noted during the DR test. To access the other parts of the post quickly you can use the links below.
- NetApp 7-Mode MetroCluster Disaster Recovery – Part 2
- NetApp 7-Mode MetroCluster Disaster Recovery – Part 3
So without further ado lets get to the meat of the sandwich. I’m going to cover this DR test off in two sections, the planning (Part 1) and the execution (Part 2 & Part 3). Both are just as important as the other. Just in case you don’t want to read through the rest of this post here’s the high-level steps performed for the DR test from a storage perspective:
- Disconnect the Service Processor connection on Disaster Site storage controller (extremely important!!!!)
- Disconnect the ISL links on the Cisco MDS 9148 fibre switches
- Turn off power to the cabinet containing NetApp controller, disk shelves, ATTO bridges and Cisco MDS switches (all of these need to be shutdown at the same time)
- Initiate takeover on Recovery Site storage controller
- Perform application and infrastructure checks
- Power on failed site disk shelves
- Power on ATTO Bridges and reconnect the ISL’s on the MDS Switches
- Power on failed site storage controller
- Re-create mirrors / plexes and wait for resync
- Perform giveback
Before proceeding I’ll give a quick overview of the environment involved in the MetroCluster and the DR test. There is a stretched-fabric MetroCluster with a NetApp FAS6250 on each site. Both sites connect to their own pair ATTO 6500N bridges for fibre conversion and a pair of Cisco MDS 9148 fibre storage switches. The connection between sites from a storage perspective is across an aggregrated 2 x 1GB dark fibre connection. Each controller node has it’s Service Processor connected to an out of band management switch. Each of the controller nodes is connected to a pair of Cisco Nexus 5548’s which are in turn cross-connected to a pair of Cisco Nexus 7009’s. The Nexus 7K’s on both site are connected to other site and OTV is used to enable a VMware stretch-cluster. This link is made up of 2 x 2Gb fibre connections. From a compute level the Cisco UCS 5108 chassis and Cisco UCS 6248 Fabric-Interconnects are being used and are connected into the Nexus 5K’s. There are two UCS chassis within each UCS domain. Each site is made up on one UCS domain which is managed centrally using Cisco UCS Central.
As you can see there’s quite a lot of moving parts and each one needs to be taken into account for the DR test. The decision was made to not power down the Nexus 7K’s as it provides routing for other production systems, namely the factory production systems including SCADA. All other components were included in the test. Below is the 10,000 feet view of the infrastructure.
Before this DR test I did a lot of research of MetroCluster to get more knowledge on how it worked. The primary documents I found that were beneficial were:
Data OnTap – High Availability and MetroCluster Configuration Guide for 7-Mode (Requires NetApp login)
I would highly recommend for anyone that wants to know more about how MetroCluster works to have a read through these documents. It’s not the easiest reading you’ll ever do but it’s definitely worthwhile. I would also recommend checking out the NetApp from the Ground Up series by Will Robinson (@oznetnerd) but most of all his write-up on SyncMirror and Plexes is excellent.
Following on from all my studious reading I installed the Fabric MetroCluster Data Collector (FMC_DC) tool from NetApp. FMC_DC is a useful tool to show if the MetroCluster environment is configured correctly and will allow failover of one site to another. Download the software from http://mysupport.netapp.com/NOW/download/tools/FMC_DC/. You may need a NetApp Now account to download the software.
Follow the installation instructions:
1. Unzip the archive into a folder on the local system. This folder needs to have sufficient
space to hold the collections for the nodes
2. run FMC_DC_GUI.jar file
When it launches you will see the scan and schedules begin
3. read the FMC_DC.txt file for more information on configuring and monitoring nodes
Click on Node -> Add Node
Click Add. Enter the device name, the IP address and the login for the device
Once all devices have been added you’ll see something like this
Once all devices are added the FMC_DC will go out and scan the entire Metrocluster infrastructure and ensure there’s sufficient multi-pathing, correct configurations and no alerts that would cause a failure of the MetroCluster failover. The green object is the correctly configured item, the yellow was configured with a lot of the components added.
PHYSICAL EQUIPMENT CHECK:
Before beginning the test I would highly recommend going to your data center and giving everything in your environment a visual check. This may seem obvious, and in reality all of the steps in the planning phase are, but you’d be surprised at the number of people that don’t give their kit a visual once it’s been installed. You’ll want to know exactly where everything is connected physically so in the event that you need to disconnect/connect anything you’ll know where it needs to go. In my case I had to check that all related equipment was in the same rack cabinet and that I could find the power for the PDUs to turn off the cabinet power at the source to simulate a power outage. This should be done for both sites but it if you can’t manage that at least make sure the site you’re shutting down for the DR test can be reviewed.
DISASTER RECOVERY TEST PLAN:
Performing any DR test should be well documented to ensure that it can be easily reviewed later by both IT personnel and by quality assurance/audit teams. While it adds extra to your workload it is invaluable in the long run. There are quite a few steps involved in performing a DR test within a MetroCluster and my advice is to work out all tasks and break these tasks into the steps in which they will need to be performed and from there place them into test cases. Each test case should be reliant on the previous one succeeding and if not it should reference another successful test case. Good documentation means that you can return to it later and still understand exactly the process that was followed. It’s pretty much what I did in order to write this blog post. It’s been almost 6 week since I did the test and I’ve been quite busy since then so the chances or remembering each step is minimal so I reverted to my documentation to remind me of the stuff that I had forgotten.
OPEN A PRE-EMPTIVE SUPPORT CALL:
I would highly recommend opening a pre-emptive support call with Netapp a few days before you are about to proceed with the DR test. This will shorten the amount of time it will take to get the call escalated should you need it on the test day and it will also make support aware that the test is taking place and they will know where to begin looking within the AutoSupport files to find any problems faster. Go to https://support.netapp.com and open a service request. You may not have to use it but most likely it will be as you’ll want the support engineer to check over your environment and the autosupport files before performing the giveback. This was even recommended by our local NetApp engineers as it was at this point they always saw problems which had usually been caused by the SP being connected when the DR test was run.
The next part of the process is the execution of the plan which is covered more in NetApp 7-Mode MetroCluster Disaster Recovery – Part 2