There’s not many tools available specifically for MetroCluster but I’ve added the ones I found below. If anyone knows of any others please let me know and i’ll update this post.
The FMC_DC can be downloads from here -> http://mysupport.netapp.com/NOW/download/tools/FMC_DC/. It will require a NetApp NOW account.
The FMC_DC is the Fabric Metro Cluster Data Collector which can be configured to gather information on all components (controllers, switches, bridges etc.) of the MetroCluster infrastructure. Once the components have been added a health check can be run. This health check appears as a card on the application and will show whether the components are healthy or need further investigation.
I’d recommend having a look over this document to get started with FMC_DC
While the FMC_DC doesn’t provide any management features it does provide peace of mind that all components are configured so that failover can be successful. If you’re doing a DR test I’d definitely recommend using it.
These are some of the things to look out for with MetroCluster and can be considered best practices and recommendations.
One very important configuration change to be done on MetroCluster controllers is to immediately disable the change_fsid option. If it is not disabled the all volumes and LUNs will be renamed during failover and make it impossible to volumes and LUNs to be referenced. This is really critical for LUNs.
To avoid the FSID change in the case of a site takeover, you can set the change_fsid option to off (the default is on). Setting this option to off has the following results if a site takeover is initiated by the cf forcetakeover -d command:
- Data ONTAP refrains from changing the FSIDs of volumes and aggregates.
- Users can continue to access their volumes after site takeover without remounting.
- LUNs remain online.
If you don’t disable the change_fsid option in MetroCluster configurations the following happens when the cf forcetakeover -d command is run:
- Data ONTAP changes the file system IDs (FSIDs) of volumes and aggregates because ownership changes.
- Because of the FSID change, clients must remount their volumes if a takeover occurs.
- If using Logical Units (LUNs), the LUNs must also be brought back online after the takeover.
options cf.takeover.change_fsid off
MetroCluster RC file
Failover/Failure Scenarios for MetroCluster
I’m not going to re-invent the wheel here. These failure scenarios are all pretty self-explanatory and can be found in TR-3788.pdf. There’s far more scenarios in that document but here I’ll cover off some of the most common types.
Scenario: Loss of power to disk shelf
Expected behaviour: Relevant disks for offline and the plex is broken. There’s no disruption to data availability to hosts running HA (VMware High Availability) or FT (Fault Tolerance), no change is detected by the ESXi Server. When the shelf is powered back on the plexes will sync automatically
Impact on data availability: None
Scenario: Loss of one link in one disk loop
Expected behaviour: A notification appears on the controller to advise that disks are only accessible via one switch. There’s no disruption to data availability to hosts running HA or FT, no change is detected by the ESXi Server. When the connection is reset an alert on the controller will advise of connectivity across two switches
Impact on data availability: None
Scenario: Failure and Failback of Storage Controller Read More
Cabling of Fabric MetroCluster
The cabling of a MetroCluster is the key. Outside of some licensing it’s the cabling that’s really the only different between MetroCluster and Mirrored HA pair. Yes it’s a bit more complex for failover and failback but really the main difference from a setup point of view is the cabling. There’s a large number of cables and the configuration should be all mapped out before beginning putting equipment into your racks. I would heartily recommend reading MetroCluster and High Availability Guide before starting to understand your cabling requirements. Below is not a how-to on how to connect everything, it’s just an overview with a brief explanation. The above NetApp document is very detailed and should answer any questions you may have.
A simplified view of a Fabric MetroCluster is as follows:
I found this workflow from NetApp documentation which is quite useful as a guideline on how the bridges should be cabled.
What is Fabric-Attached MetroCluster?
A Fabric-Attached MetroCluster configuration can be implemented for distances greater than 500 meters connects the two storage nodes by using four Brocade or Cisco Fibre Channel switches in a dual-fabric configuration for redundancy. Each site has two Fibre Channel switches, each of which is connected through an inter-switch link to a partner switch at the other site.
The inter-switch links are fibre connections which extend the storage fabric path so that it provides a greater distance between nodes than other HA pair solutions. By using four switches instead of two, redundancy is in place to avoid single-points-of-failure in the switches and their connections.
The advantages of a fabric-attached MetroCluster configuration over a stretch MetroCluster configuration include the following:
- Increased disaster protection via nodes being in separate geographical locations
- Disk shelves and nodes are not connected directly to each other, but are connected to a fabric with multiple data routes ensuring no single point of failure.
The disadvantage is that there’s more cabling and there’s more components involved in the way of fibre switches.
Fabric MetroCluster requirements
What is a Stretch MetroCluster
To understand Stretch MetroCluster you first need to understand how a HA pair operates. A Stretch MetroCluster is basically the next level up from a HA pair, or a HA Mirrored pair to be more specific.
In a HA pair, which is a very common Netapp controller deployment, the cluster can handle the outage of a physical link or the entire controller and still provide access to the underlying storage without impacting data access for end users. Each controller in the HA pair shares the same set of disks or own its own distinct set of disks but either way in the event of a controller failure all reads/writes are sent to the remaining controller that can still access the failed controllers disks. There is a HA interconnect between the controllers that’s used for both keep-alive monitoring and mirroring of NVRAM. The HA pair provides fault tolerance and allows non-disruptive upgrades as a takeover and giveback can be performed for planned migration of read/writes to second controller in the HA pair.
Next up is a Mirrored HA pair. Mirrored HA pairs maintain two copies of all data in the form of plexes. These are continually updated synchronously using SyncMirror and provides protection in the event of disk failures. You can also set the mirroring to be asynchronous if there is a need for that. The major drawback to Mirrored HA pairs is that it does not provide failover to the partner node in the event of a controller failure. The Mirrored HA controller pair need to be within the 5 metre SAS limit. This is where stretch MetroCluster comes in.
During the past couple of years I’ve been working on Flexpod solutions and even more recently than that I’ve been exposed to NetApp Flexpod MetroCluster. This has led me to doing quite a bit of research and reading about MetroCluster solutions and I thought I’d share some of that knowledge.I wanted to put together a post to help anyone else that needs to get a better understanding of MetroCluster infrastructure. It kind of got out of hand a bit and to make it easier to read I’ve split it out into a number of parts
MetroCluster is a term that is often heard but I believe rarely understood. It adds some extra complexity into every aspect of the infrastructure but from my own technical bias I love that as it gives me something else to learn and play with.
Multiple different vendors now provide a MetroCluster or metro-availability solution but my focus is on NetApp and in particular 7-Mode MetroCluster. Any reference I make to MetroCluster from here on will only be in reference to NetApp MetroClusters. If your clients or company value disaster avoidance, business continuity, fault tolerance and overall infrastructure resilience then you really need to look at a MetroCluster solution. You will also need to have some deep pockets.
As the engineer supporting such solutions and performing disaster recovery tests I can attest to the power of a MetroCluster solution that can attain zero downtime and data resilience. Even in one instance where I almost brought it to its knees it still soldiered on.
What is a MetroCluster? Read More
This is the last of the 3-part post about MetroCluster failover procedure. This section covers the giveback process and also a note about a final review. The other sections of this blog post can be found here:
- NetApp 7-Mode MetroCluster Disaster Recovery – Part 1
- NetApp 7-Mode MetroCluster Disaster Recovery – Part 2
Test Case 4 – Virtual Infrastructure Health Check
This test case covers a virtual infrastructure system check, not only to get an insight into the current status of the system but to also compare against the outcomes from test case 1.
4.1 – Log into vCenter using the desktop client or web client. Expand the virtual data center and verify all SiteB ESXi hosts are online
4.2 – Log onto NetApp onCommand System Manager. Select primary storage controller and open application
4.3 – Expand SiteA/SiteB and expand primary storage controller, select Storage and Volumes. Volumes should all appear with online status
4.4 – Log into Solarwinds – Check the events from the last 2 hours and take note of all devices from Node List which are currently red
Recently I had the honour of performing a NetApp 7-Mode MetroCluster DR Test. After my previous outing which can be read in its full gory details on another blog post I was suitably apprehensive about performing the test once again. Following the last test I worked with NetApp Support to find a root cause of the DR failure. The final synopsis is that it was due to the Service Processor being online while the DR site was down which caused hardware support to kick in automatically. This meant that a takeover was already running when the ‘cf forcetakeover -d’ command was issues. If the Service Processor is online for even a fraction of a second longer than the controller is it will initiate a takeover. Local NetApp engineers confirmed this was the case thanks to another customer suffering a similar issue and they performed multiple tests both with the Service Processor connected and disconnect. Only those tests that had the Service Processor disconnected were successful. However it wasn’t just the Service Processor. The DR procedure that I followed was not suitable for the test. WARNING: DO NOT USE TR-3788 FROM NETAPP AS THE GUIDELINE FOR FULL SITE DR TESTING. You’ll be in a world of pain if you do.
I had intended on this being just one blog post but it escalated quickly and had to be broken out. The first post is around the overview of steps followed and the health check steps carried out in advance. Part 2 covers the physical kit shutdown and the failover process. Part 3 goes into detail around the giveback process and some things that were noted during the DR test. To access the other parts of the post quickly you can use the links below.
- NetApp 7-Mode MetroCluster Disaster Recovery – Part 2
- NetApp 7-Mode MetroCluster Disaster Recovery – Part 3