During the past couple of years I’ve been working on Flexpod solutions and even more recently than that I’ve been exposed to NetApp Flexpod MetroCluster. This has led me to doing quite a bit of research and reading about MetroCluster solutions and I thought I’d share some of that knowledge.I wanted to put together a post to help anyone else that needs to get a better understanding of MetroCluster infrastructure. It kind of got out of hand a bit and to make it easier to read I’ve split it out into a number of parts
MetroCluster is a term that is often heard but I believe rarely understood. It adds some extra complexity into every aspect of the infrastructure but from my own technical bias I love that as it gives me something else to learn and play with.
Multiple different vendors now provide a MetroCluster or metro-availability solution but my focus is on NetApp and in particular 7-Mode MetroCluster. Any reference I make to MetroCluster from here on will only be in reference to NetApp MetroClusters. If your clients or company value disaster avoidance, business continuity, fault tolerance and overall infrastructure resilience then you really need to look at a MetroCluster solution. You will also need to have some deep pockets.
As the engineer supporting such solutions and performing disaster recovery tests I can attest to the power of a MetroCluster solution that can attain zero downtime and data resilience. Even in one instance where I almost brought it to its knees it still soldiered on.
What is a MetroCluster?
A MetroCluster is essentially a stretched mirrored HA pair of storage nodes in a cluster configuration. This stretched cluster can be implemented with two nodes within a maximum of 160kms of each other. There are two types of MetroCluster, a stretched MetroCluster which can have two nodes within 500 metres of each other and a Fabric-attached MetroCluster which can be greater than 500 metres and less that 160kms. Aggregate mirroring of I/O via SyncMirror technology means that you never lose a transaction as the mirrored aggregates (plexes) only commit a write once it’s been mirrored to the remote aggregate.
When reading any documentation on MetroCluster you’ll come across references to plexes quite a lot. A plex is for want of a better term a sub-aggregate. Aggregate 1 in site A is composed of Plex0 in site A and Plex 1 in site B. Plex1 is a mirror of Plex0 and can be described as being a RAID 1 mirror. SyncMirror allows these plexes to mirror at the RAID level. As with an aggregate, the plexes are made up of a number of raid groups depending on the number of disks in the environment and your configuration. An example of what this looks like is: <enter diagram here>
MetroCluster leverages NetApp HA Cluster FailOver (CFO) functionality to automatically protect against controller failures and layers SyncMirror on top to provide cluster failover on disaster (CFOD),hardware redundancy, and geographical separation to attain ridiculous levels of availability. Each node in the MetroCluster continually monitors its partner via a cluster peer connection and mirrors data for non-volatile memory (NVRAM) via the FC-VI connection to ensure that even in the event of a controller failure there is no data loss.
From a virtual infrastructure perspective MetroCluster provides the back end to a VMware Metro Storage Cluster (VMSC) which allows for a storage failure in the one site to occur and for VMs to be migrated to the second site without any downtime. I will discuss VMware Metro Storage Cluster in a separate blog post. One of the areas that requires a bit more of an in-depth look is how OTV plays a part in enabling a VMSC.
What are Plexes?
Plexes are a key part to MetroCluster and SyncMirror but it’s not solely tied to those solutions. When an aggregrate is created a plex is created by default. This plex will then have all RAID groups tied to it. When you run the command ‘aggr status –v’ you will see something similar to the following above the disk list: /aggr0/plex0/rg1. A mirrored aggregate consists of two plexes and an unmirrored aggregate will consist of one. A plex can be online or offline but if it’s offline it’s not available for read or write access. One of the keys to a plex is that the disks in a plex are not allowed to span disk pools. This makes sense as mirroring across disk pools would make providing consistent writes a real challenge.
For want of a better explanation plexes are sub-aggregates. They are a virtual representation of physical disks. An example of how it can be visualised is below.
Why would I choose a MetroCluster?
As the proliferation of virtualization has grown rapidly allowing an improvement of resource utilization, power usage, data center footprint etc. it has also meant that an unplanned outage on a physical server has a far greater disruptive impact on the end users. As virtual machine densities continue to grow the need for high availability, fault tolerance and resilience underpin an enterprises infrastructure to protect the business critical applications. A large part of IT is risk-mitigation.
Sometimes those of us in infrastructure lose sight of the goal of infrastructure itself, which is to support and provide the necessary resource to satisfy performance requirements of the applications. In order to ensure that the applications remain operational at all times and all failures can be accounted for as part of business continuity you must look at secondary site availability. For most people this is in the form of a co-lo data centre that hosts all your applications and you leverage the business continuity contingencies of the co-lo operator. Another option is that there’s an active-passive configuration set up with asynchronous replication which in certain circumstances can cause data loss. A MetroCluster is a cross-site active-active synchronous write no data loss disaster avoidance infrastructure that is essentially bomb proof. There’s absolutely no single point of failure. It can be an incredible complex system and one that’s very difficult for vendors to implement, as can be seen by the extremely low number of vendors that have such an offering within their product portfolio.
The value behind MetroCluster is that there is zero RPO and very low RTO (a test done on a 7-Mode MetroCluster took 40 minutes for failover and 12 minutes for failback). A MetroCluster provides an answer to multiple failure scenarios and means that your business can continue with ease. Some of the failure scenarios include the below and I’ll go into these more in-depth later in the series.
- Power outage in DC
- Loss of disk shelf
- Loss of storage controller
- Loss of networking on storage and at the compute level
- Entire DC blown off the face of the planet
A MetroCluster can also be run as part of a Flexpod validated design. The Flexpod design includes Cisco UCS Chassis, blades and fabric-interconnects on each site functioning as two separate UCS domains which can be managed via UCS Central. This also includes Cisco Nexus 5k and 7k switching which makes use of OTV to allow cross site layer-2 domain for VMware Metro Storage Cluster to operate successfully. A Flexpod MetroCluster architecture looks very similar to the below diagram.
Why would I not choose MetroCluster?
This comes down largely to cost. A MetroCluster solution is expensive as you to double all your storage capacity so both sites maintain equal capacity so for all intents and purposes you have one entire set of disks sitting in an unusable state at all times. This can be hard to justify to the business as RTO’s and RPO’s could be satisfied via a regular HA pair using something like Snapmirror to a remote DR site and use something like SRM to orchestrate failover. As Snapmirror is asynchronous there is going to be a RPO gap, you can only restore back to when the last snapmirror job ran.
MetroCluster requires expensive fibre switching fabrics which may sit outside your budget. These switches have to be dedicated to MetroCluster and are not supported for general use.
To make use of MetroCluster you’ll also need to have a layer-2 network stretched across two buildings/sites depending on which MetroCluster solution you choose to go with. You’ll need the required networking in place to allow this. To allow MetroCluster to run with VMware Metro Storage Cluster sitting on top, which is essentially a stretched vCenter, you will need something that can encapsulate layer3 requests in layer2 packets such as OTV. This is another expense that needs to be factored in when looking at MetroCluster solutions.
Licensing is another reason why some people don’t consider MetroCluster. You’ll pretty much need every license going. Your local NetApp rep will love you though and I’m sure he’ll buy you a drink if you drop a PO for MetroCluster
Components of a MetroCluster
Due to there being two different types of MetroCluster the components are not the same across the two types. The majority of components are the same across both configurations with the Fabric MetroCluster requiring ATTO bridges to link the storage shelves up to the fabric. The bridges form part of the storage loop and extend the SAS loops to the second site.
- Two NetApp controllers running compatible versions of OnTap
- FC-VI cluster adapter, one per controller, is required for cluster interconnect. (When installed the local Infiniband card is disabled)
- Two ATTO FibreBridge 6500N SAS to Fibre converters (two new ATTO bridges are required per each new stack).
- FC-HBAs, two or four, of at least 4Gbps each for the controllers to join the fabric
- MetroCluster licensing – cluster_remote for site failover, syncmirror_local for synchronous mirroring across sites and cluster license for controller failover
- SAS cables
- Fibre cables
- Ethernet cables (management)
Fabric MetroCluster specific
- Dedicated fibre channel between sites (also known as dark fibre)
- Four Fibre Channel SAN Switches, two per site (Cisco MDS or Brocade switches). These must be dedicated for MetroCluster and not run any other traffic.
- Additional cabling