NetApp MetroCluster Overview – Part 5 – Failure Scenarios for MetroCluster


Failover/Failure Scenarios for MetroCluster

I’m not going to re-invent the wheel here. These failure scenarios are all pretty self-explanatory and can be found in TR-3788.pdf. There’s far more scenarios in that document but here I’ll cover off some of the most common types.

Scenario: Loss of power to disk shelf

MetroCluster Failure Disk Shelf

Expected behaviour: Relevant disks for offline and the plex is broken. There’s no disruption to data availability to hosts running HA (VMware High Availability) or FT (Fault Tolerance), no change is detected by the ESXi Server. When the shelf is powered back on the plexes will sync automatically

Impact on data availability: None


Scenario: Loss of one link in one disk loop

MetroCluster Failure Inter-Switch Link

Expected behaviour: A notification appears on the controller to advise that disks are only accessible via one switch. There’s no disruption to data availability to hosts running HA or FT, no change is detected by the ESXi Server. When the connection is reset an alert on the controller will advise of connectivity across two switches

Impact on data availability: None


Scenario: Failure and Failback of Storage Controller Read More


NetApp MetroCluster Overview – Part 4 – Cabling of Fabric MetroCluster


Cabling of Fabric MetroCluster

The cabling of a MetroCluster is the key. Outside of some licensing it’s the cabling that’s really the only different between MetroCluster and Mirrored HA pair. Yes it’s a bit more complex for failover and failback but really the main difference from a setup point of view is the cabling. There’s a large number of cables and the configuration should be all mapped out before beginning putting equipment into your racks. I would heartily recommend reading MetroCluster and High Availability Guide before starting to understand your cabling requirements. Below is not a how-to on how to connect everything, it’s just an overview with a brief explanation. The above NetApp document is very detailed and should answer any questions you may have.

A simplified view of a Fabric MetroCluster is as follows:

Fabric Switch

I found this workflow from NetApp documentation which is quite useful as a guideline on how the bridges should be cabled.

Cable connection workflow

Read More


NetApp MetroCluster Overview – Part 3 – Fabric-Attached MetroCluster


What is Fabric-Attached MetroCluster?

A Fabric-Attached MetroCluster configuration can be implemented for distances greater than 500 meters connects the two storage nodes by using four Brocade or Cisco Fibre Channel switches in a dual-fabric configuration for redundancy. Each site has two Fibre Channel switches, each of which is connected through an inter-switch link to a partner switch at the other site.

The inter-switch links are fibre connections which extend the storage fabric path so that it provides a greater distance between nodes than other HA pair solutions. By using four switches instead of two, redundancy is in place to avoid single-points-of-failure in the switches and their connections.

The advantages of a fabric-attached MetroCluster configuration over a stretch MetroCluster configuration include the following:

  • Increased disaster protection via nodes being in separate geographical locations
  • Disk shelves and nodes are not connected directly to each other, but are connected to a fabric with multiple data routes ensuring no single point of failure.

The disadvantage is that there’s more cabling and there’s more components involved in the way of fibre switches.

Fabric MetroCluster requirements

Read More


NetApp MetroCluster Overview – Part 2 – Stretch MetroCluster


What is a Stretch MetroCluster

To understand Stretch MetroCluster you first need to understand how a HA pair operates. A Stretch MetroCluster is basically the next level up from a HA pair, or a HA Mirrored pair to be more specific.

Standard HA Pair

In a HA pair, which is a very common Netapp controller deployment, the cluster can handle the outage of a physical link or the entire controller and still provide access to the underlying storage without impacting data access for end users. Each controller in the HA pair shares the same set of disks or own its own distinct set of disks but either way in the event of a controller failure all reads/writes are sent to the remaining controller that can still access the failed controllers disks. There is a HA interconnect between the controllers that’s used for both keep-alive monitoring and mirroring of NVRAM. The HA pair provides fault tolerance and allows non-disruptive upgrades as a takeover and giveback can be performed for planned migration of read/writes to second controller in the HA pair.

Next up is a Mirrored HA pair. Mirrored HA pairs maintain two copies of all data in the form of plexes. These are continually updated synchronously using SyncMirror and provides protection in the event of disk failures. You can also set the mirroring to be asynchronous if there is a need for that. The major drawback to Mirrored HA pairs is that it does not provide failover to the partner node in the event of a controller failure. The Mirrored HA controller pair need to be within the 5 metre SAS limit.  This is where stretch MetroCluster comes in.

Read More


NetApp MetroCluster Overview – Part 1 – What is MetroCluster?

During the past couple of years I’ve been working on Flexpod solutions and even more recently than that I’ve been exposed to NetApp Flexpod MetroCluster. This has led me to doing quite a bit of research and reading about MetroCluster solutions and I thought I’d share some of that knowledge.I wanted to put together a post to help anyone else that needs to get a better understanding of MetroCluster infrastructure. It kind of got out of hand a bit and to make it easier to read I’ve split it out into a number of parts


MetroCluster is a term that is often heard but I believe rarely understood. It adds some extra complexity into every aspect of the infrastructure but from my own technical bias I love that as it gives me something else to learn and play with.

Multiple different vendors now provide a MetroCluster or metro-availability solution but my focus is on NetApp and in particular 7-Mode MetroCluster. Any reference I make to MetroCluster from here on will only be in reference to NetApp MetroClusters. If your clients or company value disaster avoidance, business continuity, fault tolerance and overall infrastructure resilience then you really need to look at a MetroCluster solution. You will also need to have some deep pockets.

As the engineer supporting such solutions and performing disaster recovery tests I can attest to the power of a MetroCluster solution that can attain zero downtime and data resilience. Even in one instance where I almost brought it to its knees it still soldiered on.

What is a MetroCluster? Read More


Fix: VMware – Quiesced Snapshots failing – Unexpected error DeviceIoControl

I ran into an interesting problem that took a bit of digging around to both find the root cause and also to find the final fix. When running backups on Vmware 5.5 running on NetApp storage I could see some, but not all VMs, failing and throwing up the below errors in the event logs

Event ID 57 ntfs Warning
The system failed to flush data to the transaction log. Corruption may occur.

Event ID: 137 ntfs Error
The default transaction resource manager on volume \?Volume{806289e8-6088-11e0-a168-005056ae003d} encountered a non-retryable error and could not start. The data contains the error code.

Event ID: 12289 VSS Error
Volume Shadow Copy Service error: Unexpected error DeviceIoControl(\?fdc#generic_floppy_drive#6&2bc13940&0&0#{53f5630d-b6bf-11d0-94f2-00a0c91efb8b} - 00000000000004A0,0x00560000,0000000000000000,0,0000000000353B50,4096,[0]). hr = 0x80070001, Incorrect function.

The key alert here is Event ID 12289. It was also the most off-putting. It initially looked like a floppy drive issue but there was no floppy drive attached to the VM nor were there any floppy drivers installed on the VM. A look around the VMware community forums led me to this posting – It was focused more on vSphere 4.1 however and most of the advice was around installing an older version of VMware Tools. Comment 27 was the jackpot winner. The System Reserved partition was causing the issue.

So what does the System Reserved partition do?

The System Reserved partition contains the Boot Manager and Boot Configuration data that are read on start up of the virtual machine. The VM boots from the boot loader n the System Reserved partition and then boots Windows from the System drive. It is also used as a location for the start up files for BitLocker Drive Encryption. If you need BitLocker then you’ll need to have a System Reserved partition. For Windows client OS’s then that’s a great feature to have but from a server OS perspective where BitLocker just isn’t used then it’s superfluous. The System Reserved partition is created by default on OS installation so there’s two options to remediate.

  1. Remove the partition manually post installation
  2. Remove the partition from your Windows OS templates

I won’t go into the details on how to remove the partition from your templates here but you can find more information over on which can be used. I ran through the steps myself to do this for all of our Windows templates following finding the root cause of the initial error.

As per one of the links mentioned in Comment 27 in the VMware communities post it’s possible to change the location of the boot files so that the partition can be removed. This information can be found over on However the steps didn’t include how to re-claim that partition so that there isn’t an unallocated disk partition sitting in front of the C drive (disk 0). While I haven’t tested backups in this configuration I wouldn’t be surprised if it cause other issues during backup. So below I’ve listed the steps to follow so you can successfully remove the partition as per the steps on geekshangout and then re-claim the space on gparted.

Delete System Reserved partition and reclaim space

Read More


NetApp DFM Snapshot Management – Remove Orphaned snapshots

Orphaned Snapshot Removal – Identify Orphans

I’ve been banging my head against the screen for the past few weeks looking at storage issues and finding orphaned volumes with reams of snapshots using up valuable disk space. In some cases it was due to manual intervention and a snapmirror or snapvault relationship was broken, in others it was caused by DFM creating new instances of the volume but not cleaning up old volumes and associated snapshots and in other cases, well I’ve no idea how they occurred. Hence why I’ve been slapping my brain around the inside of my skull. I’d be interested to know if this is still an issue with OCUM, answers on a postcard.

There’s no pretty way to clean up orphaned snapshots that are essentially owned by DFM. It’s messy, convoluted and requires that you’re very careful and precise about what you’re removing otherwise you’ll make things worse. There are a number of reasons why orphans can occur. One is down to  the way SnapProtect and the DFM work together. If a VM is deleted or moved to another volume and no other VM’s that are a part of that same backup subclient exist on that volume the snapshots will not age and will require a manual clean-up process. This seems to limit the use of automated DRS in VMware, but that’s a separate issue really. Another reason, and what looks to be the cause in my case, is that DFM has intermittent issues communication to the storage controller and thinks the volume doesn’t exist so it DFM may create a FlexClone of the volume and index it to have a new suffix while still being able to access the snapshots that were already captured. This can be caused by network drop outs out by the controllers or the CPUs maxing out and not being able to reply to DFM. I’m still investigating the cause of this. If a new storage policy was created in SnapProtect with these volumes assigned it would clear out the orphans but that would involve re-baselining the backups which is not something you’d want to do, unless of course that data had to value to you.

Read More


How-To: NetApp SnapProtect – Service Pack Upgrade

Oh SnapProtect, how you taunt me! It’s one of those products that’s been OEM’d from another vendor so it’s missing some functionality and also means that the documentation specific to it can be sparse. Commvault documentation most likely will be sufficient but ideally there would be documentation would exist on how to perform Service Pack upgrades specifically for SnapProtect. I have a few issues with SnapProtect but I’ll leave that rant for another time. When it came to recently upgrade SnapProtect I had the issue of not finding documentation that would clarify the process so I thought I’d capture it so I can at least return to it in the future if I need to. Below are a list of steps carried out to perform the upgrade. I understand that the media agent upgrade may be flawed, and in my case I couldn’t get it to work correctly, so if someone knows what I’ve done wrong please feel free to leave a comment. I’m not an expert in either SnapProtect or its cousin Commvault. For the vast majority of SnapProtect admins this document may be superfluous but hopefully someone finds it useful.

Pre-Upgrade Task

1. Open a preemptive support case with Netapp

2. Download the software from NetApp support site, copy the installation file to the local drive on the server. The software can be access here with a NetApp login –

3. Open the SnapProtect Administrative console on the CommServe/SnapProtect Server. In the console right-click on the commserve, select All Tasks and take a backup of SnapProtect using Disaster Recovery backup.

snapProtect-upgrade step 1

Select the option as a Full backup and click Ok

snapProtect-upgrade step 2

4. Find the SET_XXX folder, in this case in the SnapProtectDR folder and zip it.

snapProtect-upgrade step 3 Read More


The Life of NetApp – Bring out your dead!

There’s a quality scene in Monty Python’s Life of Brian where the dead are being called out to be loaded onto a cart to be taken away. Are new players in the market doing the same to NetApp? Even though they continue to say that they’re not dead everyone is writing them off and chucking them on the death-cart.

It’s easy to see why NetApp is being called to bring out its dead. There’s more and more players appearing in the storage market with serious differentiators to NetApp. Just look at the list of potential competitors like Pure Storage, Tintri, SimplivityNutanix and Nimble. And that’s not including the fully software defined storage groups such as Maxta, Stratoscale and a host of others. There’s also the old adversary EMC. All of these vendors have released new and innovative products in the past year and they have managed their marketing message far better than NetApp has. NetApp has been painfully slow at getting a smooth transition in place for its 7-Mode customers to Cluster Data OnTap (C-Dot). A lot of critics of NetApp also point to the fact that they are so heavily reliant on the OnTap software. I personally don’t see an issue with that reliance. Don’t change something just to create a new release for the sake of it. But the marketing message and the perception by the community of NetApp has caused a number of issues for them.

Read More

NetApp – Create a new volume on vFiler (7-mode)

I had to create a new volume on a vfiler recently. This is a fairly straight-forward tasks for long term NetApp admins but I thought I’d write up the process for the next time that I forget. In this example the vfiler already exists and has been exported on a different subnet than the root vfiler, vfiler0. If you’re new to vfilers then you’ll immediately notice that once you change the vfiler context to the vfiler you want to add a volume to that you don’t have the option to create a new volume. The new volume needs to be created at the root vfiler level and then assigned to the vfiler you wish. In this example I am create a new ISO datastore on a vfiler context so that one of our tenants can have their own ISO datastore. We could present out the ISO datastore from vfiler0 but that would break the security model we worked hard to put in place.

The first thing to do is change the vfiler context and then run the vol command. You will see from this that it’s not possible to create the volume directly on the vfiler.

vfiler context <tenant-vfiler>
[email protected]> vol
The following commands are available; for more information
type "vol help "
offline             options             restrict            status
[email protected]> vol create iso01 aggr1 200g
vol: No such command "create".
The following commands are available; for more information
type "vol help "
offline             options             restrict            status

So go back to the parent vfiler, vfiler0, and then create the new volume. From there you can add it to the tenant-vfiler. Before transferring the volume to the tenant-vfiler I have also changed the options to make the volume thin provisioned using the “guarantee none” setting and also set fractional_reserve to 0. The commands used to create the new volume, modify the settings and add it to the tenant-vfiler were:

[email protected]> vfiler context vfiler0
NTAPcontroller> vol create iso01 -s volume aggr1 200g
NTAPcontroller> vol options iso01 guarantee none
NTAPcontroller> vol options iso01 fractional_reserve 0
NTAPcontroller> vol status iso01
  Volume State          Status            Options
  iso01 online          raid_dp, flex     create_ucode=on, convert_ucode=on,
                        mirrored  guarantee=none, fractional_reserve=0
                        Volume UUID: 0df82cec-fdb8-11e4-a27a-123478563412
                Containing aggregate: 'aggr1'

NTAPcontroller> vfiler add tenant-vfiler /vol/iso01
WARNING: reassigning storage to another vfiler does not change the security information on that storage. If the security domains are not identical, unwanted access may be permitted, and wanted access may be denied.
Tue May 19 09:47:47 EST [NTAPcontroller:cmds.vfiler.path.move:notice]: Path /vol/iso01 was moved to vFiler unit "tenant-vfiler".
Tue May 19 09:47:47 EST []: /etc/exports was not updated for iso01 when the vol destroy command was run. Please either manually update /etc/exports or copy /etc/ to it.

Read More