I’ve been working on an issue over the past couple of days where a backup has constantly been failing. the problem was isolated down to the fact that the VM has a warning that it required disks to be consolidated. Nothing major, or so I thought. I had a look at the datastore where the VM resides and it has 185 snapshot vmdk disks. Well that can’t be right! So I did a bit of investigation and found a number of VMware KB articles around the problem. The basic option is to follow KB 2003638 and just run a basic consolidation by going to Snapshot -> Consolidate.
You’ll then be prompted to select Yes/No as you’ll have to consolidate the Redo logs. Select Yes.
At this point it looked as it the consolidation was going to work but at about 20% it failed. The next error shows that the file is locked.
There are a number of recommendations around what can be done to remove the lock on the file. One is to run a vMotion/svMotion in VMware to another host. Unfortunately due to these both being standalone ESXi hosts with no vMotion network or capabilities that couldn’t be done. Some people recommend reboot the ESXi host to release the lock but per my issue above, there was no vMotion network and these hosts run production manufacturing systems and cannot just be randomly rebooted. Waiting on a downtime approval would take too long. The next step was to restart the management agents on the ESXi host. This was done by connecting to the ESXi host via SSH and running the following commands:
/etc/init.d/hostd restart /etc/init.d/vpxa restart
This caused the host to be unmanageable for a brief moment. I re-ran the consolidation task tried earlier but got the same error message. Next I started to go through the KB article KB 10051 – Investigatin virtual machine file locks on ESXi/ESX. This was a good article up to a point, but it could be clearer. I connected to the ESXi host via SSH. As I knew the VM I moved straight to locating the lock and removing it. Instead of using the /var/log/messages as mentioned in the KB article I opened vmware.log in vi. From there I ran a search for “lock“. What I found was that the 1-000830-delta.vmdk was locked.
Based on the disk number it was possible to run the command ‘vmkfstools -D <vm_name>_1-000830-delta.vmdk‘ which returned the MAC address of the device causing the Read Only (RO). I didn’t capture this at the time but it will look something similar to that shown on the VMware KB article
[root@test-esx1 testvm]# vmkfstools -D test-000008-delta.vmdk Lock [type 10c00001 offset 45842432 v 33232, hb offset 4116480 gen 2397, mode 2, owner 00000000-00000000-0000-000000000000mtime 5436998] <-------------- MAC address of lock owner RO Owner HB offset 3293184 4f284470-4991d61b-4b28-001a64c335dc <------------------------------ MAC address of read-only lock owner Addr <4, 80, 160>, gen 33179, links 1, type reg, flags 0, uid 0, gid 0, mode 100600 len 738242560, nb 353 tbz 0, cow 0, zla 3, bs 2097152
With the MAC address you can check in vCenter to see which vNIC and vSwitch the MAC address was assigned to. In my case it was the ESXi management vNIC.
From here I deviated from the KB article. I ran the following command:
# lsof | grep <vm_name>_1-000830-delta.vmdk
This returned two processes.
9778541 vpxa-worker 11 51 /vmfs/volumes/5356c1f8-55d703b6-d4b5-b83861d73252/<VM_Name>/<name_of_locked_file> 1053523 vpxa-worker 2 8 /vmfs/volumes/5356c1f8-55d703b6-d4b5-b83861d73252/<VM_Name>/<name_of_locked_file>
I killed the older job as that was the one causing the lock.
Now that the lock was removed I was able to re-run the Consolidate Snapshots and it ran successfully.
Not all vmdks removed as part of the consolidation. Some remained and this was due to those disks being mounted via hot-add to the backup server
Cause of the issue:
So why did this happen? The backup software was failing to clear up a snapshot which then had a knock on effect for each backup after that causing a number of new disks to be created. The Hot-add feature of the backup software to the backup server VM meant that the backup server had a lock on on of the vmdks. It didn’t release due to a failed backup at some time in the past and every time a new backup was taken the disks just kept growing. Consolidating the snapshots actually caused the backup server to shutdown and could not be powered back on again until it had all hot-add disks removed. I chose not to delete from disk and will perform that cleanup as a manual task.