vSphere 6.5 guest UNMAP may cause VM I/O latency spikes – fixed in update 1

I converted some VMs to thin and upgraded VM hardware version to 13 to test out savings. Initial retrim caused transient I/O slowdown in VM but the issue kept reappearing randomly. I/O latency just spikes to 400ms for minutes for no apparent reason. It also seems to affect other surrounding VMs, just not as badly. After several days, I converted VMs back to thick and issues disappeared.

I’m not sure where the problem is and I can’t look into it anymore. Might be a bug in vSphere. Might be the IBM v7000 G2 SAN that goes crazy. As I said, I cannot investigate it any further but I’ll update the post if I ever hear anything.

PS! Savings were great, on some systems nearly 100% from VMFS perspective. On some larger VMs with possible alignment issues, reclamation takes several days though. For example, a 9TB thick file server took 3 days to shrink to 5TB.

Update 2017.o6.29:

Veeam’s (or Anton Gostev’s) newsletter mentioned a similar issue just as I came across this issue again in a new vSphere cluster. In the end VMware support confirmed the issue with expected release of 6.5 Update 1 at the end of July.

Update much later in november

I’ve been running Update 1 since pretty much release date and UNMAP works great! No particular performance hit. Sure, it might be a bit slower during UNMAP run but it’s basically invisible for most workloads.

I’ve noticed that for some VM’s, you don’t space back immediately. On some more internally fragmented huge (multi-TB) VMs, particularly those with 4K clusters, space usage seems to reduce slowly over days or weeks. I’m not sure what’s going on but perhaps ESXi is doing some kind of defrag operation in VMDK…? And yeah, doing a defrag (you can do it manually form command line in Windows 2012+) and then UNMAP helps too.

IBM Tivoli Storage Manager excludes most VSS protected files

Let’s say we’re using IBM TSM with agents on Windows. It supports VSS snapshots so you might expect that when you perform backup, you can restore any file in system.

Wrong!

TSM will hard-exclude any VSS-protected files except for a short list of supported inbox writers. Most recent list is here:
http://www.ibm.com/support/knowledgecenter/SSGSG7_7.1.0/com.ibm.itsm.client.doc/t_bac_sysstate.html
Don’t worry, it hasn’t changed since ever. I count 16.

And now take a look at just the list of Windows inbox writers:
https://msdn.microsoft.com/en-us/library/windows/desktop/bb968827
I currently counted 34 items (it may change in future).
WDS, WID, RMS, Certificate Services are absent in IBM’s list for example.

Now think VSS aware products, like SQL Server, Oracle, Exchange among big names. In some cases you just might not care about application-specific backups, application consistent VSS file-based backup will do just fine. SQL Server database crashed? OK, lets copy database files back in place, start engine – good enough.

Now what will Tivoli do?

  • VSS snapshot like pretty much every other product
  • Query VSS for list of writers and writer protected files
  • It will hard-exclude ANY file protected by ANY VSS writer not included in list

Say you have a WSUS running on WID. WID database are hard-excluded even though they are consistent in VSS snapshot. I repeat, you cannot backup these files as Tivoli will just not let you. You have a WDS to PXE boot systems? Nope. SQL Express running in simple logging mode to run some tool that you only care to have database file in backup. Tough luck, excluded.

The cynical part is that when you query TSM for excluded files, it will say excluded by operating system. No, it is not excluded by the operating system, it is excluded by IBM! When looking around in forums, the same opinion reigns. Wrong! Operating system does not exclude them. Do a backup snapshot with diskshadow and mount it. The files are there.
Also there are claims that these files should be excluded because they may be volatile and inconsistent. Wrong! The point of VSS Writers existence is to make them consistent. Not crash-consistent but cleanly consistent! Do backup snapshot with diskshadow. The files are there. They are consistent. It seems that IBM sales/marketing are really, i mean like REALLY greedy or tech guys are really incompetent.

Oh boy… I guess some guys have only seen LVM snapshots…

When we contacted support, response was “by design”. I cannot comprehend the stupidness of this response. Backup product that refuses to protect OS components.

I dug around a bit and it seems that TSM used to work fine until about version 5.5 when this “functionality” was introduced. https://adsm.org/forum/index.php?threads/files-missing-in-windows-server-2008-backup.17112

Workaround 1: PRESCEDULECMD for pretty much anything to dump or copy data before backup. The bad part is that it is only automatically invoked when backup is started from schedule.

Workaround 2: Dump TSM and get a anything else

Workaround 3: adding these options to your dsm.opt file might help. I didn’t bother to try, I voted with my wallet.
TESTFLAG VSSDISABLEEXCL
TESTFLAG SKIPSYSTEMEXCLUDE

TL;DR: After having been forced to work with Tivoli Storage Manager for a years, avoid it like plague, burn it with fire. Expensive, slow, plain stupid.