vSphere 6.5 guest UNMAP may cause VM I/O latency spikes – fixed in update 1

I converted some VMs to thin and upgraded VM hardware version to 13 to test out savings. Initial retrim caused transient I/O slowdown in VM but the issue kept reappearing randomly. I/O latency just spikes to 400ms for minutes for no apparent reason. It also seems to affect other surrounding VMs, just not as badly. After several days, I converted VMs back to thick and issues disappeared.

I’m not sure where the problem is and I can’t look into it anymore. Might be a bug in vSphere. Might be the IBM v7000 G2 SAN that goes crazy. As I said, I cannot investigate it any further but I’ll update the post if I ever hear anything.

PS! Savings were great, on some systems nearly 100% from VMFS perspective. On some larger VMs with possible alignment issues, reclamation takes several days though. For example, a 9TB thick file server took 3 days to shrink to 5TB.

Update 2017.o6.29:

Veeam’s (or Anton Gostev’s) newsletter mentioned a similar issue just as I came across this issue again in a new vSphere cluster. In the end VMware support confirmed the issue with expected release of 6.5 Update 1 at the end of July.

Update much later in november

I’ve been running Update 1 since pretty much release date and UNMAP works great! No particular performance hit. Sure, it might be a bit slower during UNMAP run but it’s basically invisible for most workloads.

I’ve noticed that for some VM’s, you don’t space back immediately. On some more internally fragmented huge (multi-TB) VMs, particularly those with 4K clusters, space usage seems to reduce slowly over days or weeks. I’m not sure what’s going on but perhaps ESXi is doing some kind of defrag operation in VMDK…? And yeah, doing a defrag (you can do it manually form command line in Windows 2012+) and then UNMAP helps too.

vSphere 6.5 virtual NVMe does not support TRIM/UNMAP/Deallocate

Update 2018.10.15

It works more-less fine in 6.7. Known issues/notes so far:

  • Ugly warning/errors is Linux kernel log if Discard is blocked (snapshot create/commit) – harmless
  • Linux NVMe controller has a default timeout of 30s. With VMTools, only SCSI gets increase to 180s so you might want to manually increase nvme module timeout just in case. “CRAZY FAST, CRAZY LOW LATENCY!!!” you scream? Well fabrics and transport layers still may have hickups and tolerating transient issues might be better than being broken.
  • When increasing VMDK sizes, Linux NVME driver doesn’t notice namespace resize. Newer kernels (4.9+ ?) have configuration device to rescan, older require VM reboot
  • One VMFS6 locking issue that may or not be related to vNVME. Will update if I remember to (or get feedback from VMware).
  • It seems to be VERY slightly faster and have VERY slightly lower CPU overhead. It’s within the margin of error, in real life it’s basically the same as PVSCSI.
  • The nice thing is that it works with Windows 7 and Windows 2008 R2! Remember that they don’t support SCSI UNMAP. However NVME Discard seems to work. Delete reclaims space, (ironically) manual defrag frees space, also sdelete zero successfully reclaims space.

I was playing with guest TRIM/UNMAP the other day and looked at new shiny virtual NVMe controller. While it would not help much in my workloads, cutting overhead never hurts. So I tried to do “defrag /L” in VM and it return that device doesn’t support it.

So I looked up release notes. Virtual NVMe device: “Supports NVMe Specification v1.0e mandatory admin and I/O commands”.

The thing is that NVMe part that deals with Deallocate (ATA TRIM/SCSI UNMAP in NVMe-speak) is optional. So back to pvscsi for space savings…

An unpopular opinion about Vista

I have said it again and again. I think Vista was not a bad OS at all. Not the greatest but somewhere between good and great.

While I missed very early teething issues, I did catch a few. I didn’t get to use Vista until I completed my military service, in summer of 2007. This was the first and last OS that caused me to say “wow” on first boot. It just looked so great! Sure, Linux had all the bells and whistles and XP had WindowBlinds but they never looked as clean and classy. But to get that far, I had to remove some RAM as setup hung when you had more than 2GB… And then I got a BSOD due to Bluetooth stack. 🙂
I did keep on using Vista personally until a few months after 7 came out.

I did plenty of Vista rollouts in 2008 and 2009 and… it worked great. By that time SP1 was out and drivers had stabilized. On most of hardware it ran just fine. Maybe not as fast but XP the difference was not noticeable and people actually liked Vista. For most of enterprises, I think it was a mistake to skip Vista. As tooling and many OS concepts had changed considerably, I saw many people complaining after Windows 7 release. They hadn’t even touched Vista and were surprised how similar Vista and 7 were.

Security was better. UAC was actually great (it had some nice side-effects). Quite a few features actually became usable compared to XP. It had some nice features for sysadmins that went relatively unnoticed. On the other hand, early tools sucked big time. Later WAIKs were much better and by SP2 it pretty much looked as it does today.

I switched jobs in 2010 and didn’t get to professionally touch Vista since. Kind of sad actually. Technology was solid but teething issues caused an unrecoverable PR nightmare.