vSphere 6.5 and 6.7 qfle3 driver is really unstable

Edit 2019.03.08

In the end RSS and iSCSI were separate issues. RSS is to be fixed in vSphere 6.7U2 sometime this spring. Update Marvell (wow, Broadcom -> QLogic -> Cavium -> Marvell, I’m not sure what to call it by now) drivers are on VMware’s support portal. I haven’t tested them yet as I don’t currently have any Marvell NICs to try out.

Some details from my ServerFault answer to a similar issue: https://serverfault.com/a/950302/8334

Edit 2018.10.15

Three months have passed and QLogic/Cavium drivers are still broken. I’ve gotten a few debug drivers (and others have as well) but there’s no solution. Initial suspicion about bad optics was a red herring (optics really was bad but it was unrelated). Currently there are 2 issues:

  • Hardware iSCSI offload will PSOD the system (in my case in 5-30 minutes, in other cases randomly)
  • NIC RSS configuration will randomly fail (once every few weeks), causing total loss of network connectivity or PSOD or a NMI by BIOS/BMC (or a combination of 3).

So far I’ve had to swap everything to Intels (being between a rock and a hard place). They have their own set of problems, but at least no PSODs or networking losses. Beacon probing doesn’t seem to work with Intel X710 based cards (confirmed by HPE) – incoming packets just disappear in NIC/driver. Compared to random PSOD, I can live with that.

Edit 2018.07.11

HPE support confirmed that qfle3 bundle is dead in water. Our VAR was astonished that sales branch was completely unaware of severe stability issues. Edited subject to reflect findings.

Edit 2018.07.09

Qlogic qfle3i (and whole Qlogic 57810 driver bundle) seems to be just fucked. qfle3i crashes on no matter what. Even basic NIC driver qfle3 crashes occasionally. So if you’re planning to switch from bnx2 to qfle3 as required by HPE, don’t! bnx2 is at least stable for now. Latest HPE images already contain this fix – however it doesn’t fix these specific crashes. VMware support also confirmed that there’s an ongoing investigation into this known common issue and it also affects vSphere 6.5. I’m suffering on HPE 534FLR-SFP+ adapters but your OEM may have other names for Qlogic/Cavium/Broadcom 57810 chipset.

A few days ago I was setting up a new green-field VMware deployment. As a team effort, we were ironing out configuration bugs and oversights, but all despite all the fixes, vSpheres kept PSODing consistently. Stack showed crashes in Qlogic hardware iSCSI adapter driver qfle3i.

Firmwares were updated and updates were installed, to no effect. After looking around and trial-and-errors, one fiber cable turned out to be faulty and caused occasional packet loss on SAN to switch path. TCP is supposed to fix that in theory but hardware adapters seem to be much more picky. Monitoring was not yet configured so it was quite annoying to track down. Also as SAN was not properly accessible, no persistent storage for logs nor dumps.

So if you’re using hardware adapters and seeing PSODs, check for packet loss in switches. I won’t engage support for this as I have no logs nor dumps. But if you see “qfle3i_tear_down_conn” in PSOD, look for Ethernet problems.

VMware EVC now exposes Spectre mitigation MSRs with latest patches

Edit: speak of the devil… new vCenter and vSphere patches just released: https://www.vmware.com/security/advisories/VMSA-2018-0004.html Headline revised to reflect update.

Edit 2: As this update requires shutting down and starting VMs (full power cycle, simply restart does not work), use this PowerCLI command to find VMs that don’t yet have new features exposed

Get-VM |? {$_.extensiondata.runtime.featurerequirement.key -notcontains 'cpuid.IBRS'  -or $_.extensiondata.runtime.featurerequirement.key -notcontains 'cpuid.IBPB'}

While you can apply VMware patches and BIOS microcode updates, guests will not see any mitigation options if EVC is enabled (as these options were not in original CPU specification). It’s the same for KVM/QEMU CPU masking, however Hyper-V allows exposing new flags (probably because it doesn’t have anything like EVC besides “compatibility” flag).

I haven’t yet tested without EVC but with all things patched up, clients with Broadwell EVC don’t see required MSRs with ESXi 6.5.

Running UNMAP on snapshotted VMware hardware 11+ thin VMs may cause them to inflate to full size

Scenario:

  1. ESXi 6.X (6.5 in my case)
  2. VMware hardware 11+ (13)
  3. Thin VM
  4. UNMAP aware OS (Windows 2012+)

If you snapshot VM and run UNMAP (for example retrim from defrag utility), VM may (not always) inflate to full size during snapshot commit. It also results in really long commit times.

I’ve seen it happen quite a few times and it’s really annoying if you for some historical reason have tons of free space on drives (for example NTFS dedup was enabled long after deployment) and may even cause datastores to become full (needless to say, really bad). Also it tends to happen during backup windows that keep snapshots open for quite a while (usually at night, terrible). While I could disable automatic retrim (bad with lots of small file operations, normal UNMAP isn’t very effective on them due to alignment issues) or UNMAP (even worse), it’s an acceptable risk for now if you keep enough free space on datastore to absorb inflation of the biggest VM. You can retrim after snapshot commit and it drops down to normal size quickly (minutes).

I haven’t seen this anywhere else but I guess I’ll do a reproducable PoC, contact VMware support and do an update.

vSphere 6.5 guest UNMAP may cause VM I/O latency spikes – fixed in update 1

I converted some VMs to thin and upgraded VM hardware version to 13 to test out savings. Initial retrim caused transient I/O slowdown in VM but the issue kept reappearing randomly. I/O latency just spikes to 400ms for minutes for no apparent reason. It also seems to affect other surrounding VMs, just not as badly. After several days, I converted VMs back to thick and issues disappeared.

I’m not sure where the problem is and I can’t look into it anymore. Might be a bug in vSphere. Might be the IBM v7000 G2 SAN that goes crazy. As I said, I cannot investigate it any further but I’ll update the post if I ever hear anything.

PS! Savings were great, on some systems nearly 100% from VMFS perspective. On some larger VMs with possible alignment issues, reclamation takes several days though. For example, a 9TB thick file server took 3 days to shrink to 5TB.

Update 2017.o6.29:

Veeam’s (or Anton Gostev’s) newsletter mentioned a similar issue just as I came across this issue again in a new vSphere cluster. In the end VMware support confirmed the issue with expected release of 6.5 Update 1 at the end of July.

Update much later in november

I’ve been running Update 1 since pretty much release date and UNMAP works great! No particular performance hit. Sure, it might be a bit slower during UNMAP run but it’s basically invisible for most workloads.

I’ve noticed that for some VM’s, you don’t space back immediately. On some more internally fragmented huge (multi-TB) VMs, particularly those with 4K clusters, space usage seems to reduce slowly over days or weeks. I’m not sure what’s going on but perhaps ESXi is doing some kind of defrag operation in VMDK…? And yeah, doing a defrag (you can do it manually form command line in Windows 2012+) and then UNMAP helps too.

vSphere 6.5 virtual NVMe does not support TRIM/UNMAP/Deallocate

Update 2018.10.15

It works more-less fine in 6.7. Known issues/notes so far:

  • Ugly warning/errors is Linux kernel log if Discard is blocked (snapshot create/commit) – harmless
  • Linux NVMe controller has a default timeout of 30s. With VMTools, only SCSI gets increase to 180s so you might want to manually increase nvme module timeout just in case. “CRAZY FAST, CRAZY LOW LATENCY!!!” you scream? Well fabrics and transport layers still may have hickups and tolerating transient issues might be better than being broken.
  • When increasing VMDK sizes, Linux NVME driver doesn’t notice namespace resize. Newer kernels (4.9+ ?) have configuration device to rescan, older require VM reboot
  • One VMFS6 locking issue that may or not be related to vNVME. Will update if I remember to (or get feedback from VMware).
  • It seems to be VERY slightly faster and have VERY slightly lower CPU overhead. It’s within the margin of error, in real life it’s basically the same as PVSCSI.
  • The nice thing is that it works with Windows 7 and Windows 2008 R2! Remember that they don’t support SCSI UNMAP. However NVME Discard seems to work. Delete reclaims space, (ironically) manual defrag frees space, also sdelete zero successfully reclaims space.

I was playing with guest TRIM/UNMAP the other day and looked at new shiny virtual NVMe controller. While it would not help much in my workloads, cutting overhead never hurts. So I tried to do “defrag /L” in VM and it return that device doesn’t support it.

So I looked up release notes. Virtual NVMe device: “Supports NVMe Specification v1.0e mandatory admin and I/O commands”.

The thing is that NVMe part that deals with Deallocate (ATA TRIM/SCSI UNMAP in NVMe-speak) is optional. So back to pvscsi for space savings…