vSphere 6.5 and 6.7 qfle3 driver is really unstable

Edit 2019.03.08

In the end RSS and iSCSI were separate issues. RSS is to be fixed in vSphere 6.7U2 sometime this spring. Update Marvell (wow, Broadcom -> QLogic -> Cavium -> Marvell, I’m not sure what to call it by now) drivers are on VMware’s support portal. I haven’t tested them yet as I don’t currently have any Marvell NICs to try out.

Some details from my ServerFault answer to a similar issue: https://serverfault.com/a/950302/8334

Edit 2018.10.15

Three months have passed and QLogic/Cavium drivers are still broken. I’ve gotten a few debug drivers (and others have as well) but there’s no solution. Initial suspicion about bad optics was a red herring (optics really was bad but it was unrelated). Currently there are 2 issues:

  • Hardware iSCSI offload will PSOD the system (in my case in 5-30 minutes, in other cases randomly)
  • NIC RSS configuration will randomly fail (once every few weeks), causing total loss of network connectivity or PSOD or a NMI by BIOS/BMC (or a combination of 3).

So far I’ve had to swap everything to Intels (being between a rock and a hard place). They have their own set of problems, but at least no PSODs or networking losses. Beacon probing doesn’t seem to work with Intel X710 based cards (confirmed by HPE) – incoming packets just disappear in NIC/driver. Compared to random PSOD, I can live with that.

Edit 2018.07.11

HPE support confirmed that qfle3 bundle is dead in water. Our VAR was astonished that sales branch was completely unaware of severe stability issues. Edited subject to reflect findings.

Edit 2018.07.09

Qlogic qfle3i (and whole Qlogic 57810 driver bundle) seems to be just fucked. qfle3i crashes on no matter what. Even basic NIC driver qfle3 crashes occasionally. So if you’re planning to switch from bnx2 to qfle3 as required by HPE, don’t! bnx2 is at least stable for now. Latest HPE images already contain this fix – however it doesn’t fix these specific crashes. VMware support also confirmed that there’s an ongoing investigation into this known common issue and it also affects vSphere 6.5. I’m suffering on HPE 534FLR-SFP+ adapters but your OEM may have other names for Qlogic/Cavium/Broadcom 57810 chipset.

A few days ago I was setting up a new green-field VMware deployment. As a team effort, we were ironing out configuration bugs and oversights, but all despite all the fixes, vSpheres kept PSODing consistently. Stack showed crashes in Qlogic hardware iSCSI adapter driver qfle3i.

Firmwares were updated and updates were installed, to no effect. After looking around and trial-and-errors, one fiber cable turned out to be faulty and caused occasional packet loss on SAN to switch path. TCP is supposed to fix that in theory but hardware adapters seem to be much more picky. Monitoring was not yet configured so it was quite annoying to track down. Also as SAN was not properly accessible, no persistent storage for logs nor dumps.

So if you’re using hardware adapters and seeing PSODs, check for packet loss in switches. I won’t engage support for this as I have no logs nor dumps. But if you see “qfle3i_tear_down_conn” in PSOD, look for Ethernet problems.