vSphere 6.5 and 6.7 qfle3 driver is really unstable

Edit 2019.03.08

In the end RSS and iSCSI were separate issues. RSS is to be fixed in vSphere 6.7U2 sometime this spring. Update Marvell (wow, Broadcom -> QLogic -> Cavium -> Marvell, I’m not sure what to call it by now) drivers are on VMware’s support portal. I haven’t tested them yet as I don’t currently have any Marvell NICs to try out.

Some details from my ServerFault answer to a similar issue: https://serverfault.com/a/950302/8334

Edit 2018.10.15

Three months have passed and QLogic/Cavium drivers are still broken. I’ve gotten a few debug drivers (and others have as well) but there’s no solution. Initial suspicion about bad optics was a red herring (optics really was bad but it was unrelated). Currently there are 2 issues:

  • Hardware iSCSI offload will PSOD the system (in my case in 5-30 minutes, in other cases randomly)
  • NIC RSS configuration will randomly fail (once every few weeks), causing total loss of network connectivity or PSOD or a NMI by BIOS/BMC (or a combination of 3).

So far I’ve had to swap everything to Intels (being between a rock and a hard place). They have their own set of problems, but at least no PSODs or networking losses. Beacon probing doesn’t seem to work with Intel X710 based cards (confirmed by HPE) – incoming packets just disappear in NIC/driver. Compared to random PSOD, I can live with that.

Edit 2018.07.11

HPE support confirmed that qfle3 bundle is dead in water. Our VAR was astonished that sales branch was completely unaware of severe stability issues. Edited subject to reflect findings.

Edit 2018.07.09

Qlogic qfle3i (and whole Qlogic 57810 driver bundle) seems to be just fucked. qfle3i crashes on no matter what. Even basic NIC driver qfle3 crashes occasionally. So if you’re planning to switch from bnx2 to qfle3 as required by HPE, don’t! bnx2 is at least stable for now. Latest HPE images already contain this fix – however it doesn’t fix these specific crashes. VMware support also confirmed that there’s an ongoing investigation into this known common issue and it also affects vSphere 6.5. I’m suffering on HPE 534FLR-SFP+ adapters but your OEM may have other names for Qlogic/Cavium/Broadcom 57810 chipset.

A few days ago I was setting up a new green-field VMware deployment. As a team effort, we were ironing out configuration bugs and oversights, but all despite all the fixes, vSpheres kept PSODing consistently. Stack showed crashes in Qlogic hardware iSCSI adapter driver qfle3i.

Firmwares were updated and updates were installed, to no effect. After looking around and trial-and-errors, one fiber cable turned out to be faulty and caused occasional packet loss on SAN to switch path. TCP is supposed to fix that in theory but hardware adapters seem to be much more picky. Monitoring was not yet configured so it was quite annoying to track down. Also as SAN was not properly accessible, no persistent storage for logs nor dumps.

So if you’re using hardware adapters and seeing PSODs, check for packet loss in switches. I won’t engage support for this as I have no logs nor dumps. But if you see “qfle3i_tear_down_conn” in PSOD, look for Ethernet problems.

10 thoughts on “vSphere 6.5 and 6.7 qfle3 driver is really unstable”

  1. I am having the Exact same issue on my new HPE hardware. Your blog post was very informative. I have the issue triggering very reliably when i do high volumes of data over the hardware ISCSI interfaces. Deploying 4 VMs at a time does the trick to cause a crash. I am waiting to hear back from HPE and VMware support on this issue. Hopefully There is a new driver soon

    1. Sorry, I haven’t checked my blog for a while.
      The issue still exists. iSCSI really sucks, I can crash it by just configuring initiator and letting it sit for a while.
      Even worse, base NIC crashes/hangs every once in a while (RSS fails or PSODs), usually after vMotion but occasionally without any provocation. By now both HPE and VMware are involved. I got a few NICs swapped to Intels by HPE and for some critical stuff, we just bought new Intel ones. I’m still looking forward to a solution though as we have a ton of QLogics…

  2. I was told by Dell/EMC that a patch for this is supposedly being released this week…..

    1. It was released to VMware portal late February. See my update and ServerFault link for more details.

  3. I’m having similar issues on my HP430’s.. recently upgraded from 6.0 to 6.7U1 and having nothing but issues. I was getting PSOD’s on my older firmware on my QLogic 10Gbe NICs, and now since upgrading the firmware as suggested by VMware, ESXi doesn’t even show the NICs in the NIC list.. Any ideas?

  4. Just curious…
    Did you ever get your BCM578xx NICs stable with the QFLE3 drivers?
    We have just had our first PSOD on Dell ESXi 6.7u2 hosts with 57800 NICs and v1.077.2 drivers.
    VMWare recommends update to 1.0.84 driver – which unfortunately isn’t supported on latest Dell firmware 🙂

    1. Yes, I currently have a few hosts that have ran for about 2 months. Even ran iSCSI fine but reverted to software (bnx2 hardware iSCSI was faster and software outperforms qfle3i).
      I can’t recall exact driver and firmware revisions (on vacation currently), but they were pretty recent – HPE specific though.
      What’s your stack trace, might be a different issue?

  5. Don – any indication if the updated qfle3 driver resolved the issues you were seeing? I see from above you reverted to bnx2 but have you done any any further testing on qfle3? HPE only officially supports qfle3 on their Broadcom -> QLogic -> Cavium -> Marvell 57810/57840 chipset-based devices. I should note that with the new driver HPE also recommends Firmware Upgrade Utility 1.25.11(12 Aug 2019) which contains combo image 7.18.2 / MFW boot code 7.15.68 . The combination of the firmware upgrade and new driver will load firmware

    esxcli network nic get -n vmnic0
    Advertised Auto Negotiation: true
    Advertised Link Modes: Auto, 20000BaseKR2/Full
    Auto Negotiation: true
    Cable Type: FIBRE
    Current Message Level: 0
    Driver Info:
    Bus Info: 0000:1b:00:0
    Driver: qfle3
    Firmware Version: Storm: MFW: 7.15.68

    1. I haven’t used qfle3 iSCSI in a while as I had everything replace with Intels. I might soon try it again though as I recently got another environment with these NICs. There’s a bit newer driver here too https://my.vmware.com/web/vmware/details?downloadGroup=DT-ESXI67-QLOGIC-QFLE3-10870&productId=742
      There were other problems, much rarer luckyly – https://kb.vmware.com/s/article/71361 so I just turned Option ROM off and left it as is. The newest drivers should have fixed it though.
      If that firmware was included in latest SPP then I’ve deployed – so it’s probaby fine.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.